In this notebook we will see how to use all the concepts learnt in the past 2 notebooks to connect with Github repositories.

The following elements are going to be used : 
-   **LlamaIndex** : Data framework.
    -   *Data Connectors* : Ingest data from various sources.
    -   *Data Indexing* : Structure the ingested data.
    -   *Query Engines* : Enable natural language queries to interact with the stored data.
-   **DeepLake** : Data lake.
    -   *Optimized Storage* : Designed for quick data retrieval.
    -   *Data Type Support* : Handles multiple data tyoes, like images, videos and complex data structures.
-   **Ollama** : LLM models.
-   **python-dotenv** : Library to specify environment variables.


# Basic Implementation

The first step is to set the values of the enviroment variables.

In [7]:
import nest_asyncio
nest_asyncio.apply()

Set LlamaIndex settings

In [8]:
from llama_index.core import Settings
from langchain_ollama import OllamaEmbeddings
from langchain_ollama import OllamaLLM

Settings.embed_model = OllamaEmbeddings(model="llama3.1:8b") # Load it into the setting of llama index
Settings.llm = OllamaLLM(model="llama3.1:8b")

You have to change the following code to load your keys

In [9]:
import json
import os

with open('../data/keys.json', 'r') as file:
    data = json.load(file)
    github_token = data['GitHubToken']
    activeloop_token = data['ActiveLoopKey']
    github_url = data['GitHubUrl']
    activeloop_url = data['ActiveLoopVectorStoreUrl']
    os.environ["GITHUB_TOKEN"] = github_token
    os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
    os.environ["GITHUB_PATH"] = github_url
    os.environ["ACTIVELOOP_PATH"] = activeloop_url

In [10]:
import re
from llama_index.readers.github import GithubRepositoryReader, GithubClient
from llama_index.core import download_loader

def parse_github_url(url):
    # Function that takes a GitHub URL and extracts the repository owner and name using regular expressions
    pattern = r"https://github\.com/([^/]+)/([^/]+)"
    match = re.match(pattern, url)
    return match.groups() if match else (None, None)

def validate_owner_repo(owner, repo):
    # Function that both the repository owner and name are present
    return bool(owner) and bool(repo)

def initialize_github_client():
    # Initializes the GitHub client using the token fetched from the environment variables
    github_token = os.getenv("GITHUB_TOKEN")
    return GithubClient(github_token)

# Check for GitHub Token
github_token = os.getenv("GITHUB_TOKEN")
if not github_token:
    raise EnvironmentError("GitHub token not found in environment variables")
# Check for Activeloop Token
active_loop_token = os.getenv("ACTIVELOOP_TOKEN")
if not active_loop_token:
    raise EnvironmentError("Activeloop token not found in environment variables")

github_client = initialize_github_client()
download_loader("GithubRepositoryReader")
github_url = os.getenv("GITHUB_PATH")
owner, repo = parse_github_url(github_url) # Get the owner and repository name

loader = GithubRepositoryReader(
    github_client=github_client,
    owner=owner,
    repo=repo,
    filter_file_extensions=([".py", ".md"], GithubRepositoryReader.FilterType.INCLUDE),
    verbose=False,
    concurrent_requests=5
)
print(f"Loading {repo} repository by {owner}")
docs = loader.load_data(branch="main")
print("Documents uploaded : ")
for doc in docs:
    print(doc.metadata)

  download_loader("GithubRepositoryReader")


Loading RAGCourse repository by AlejandroTorresMunoz
Documents uploaded : 
{'file_path': 'README.md', 'file_name': 'README.md', 'url': 'https://github.com/AlejandroTorresMunoz\\RAGCourse\\blob/main\\README.md'}


Once the data from GitHub has been downloaded, we create a vector store and upload the data.

In [11]:
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import VectorStoreIndex

# Create an object to connect with the vector store in ActiveLoop
vector_store = DeepLakeVectorStore(
    dataset_path=os.environ["ACTIVELOOP_PATH"],
    overwrite=True,
    runtime={"tensor_db": True},
)
print("Creating storage context")
storage_context = StorageContext.from_defaults(vector_store=vector_store) # Storage context to store nodes, indexes, vectors...
print("Creating vector store index")
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context) # Vector store index
print("Creating query engine")
query_engine = index.as_query_engine() # Create query engine based on the index created

Your Deep Lake dataset has been successfully created!


 

Creating storage context
Creating vector store index
Uploading data to deeplake dataset.


100%|██████████| 1/1 [00:01<00:00,  1.27s/it]
 

Dataset(path='hub://alejandrotormun/repository_vector_store', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
   text       text      (1, 1)      str     None   
 metadata     json      (1, 1)      str     None   
 embedding  embedding  (1, 4096)  float32   None   
    id        text      (1, 1)      str     None   
Creating query engine


We test the created query engine with a simple question.

In [12]:
import textwrap

intro_question = "What is the repository about?"
print(f"Test question : {intro_question}")
print("=" * 50)
answer = query_engine.query(intro_question)
print(f"Answer: {textwrap.fill(str(answer), 100)} \n")


Test question : What is the repository about?
Answer: The repository is about annotating a course on foundational model certification for Gen AI, with a
focus on using Ollama models (open-source models) instead of the original course's proprietary
models. 



# Diving Deeper : Low-level API

In this section, we will learn how to customize the query logic we have seen in this first chapter. 

In [13]:

# Create an index of the documents
try:
    # Create the store in ActiveLoop cloud
    vector_store = DeepLakeVectorStore(
        dataset_path=os.environ["ACTIVELOOP_PATH"],
        overwrite=True,
        runtime={"tensor_db": True},
    )
except Exception as e:
    print(f"An unexpected error occurred while creating or fetching the vector store: {str(e)}")

storage_context = StorageContext.from_documents(vector_store=vector_store)
# Load the documents into the vector store
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

Your Deep Lake dataset has been successfully created!


 

AttributeError: type object 'StorageContext' has no attribute 'from_documents'

The retriever is the responsible for fetching relevant nodes from the index. LlamaIndex supports various retrieval modes, allowign you to choose the one that best fits your needs:

In [None]:
from llama_index.core.retrievers import VectorIndexRetriever

# We configure a retriever that returns the 4 nodes with most similarity to the context 
retriever = VectorIndexRetriever(index=index, similarity_top_k=4) 

We can also parse some extra parameters.

In [None]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# Get an instance of the response synthesizer
response_synthesizer = get_response_synthesizer()
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever, # Retriver configured to select the most similar nodes
    response_mode="default", # Default mode means the system will "create and refine" an answer by sequentially going through each retrieved node
    response_synthesizer=response_synthesizer, # Component that will generate the final response using the response_mode specified
    node_postprocessors=[ # List of postprocessors that can be applied to the nodes given by the retriever
        SimilarityPostprocessor(similarity_cutoff=0.7)]  # Filters out nodes based on their similarity scores, only nodes with a similarity score above 0.7 are considered 
)

response = query_engine.query("What code is in this repository?")
print(response.response)

And we can change the response mode

In [None]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# Get an instance of the response synthesizer
response_synthesizer = get_response_synthesizer()
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever, # Retriver configured to select the most similar nodes
    response_mode="compact", # Compact mode means the system fits as many node text chunks as maximum prompt size. If there are too many chunks, it refines the answer by processing multiple prompts. 
    response_synthesizer=response_synthesizer, # Component that will generate the final response using the response_mode specified
    node_postprocessors=[ # List of postprocessors that can be applied to the nodes given by the retriever
        SimilarityPostprocessor(similarity_cutoff=0.7)]  # Filters out nodes based on their similarity scores, only nodes with a similarity score above 0.7 are considered 
)

response = query_engine.query("What code is in this repository?")
print(response.response)

In [None]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# Get an instance of the response synthesizer
response_synthesizer = get_response_synthesizer()
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever, # Retriver configured to select the most similar nodes
    response_mode="tree_summarize", # Tree summarize mode means the system constructs a tree from a set of nodes, and the query returns the root node as the response. It's beneficial for summarization nodes.  
    response_synthesizer=response_synthesizer, # Component that will generate the final response using the response_mode specified
    node_postprocessors=[ # List of postprocessors that can be applied to the nodes given by the retriever
        SimilarityPostprocessor(similarity_cutoff=0.7)]  # Filters out nodes based on their similarity scores, only nodes with a similarity score above 0.7 are considered 
)

response = query_engine.query("What code is in this repository?")
print(response.response)

In [None]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# Get an instance of the response synthesizer
response_synthesizer = get_response_synthesizer()
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever, # Retriver configured to select the most similar nodes
    response_mode="no_text", # No text sumarrize mode means the retriever fetches the nodes but doesn't send them to the LLM. This mode allos to inspect the retrieved nodes  
    response_synthesizer=response_synthesizer, # Component that will generate the final response using the response_mode specified
    node_postprocessors=[ # List of postprocessors that can be applied to the nodes given by the retriever
        SimilarityPostprocessor(similarity_cutoff=0.7)]  # Filters out nodes based on their similarity scores, only nodes with a similarity score above 0.7 are considered 
)

response = query_engine.query("What code is in this repository?")
print(response.response)