In this notebook we will see how to use all the concepts learnt in the past 2 notebooks to connect with Github repositories.

The following elements are going to be used : 
-   **LlamaIndex** : Data framework.
    -   *Data Connectors* : Ingest data from various sources.
    -   *Data Indexing* : Structure the ingested data.
    -   *Query Engines* : Enable natural language queries to interact with the stored data.
-   **DeepLake** : Data lake.
    -   *Optimized Storage* : Designed for quick data retrieval.
    -   *Data Type Support* : Handles multiple data tyoes, like images, videos and complex data structures.
-   **Ollama** : LLM models.
-   **python-dotenv** : Library to specify environment variables.


# Set env variables

The first step is to set the values of the enviroment variables.

In [1]:
import nest_asyncio
nest_asyncio.apply()

Set LlamaIndex settings

In [2]:
from llama_index.core import Settings
from langchain_ollama import OllamaEmbeddings
from langchain_ollama import OllamaLLM

Settings.embed_model = OllamaEmbeddings(model="llama3.1:8b") # Load it into the setting of llama index
Settings.llm = OllamaLLM(model="llama3.1:8b")

You have to change the following code to load your keys

In [3]:
import json
import os

with open('../data/keys.json', 'r') as file:
    data = json.load(file)
    github_token = data['GitHubToken']
    activeloop_token = data['ActiveLoopKey']
    github_url = data['GitHubUrl']
    activeloop_url = data['ActiveLoopVectorStoreUrl']
    os.environ["GITHUB_TOKEN"] = github_token
    os.environ["ACTIVELOOP_TOKEN"] = activeloop_token
    os.environ["GITHUB_PATH"] = github_url
    os.environ["ACTIVELOOP_PATH"] = activeloop_url

In [4]:
import re
from llama_index.readers.github import GithubRepositoryReader, GithubClient
from llama_index.core import download_loader

def parse_github_url(url):
    # Function that takes a GitHub URL and extracts the repository owner and name using regular expressions
    pattern = r"https://github\.com/([^/]+)/([^/]+)"
    match = re.match(pattern, url)
    return match.groups() if match else (None, None)

def validate_owner_repo(owner, repo):
    # Function that both the repository owner and name are present
    return bool(owner) and bool(repo)

def initialize_github_client():
    # Initializes the GitHub client using the token fetched from the environment variables
    github_token = os.getenv("GITHUB_TOKEN")
    return GithubClient(github_token)

# Check for GitHub Token
github_token = os.getenv("GITHUB_TOKEN")
if not github_token:
    raise EnvironmentError("GitHub token not found in environment variables")
# Check for Activeloop Token
active_loop_token = os.getenv("ACTIVELOOP_TOKEN")
if not active_loop_token:
    raise EnvironmentError("Activeloop token not found in environment variables")

github_client = initialize_github_client()
download_loader("GithubRepositoryReader")
github_url = os.getenv("GITHUB_PATH")
owner, repo = parse_github_url(github_url) # Get the owner and repository name

loader = GithubRepositoryReader(
    github_client=github_client,
    owner=owner,
    repo=repo,
    filter_file_extensions=([".py", ".md"], GithubRepositoryReader.FilterType.INCLUDE),
    verbose=False,
    concurrent_requests=5
)
print(f"Loading {repo} repository by {owner}")
docs = loader.load_data(branch="main")
print("Documents uploaded : ")
for doc in docs:
    print(doc.metadata)

  download_loader("GithubRepositoryReader")


Loading RAGCourse repository by AlejandroTorresMunoz
Documents uploaded : 
{'file_path': 'Module 1 - Basics of RAG with Langchain and LllamaIndex\\BasicConceptsRecap.ipynb', 'file_name': 'Module 1 - Basics of RAG with Langchain and LllamaIndex\\BasicConceptsRecap.ipynb', 'url': 'https://github.com/AlejandroTorresMunoz\\RAGCourse\\blob/main\\Module 1 - Basics of RAG with Langchain and LllamaIndex\\BasicConceptsRecap.ipynb'}
{'file_path': 'Module 1 - Basics of RAG with Langchain and LllamaIndex\\LlamaIndexIntroduction.ipynb', 'file_name': 'Module 1 - Basics of RAG with Langchain and LllamaIndex\\LlamaIndexIntroduction.ipynb', 'url': 'https://github.com/AlejandroTorresMunoz\\RAGCourse\\blob/main\\Module 1 - Basics of RAG with Langchain and LllamaIndex\\LlamaIndexIntroduction.ipynb'}
{'file_path': 'Module 1 - Basics of RAG with Langchain and LllamaIndex\\LlamaIndexWithGithubRepos.ipynb', 'file_name': 'Module 1 - Basics of RAG with Langchain and LllamaIndex\\LlamaIndexWithGithubRepos.ipynb'

Once the data from GitHub has been downloaded, we create a vector store and upload the data.

In [5]:
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import VectorStoreIndex

# Create an object to connect with the vector store in ActiveLoop
vector_store = DeepLakeVectorStore(
    dataset_path=os.environ["ACTIVELOOP_PATH"],
    overwrite=True,
    runtime={"tensor_db": True},
)
print("Creating storage context")
storage_context = StorageContext.from_defaults(vector_store=vector_store) # Storage context to store nodes, indexes, vectors...
print("Creating vector store index")
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context) # Vector store index
print("Creating query engine")
query_engine = index.as_query_engine() # Query engine

Your Deep Lake dataset has been successfully created!


 

Creating storage context
Creating vector store index


KeyboardInterrupt: 

We test the created query engine with a simple question.

In [None]:
import textwrap

intro_question = "What is the repository about?"
print(f"Test question : {intro_question}")
print("=" * 50)
answer = query_engine.query(intro_question)
print(f"Answer: {textwrap.fill(str(answer), 100)} \n")
