# Chat with your Repo - Langchain Example
With this example notebook, you will learn to utilize Langchain to add Repo Context to your Queries to ask detailed questions.
This example will use OpenAI GPT 3.5 and the Vector Store Chroma. However, due to the modularity of Langchain you can easily swap the LLM and Vectorstore easily to adapt to your preferences.

## Setup
First let us beginn by installing necessary dependencies (if not already installed).

In [None]:
%pip install openai langchain ipywidgets chroma GitPython chromadb tiktoken

Then let us import necessary packages and load the env for our OpenAI Key:

Note: Make sure you have created a .env file with OPENAI_API_KEY=sk-...

In [None]:
import os
from functools import partial

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file for our API Keys
from langchain.chat_models import ChatOpenAI # Switch to another provider as you please
from langchain.document_loaders import GitLoader # Used to load the repo
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
) # Text splitter for our codebase
from langchain.embeddings import OpenAIEmbeddings # Embeddings for the LLM injection
from langchain.vectorstores import Chroma # Our Vectorstore
from langchain.chains import RetrievalQA # Chain to chat with our documents
from langchain.memory import ConversationTokenBufferMemory # Memory for our conversation
import ipywidgets as widgets # Widgets for our chat
from IPython.display import display, Markdown # Used to display the chat
from langchain.prompts import ChatPromptTemplate

## Parameters
Now let us define our Parameters necessary for running our notebook

In [None]:
repo_url = "https://github.com/ganlanyuan/tiny-slider" # Replace with your repo
local_path = os.getcwd() + '/repo' # Replace with your local path
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0) # Replace with your Chat-optimized LLM

Lets define a helper function to retrieve our Loader:

In [None]:
def get_loader(local_path, repo_url, branch = 'master', has_file_ext = ['.md', '.js', '.html'], ignore_paths = ['dist/']):
    """
    Helper function to create a Loader to load the repo

    Args:
        local_path (str): Path to the local repo
        repo_url (str): URL of the repo to clone from
        branch (str): Branch of the repo to checkout
        has_file_ext (list): ist of file extensions to load
        ignore_paths (list): List of paths to ignore
    
    Returns:
        GitLoader Object
    """
    
    # If the path exists, the GitLoader will throw an Error when trying to clone
    if os.path.exists(local_path):
        repo_url = None
    
    file_filter_functions = []

    def not_in_ignore_paths(file_path, ignore_paths):
        return all(file_path.find(path) == -1 for path in ignore_paths)

    def has_allowed_extension(file_path, extensions):
        return any(file_path.endswith(ext) for ext in extensions)

    if len(ignore_paths):
        file_filter_functions.append(partial(not_in_ignore_paths, ignore_paths=ignore_paths))

    if len(has_file_ext):
        file_filter_functions.append(partial(has_allowed_extension, extensions=has_file_ext))

    def file_filter_function(file_path):
        return all(func(file_path) for func in file_filter_functions)
    
    return GitLoader(repo_path=local_path, clone_url=repo_url, branch=branch, file_filter=file_filter_function)

In our implementation, we use the RecursiveCharacterTextSplitter to split the documents. Let us define a helper function to create a list of documents in the Languages of the Repo:

In [None]:
def split_docs(docs):
    """
    Helper function to split the docs into chunks by supported languages.

    Args:
        docs (list): List of documents to split
    
    Returns:
        list of documents
    """
    
    js_splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.JS, chunk_size=1024, chunk_overlap=0
    )
    html_splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.HTML, chunk_size=1024, chunk_overlap=0
    )
    markdown_splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.MARKDOWN, chunk_size=1024, chunk_overlap=0
    )

    # Only retrieve the text from the documents
    text_docs = [doc.page_content for doc in docs]
    
    js, html, markdown = js_splitter.create_documents(texts=text_docs), html_splitter.create_documents(texts=text_docs), markdown_splitter.create_documents(texts=text_docs)

    # merge the docs to List
    return js + html + markdown

Now let us retrieve our loader and the split docs:

In [None]:
loader = get_loader(local_path, repo_url)
docs = loader.load()
split_docs = split_docs(docs)

Now we can convert them to Embeddings that we can then store to our vector store

(Note: each time we run this cell we will infer the API, for production you should store the embeddings locally or on a server so you don't always pay API costs):

In [None]:
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents=split_docs, embedding=embeddings)

Now let's define the retriever:

In [None]:
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # You can adjust search_kwargs here

Now let us define our memory to keep some of the conversation for context:

In [None]:
memory = ConversationTokenBufferMemory(llm=llm, max_token_limit=512) # Adjust max_token_limit depending how much information you want to keep from the conversation

Now let us define a prompt template for use when calling our chain. This one is a modified version from the original one to give a bit better context.

In [None]:
prompt_template = ChatPromptTemplate.from_template(
    """
    Your task is to assist with questions from a code repository. \
    The Repository is a library called tiny slider. \
    Use the following pieces of context to answer the question at the end. \
    The context are snippets of files from the repository \
    When answering with code snippets, make sure to wrap them in the correct syntax using markdown backticks. \
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    {context}

    Question: {question}
    Helpful Answer:
    """
)

Now we can define the chain that we will use to infer questions regarding our Git Repository:

In [None]:
chain = RetrievalQA.from_llm(
    llm=llm,
    memory=memory, 
    retriever=vector_retriever, 
    prompt=prompt_template, 
)

## Creating an interactive chat interface to chat with our Repo

In [None]:
# Create the text input widget
text_input = widgets.Text(
    value='',
    placeholder='Type something...',
    description='Input:',
    disabled=False
)

# Create an output widget to display the conversation history
output = widgets.Output()

def format_input(user, user_input):
    return f"## {user}: \n --- \n {user_input}"

# Function to handle the text input
def handle_submit(sender):
    with output:
        user_input = text_input.value
        # Clear the input box for the next message
        text_input.value = ''
        # Display the user's input
        display(Markdown(format_input('User', user_input)))
        result = chain.run(user_input)
        display(Markdown(format_input('AI', result)))

with output:
    display(Markdown('# Chat with your repo'))

# Link the function to the text input's submission event
text_input.on_submit(handle_submit)

# Display the widgets
display(output, text_input)