# RAG Model Proof of Concept

- Source: https://python.langchain.com/v0.1/docs/use_cases/code_understanding/
- Code: https://colab.research.google.com/github/langchain-ai/langchain/blob/v0.1/docs/docs/use_cases/code_understanding.ipynb#scrollTo=etu19rb1Oj2d

In [34]:
%pip install --upgrade --quiet langchain-openai tiktoken langchain-chroma langchain GitPython

Note: you may need to restart the kernel to use updated packages.


In [13]:
# import libraries
from dotenv import load_dotenv
import os
from git import Repo
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_text_splitters import Language
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
import pickle

In [18]:
# load environment variables
load_dotenv()

# Access OpenAI API key
openai_key = os.getenv('OPENAI_API_KEY')

## Clone repo and load content

In [16]:
# Clone target repo locally
target_repo_link = "https://github.com/langchain-ai/langchain"
repo_path = "Users/djr/Desktop/target_repo"
repo = Repo.clone_from(target_repo_link, to_path=repo_path)

In [20]:
# Load content from repository
loader = GenericLoader.from_filesystem(
    repo_path + "/libs/core/langchain_core",
    glob="**/*",
    suffixes=[".py"],
    exclude=["**/non-utf8-encoding.py"],
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),
)
documents = loader.load()
len(documents)

### Index data and embed to vector store

In [21]:
# split and index documents
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=2000, chunk_overlap=200
)
texts = python_splitter.split_documents(documents)
len(texts)

1017

In [23]:
# generate and save vector embeddings
db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()), persist_directory='vector_store')

### Read embeddings and create retriever

In [31]:
# read embeddings
db1 = Chroma(persist_directory='vector_store', embedding_function=OpenAIEmbeddings(disallowed_special=()))

#build retriever
retriever = db1.as_retriever(
    search_type="mmr",  # Also test "similarity"
    search_kwargs={"k": 8},
)

### Create LLM + retriever pipeline

In [32]:
llm = ChatOpenAI(model="gpt-4")

# First we need a prompt that we can pass into an LLM to generate this search query

prompt = ChatPromptTemplate.from_messages(
    [
        ("placeholder", "{chat_history}"),
        ("user", "{input}"),
        (
            "user",
            "Given the above conversation, generate a search query to look up to get information relevant to the conversation",
        ),
    ]
)

retriever_chain = create_history_aware_retriever(llm, retriever, prompt)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the user's questions based on the below context:\n\n{context}",
        ),
        ("placeholder", "{chat_history}"),
        ("user", "{input}"),
    ]
)
document_chain = create_stuff_documents_chain(llm, prompt)

qa = create_retrieval_chain(retriever_chain, document_chain)

### Test 

In [33]:
question = "How do I make a retriever?"
result = qa.invoke({"input": question})
result["answer"]

'To make a retriever, you need to use the `create_retriever_tool` function. The function takes in the following parameters:\n\n- `retriever`: The retriever to use for the retrieval. This should be an instance of `BaseRetriever` or a subclass.\n- `name`: The name for the tool. This will be passed to the language model, so it should be unique and somewhat descriptive.\n- `description`: The description for the tool. This will be passed to the language model, so it should be descriptive.\n- `document_prompt`: (Optional) An instance of `BasePromptTemplate` for formatting the output documents. If not provided, a default template will be used.\n- `document_separator`: (Optional) A string used to separate multiple documents in the output. The default is two newline characters.\n\nHere\'s an example of how to create a retriever:\n\n```python\nretriever = MyCustomRetriever()  # Replace with your actual retriever\ntool = create_retriever_tool(\n    retriever=retriever,\n    name="My Retriever",\n