# Analysis of source code in Github Repository with LangChain, GPT4 and Deep Lake
In this tutorial, we are going to use Langchain + Deep Lake with GPT4 to analyze the code base of GPTeam
.

In [5]:
!python3 -m pip install --upgrade langchain deeplake openai tiktoken

Requirement already up-to-date: langchain in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (0.0.189)
Requirement already up-to-date: deeplake in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (3.5.4)
Requirement already up-to-date: openai in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (0.27.7)
Requirement already up-to-date: tiktoken in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (0.4.0)
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Define OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. For full documentation of Deep Lake please follow [docs](https://docs.activeloop.ai/) and [API reference](https://docs.deeplake.ai/en/latest/).

Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the [platform](https://app.activeloop.ai)

In [6]:
import os

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

os.environ['OPENAI_API_KEY'] = ''
os.environ['ACTIVELOOP_TOKEN'] = ''

In [7]:
embeddings = OpenAIEmbeddings(disallowed_special=())

disallowed_special=() is required to avoid `Exception: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte` from tiktoken for some repositories

### 1. Index the code base (optional)
You can directly skip this part and directly jump into using already indexed dataset. To begin with, first we will clone the repository, then parse and chunk the code base and use OpenAI indexing.

In [8]:
!git clone https://github.com/101dotxyz/GPTeam # replace any repository of your choice

Cloning into 'GPTeam'...
remote: Enumerating objects: 3047, done.[K
remote: Counting objects: 100% (435/435), done.[K
remote: Compressing objects: 100% (155/155), done.[K
remote: Total 3047 (delta 288), reused 384 (delta 251), pack-reused 2612[K
Receiving objects: 100% (3047/3047), 1.78 MiB | 2.33 MiB/s, done.
Resolving deltas: 100% (2297/2297), done.


Load all files inside the repository

In [9]:
import os
from langchain.document_loaders import TextLoader

root_dir = './GPTeam'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

Then, chunk the files

In [10]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Created a chunk of size 3166, which is longer than the specified 1000
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
Created a chunk of size 1884, which is longer than the specified 1000
Created a chunk of size 1818, which is longer than the specified 1000
Created a chunk of size 1376, which is longer than the specified 1000
Created a chunk of size 2345, which is longer than the specified 1000
Created a chunk of size 1190, which is longer than the specified 1000
Created a chunk of size 1004, which is longer than the specified 1000
Created a chunk of size 1095, which is longer than the specified 1000


Execute the indexing. This will take about ~4 mins to compute embeddings and upload to Activeloop. You can then publish the dataset to be public.

In [11]:
username = "hansenchen" # replace with your username from app.activeloop.ai
dataset_name = "GPTeam" # replace with your dataset name
db = DeepLake(dataset_path=f"hub://{username}/{dataset_name}", embedding_function=embeddings, public=True) #dataset would be publicly available
db.add_documents(texts)

Your Deep Lake dataset has been successfully created!


-

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/hansenchen/GPTeam


 

hub://hansenchen/GPTeam loaded successfully.


Evaluating ingest: 100%|██████████| 1/1 [00:42<00:00
 

Dataset(path='hub://hansenchen/GPTeam', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape      dtype  compression
  -------   -------    -------    -------  ------- 
 embedding  generic  (451, 1536)  float32   None   
    ids      text     (451, 1)      str     None   
 metadata    json     (451, 1)      str     None   
   text      text     (451, 1)      str     None   


['c76b9fe6-02cc-11ee-9446-8c85907a416d',
 'c76ba20c-02cc-11ee-9446-8c85907a416d',
 'c76ba2de-02cc-11ee-9446-8c85907a416d',
 'c76ba32e-02cc-11ee-9446-8c85907a416d',
 'c76ba360-02cc-11ee-9446-8c85907a416d',
 'c76ba392-02cc-11ee-9446-8c85907a416d',
 'c76ba3c4-02cc-11ee-9446-8c85907a416d',
 'c76ba3f6-02cc-11ee-9446-8c85907a416d',
 'c76ba428-02cc-11ee-9446-8c85907a416d',
 'c76ba45a-02cc-11ee-9446-8c85907a416d',
 'c76ba482-02cc-11ee-9446-8c85907a416d',
 'c76ba4b4-02cc-11ee-9446-8c85907a416d',
 'c76ba4e6-02cc-11ee-9446-8c85907a416d',
 'c76ba50e-02cc-11ee-9446-8c85907a416d',
 'c76ba540-02cc-11ee-9446-8c85907a416d',
 'c76ba572-02cc-11ee-9446-8c85907a416d',
 'c76ba5a4-02cc-11ee-9446-8c85907a416d',
 'c76ba5d6-02cc-11ee-9446-8c85907a416d',
 'c76ba5fe-02cc-11ee-9446-8c85907a416d',
 'c76ba630-02cc-11ee-9446-8c85907a416d',
 'c76ba662-02cc-11ee-9446-8c85907a416d',
 'c76ba68a-02cc-11ee-9446-8c85907a416d',
 'c76ba6bc-02cc-11ee-9446-8c85907a416d',
 'c76ba6ee-02cc-11ee-9446-8c85907a416d',
 'c76ba720-02cc-

### 2. Question Answering on GPTeam codebase
First load the dataset, construct the retriever, then construct the Conversational Chain

In [12]:
db = DeepLake(dataset_path="hub://hansenchen/GPTeam", read_only=True, embedding_function=embeddings) # replace dataset_path with the link you specified in the previous cell

 

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/hansenchen/GPTeam



\

hub://hansenchen/GPTeam loaded successfully.



 

Deep Lake Dataset in hub://hansenchen/GPTeam already exists, loading from the storage
Dataset(path='hub://hansenchen/GPTeam', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape      dtype  compression
  -------   -------    -------    -------  ------- 
 embedding  generic  (451, 1536)  float32   None   
    ids      text     (451, 1)      str     None   
 metadata    json     (451, 1)      str     None   
   text      text     (451, 1)      str     None   


In [13]:
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)

In [14]:
"""
def filter(x):
    # filter based on source code
    if 'com.google' in x['text'].data()['value']:
        return False
    
    # filter based on path e.g. extension
    metadata =  x['metadata'].data()['value']
    return 'scala' in metadata['source'] or 'py' in metadata['source']
"""
### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter

"\ndef filter(x):\n    # filter based on source code\n    if 'com.google' in x['text'].data()['value']:\n        return False\n    \n    # filter based on path e.g. extension\n    metadata =  x['metadata'].data()['value']\n    return 'scala' in metadata['source'] or 'py' in metadata['source']\n"

In [15]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name='gpt-3.5-turbo') # may switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

In [17]:
questions = [
    "Can you walk me through the process from starting the world to the end by variable stop_condition? You need to specify the important functions and variables used for each steps.",
    "Can you walk me through the workflow of an agent? You need to specify the important functions and variables used for each steps"
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")


-> **Question**: Can you walk me through the process from starting the world to the end by variable stop_condition? You need to specify the important functions and variable used for each steps. 

**Answer**: Sure, here is a high-level overview of the process from starting the world to the end by variable stop_condition, along with the important functions and variables used for each step:

1. Clone the project repository to your local machine.
2. Move to the repository: `cd gpteam`.
3. Run `python setup.py` to check your environment setup and configure it as needed.
4. Update the environment variables in `.env` with your API Keys.
5. Launch the world by running `poetry run world`.

Once the world is running, the agents will be moving around the world and performing tasks in different locations, depending on what they are doing and where other agents are located. They can speak to each other and collaborate on tasks, working in parallel towards common goals.

To end the world based on th