In [1]:
import warnings
warnings.filterwarnings("ignore")
%load_ext watermark
%watermark

Last updated: 2023-11-17T10:23:54.854927+05:30

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
CPU cores   : 8
Architecture: 64bit



### The New Way to Understand Code Repositories

The new way is four steps that take less than an hour to build:

    Index the codebase
    Store embeddings and the code in Deep Lake
    Use Conversational Retriever Chain from LangChain
    Ask any questions you’d like!


#### How to Build Code Understanding App with LangChain, GPT-4, & Conversational Retriever Chain?

    Index the Codebase: Duplicate the target repository, load all contained files, divide the files, and initiate the indexing procedure. Alternatively, you can bypass this step and use a pre-indexed dataset.
    Store Embeddings and the Code: Code segments are embedded using a code-aware embedding model and saved in the Deep Lake VectorStore.
    Assemble the Retriever: Conversational Retriever Chain searches the VectorStore to find a specific query’s most relevant code segments.
    Build the Conversational Chain: Customize retriever settings and define any user-defined filters as necessary.
    Pose Questions: Create a list of questions about the codebase, then use the Conversational Retrieval Chain to produce context-sensitive responses. The LLM (GPT-4, in this case) should now generate detailed, context-aware answers based on the retrieved code segments and conversation history.


#### Step 1: Installing required libraries and authenticating with Deep Lake and Open AI

First, we will install everything we’ll need.

In [2]:
#pip install --upgrade langchain deeplake openai tiktoken
import langchain
import deeplake
import openai
import tiktoken
%watermark --iversions 

tiktoken : 0.5.1
langchain: 0.0.336
deeplake : 3.8.6
openai   : 1.3.2



Next, let’s import the necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY and define the OpenAI embeddings. 

In [3]:
# Either you can store the  OpenAI key in the “OPENAI_API_KEY” environment variable.
# or pass it here as below from a config.ini
import configparser
workingFolder='C:\\Users\\jfrancis\\OneDrive - GalaxE. Solutions, Inc\\GalaxE D Drive\\AI Journey\\Gen AI\\'
# Read the configuration file
config = configparser.ConfigParser()
config.read(workingFolder+'\\config.ini')
OPENAI_API_KEY=config.get('General','OPENAI_API_KEY')
ACTIVELOOP_TOKEN=config.get('General','ACTIVELOOP_TOKEN')
ACTIVELOOP_ORG_ID=config.get('General','ACTIVELOOP_ORG_ID')
HUGGINGFACEHUB_API_TOKEN=config.get('General','HUGGINGFACEHUB_API_TOKEN')
GOOGLE_API_KEY=config.get('General','GOOGLE_API_KEY')
GOOGLE_CSE_ID=config.get('General','GOOGLE_CSE_ID')
COHERE_API_KEY=config.get('General','COHERE_API_KEY')

In [4]:
# Get the token from OPENAI/Active loop website before this. Now we are taking from the config.ini
import os
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN
# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = ACTIVELOOP_ORG_ID

In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embeddings = OpenAIEmbeddings()

#### Step 2: Indexing the Twitter Algorithm Code Base (Optional)

You can skip this part and jump right into using an already indexed dataset (just like the one in this example). To index the code base, first clone the repository, parse the code, break it into chunks, and apply OpenAI indexing:

In [6]:
# replace any repository of your choice
# Run it from Git Bash in your required foler
#git clone https://github.com/twitter/the-algorithm

Next, load all files inside the repository.

In [7]:
import os
from langchain.document_loaders import TextLoader

root_dir = workingFolder + 'Twitter\\the-algorithm'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass


Subsequently, divide the loaded files into chunks:

In [8]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Created a chunk of size 2549, which is longer than the specified 1000
Created a chunk of size 2095, which is longer than the specified 1000
Created a chunk of size 1983, which is longer than the specified 1000
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
Created a chunk of size 1245, which is longer than the specified 1000
Created a chunk of size 1257, which is longer than the specified 1000
Created a chunk of size 2273, which is longer than the specified 1000
Created a chunk of size 1411, which is longer than the specified 1000
Created a chunk of size 1263, which is longer than the specified 1000
Created a chunk of size 1672, which is longer than the specified 1000
Created a chunk of size 1794, which is longer than the specified 1000
Created a chunk of size 1034, which is longer than the specified 1000
Created a chunk of size 1201, which is longer than the specified 1000
Created a chunk of s

Created a chunk of size 1136, which is longer than the specified 1000
Created a chunk of size 1658, which is longer than the specified 1000
Created a chunk of size 1135, which is longer than the specified 1000
Created a chunk of size 1123, which is longer than the specified 1000
Created a chunk of size 1483, which is longer than the specified 1000
Created a chunk of size 1651, which is longer than the specified 1000
Created a chunk of size 1371, which is longer than the specified 1000
Created a chunk of size 1123, which is longer than the specified 1000
Created a chunk of size 1085, which is longer than the specified 1000
Created a chunk of size 1300, which is longer than the specified 1000
Created a chunk of size 1730, which is longer than the specified 1000
Created a chunk of size 1260, which is longer than the specified 1000
Created a chunk of size 2134, which is longer than the specified 1000
Created a chunk of size 1806, which is longer than the specified 1000
Created a chunk of s

Perform the indexing process. This takes roughly 4 minutes to calculate embeddings and upload them to Activeloop. Afterward, you can publish the dataset publicly:

In [None]:
username = ACTIVELOOP_ORG_ID # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/twitter-algorithm", embedding=embeddings)
db.add_documents(texts)

If the dataset has been already created, you can load it later without recomputing embeddings as seen below.

#### Step 3: Conversational Retriever Chain

First, load the dataset, establish the retriever, and create the Conversational Chain:

In [15]:
#db = DeepLake(dataset_path="hub://davitbun/twitter-algorithm", read_only=True, embedding=embeddings)
db = DeepLake(dataset_path="hub://activeloop/twitter-algorithm", read_only=True, embedding=embeddings)

Deep Lake Dataset in hub://activeloop/twitter-algorithm already exists, loading from the storage


Retriever Functions:

In [16]:
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

In [17]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model='gpt-3.5-turbo') # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

#### Step 4: Ask Questions to the Codebase in Natural Language

Define all the juicy questions you want to be answered:

In [18]:
questions = [
    "What does favCountParams do?",
    "is it Likes + Bookmarks, or not clear from the code?",
    "What are the major negative modifiers that lower your linear ranking parameters?",   
    "How do you get assigned to SimClusters?",
    "What is needed to migrate from one SimClusters to another SimClusters?",
    "How much do I get boosted within my cluster?",   
    "How does Heavy ranker work. what are it’s main inputs?",
    "How can one influence Heavy ranker?",
    "why threads and long tweets do so well on the platform?",
    "Are thread and long tweet creators building a following that reacts to only threads?",
    "Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?",
    "Content meta data and how it impacts virality (e.g. ALT in images).",
    "What are some unexpected fingerprints for spam factors?",
    "Is there any difference between company verified checkmarks and blue verified individual checkmarks?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What does favCountParams do? 

**Answer**: I don't have enough information to answer that question. 

-> **Question**: is it Likes + Bookmarks, or not clear from the code? 

**Answer**: It is not clear from the provided code whether favCountParams is equal to the sum of Likes and Bookmarks. 

-> **Question**: What are the major negative modifiers that lower your linear ranking parameters? 

**Answer**: Based on the given code snippets, the main factors that decrease the linear ranking parameters are:

1. Low reputation: If a user has a low reputation, it will decrease the linear ranking parameter.

2. Low text score: If the text score of a user is low, it will decrease the linear ranking parameter.

3. Low retweet count: If the retweet count of a user is low, it will decrease the linear ranking parameter.

4. Low favorite count: If the favorite count of a user is low, it will decrease the linear ranking parameter.

5. Social filter: If a post is filtered out by the soc

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-k9LQtZKllzwGlzidOA94QWrY on requests per min (RPM): Limit 3, Used 3, Requested 1. Please try again in 20s. Visit https://platform.openai.com/account/rate-limits to learn more. You can increase your rate limit by adding a payment method to your account at https://platform.openai.com/account/billing.', 'type': 'requests', 'param': None, 'code': 'rate_limit_exceeded'}}