### LangChain & GPT-4 for Code Understanding: Twitter Algorithm

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embeddings = OpenAIEmbeddings()

In [3]:
import os
from langchain.document_loaders import TextLoader

root_dir = './the-algorithm'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

In [4]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Created a chunk of size 2549, which is longer than the specified 1000
Created a chunk of size 2095, which is longer than the specified 1000
Created a chunk of size 1983, which is longer than the specified 1000
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
Created a chunk of size 1245, which is longer than the specified 1000
Created a chunk of size 1257, which is longer than the specified 1000
Created a chunk of size 2273, which is longer than the specified 1000
Created a chunk of size 1411, which is longer than the specified 1000
Created a chunk of size 1263, which is longer than the specified 1000
Created a chunk of size 1672, which is longer than the specified 1000
Created a chunk of size 1794, which is longer than the specified 1000
Created a chunk of size 1034, which is longer than the specified 1000
Created a chunk of size 1201, which is longer than the specified 1000
Created a chunk of s

In [5]:
username = "mbilalshahid" # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/twitter-algorithm", embedding_function=embeddings)
db.add_documents(texts)

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!


 

Batch upload: 31311 samples are being uploaded in 32 batches of batch size 1000


Evaluating ingest: 100%|██████████| 32/32 [19:14<00:00
 

Dataset(path='hub://mbilalshahid/twitter-algorithm', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype        shape       dtype  compression
  -------    -------      -------     -------  ------- 
 embedding  embedding  (31311, 1536)  float32   None   
    id        text      (31311, 1)      str     None   
 metadata     json      (31311, 1)      str     None   
   text       text      (31311, 1)      str     None   


['55a08f80-5550-11ee-b60b-d83bbf21f498',
 '55a09c24-5550-11ee-98a7-d83bbf21f498',
 '55a09c25-5550-11ee-b082-d83bbf21f498',
 '55a09c26-5550-11ee-b8f9-d83bbf21f498',
 '55a09c27-5550-11ee-9aea-d83bbf21f498',
 '55a09c28-5550-11ee-8297-d83bbf21f498',
 '55a09c29-5550-11ee-9092-d83bbf21f498',
 '55a09c2a-5550-11ee-bdca-d83bbf21f498',
 '55a09c2b-5550-11ee-a0b8-d83bbf21f498',
 '55a09c2c-5550-11ee-9f19-d83bbf21f498',
 '55a09c2d-5550-11ee-99ea-d83bbf21f498',
 '55a09c2e-5550-11ee-bac4-d83bbf21f498',
 '55a09c2f-5550-11ee-b229-d83bbf21f498',
 '55a09c30-5550-11ee-b51c-d83bbf21f498',
 '55a09c31-5550-11ee-9369-d83bbf21f498',
 '55a09c32-5550-11ee-bdf0-d83bbf21f498',
 '55a09c33-5550-11ee-9e80-d83bbf21f498',
 '55a09c34-5550-11ee-96ea-d83bbf21f498',
 '55a09c35-5550-11ee-a556-d83bbf21f498',
 '55a09c36-5550-11ee-b910-d83bbf21f498',
 '55a09c37-5550-11ee-ba1e-d83bbf21f498',
 '55a09c38-5550-11ee-8dfa-d83bbf21f498',
 '55a09c39-5550-11ee-a04c-d83bbf21f498',
 '55a09c3a-5550-11ee-88e9-d83bbf21f498',
 '55a09c3b-5550-

In [6]:
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

In [7]:
def filter(x):
    if 'com.google' in x['text'].data()['value']:
        return False
    metadata = x['metadata'].data()['value']
    return 'scala' in metadata['source'] or 'py' in metadata['source']

# Uncomment the following line to apply custom filtering
# retriever.search_kwargs['filter'] = filter

In [8]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model='gpt-3.5-turbo') # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

In [9]:
questions = [
    "What does favCountParams do?",
    "is it Likes + Bookmarks, or not clear from the code?",
    "What are the major negative modifiers that lower your linear ranking parameters?",   
    "How do you get assigned to SimClusters?",
    "What is needed to migrate from one SimClusters to another SimClusters?",
    "How much do I get boosted within my cluster?",   
    "How does Heavy ranker work. what are it’s main inputs?",
    "How can one influence Heavy ranker?",
    "why threads and long tweets do so well on the platform?",
    "Are thread and long tweet creators building a following that reacts to only threads?",
    "Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?",
    "Content meta data and how it impacts virality (e.g. ALT in images).",
    "What are some unexpected fingerprints for spam factors?",
    "Is there any difference between company verified checkmarks and blue verified individual checkmarks?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What does favCountParams do? 

**Answer**: The `favCountParams` represents a parameter related to the favorite count of a tweet. However, without further context, it is not possible to determine its exact purpose or functionality. 

-> **Question**: is it Likes + Bookmarks, or not clear from the code? 

**Answer**: No, it is not clear from the given code if `favCountParams` represents the sum of likes and bookmarks. 

-> **Question**: What are the major negative modifiers that lower your linear ranking parameters? 

**Answer**: The major negative modifiers that lower the linear ranking parameters are:
- No text hit demotion
- URL only hit demotion
- Name only hit demotion
- Separate text/name demotion
- Separate text/url demotion 

-> **Question**: How do you get assigned to SimClusters? 

**Answer**: In the given context, it is not explicitly mentioned how users are assigned to SimClusters. However, based on the information provided, SimClusters is a general-purpose r