## ActiveLoop Course - LangChain & Vector Databases in Production

> https://learn.activeloop.ai/enrollments

---

In [1]:
import sys
sys.path.append("/Users/shaunaksen/Documents/personal-projects/Natural-Language-Processing/LLM Concepts/mini_projects/chatgpt_clone")

In [2]:
def print_wrap(text, width=100):
 """
 Prints text wrapped to a set width.

 Args:
   text: The text to wrap.
   width: The maximum width of each line.

 Returns:
   None
 """
 words = text.split()
 current_line = ""
 for word in words:
   if len(current_line) + len(word) + 1 > width:
     print(current_line)
     current_line = word
   else:
     current_line += " " + word
 if current_line:
   print(current_line)

In [12]:
import os

# models and embeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.llms import AzureOpenAI
from langchain.embeddings import AzureOpenAIEmbeddings

# prompts and utils
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

# memory
from langchain.memory import ConversationBufferMemory

# chains
from langchain.chains import LLMChain, ConversationChain, RetrievalQA

# vectorstores
from langchain.vectorstores import DeepLake, FAISS
from langchain.vectorstores.chroma import Chroma

# agents and tools
from langchain.agents import initialize_agent, Tool, load_tools
from langchain.agents import AgentType
from langchain.utilities import GoogleSearchAPIWrapper

from creds import AZURE_API_BASE, AZURE_API_KEY, AZURE_API_VERSION, ACTIVELOOP_TOKEN, GOOGLE_API_KEY, GOOGLE_CSE_ID


In [4]:
def init_models() -> list:
    embeddings = AzureOpenAIEmbeddings(
        deployment="text-embedding-ada-002",
        model="text-embedding-ada-002",
        openai_api_type='azure',
        azure_endpoint=AZURE_API_BASE,
        openai_api_key=AZURE_API_KEY,
        openai_api_version=AZURE_API_VERSION,
        chunk_size=1, max_retries=1e0 
    )

    llm_chat_gpt_4 = AzureChatOpenAI(deployment_name='gpt-4-32k',
                          model='gpt-4-32k',
                          openai_api_type='azure',
                          azure_endpoint=AZURE_API_BASE,
                          openai_api_key=AZURE_API_KEY,
                          openai_api_version=AZURE_API_VERSION,
                          max_retries=2,
                          temperature=0,
                          streaming=True
                          )

    llm_gpt_4 = AzureOpenAI(deployment_name='gpt-4-32k',
                            model='gpt-4-32k',
                            openai_api_type='azure',
                            azure_endpoint=AZURE_API_BASE,
                            openai_api_key=AZURE_API_KEY,
                            openai_api_version=AZURE_API_VERSION,
                            max_retries=2,
                            temperature=0,
                            streaming=True
                            )
    
    llm_text_davinci = AzureOpenAI(deployment_name='text-davinci-003',
                          model='text-davinci-003',
                          openai_api_type='azure',
                          azure_endpoint=AZURE_API_BASE,
                          openai_api_key=AZURE_API_KEY,
                          openai_api_version=AZURE_API_VERSION,
                          max_retries=2,
                          temperature=0.5)
    
    return [embeddings, llm_chat_gpt_4, llm_gpt_4, llm_text_davinci]

In [5]:
embeddings, llm_chat_gpt_4, llm_gpt_4, llm_text_davinci = init_models()

In [6]:
text = "Suggest a personalized workout routine for someone looking to improve cardiovascular endurance and prefers outdoor activities."
print(llm_text_davinci(text))



1. Interval runs: Start with a warm-up jog for 5-10 minutes, then alternate between 1 minute of sprinting and 1 minute of jogging for 30 minutes. 

2. Hill sprints: Find a hill that is about 200-400 meters long. Warm up with a jog for 5-10 minutes, then sprint up the hill for 30 seconds, rest for 30 seconds, and repeat for 10 minutes. 

3. Plyometric exercises: Perform exercises such as jump squats, box jumps, and burpees for 30 minutes. 

4. Outdoor circuit: Create an outdoor circuit with exercises such as squats, lunges, push-ups, and jumping jacks. Perform each exercise for 1 minute and rest for 30 seconds between each exercise. Repeat the circuit 3-4 times. 

5. Swimming: Swim laps for 30 minutes in a pool or lake. 

6. Hiking: Go on a hike with some hills for 30-60 minutes. 

7. Cycling: Cycle outdoors for 30-60 minutes. 

8. Rowing: Use a rowing machine or go to a lake to row for 30-60 minutes.


## The Chains

In LangChain, a chain is an end-to-end wrapper around multiple individual components, providing a way to accomplish a common use case by combining these components in a specific sequence. The most commonly used type of chain is the LLMChain, which consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser.

The LLMChain works as follows:



1. Takes (multiple) input variables.
1. Uses the PromptTemplate to format the input variables into a prompt.
2. Passes the formatted prompt to the model (LLM or ChatModel).
3. If an output parser is provided, it uses the OutputParser to parse the output of the LLM into a final format.

In the next example, we demonstrate how to create a chain that generates a possible name for a company that produces eco-friendly water bottles. By using LangChain's LLMChain, PromptTemplate, and OpenAIclasses, we can easily define our prompt, set the input variables, and generate creative outputs. 



In [7]:
prompt = PromptTemplate(
    input_variables=['product', 'character'],
    template="What is a good name for a company that makes {product}. Be {character} in your response?"
)

chain = LLMChain(prompt=prompt, llm=llm_text_davinci, verbose=True)

In [8]:
print_wrap(chain.run(product="bamboo toothbrushes", character="like Chandler Bibg"))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWhat is a good name for a company that makes bamboo toothbrushes. Be like Chandler Bibg in your response?[0m

[1m> Finished chain.[0m
 Bambooth Toothbrushes.


## The Memory


In LangChain, Memory refers to the mechanism that stores and manages the conversation history between a user and the AI. It helps maintain context and coherency throughout the interaction, enabling the AI to generate more relevant and accurate responses. Memory, such as ConversationBufferMemory, acts as a wrapper around ChatMessageHistory, extracting the messages and providing them to the chain for better context-aware generation.



In [9]:
memory = ConversationBufferMemory()
conversation_chain = ConversationChain(
    llm=llm_text_davinci,
    verbose=True,
    memory=memory
)

In [10]:
# Start the conversation
conversation_chain.predict(input="Tell me about yourself.")

# Continue the conversation
conversation_chain.predict(input="What can you do?")
conversation_chain.predict(input="How can you help me with data analysis?")

# Display the conversation
print(conversation_chain)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: Tell me about yourself.
AI:[0m

[1m> Finished chain.[0m


[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Tell me about yourself.
AI:  Hi there! I'm an AI created to help people with their everyday tasks. I'm programmed to understand natural language and provide helpful information. I'm also equipped with a knowledge base that I can draw from

## Deep Lake VectorStore

Deep Lake provides storage for embeddings and their corresponding metadata in the context of LLM apps. It enables hybrid searches on these embeddings and their attributes for efficient data retrieval. It also integrates with LangChain, facilitating the development and deployment of applications.



In [7]:
# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638"
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

In [8]:
docs

[Document(page_content='Napoleon Bonaparte was born in 15 August 1769'),
 Document(page_content='Louis XIV was born in 5 September 1638')]

In [9]:
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

In [18]:
# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = ACTIVELOOP_TOKEN 
my_activeloop_dataset_name = "langchain_course_from_zero_to_hero"
dataset_path = f"hub://shaunaksen/langchain_course_from_zero_to_hero"
db = DeepLake(dataset_path=dataset_path, embedding=embeddings)

Deep Lake Dataset in hub://shaunaksen/langchain_course_from_zero_to_hero already exists, loading from the storage


In [11]:
# add documents to our Deep Lake dataset
db.add_documents(docs)

Creating 2 embeddings in 1 batches of size 2:: 100%|██████████| 1/1 [00:23<00:00, 23.39s/it]

Dataset(path='hub://shaunaksen/langchain_course_from_zero_to_hero', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (12, 1536)  float32   None   
    ids      text     (12, 1)      str     None   
 metadata    json     (12, 1)      str     None   
   text      text     (12, 1)      str     None   





['b1b5b5ca-a1c7-11ee-847c-527b27766e9a',
 'b1b5b6ce-a1c7-11ee-847c-527b27766e9a']

If so, you’ve just created your first Deep Lake dataset!

Now, let's create a RetrievalQA chain:



In [19]:
retriever = db.as_retriever()
retriever_chain = RetrievalQA.from_chain_type(
    llm=llm_text_davinci,
    chain_type="stuff",
    retriever = retriever
)

In [20]:
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType

tools = [
    Tool(
        name="Retrieval QA System",
        func=retriever_chain.run,
        description="Useful for answering questions."
    ),
]

agent = initialize_agent(
	tools,
	llm_text_davinci,
	agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
	verbose=True
)

In [None]:
response = agent.run("When was Napoleone born?")
print(response)

In [None]:
retriever_chain.invoke("When was Napoleone born?")

### Basic Retriever

In [40]:
full_text = open("state_of_the_union.txt", "r").read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_text(full_text)


In [41]:
texts[:3]

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.',
 'Groups of citizens blocking tanks with

In [42]:
db = Chroma.from_texts(texts, embeddings)

In [44]:
retriever = db.as_retriever()

In [45]:
retrieved_docs = retriever.invoke(
    "What did the president say about Ketanji Brown Jackson?"
)
print(retrieved_docs[0].page_content)

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. 

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.


### RetrievalQA Chain

In [46]:
retrieval_qa = RetrievalQA.from_chain_type(
	llm=llm_text_davinci,
	chain_type="stuff",
	retriever=retriever
)

In [47]:
retrieval_qa.run("What did the president say about Ketanji Brown Jackson?")

" The president said that Ketanji Brown Jackson is one of our nation's top legal minds, a former top litigator in private practice, a former federal public defender, from a family of public school educators and police officers, and a consensus builder. Since she's been nominated, she's received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

### Agent with RetrievalQA chain as a tool

In [48]:
tools = [
    Tool(
        name="Retrieval QA System",
        func=retrieval_qa.run,
        description="Useful for answering questions."
    )
]

agent = initialize_agent(
    tools=tools,
    llm=llm_text_davinci,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)


In [49]:
response = agent.run("What did the president say about Ketanji Brown Jackson?")
print(response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to find a source that has a statement from the president about Ketanji Brown Jackson.
Action: Retrieval QA System
Action Input: "What did the president say about Ketanji Brown Jackson?"[0m
Observation: [36;1m[1;3m The president said that Ketanji Brown Jackson is one of our nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: The president said that Ketanji Brown Jackson is one of our nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also

### Vector Databases

 **Here's a breakdown of how a basic vector database works and why we need them:**

**Key Concepts:**

- **Vectors:** Numerical representations of data that capture its features and characteristics in a high-dimensional space.
- **Embeddings:** The process of converting raw data (text, images, audio, etc.) into vectors using machine learning techniques.
- **Approximate Nearest Neighbor (ANN) Search:** A technique for finding similar vectors efficiently without exhaustively comparing every pair.

**How It Works:**

1. **Data Embedding:**
   - Raw data is transformed into vectors using pre-trained or custom-built embedding models.
   - These vectors capture semantic relationships and patterns within the data.

2. **Vector Indexing:**
   - The vectors are indexed using specialized algorithms designed for ANN search.
   - Common algorithms include:
     - Product Quantization (PQ)
     - Locality Sensitive Hashing (LSH)
     - Hierarchical Navigable Small World (HNSW)

3. **Querying and Retrieval:**
   - When a query is made, its embedding is calculated.
   - The database efficiently finds the nearest neighbors (most similar vectors) to the query vector using the ANN algorithms.
   - The corresponding raw data associated with those vectors is retrieved and returned to the user.

**Why We Need Vector Databases:**

- **Unstructured Data Handling:** They excel at managing and querying unstructured data that traditional databases struggle with.
- **Similarity Search:** They enable fast and accurate retrieval of similar items based on their semantic meaning, not just exact matches.
- **Richer Context Understanding:** Vector representations capture deeper semantic relationships, facilitating more nuanced and contextual search results.
- **Scalable AI Integration:** They integrate seamlessly with machine learning models, allowing for real-time analysis and decision-making based on vector data.

**Common Use Cases:**

- **Recommender Systems:** Suggesting similar products, movies, music, articles, etc.
- **Image and Video Search:** Finding visually similar images or videos.
- **Text Search and Semantic Analysis:** Semantic search, document clustering, sentiment analysis, chatbots, plagiarism detection.
- **Facial Recognition:** Identifying similar faces or objects.
- **Drug Discovery:** Identifying potential drug candidates based on molecular structure similarity.
- **Fraud Detection:** Anomaly detection in financial transactions.
- **Personalization:** Tailoring user experiences based on preferences and behavior.


## Agents in LangChain

In LangChain, agents are high-level components that use language models (LLMs) to determine which actions to take and in what order. An action can either be using a tool and observing its output or returning it to the user. Tools are functions that perform specific duties, such as Google Search, database lookups, or Python REPL.

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done.

Several types of agents are available in LangChain:

- The zero-shot-react-description agent uses the ReAct framework to decide which tool to employ based purely on the tool's description. It necessitates a description of each tool.
- The react-docstore agent engages with a docstore through the ReAct framework. It needs two tools: a Search tool and a Lookup tool. The Search tool finds a document, and the Lookup tool searches for a term in the most recently discovered document.
- The self-ask-with-search agent employs a single tool named Intermediate Answer, which is capable of looking up factual responses to queries. It is identical to the original self-ask with the search paper, where a Google search API was provided as the tool.
- The conversational-react-description agent is designed for conversational situations. It uses the ReAct framework to select a tool and uses memory to remember past conversation interactions.

In our example, the Agent will use the Google Search tool to look up recent information about the Mars rover and generates a response based on this information.



In [7]:

os.environ["GOOGLE_CSE_ID"] = GOOGLE_CSE_ID
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [11]:
search = GoogleSearchAPIWrapper()

tool = Tool(
    name="Google Search",
    description="Search Google for recent results.",
    func=search.run,
)

In [9]:
tool.run("Obama's first name?")

"1 Child's First Name. (Type or print). BARACK. CERTIFICATE OF LIVE BIRTH lb ... OBAMA, II. Day. 4. 6b. Island. Year. 5b. Hour. 1961 7:24 P.M.. Oahu. 6d. Is Place\xa0... His last name, Obama, was derived from his Luo descent. Obama's parents met in ... First 100 days. Main article: First 100 days of Barack Obama's presidency. Apr 12, 2017 ... President Barack Obama's full name is Barack Hussein Obama. Does it mean that he is a Muslim? 202,116 Views. First Lady Michelle LaVaughn Robinson Obama is a lawyer, writer, and the wife of the 44th President, Barack Obama. She is the first African-American First\xa0... Apr 2, 2018 ... BARACK : Barkat and Mubarak both are derived from it in Hindi and Urdu. Roughly meaning blessing, abundance etc. Husen or Hussein from which\xa0... Jan 19, 2017 ... Hopeful parents named their sons for the first Black president, whose name is a variation of the Hebrew name Baruch, which means “blessed”\xa0... The Middle East remained a key foreign policy challenge. 

The `Tool`` object represents a specific capability or function the system can use. In this case, it's a tool for performing Google searches.

It is initialized with three parameters:

- name parameter: This is a string that serves as a unique identifier for the tool. In this case, the name of the tool is "google-search.”
- func parameter: This parameter is assigned the function that the tool will execute when called. In this case, it's the run method of the search object, which presumably performs a Google search.
- description parameter: This is a string that briefly explains what the tool does. The description explains that this tool is helpful when you need to use Google to answer questions about current events.

In [13]:
search = GoogleSearchAPIWrapper()
tools = [
    Tool(
        name="google-search",
        func=search.run,
        description="useful for when you need to search google to answer questions about current events"

    )
]

Next, we create an agent that uses our Google Search tool:

- initialize_agent(): This function call creates and initializes an agent. An agent is a component that determines which actions to take based on user input. These actions can be using a tool, returning a response to the user, or something else.

- tools:  represents the list of Tool objects that the agent can use.

- agent="zero-shot-react-description": The "zero-shot-react-description" type of an Agent uses the ReAct framework to decide which tool to use based only on the tool's description.

- verbose=True: when set to True, it will cause the Agent to print more detailed information about what it's doing. This is useful for debugging and understanding what's happening under the hood.

- max_iterations=6: sets a limit on the number of iterations the Agent can perform before stopping. It's a way of preventing the agent from running indefinitely in some cases, which may have unwanted monetary costs.

In [14]:
agent = initialize_agent(
    tools=tools,
    llm=llm_text_davinci,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    max_iterations=2
)

In [17]:
response = agent.run("What's the latest news about the Mars rover?")
print_wrap(response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to find out what the latest news is about the Mars rover.
Action: google-search
Action Input: "Mars rover news"[0m
Observation: [36;1m[1;3mA new video rings in the rover's ninth year on Mars, letting viewers tour Curiosity's location on a Martian mountain. More. NASA's Curiosity Mars rover used its ... Apr 25, 2023 ... The fully robotic Zhurong, named after a mythical Chinese god of fire, was expected to have woken up in December after entering a planned sleep ... NASA's Mars 2020 Perseverance rover will look for signs of past microbial life, cache rock and soil samples, and prepare for future human exploration. Dec 15, 2021 ... These and other findings were presented today during a news briefing at the American Geophysical Union fall science meeting in New Orleans. Even ... NASA InSight Study Provides Clearest Look Ever at Martian Core ... A pair of quakes in 2021 sent seismic waves deep into the Red Planet's core