<font size="6.2">Summary</font>  

LangChain is an open-source framework designed to simplify the deployment of large language models (LLMs) in production. It offers a model-agnostic toolkit that allows developers to experiment with various LLMs through a unified interface, facilitating easy integration with multiple providers without extensive code changes. The notebook showcases diverse use cases, including chat models, prompt templates, memory management, and chains. Additionally, it highlights Retrieval Augmented Generation (RAG) and the implementation of intelligent agents for Multi-Doc-Chatbot. This makes LangChain a versatile tool for building robust applications using LLMs.

Python functions and data files needed to run this notebook are available via this [link](https://github.com/MehdiRezvandehy/LangChain-in-Action-LLM-Applications-with-RAG-and-Agents.git).

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#How-it-works" data-toc-modified-id="How-it-works-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>How it works</a></span></li><li><span><a href="#LongChain-Building-Blocks" data-toc-modified-id="LongChain-Building-Blocks-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>LongChain Building Blocks</a></span></li></ul></li><li><span><a href="#LangChain-Language-Model" data-toc-modified-id="LangChain-Language-Model-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>LangChain Language Model</a></span><ul class="toc-item"><li><span><a href="#Chat-Model" data-toc-modified-id="Chat-Model-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Chat Model</a></span></li><li><span><a href="#Prompt-Template" data-toc-modified-id="Prompt-Template-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Prompt Template</a></span></li><li><span><a href="#Parsers" data-toc-modified-id="Parsers-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Parsers</a></span><ul class="toc-item"><li><span><a href="#Output-Parsers" data-toc-modified-id="Output-Parsers-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Output Parsers</a></span></li><li><span><a href="#Pydantic-Output-Parser" data-toc-modified-id="Pydantic-Output-Parser-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Pydantic Output Parser</a></span></li></ul></li></ul></li><li><span><a href="#Memory" data-toc-modified-id="Memory-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Memory</a></span></li><li><span><a href="#Chains" data-toc-modified-id="Chains-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Chains</a></span><ul class="toc-item"><li><span><a href="#Simple-Chain" data-toc-modified-id="Simple-Chain-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Simple Chain</a></span></li><li><span><a href="#Sequential-Chain" data-toc-modified-id="Sequential-Chain-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Sequential Chain</a></span><ul class="toc-item"><li><span><a href="#using-prompt-|-model" data-toc-modified-id="using-prompt-|-model-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>using <code>prompt | model</code></a></span></li><li><span><a href="#Application-with-Streamlit" data-toc-modified-id="Application-with-Streamlit-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Application with <a href="https://streamlit.io/" rel="nofollow" target="_blank">Streamlit</a></a></span></li></ul></li><li><span><a href="#Router-Chains" data-toc-modified-id="Router-Chains-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Router Chains</a></span></li></ul></li><li><span><a href="#Document-Loading" data-toc-modified-id="Document-Loading-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Document Loading</a></span><ul class="toc-item"><li><span><a href="#Document-Splitting" data-toc-modified-id="Document-Splitting-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Document Splitting</a></span><ul class="toc-item"><li><span><a href="#CharacterTextSplitter" data-toc-modified-id="CharacterTextSplitter-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span><code>CharacterTextSplitter</code></a></span></li><li><span><a href="#RecursiveCharacterTextSplitter" data-toc-modified-id="RecursiveCharacterTextSplitter-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span><code>RecursiveCharacterTextSplitter</code></a></span></li><li><span><a href="#Vectorstore-&amp;-Embeddings" data-toc-modified-id="Vectorstore-&amp;-Embeddings-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>Vectorstore &amp; Embeddings</a></span></li></ul></li><li><span><a href="#Hands-on" data-toc-modified-id="Hands-on-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Hands on</a></span><ul class="toc-item"><li><span><a href="#Similarity-Search" data-toc-modified-id="Similarity-Search-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Similarity Search</a></span><ul class="toc-item"><li><span><a href="#Saving-Embeddings-to-Chroma-DB" data-toc-modified-id="Saving-Embeddings-to-Chroma-DB-5.2.1.1"><span class="toc-item-num">5.2.1.1&nbsp;&nbsp;</span>Saving Embeddings to Chroma DB</a></span></li></ul></li><li><span><a href="#Retrieval-Augmented-Generation-(RAG)" data-toc-modified-id="Retrieval-Augmented-Generation-(RAG)-5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>Retrieval Augmented Generation (RAG)</a></span></li></ul></li></ul></li><li><span><a href="#Agents" data-toc-modified-id="Agents-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Agents</a></span><ul class="toc-item"><li><span><a href="#Math-Agent" data-toc-modified-id="Math-Agent-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Math Agent</a></span></li><li><span><a href="#Adding-General-Knowledge-Tool-for-Agent" data-toc-modified-id="Adding-General-Knowledge-Tool-for-Agent-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Adding General Knowledge Tool for Agent</a></span></li><li><span><a href="#Agents-Types" data-toc-modified-id="Agents-Types-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Agents Types</a></span><ul class="toc-item"><li><span><a href="#zero-shot-react-description" data-toc-modified-id="zero-shot-react-description-6.3.1"><span class="toc-item-num">6.3.1&nbsp;&nbsp;</span><code>zero-shot-react-description</code></a></span></li><li><span><a href="#conversational-react-description" data-toc-modified-id="conversational-react-description-6.3.2"><span class="toc-item-num">6.3.2&nbsp;&nbsp;</span><code>conversational-react-description</code></a></span></li><li><span><a href="#react-docstore-(docstore)" data-toc-modified-id="react-docstore-(docstore)-6.3.3"><span class="toc-item-num">6.3.3&nbsp;&nbsp;</span><code>react-docstore (docstore)</code></a></span></li><li><span><a href="#Self-Ask-Agent-(google-search)" data-toc-modified-id="Self-Ask-Agent-(google-search)-6.3.4"><span class="toc-item-num">6.3.4&nbsp;&nbsp;</span>Self-Ask Agent (google-search)</a></span></li></ul></li></ul></li><li><span><a href="#Real-World-Use-Cases" data-toc-modified-id="Real-World-Use-Cases-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Real-World Use Cases</a></span><ul class="toc-item"><li><span><a href="#Bill-Extractor" data-toc-modified-id="Bill-Extractor-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Bill Extractor</a></span></li><li><span><a href="#Multi-Doc-Chatbot" data-toc-modified-id="Multi-Doc-Chatbot-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Multi-Doc-Chatbot</a></span><ul class="toc-item"><li><span><a href="#Streamlit-App:-Full-Multi-Document-Chatbot" data-toc-modified-id="Streamlit-App:-Full-Multi-Document-Chatbot-7.2.1"><span class="toc-item-num">7.2.1&nbsp;&nbsp;</span>Streamlit App: Full Multi-Document Chatbot</a></span></li></ul></li><li><span><a href="#Question-answering" data-toc-modified-id="Question-answering-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Question answering</a></span><ul class="toc-item"><li><span><a href="#Streamlit-App" data-toc-modified-id="Streamlit-App-7.3.1"><span class="toc-item-num">7.3.1&nbsp;&nbsp;</span>Streamlit App</a></span></li></ul></li></ul></li></ul></div>

# Introduction 

LangChain is one of the fastest-growing open-source projects, driven by the surge of interest in large language models (LLMs). Key features that make LangChain powerful include its ability to connect data to language models (such as OpenAI’s GPT via API) and create agent workflows.

**Why does LangChain exist?** 🤔
  
> The landscape of language models is still evolving, and developers face challenges due to a lack of sufficient tooling for production-level deployments. LangChain addresses these gaps by offering a model-agnostic toolkit, enabling developers to experiment with multiple LLMs and identify the best fit for their needs—all within a unified interface, avoiding the need for extensive codebase scaling as more providers are integrated.

**LangChain’s Community** ⭐️

> When evaluating tools, the community around them is crucial—especially for open-source projects. LangChain has a robust community, with over 51k GitHub stars, 1 million downloads per month, and an active presence on Discord and Twitter.

**Agents in LangChain** 🤖

> A popular concept in the LLM space is the use of agents—programmatic entities capable of executing goals and tasks. LangChain simplifies agent creation using its agents API. Developers can leverage OpenAI functions and other task execution tools, allowing agents to act autonomously. LangChain stands out by providing access to multiple tools within a single interface. The "plan and execute" functionality enables agents to autonomously set goals, plan, and perform tasks with minimal human input. Though current models struggle with long-term autonomy, these capabilities will improve over time.

**Memory with Language Models** 🧠

> One challenge with LLMs, such as OpenAI’s API, is that they are stateless. Every new request requires sending back the necessary context to generate a response. While developers can manage this by saving message histories in Python lists or text files, this approach doesn't scale efficiently. LangChain helps address this limitation.

## How it works

* It takes a document and transform it into **VectorStore** (stores the chunks of data)
![image.png](attachment:image.png)

VectorStore holds **embeddings - Vector** representation of the text. The reason of **Embeddings** is that we can easily do search where we look for pieces of text that are most similar in the vector space.

* How **Vector Store** works
![image.png](attachment:image.png)
Image retrieved from www.langchain.com

Schematic illustration below shows how information can be retrieved from vector store:

![image.png](attachment:image.png)
Image retrieved from www.langchain.com

Here are Lanchain benefits:

- **Accelerate application development**: Easily combine different components to build applications more quickly.
- **Simplify development**: Complexity is abstracted away, allowing developers to focus on creating high-quality applications.
- **Extensive built-in modules and tools**: A wide range of ready-to-use features for seamless integration into AI applications.
- **Open-source platform**: Rapid innovation with continuous addition and maintenance of new tools and features.

To install LangChain, you can apply pip as below which will install the bare minimum requirement of LangChain. By default, the dependencies needed to do that are NOT installed. You need to install dependencies for specific integration separately.

`pip install langchain`

Install LangChain for OpenAI:

`pip install langchain-openai`

Next we need to get our "OPENAI_API_KEY". We should do it different than above. 

Then, install `pip install python_dotenv`. This can get our `.env` variables into Python. Here is an example:

`%pip install python_dotenv`

`%pip install -U langchain-openai`

See a simple application of LangChain with OpenAI:

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import openai
import os
from dotenv import find_dotenv, load_dotenv

# openai wrapper for langchain
from langchain_community.llms import OpenAI

# to find all environmental variables
load_dotenv(find_dotenv())

# create a variable for model
model_llm = "gpt-3.5-turbo"

# read OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = "........"

openai.api_key = os.getenv("OPENAI_API_KEY")  
#llm = OpenAI(open_api_key=os.getenv("OPENAI_API_KEY"))

llm = OpenAI(temperature=0.75)

print (llm.invoke("what is the weather in calgary now"))

  warn_deprecated(




As an AI, I do not have access to real-time weather data. Please check a trusted weather source for the current weather in Calgary.


## LongChain Building Blocks 

* **Components** ![image.png](attachment:image.png)

> **LLM Wrappers**: These provide a simple interface to connect and interact with large language models (LLMs) like GPT.
  
> **Prompt Templates**: Predefined templates that help structure prompts, ensuring consistent and effective communication with LLMs.

> **Indexes**: Tools for extracting and organizing relevant information from large datasets, making it easier to retrieve key details efficiently during queries.

* **Chains**

![image-2.png](attachment:image-2.png)

> A **chain** involves combining a sequence of function calls to perform a specific task. Each step can execute different operations, such as processing input or interacting with a language model. This modular approach streamlines workflows, enhances efficiency, and ensures smooth data flow between steps.

* **Agents**

![image.png](attachment:image.png)

> Agents LLMs to perform actions, such as interacting with external sources like Wikipedia. 

See screen shot below for LangChain overview:

![image-2.png](attachment:image-2.png)
Image retrieved from https://daxg39y63pxwu.cloudfront.net/images/blog/langchain/LangChain.webp

# LangChain Language Model 

## Chat Model

ChatModel takes a list of messages as input, and return a message:

* Output is a ChatMessage, which has the following components:
   * **Content**: the actual message
   * **Role**: the role of the entity from which the ChatMessage is coming from


In [3]:
import openai
import os
from dotenv import find_dotenv, load_dotenv

# openai wrapper for langchain
from langchain_community.llms import OpenAI
from langchain.schema import HumanMessage

# openai wrapper for langchain.chat
from langchain_openai import ChatOpenAI

# create a variable for model
model_llm = "gpt-3.5-turbo"

prompt = "How deep is Caspian see"
# encapsulates the prompt in a format suitable for processing 
# by a language model or conversational agent.
messages = [HumanMessage(content=prompt)]

print("===================")

model_chat = ChatOpenAI(temperature=0.75)
print (model_chat.invoke("what is the weather in calgary now"))

print("===================")

model_chat = ChatOpenAI(temperature=0.75)
print (model_chat.invoke(messages).content)


content='I apologize, but I am not able to provide real-time weather updates. I recommend checking a reliable weather website or app for the current weather in Calgary.' response_metadata={'token_usage': {'completion_tokens': 31, 'prompt_tokens': 15, 'total_tokens': 46, 'prompt_tokens_details': {'cached_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-7c5411ab-961d-4ced-ae9a-12ce796b8d0e-0' usage_metadata={'input_tokens': 15, 'output_tokens': 31, 'total_tokens': 46}
The Caspian Sea is the world's largest inland body of water, with a maximum depth of about 3,363 feet (1,025 meters).


## Prompt Template

The question is why we should utilize LangChain prompt templates instead of directly using the ChatGPT model:

1. **Simplifies Long Templates**: Prompt templates are particularly useful for managing lengthy templates, as they help reduce complexity.
2. **Ready-to-Use Built-in Templates**: LangChain offers a variety of built-in prompt templates that we can immediately implement, further minimizing complexity. 
   - For instance, there are templates available for tasks such as summarization, question-answering, connecting to SQL databases, and interfacing with specific APIs.

In [4]:
customer_review = """ food quality is terrible, 
I will not go to this place anymore """ 

In [5]:
from langchain.prompts import ChatPromptTemplate

# usign LangChain & prompt templates
model_chat = ChatOpenAI(temperature=0.7,
                       model=model_llm)

# create a template string
template_string = """
Translate the following text {customer_review}
into Farsi language in a polite manner.
And the restaurant's name is {restaurant_name}.

"""

# create a chat prompt template
template_prompt = ChatPromptTemplate.from_template(template_string)
message_translation = template_prompt.format_messages(
    customer_review = customer_review,
    restaurant_name = "Khazar"
)


In [6]:
response = model_chat(message_translation)
print (response.content)

  warn_deprecated(


کیفیت غذا در رستوران خزر واقعاً بد بود. من دیگر به این مکان نخواهم رفت.


This is an amazing approach to write a very long prompt using LangChain.

## Parsers

Parses are very import because most cases we need to retrieve format that is not usable to a format that is usable: it structures and formats data for further processing downstream

![image-2.png](attachment:image-2.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

In [7]:
# create a variable for model
model_llm = "gpt-3.5-turbo"

model_chat = ChatOpenAI(temperature=0.0, model=model_llm)

In [8]:
paris_visit = """Upon landing in Paris, the adventure begins with a 
check-in at a charming boutique hotel nestled in the heart of 
the city. The first venture out includes a visit to 
the majestic Eiffel Tower, where panoramic views of Paris await. 
A stroll along the Seine River, crossing its iconic bridges, 
provides a picturesque introduction to 
the city's romantic ambiance. As the day wanes, enjoy a serene evening at a local café, savoring classic French cuisine.

The following day is dedicated to immersing yourself in Parisian 
culture and art. Begin with an early visit to the Louvre Museum 
to admire historical masterpieces, including the Mona Lisa. 
Post-lunch, a journey through the Gothic splendor of the
Notre-Dame Cathedral and a leisurely exploration of the Latin 
Quarter's quaint streets and cozy bookshops reveal the 
city's vibrant heart. The evening offers a chance to experience 
Paris's renowned culinary scene, with a dinner featuring exquisite 
French delicacies.

On the final day, delve into the artistic enclave of Montmartre, 
where the Sacré-Cœur Basilica stands majestically. This district, 
known for its bohemian spirit, offers a glimpse into the city's 
artistic legacy."""


In [9]:
itinerary_template = """
Extract following information 

hotel: hotel to stay

first_day_visit: first day plan 

second_day_visit: second day plan 

final_day_visit: final desitination to visit 

from the itinerary below:
itinerary: {itinerary}

The output should be formated as JSON with the following keys:
hotel
first_day_visit
second_day_visit
final_day_visit

"""

In [10]:
from langchain_core.prompts import ChatPromptTemplate

# create a prompt template
template_prompt = ChatPromptTemplate.from_template(itinerary_template)
print(template_prompt)

input_variables=['itinerary'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['itinerary'], template='\nExtract following information \n\nhotel: hotel to stay\n\nfirst_day_visit: first day plan \n\nsecond_day_visit: second day plan \n\nfinal_day_visit: final desitination to visit \n\nfrom the itinerary below:\nitinerary: {itinerary}\n\nThe output should be formated as JSON with the following keys:\nhotel\nfirst_day_visit\nsecond_day_visit\nfinal_day_visit\n\n'))]


In [11]:
messages = template_prompt.format_messages(itinerary=paris_visit)

In [12]:
response = model_chat(messages)
print(response.content)

{
  "hotel": "charming boutique hotel nestled in the heart of the city",
  "first_day_visit": "visit to the majestic Eiffel Tower, stroll along the Seine River, evening at a local café",
  "second_day_visit": "visit to the Louvre Museum, journey through Notre-Dame Cathedral, exploration of the Latin Quarter",
  "final_day_visit": "delve into the artistic enclave of Montmartre, visit the Sacré-Cœur Basilica"
}


<span class="mark">The problem with LLMs is they can be moody sometimes giving wrong information.LangChain can resolve this.</span>

In [13]:
type(response.content)

str

### Output Parsers

The library `output_parsers` makes the structure that we want to achieve.

In [14]:
from langchain.output_parsers import ResponseSchema, StructuredOutputParser

In [15]:
# create schema (fields)
hotel_schema = ResponseSchema(name = "hotel", description="the recomended hotel to check in and stay")
first_day_visit_schema = ResponseSchema(name = "first_day_visit", description="the place to visit on first day")
second_day_visit_schema = ResponseSchema(name = "second_day_visit", description="the place to visit on second day")
final_day_visit_schema = ResponseSchema(name = "final_day_visit", description="the place to visit on last day")

In [16]:
# create responses
response_schema = [
    hotel_schema,
    first_day_visit_schema,
    second_day_visit_schema,
    final_day_visit_schema 
]

In [17]:
# setup output parsers
output_parser = StructuredOutputParser.from_response_schemas(response_schema)

format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"hotel": string  // the recomended hotel to check in and stay
	"first_day_visit": string  // the place to visit on first day
	"second_day_visit": string  // the place to visit on second day
	"final_day_visit": string  // the place to visit on last day
}
```


But, we want to make the output as a dictionary:

In [18]:
output_dict = output_parser.parse(response.content)
output_dict

{'hotel': 'charming boutique hotel nestled in the heart of the city',
 'first_day_visit': 'visit to the majestic Eiffel Tower, stroll along the Seine River, evening at a local café',
 'second_day_visit': 'visit to the Louvre Museum, journey through Notre-Dame Cathedral, exploration of the Latin Quarter',
 'final_day_visit': 'delve into the artistic enclave of Montmartre, visit the Sacré-Cœur Basilica'}

* **Example Visit Iran**

In [19]:
itinerary_Iran = """ Upon your arrival in Tehran, the first destination is the luxurious Espinas Palace Hotel. 
Nestled in the city's heart, this accommodation promises modern amenities, breathtaking city views, and exemplary 
service, ensuring a stay that combines comfort with elegance. After checking in, a traditional Iranian breakfast 
awaits at the hotel's restaurant, offering a taste of the local cuisine's rich flavors and culinary heritage.

The adventure begins in the afternoon with a visit to the Golestan Palace. This UNESCO World Heritage site is 
a jewel in Tehran's crown, showcasing the opulence of the Qajar era through its stunning gardens, exquisite 
interiors, and detailed tile work. As the day winds down, the bustling Tehran Bazaar becomes the perfect 
backdrop for an evening stroll. Here, the vibrant chaos, colorful stalls, and the aroma of spices and 
fresh foods provide a sensory feast, offering insights into the daily lives of the city's residents.

The next leg of your journey takes you to the enchanting city of Isfahan, where the Abbasi Hotel awaits. 
Known for its beautiful traditional architecture and lush gardens, this historic hotel serves as a gateway 
to the past, located conveniently close to Isfahan's main attractions. The Si-o-se-pol Bridge, with its 
iconic 33 arches, is a splendid first stop, offering serene views, especially at sunset. Following this, 
the Naqsh-e Jahan Square invites exploration, with the Imam Mosque, Ali Qapu Palace, and the bustling 
bazaar around the square offering endless opportunities for discovery and wonder.

On your final day, delve deeper into Isfahan's artistic heritage with a visit to the Chehel Sotoun 
Palace, a stunning example of Persian garden design and architecture. The palace's mirrored hall 
and the intricate wall paintings provide a glimpse into the royal festivities of the Safavid era. 
As your journey comes to a close, the Armenian Quarter of Jolfa offers a quiet retreat, with its 
quaint cafes, the Vank Cathedral, and art galleries, encapsulating the diversity and cultural 
richness that Iran proudly preserves.

This itinerary, weaving through the heart of Iran, offers a tapestry of experiences that promise 
to enrich, educate, and inspire, making your visit not just a trip but a journey through time and culture.
"""

In [20]:
messages = template_prompt.format_messages(itinerary=itinerary_Iran,
                                        format_instructions=format_instructions)
print(messages)

[HumanMessage(content="\nExtract following information \n\nhotel: hotel to stay\n\nfirst_day_visit: first day plan \n\nsecond_day_visit: second day plan \n\nfinal_day_visit: final desitination to visit \n\nfrom the itinerary below:\nitinerary:  Upon your arrival in Tehran, the first destination is the luxurious Espinas Palace Hotel. \nNestled in the city's heart, this accommodation promises modern amenities, breathtaking city views, and exemplary \nservice, ensuring a stay that combines comfort with elegance. After checking in, a traditional Iranian breakfast \nawaits at the hotel's restaurant, offering a taste of the local cuisine's rich flavors and culinary heritage.\n\nThe adventure begins in the afternoon with a visit to the Golestan Palace. This UNESCO World Heritage site is \na jewel in Tehran's crown, showcasing the opulence of the Qajar era through its stunning gardens, exquisite \ninteriors, and detailed tile work. As the day winds down, the bustling Tehran Bazaar becomes the 

In [21]:
response = model_chat(messages)

# parse into dictionary
output_dict = output_parser.parse(response.content)
print(output_dict)

{'hotel': 'Espinas Palace Hotel', 'first_day_visit': 'Golestan Palace', 'second_day_visit': 'Si-o-se-pol Bridge, Naqsh-e Jahan Square', 'final_day_visit': 'Chehel Sotoun Palace, Armenian Quarter of Jolfa'}


### Pydantic Output Parser

The Pydantic Output Parser for LangChain is a tool designed to facilitate the integration of Pydantic models with the LangChain framework.

We are using the same itinerary example as before:

In [22]:
# Import Pydantic parsers
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List

Now we can define our desired data structure:

In [23]:
from pydantic import BaseModel, Field, field_validator
from langchain.output_parsers import PydanticOutputParser

class ItineraryInfo(BaseModel):
    hotel: str = Field(description="the recommended hotel to check in and stay")
    first_day_visit: str = Field(description="the place to visit on first day")
    second_day_visit: str = Field(description="the place to visit on second day")
    final_day_visit: str = Field(description="the place to visit on last day")
    num_people: int = Field(description="number of people to join this journey")
    
    @field_validator('num_people')
    def number_people_checking(cls, value, info):
        if value <= 0:
            raise ValueError("Not an accurate number of people to travel")
        return value

# Setup a parser and inject instructions
pydantic_parser = PydanticOutputParser(pydantic_object=ItineraryInfo)
format_instructions = pydantic_parser.get_format_instructions()

print(format_instructions)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"hotel": {"description": "the recommended hotel to check in and stay", "title": "Hotel", "type": "string"}, "first_day_visit": {"description": "the place to visit on first day", "title": "First Day Visit", "type": "string"}, "second_day_visit": {"description": "the place to visit on second day", "title": "Second Day Visit", "type": "string"}, "final_day_visit": {"description": "the place to visit on last day", "title": "Final Day Visit", "type": "string"}, "num_people": {"description": "number of people to join this journey", "

In [24]:
itinerary_template_revised = """
Extract information from the following itinerary:

itinerary: {itinerary}
{format_instructions}
"""

In [25]:
updated_prompt = ChatPromptTemplate.from_template(template=itinerary_template_revised)
messages = updated_prompt.format_messages(itinerary=itinerary_Iran,
                                          format_instructions=format_instructions)
format_response = model_chat(messages)
print(format_response)

content='{\n  "hotel": "Espinas Palace Hotel",\n  "first_day_visit": "Golestan Palace",\n  "second_day_visit": "Si-o-se-pol Bridge and Naqsh-e Jahan Square",\n  "final_day_visit": "Chehel Sotoun Palace and Armenian Quarter of Jolfa",\n  "num_people": 1\n}' response_metadata={'token_usage': {'completion_tokens': 74, 'prompt_tokens': 798, 'total_tokens': 872, 'prompt_tokens_details': {'cached_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-2fd3fe12-6824-444a-9b9c-ea89a147a9ae-0' usage_metadata={'input_tokens': 798, 'output_tokens': 74, 'total_tokens': 872}


In [26]:
type(format_response.content)

str

In [27]:
print(format_response.content)

{
  "hotel": "Espinas Palace Hotel",
  "first_day_visit": "Golestan Palace",
  "second_day_visit": "Si-o-se-pol Bridge and Naqsh-e Jahan Square",
  "final_day_visit": "Chehel Sotoun Palace and Armenian Quarter of Jolfa",
  "num_people": 1
}


In [28]:
# convert str to JSON
visit = pydantic_parser.parse(format_response.content)
print(type(visit))
print(visit)

<class '__main__.ItineraryInfo'>
hotel='Espinas Palace Hotel' first_day_visit='Golestan Palace' second_day_visit='Si-o-se-pol Bridge and Naqsh-e Jahan Square' final_day_visit='Chehel Sotoun Palace and Armenian Quarter of Jolfa' num_people=1


In [29]:
print(visit.hotel)

Espinas Palace Hotel


See https://python.langchain.com/docs/modules/model_io/output_parsers/types/pydantic for more information about pydantic

# Memory

LLMs do not remember anything:

Luckily LangChain has many types of wrapper can have memory:

LangChain offers various memory types to manage and store conversational context, ensuring seamless and coherent interactions. The main memory types include:

1. `ConversationBufferMemory`: storing of messages and then extracts the messages in a variable

2. `ConversationBufferWindowMemory`: keeps a list of interactions of the conversation over time using only last k interactions

3. `ConversationTokenBufferWindowMemory`: keeps a buffer of recent interactions in memory and uses token length rather than number of interactions to end interactions.

4. **Entity Memory** (`EntityMemory`): Tracks specific entities mentioned in conversations, ensuring consistent and accurate references throughout the interaction.


In [30]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

In [31]:
# OpenAI Chat API
model_llm = "gpt-3.5-turbo"

model_chat = ChatOpenAI(temperature=0.7,
                       model=model_llm)

  warn_deprecated(


In [32]:
print(model_chat.invoke("My name is Mehdi and I am a data scientist and what about you?").content)
print(model_chat.invoke("\n\nCool! can you tell me what my name is?").content) # there are memory issue

Nice to meet you Mehdi! I am a language model AI designed to assist with various tasks and conversations. How can I help you today?
I'm sorry, but I do not have access to any personal information about you. However, you can tell me your name and I would be happy to address you by it in our conversation.


Now how we can solve memory issue? we can leverage `ConversationChain`.

In [33]:
memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=model_chat,
    memory=memory,
    verbose=True # see what is going on in background
)

conversation.invoke(input="Hi there, my name is Mehdi")
conversation.invoke(input="why we have stars at the sky")
conversation.invoke(input="why people are scared of snakes")
conversation.invoke(input="Do you remember what my name is?")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: Hi there, my name is Mehdi
AI:[0m

[1m> Finished chain.[0m


[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Hi there, my name is Mehdi
AI: Hello Mehdi! It's nice to meet you. How can I assist you today?
Human: why we have stars at the sky
AI:[0m

[1m> Finished chain.[0m


[1m> Entering new ConversationChain chain...[0m
Prompt after form

{'input': 'Do you remember what my name is?',
 'history': "Human: Hi there, my name is Mehdi\nAI: Hello Mehdi! It's nice to meet you. How can I assist you today?\nHuman: why we have stars at the sky\nAI: Stars are massive, luminous spheres of plasma that are held together by gravity. They form when clouds of gas and dust in space collapse under their own gravity and begin nuclear fusion, which releases energy in the form of light and heat. These stars then continue to shine for billions of years until they exhaust their nuclear fuel and eventually die. The reason we see stars in the sky is because their light travels through space and reaches our eyes, creating the beautiful nighttime view we all enjoy.\nHuman: why people are scared of snakes\nAI: People are scared of snakes for a variety of reasons. One possible explanation is that snakes have been portrayed negatively in many cultures and religions throughout history, leading to a fear that has been passed down through generations. A

In [34]:
print(memory.load_memory_variables({}))

{'history': "Human: Hi there, my name is Mehdi\nAI: Hello Mehdi! It's nice to meet you. How can I assist you today?\nHuman: why we have stars at the sky\nAI: Stars are massive, luminous spheres of plasma that are held together by gravity. They form when clouds of gas and dust in space collapse under their own gravity and begin nuclear fusion, which releases energy in the form of light and heat. These stars then continue to shine for billions of years until they exhaust their nuclear fuel and eventually die. The reason we see stars in the sky is because their light travels through space and reaches our eyes, creating the beautiful nighttime view we all enjoy.\nHuman: why people are scared of snakes\nAI: People are scared of snakes for a variety of reasons. One possible explanation is that snakes have been portrayed negatively in many cultures and religions throughout history, leading to a fear that has been passed down through generations. Additionally, snakes move in a unique and unpre

Now we can see LLM remember my name!!

See this page for more information:
https://python.langchain.com/docs/modules/memory/

![image.png](attachment:image.png)

# Chains

Chain is a type of chain that allows you to link multiple individual chains or components together in a sequence. This means that the output of one chain or component is passed as input to the next one, allowing you to create complex workflows and data processing pipelines by combining simpler steps. Figure below shows a simple chain

![image-3.png](attachment:image-3.png)

Another schematic for sequential chain:
![image-4.png](attachment:image-4.png)

## Simple Chain

In [35]:
# run a simple LLMChain
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

# OpenAI Chat API
model_llm = "gpt-3.5-turbo"

model_chat = ChatOpenAI(temperature=0.76, model=model_llm, verbose=True)
open_ai = OpenAI(temperature=0.78)

In [36]:
# LLMChain
prompt = PromptTemplate(
    input_variables=["language"],
    template="How do you greet in {language}"
)

simple_chain = prompt | model_chat
print(simple_chain.invoke({"language": "Persian?"}).content)

In Persian, you can greet someone by saying "سلام" (salaam) which means hello or hi. You can also use "خوبی؟" (khubi?) which means how are you? or "خوش آمدید" (khosh amadid) which means welcome.


## Sequential Chain

In [37]:
open_ai = OpenAI(temperature=0.78)

template = """
write a fake stroy of 100 words for a person living in {location} 
and make a living based on boxing. Make his/her name as {name} 

fake STORY:
"""

prompt = PromptTemplate(input_variables=["location", "name"],
                        template=template, verbose=True)

In [38]:
simple_chain = prompt | model_chat
print(simple_chain.invoke({"location": "bandaeanzali?","name": "Ebi?"}).content)

Ebi was a rising star in the boxing world, known for his lightning-fast punches and unbeatable determination. Living in Bandar-e Anzali, he trained tirelessly in a small gym by the sea, dreaming of one day becoming a champion. Despite facing numerous obstacles and setbacks, Ebi never gave up on his passion for boxing. With each victory in the ring, his reputation grew, attracting the attention of sponsors and fans alike. His resilience and unwavering dedication to his craft inspired those around him, proving that with hard work and perseverance, anything is possible. Ebi was destined for greatness, and nothing could stand in his way.


In [39]:
fake_story_chain = LLMChain(llm=open_ai, prompt=prompt, verbose=True) # see what is going on in background
print(fake_story_chain.run({"location": "bandaeanzali", "name":"Ebi"}))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
write a fake stroy of 100 words for a person living in bandaeanzali 
and make a living based on boxing. Make his/her name as Ebi 

fake STORY:
[0m


  warn_deprecated(
  warn_deprecated(



[1m> Finished chain.[0m

Ebi was a rising star in the world of boxing, hailing from the small town of Bandaeanzali. From a young age, Ebi was known for his quick hands and powerful punches. He would spend hours training in the local gym, determined to make a name for himself in the ring.

As he got older, Ebi's skills only improved. He became the talk of the town, with many predicting he would become the next big thing in boxing. His hard work and dedication paid off when he was scouted by a renowned boxing coach from the city.

Ebi left his small town behind and moved to the city to pursue his dreams of becoming a professional boxer. He trained rigorously every day, pushing himself to the limits. And finally, his big break came when he was offered a chance to fight in a televised match.

The entire town of Bandaeanzali gathered around their TVs to watch Ebi in action. And he did not disappoint. In a nail-biting match, Ebi knocked out his opponent in the final round, securing his fi

Now after making that story, we want to get the response of the story as input for another story.

In [40]:
fake_story_chain = LLMChain(llm=open_ai, prompt=prompt, 
                            output_key="story",
                            verbose=True) # see what is going on in background

In [41]:
from langchain.chains import LLMChain, SequentialChain

update_template = """
# translate the {story} into {language}. Please ensure that the language is easily understandable and is fun to read.

Translation into {language}: 
"""

translate_prompt = PromptTemplate(input_variables=["story", "language"],
                                 template=update_template)

translate_chain = LLMChain(llm=open_ai, 
                          prompt=translate_prompt, 
                          output_key="translated"
                         ) 

Now we need to create Sequential Chain:

In [42]:
chain_overall = SequentialChain(
    chains=[fake_story_chain, translate_chain],
    input_variables=["location", "name", "language"],
    output_variables=["story", "translated"], # This will return the story and translate it
    verbose=True
)

response = chain_overall({"location": "Bandar-e-Anzali",
                        "name": "Abay",
                        "language": "Persian",
                         })



[1m> Entering new SequentialChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
write a fake stroy of 100 words for a person living in Bandar-e-Anzali 
and make a living based on boxing. Make his/her name as Abay 

fake STORY:
[0m


  warn_deprecated(



[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [43]:
print(f"English Version is {response['story']} \n\n ")
print(f"Translated Version is {response['translated']} \n\n ")

English Version is In the bustling city of Bandar-e-Anzali, there lived a young man by the name of Abay. He was tall, muscular and had a passion for boxing that burned deep within him. Ever since he was a child, he dreamed of becoming a professional boxer and making a name for himself. With determination and hard work, Abay trained tirelessly every day at the local boxing gym.

His skills caught the attention of a renowned coach who took him under his wing and molded him into a fierce fighter. Soon, Abay was winning matches left and right, gaining fame and fortune with each victory. He became a local hero, with posters of him plastered all over the city.

But with success came jealousy and envy from his opponents. One day, Abay was challenged by a fierce rival who was known for his dirty tricks in the ring. The match was intense, with both fighters going toe to toe. But in the end, it was Abay's determination and sharp skills that led him to emerge victorious.

With his undefeated reco

### using `prompt | model`

In [44]:
chain_1 = prompt | model_chat
chain_2 = translate_prompt | model_chat

answer1 = chain_1.invoke({"location": "bandaeanzali?","name": "Ebi?"})
answer2 = chain_2.invoke({"story": answer1.content, "language": "persian"})
print("Fake story:\nn", answer1.content)
print("\nnTranslation:\nn", answer2.content)

Fake story:
n Ebi was a rising star in the boxing world, living in Bandar-e Anzali. With a natural talent for the sport, he quickly made a name for himself in the ring. However, his success came with a price. Ebi's opponents would often try to sabotage him, leading to intense rivalries and dangerous situations. Despite the challenges, Ebi never backed down and continued to train tirelessly, determined to become the champion of Bandar-e Anzali. His hard work and dedication paid off when he finally won the title, solidifying his place as a boxing legend in the city.

nTranslation:
n ابی ستاره‌ی صعود کننده‌ای در دنیای باکس بود که در بندر انزلی زندگی می‌کرد. با استعداد طبیعی برای ورزش، او به سرعت نامی برای خود در رینگ ایجاد کرد. با این حال، موفقیت ابی با یک قیمت همراه بود. حریفان ابی اغلب سعی می‌کردند او را خراب کنند که منجر به رقابت‌های شدید و وضعیت‌های خطرناک می‌شد. با وجود چالش‌ها، ابی هرگز پشت نکش نکرد و به سرعت ادامه داد تا با تمرین‌های بی‌وقفه، مصمم به تبدیل شدن به قهرمان بندر انزلی 

### Application with [Streamlit](https://streamlit.io/)

We can convert the code above to a streamlit app as below:

In [45]:
#%pip install aiortc==1.3.2 --user
#%pip install matplotlib==3.5.1 --user
#%pip install numpy==1.22.3 --user
#%pip install opencv-python-headless==4.5.5.64 --user
#%pip install pydub==0.25.1 --user
#%pip install streamlit==1.9.0 --user
#%pip install streamlit_webrtc==0.37.0 --user
#%pip install typing_extensions==4.1.1 --user
#%pip install protobuf~=3.19.0 --user
#%pip install "altair<5" --user

In [46]:
import openai
import os
from dotenv import find_dotenv, load_dotenv
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, SequentialChain
from langchain_community.llms import OpenAI
from langchain.chat_models import ChatOpenAI
import streamlit as st    

openai.api_key = os.getenv("OPENAI_API_KEY")  

# to find all environmental variables
load_dotenv(find_dotenv())

# OpenAI Chat API
model_llm = "gpt-3.5-turbo"

model_chat = ChatOpenAI(temperature=0.76, model=model_llm)
open_ai = OpenAI(temperature=0.78)


def lullaby_generate(location, name, language):

    template = """
    write an a fake stroy of 100 words for a person living in {location} 
    and make a living based on boxing. Make his/her name as {name} 
    
    fake STORY:
    """
    prompt = PromptTemplate(input_variables=["location", "name"],
                            template=template
                           )
    
    fake_story_chain = LLMChain(llm=open_ai, prompt=prompt, 
                                output_key="story",
                                verbose=True) # see what is going on in background
    #
    update_template = """
    # translate the {story} into {language}. Please ensure that the language is easily 
    understandable and is fun to read.
    
    Translation into {language}: 
    """
    
    translate_prompt = PromptTemplate(input_variables=["story", "language"],
                                     template=update_template)
    #
    translate_chain = LLMChain(llm=open_ai, 
                              prompt=translate_prompt, 
                              output_key="translated"
                             )                           
                             
    #
    chain_overall = SequentialChain(
        chains=[fake_story_chain, translate_chain],
        input_variables=["location", "name", "language"],
        output_variables=["story", "translated"], # This will return the story and translate it
        verbose=True
    )
    
    response = chain_overall({"location": location,
                            "name": name,
                            "language": language,
                             })                           
    
    return response
   
    
# Create a user interface here
def main():
    st.set_page_config(page_title="Generate a fake story",
                      layout="centered")
    st.title("Ask AI to write a fake story about a boxer and translate it to another language 📚")
    st.header("Now it is started ...")
    location_input = st.text_input(label="Location for the story")
    name_input = st.text_input(label="What is the name of character")
    language_input = st.text_input(label="Translate story to another language")
    
    submit_button = st.button("Submit")
    if location_input and name_input and language_input:
        if submit_button:
            with st.spinner("Generate a Fake story..."):
                response = lullaby_generate(location=location_input,
                                                    name=name_input,
                                                    language=language_input
                                                    )
                with st.expander("English version"):
                    st.write(response['story'])
                    
                with st.expander(f"{language_input} language"):
                    st.write(response['translated'])
                    
            st.success("Successfully done!")    
    

#Invoking main function
if __name__ == '__main__':
    main()

2024-10-19 06:40:33.649 
  command:

    streamlit run D:\Learning\MyWebsite\LangChain\vm_langchain\lib\site-packages\ipykernel_launcher.py [ARGUMENTS]
2024-10-19 06:40:33.650 Session state does not function when running a script without `streamlit run`


To run the code above by streamlit, the code should be wrapped in a python file and run in console by `streamlit run app.py`

## Router Chains

**Router Chain** is a specialized type of chain designed <span class="mark">to route inputs to different sub-chains based on specific criteria</span> or conditions. This allows for more dynamic and flexible handling of inputs by directing them to the appropriate processing chain.

![image-2.png](attachment:image-2.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

Routers in LangChain are specialized <span class="mark">components designed to manage and direct the flow of tasks or queries within the system</span>. They help in distributing incoming tasks to the appropriate models, agents, or other resources based on specific criteria or requirements. Routers enhance efficiency and optimize resource utilization by ensuring that each task is handled by the most suitable component available. This capability is crucial in complex, multi-agent systems where diverse tasks need different handling strategies. <span class="mark">By using routers, LangChain can dynamically adapt to varying workloads and improve overall performance and accuracy of the language processing pipeline</span>.

In [47]:
openai.api_key = os.getenv("OPENAI_API_KEY")  

# to find all environmental variables
load_dotenv(find_dotenv())

# OpenAI Chat API
model_llm = "gpt-3.5-turbo"

model_chat = ChatOpenAI(temperature=0.0, model=model_llm)


weather_template = """You are an expert at global warming. You can answer any 
question related to earth temperture raising.

Here is a question:
{input}"""


sport_template = """You are a very good swim coach. You are great at teaching students how to swim. 

Here is a question:
{input}"""

physician_template = """You are a very good physician specializing in heart disease. 

Here is a question:
{input}"""



prompt_infos = [
    {
        "name": "weather",
        "description": "Good at global warming",
        "prompt_template": weather_template
    },
    {
        "name": "sport",
        "description": "Good for teaching people how to swim",
        "prompt_template": sport_template,
    },
    {
        "name": "physician",
        "description": "Good for healing heart disease",
        "prompt_template": physician_template,
    },
]

destination_chains = {}
for info in prompt_infos:
    name = info["name"]
    prompt_template = info["prompt_template"]
    prompt = ChatPromptTemplate.from_template(template=prompt_template)
    chain = LLMChain(llm=model_chat, prompt=prompt)
    destination_chains[name] = chain

In [48]:
destination_chains["physician"]

LLMChain(prompt=ChatPromptTemplate(input_variables=['input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='You are a very good physician specializing in heart disease. \n\nHere is a question:\n{input}'))]), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x000001BF69EF2460>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x000001BF6B5D8820>, temperature=0.0, openai_api_key='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', openai_proxy=''))

In [49]:
destinations = [f"{p['name']}: {p['description']}" for p in prompt_infos]
destinations_str = "\n".join(destinations)

Up to now, we have created *"Destination Chain"*, if the question is not related to any other question, we should have a default so next step is to create *"Default Chain"*

In [50]:
# Setup the default chain  
default_prompt = ChatPromptTemplate.from_template("{input}")
default_chain = LLMChain(llm=model_chat, prompt=default_prompt)
default_chain

LLMChain(prompt=ChatPromptTemplate(input_variables=['input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='{input}'))]), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x000001BF69EF2460>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x000001BF6B5D8820>, temperature=0.0, openai_api_key='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', openai_proxy=''))

In [51]:
from langchain.chains.router.multi_prompt_prompt import MULTI_PROMPT_ROUTER_TEMPLATE
from langchain.chains.router.llm_router import LLMRouterChain, RouterOutputParser
from langchain.chains.router import MultiPromptChain

In [52]:
destinations_str

'weather: Good at global warming\nsport: Good for teaching people how to swim\nphysician: Good for healing heart disease'

In [53]:
# create actual router template
router_template = MULTI_PROMPT_ROUTER_TEMPLATE.format(destinations=destinations_str)
print(router_template)

Given a raw text input to a language model select the model prompt best suited for the input. You will be given the names of the available prompts and a description of what the prompt is best suited for. You may also revise the original input if you think that revising it will ultimately lead to a better response from the language model.

<< FORMATTING >>
Return a markdown code snippet with a JSON object formatted to look like:
```json
{{
    "destination": string \ name of the prompt to use or "DEFAULT"
    "next_inputs": string \ a potentially modified version of the original input
}}
```

REMEMBER: "destination" MUST be one of the candidate prompt names specified below OR it can be "DEFAULT" if the input is not well suited for any of the candidate prompts.
REMEMBER: "next_inputs" can just be the original input if you don't think any modifications are needed.

<< CANDIDATE PROMPTS >>
weather: Good at global warming
sport: Good for teaching people how to swim
physician: Good for heali

In [54]:
router_prompt = PromptTemplate(
    template=router_template,
    input_variables=["input"],
    output_parser=RouterOutputParser()
)

In [55]:
print(router_prompt)

input_variables=['input'] output_parser=RouterOutputParser() template='Given a raw text input to a language model select the model prompt best suited for the input. You will be given the names of the available prompts and a description of what the prompt is best suited for. You may also revise the original input if you think that revising it will ultimately lead to a better response from the language model.\n\n<< FORMATTING >>\nReturn a markdown code snippet with a JSON object formatted to look like:\n```json\n{{\n    "destination": string \\ name of the prompt to use or "DEFAULT"\n    "next_inputs": string \\ a potentially modified version of the original input\n}}\n```\n\nREMEMBER: "destination" MUST be one of the candidate prompt names specified below OR it can be "DEFAULT" if the input is not well suited for any of the candidate prompts.\nREMEMBER: "next_inputs" can just be the original input if you don\'t think any modifications are needed.\n\n<< CANDIDATE PROMPTS >>\nweather: Goo

In [56]:
router_chain = LLMRouterChain.from_llm(
    llm=model_chat,
    prompt=router_prompt,
) 

In [57]:
router_chain

LLMRouterChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['input'], output_parser=RouterOutputParser(), template='Given a raw text input to a language model select the model prompt best suited for the input. You will be given the names of the available prompts and a description of what the prompt is best suited for. You may also revise the original input if you think that revising it will ultimately lead to a better response from the language model.\n\n<< FORMATTING >>\nReturn a markdown code snippet with a JSON object formatted to look like:\n```json\n{{\n    "destination": string \\ name of the prompt to use or "DEFAULT"\n    "next_inputs": string \\ a potentially modified version of the original input\n}}\n```\n\nREMEMBER: "destination" MUST be one of the candidate prompt names specified below OR it can be "DEFAULT" if the input is not well suited for any of the candidate prompts.\nREMEMBER: "next_inputs" can just be the original input if you don\'t think any modifica

In [58]:
default_chain

LLMChain(prompt=ChatPromptTemplate(input_variables=['input'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='{input}'))]), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x000001BF69EF2460>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x000001BF6B5D8820>, temperature=0.0, openai_api_key='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', openai_proxy=''))

In [59]:
chain = MultiPromptChain(
    router_chain=router_chain,
    destination_chains=destination_chains,
    default_chain=default_chain,
    verbose=True
)

# Test
response = chain.run("can you tell me why the earth temperature is raising?")
#response = chain.run("How old as the stars?")
print(response)



[1m> Entering new MultiPromptChain chain...[0m
weather: {'input': 'can you explain the reasons behind the rise in global temperatures?'}
[1m> Finished chain.[0m
There are several factors that contribute to the rise in global temperatures, with the primary driver being human activities that release greenhouse gases into the atmosphere. These gases, such as carbon dioxide, methane, and nitrous oxide, trap heat from the sun and prevent it from escaping back into space, leading to a warming effect known as the greenhouse effect.

Some of the main human activities that contribute to the increase in greenhouse gases include burning fossil fuels for energy production, transportation, and industrial processes, deforestation, and agriculture practices such as livestock farming. These activities have significantly increased the concentration of greenhouse gases in the atmosphere, leading to a rapid rise in global temperatures.

In addition to human activities, natural factors such as volca

In [60]:
# Test
response = chain.run("can you tell me what is best approach to loss weight?")
#response = chain.run("How old as the stars?")
print(response)



[1m> Entering new MultiPromptChain chain...[0m
physician: {'input': 'can you tell me what is the best approach to losing weight?'}
[1m> Finished chain.[0m
As a physician specializing in heart disease, I recommend a comprehensive approach to weight loss that includes a combination of healthy eating, regular physical activity, and behavior modification. Here are some tips to help you achieve your weight loss goals:

1. Start by setting realistic and achievable goals for weight loss. Aim to lose 1-2 pounds per week, as this is a safe and sustainable rate of weight loss.

2. Focus on making healthy food choices by incorporating plenty of fruits, vegetables, whole grains, lean proteins, and healthy fats into your diet. Limit your intake of processed foods, sugary drinks, and high-fat foods.

3. Practice portion control by measuring your food and paying attention to serving sizes. Eating smaller, more frequent meals throughout the day can help keep you feeling full and satisfied.

4. S

# Document Loading

One of great advantage of using LangChain is that it allows developers to chat with their own data, documents (<span class="mark">URLs, CSV, pdfs, HTML, JSON</span> ...)
![image-2.png](attachment:image-2.png)

First document should be converted into **vector store**: the steps are:
1. Document Loading
2. Splitting and Chunking
3. Embedding and Storage

![image.png](attachment:image.png)

After vector score, we can have a conversation with the document by querying the question and find the relevant statement in document. This process is called **Retrieval** process.

![image.png](attachment:image.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

In [61]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

from langchain.document_loaders import PyPDFLoader  

# OpenAI Chat API
model_llm = "gpt-3.5-turbo"

model_chat = ChatOpenAI(temperature=0.2, model=model_llm)

Load a paper by `PyPDFLoader`:

In [62]:
### pip install pypdf

loader = PyPDFLoader("./data/paper.pdf")
pages = loader.load()

print(f"Nmber of pages for this paper are {len(pages)}.")

# first page
page_1 = pages[0]
print(page_1)

Nmber of pages for this paper are 21.
page_content='RESEARCH ARTICLE\nMachine learning approaches for the prediction of serious\nfluid leakage from hydrocarbon wells\nMehdi Rezvandehy1and Bernhard Mayer2\n1Department of Chemical and Petroleum Engineering, University of Calgary, Calgary, AB, Canada\n2Department of Geoscience, University of Calgary, Calgary, AB, Canada\nCorresponding author: Mehdi Rezvandehy; Email: mehdi.rezvandehy@ucalgary.ca\nReceived: 23 August 2022; Revised: 05 April 2023; Accepted: 14 April 2023\nKeywords: Energy wells; imbalanced class classification; imputation; probability estimation; resampling\nAbstract\nThe exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources,\nand greenhouse gas emissions. Fluids such as methane or CO 2may in some cases migrate toward the groundwater\nzone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing\nregions are routinely

In [63]:
# first 200 characters on the page 1
print(page_1.page_content[0:200]) 

RESEARCH ARTICLE
Machine learning approaches for the prediction of serious
fluid leakage from hydrocarbon wells
Mehdi Rezvandehy1and Bernhard Mayer2
1Department of Chemical and Petroleum Engineering, 


In [64]:
print(page_1.metadata)

{'source': './data/paper.pdf', 'page': 0}


Go to this page, there are different type of data loader https://python.langchain.com/docs/modules/data_connection/document_loaders/

![image.png](attachment:image.png)

## Document Splitting 

![image.png](attachment:image.png)

Creating good chunks is essential in semantic search and RAG (Retrieval-Augmented Generation). Effective content division ensures that we maintain coherence and context in the response. If we divide a story into unrelated fragments, we could lose the ability to create a coherent response.

Langchain provides different splitter:
1. Split by character - `CharacterTextSplitter`
2. Split code - `CodeTextSplitter`
3. `MarkdownHeaderSplitter`
4. Recursively split by character - `RecursiveCharacterTextSplitter`
5. Split by tokens - most LLM have token limit, therefore, we can use splitter to ensure we do not go beyond token limit `TokenTextSplitter`

### `CharacterTextSplitter`

In [66]:
from langchain.text_splitter import CharacterTextSplitter

# 1. CharacterTextSplitter
with open("./data/wild_animals_book.txt", encoding="utf8") as paper:
    speech = paper.read()
    
text_splitter = CharacterTextSplitter(
    length_function = len
)

texts = text_splitter.create_documents([speech])
print(texts[0])

page_content='The Project Gutenberg eBook of Wild Animals I Have Known\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: Wild Animals I Have Known\n\n\nAuthor: Ernest Thompson Seton\n\nRelease date: January 1, 2002 [eBook #3031]\n                Most recently updated: March 3, 2017\n\nLanguage: English\n\nCredits: Produced by David Reed, and David Widger\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK WILD ANIMALS I HAVE KNOWN ***\n\nProduced by David Reed\n\nWILD ANIMALS I HAVE KNOWN\n\nBy Ernest Thompson Seton\n\n\nBooks by Ernest Thompson Seton\n\n     Biography of a Grizzly\n

### `RecursiveCharacterTextSplitter`

In [67]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. CharacterTextSplitter
with open("./data/wild_animals_book.txt", encoding="utf8") as paper:
    speech = paper.read()
    
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 20,
    chunk_overlap = 5,
    length_function = len,
    add_start_index=True
)

docs = text_splitter.create_documents([speech])

print(len(docs))
print(f"Doc 1: {docs[0]}")
print(f"Doc 2: {docs[1]}")

20661
Doc 1: page_content='The Project' metadata={'start_index': 0}
Doc 2: page_content='Gutenberg eBook of' metadata={'start_index': 12}


In [68]:
s = "Python can be easy to pick up whether you're a professional or a beginner."

text = text_splitter.split_text(s)
print(text)


['Python can be easy', 'easy to pick up', "up whether you're a", 'a professional or a', 'or a beginner.']


### Vectorstore & Embeddings

After splitting document, embedding should be applied. The embedding should be stored in Vector store:

![image.png](attachment:image.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

Embedding vectors are numerical representations that capture the content and meaning of text. These vectors are designed so that texts with similar content and meaning will have similar vectors in the high-dimensional space.

![image.png](attachment:image.png)

After embedding, the text and query
* **Searching:** find relevant result to the query string...
* **Recommendations:** items with related text strings are recommended...
* **Classification:** text strings are classified by most relevant and similar labels

* **Full Overview of Vector Store** for Documents

>![image.png](attachment:image.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

* **Overview of Vector Store for query/question**
> ![image.png](attachment:image.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

## Hands on 

In [69]:
import numpy as np
# langchain wrapper for embedding
from langchain.embeddings import OpenAIEmbeddings 

embeddings = OpenAIEmbeddings()

corpus = ["Global warming is happening", 
        "The weather is not good to play golf today", 
        "Never compare an apple to an orange", 
        "Apple and orange are completely different from each other"]

for itext in corpus:
    for jtext in corpus:
        embed1 = embeddings.embed_query(itext)
        embed2 = embeddings.embed_query(jtext)
        similarity = np.dot(embed1, embed2)
        print(f"{itext}, {jtext}: Similarity %: {similarity*100}")

  warn_deprecated(


Global warming is happening, Global warming is happening: Similarity %: 99.99999999999997
Global warming is happening, The weather is not good to play golf today: Similarity %: 80.02220475419695
Global warming is happening, Never compare an apple to an orange: Similarity %: 73.22129176649644
Global warming is happening, Apple and orange are completely different from each other: Similarity %: 72.1563466439271
The weather is not good to play golf today, Global warming is happening: Similarity %: 80.02220475419695
The weather is not good to play golf today, The weather is not good to play golf today: Similarity %: 99.99999999999999
The weather is not good to play golf today, Never compare an apple to an orange: Similarity %: 74.39217044351511
The weather is not good to play golf today, Apple and orange are completely different from each other: Similarity %: 72.88751127003663
Never compare an apple to an orange, Global warming is happening: Similarity %: 73.21530565210482
Never compare an 

### Similarity Search

In [70]:
# 1. Load a pdf file
loader = PyPDFLoader("./data/paper.pdf")
pages = loader.load()

# 2. Split the document into chunks
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 400
)
splits = text_splitter.split_documents(pages)
print(len(splits))
# =============== ==================== # 

91


#### Saving Embeddings to Chroma DB 

In [71]:
# Real-world exampel with embeddings!
# Chroma db = #pip install chroma
# pip install chromadb
from langchain.vectorstores import Chroma

persist_directory = "./data/chroma"

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings, # openai embeddings
    persist_directory=persist_directory
    )

vectorstore.persist() # save this for later usage!

print(vectorstore._collection.count())

91


  warn_deprecated(


Now we can find the best answer to a query:

In [72]:
query = "what is gas migration?"

docs_resp = vectorstore.similarity_search(query=query, k=3)

print(len(docs_resp))
print(docs_resp[0].page_content)

3
Abboud et al., 2021 ). The Alberta Energy Regulator (AER) in Alberta, Canada, conducts such field tests
for energy wells within the province. The AER applies two field tests for the identification of fluid
migration after a well is completed to produce hydrocarbon or to inject any fluid:
1. SCVF is the flow of gas (methane, CO
2, etc.) out of the casing annulus or surface casing. SCVF is
often referred to as internal migration. Wells with positive SCVF are considered serious in the
province of Alberta under one or several of the following conditions: (a) gas-flow rates higher than
300 m3/d, (b) stabilized pressure >9.8 kPa/m, (c) liquid-hydrocarbons, and (d) hydrogen sulfide
(H2S) flow (see Alberta Energy Regulator, 2003 , for more information).
2. GM is a flow of any gas that is detectable at surface outside of the outermost casing string. GM is
often referred to as seepage or external migration (Alberta Energy Regulator, 2003 ). A GM is


In [73]:
import sqlite3
import pandas as pd

# Create a SQL connection to our SQLite database
con = sqlite3.connect("data/chroma/chroma.sqlite3")

cur = con.cursor()

# Return all results of query
cur.execute('SELECT * FROM embedding_metadata limit 10')
cur.fetchall()

[(1, 'source', './data/paper.pdf', None, None, None),
 (1, 'page', None, 0, None, None),
 (1,
  'chroma:document',
  'RESEARCH ARTICLE\nMachine learning approaches for the prediction of serious\nfluid leakage from hydrocarbon wells\nMehdi Rezvandehy1and Bernhard Mayer2\n1Department of Chemical and Petroleum Engineering, University of Calgary, Calgary, AB, Canada\n2Department of Geoscience, University of Calgary, Calgary, AB, Canada\nCorresponding author: Mehdi Rezvandehy; Email: mehdi.rezvandehy@ucalgary.ca\nReceived: 23 August 2022; Revised: 05 April 2023; Accepted: 14 April 2023\nKeywords: Energy wells; imbalanced class classification; imputation; probability estimation; resampling\nAbstract\nThe exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources,\nand greenhouse gas emissions. Fluids such as methane or CO 2may in some cases migrate toward the groundwater\nzone and atmosphere through and along imperfectly sealed hydrocarbon 

### Retrieval Augmented Generation (RAG)

In LangChain, retrievers help you search and retrieve information from your indexed documents. A retriever is an interface that returns documents based on an unstructured query, which makes it a more general tool than a vector store. Unlike a vector store, a retriever does not need to be able to store documents.

There are 3 primary steps to RAG development in LangChain. 
1. **<span class="mark">Loading the documents into LangChain</span>** with document loaders. 
2. **<span class="mark">Splitting the documents into chunks</span>**. Chunks are units of information that we can index and process individually. 
3. **<span class="mark">Encoding and storing the chunks for retrieval</span>**, which could utilize a vector database if that meets the needs of the use case. We'll discuss all of these steps throughout the next chapter, but for now we'll start with document loaders.

![image.png](attachment:image.png)

![image.png](attachment:image.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

In [74]:
## load the persisted db
vector_store = Chroma(persist_directory=persist_directory,
                      embedding_function=embeddings)

In [75]:
# make a retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 5})  # number of document to get is 2
docs = retriever.get_relevant_documents("Tell me more about ReAct prompting")
print(retriever.search_type)

similarity


  warn_deprecated(


In [76]:
print(docs[0].page_content)

and environmental benefits.
Impact Statement
Field test operations to detect methane and CO2 leakages from hydrocarbon wells can be costly. Most wells do
not have leaks or are categorized as non-serious, which means that no repair is needed until they are abandoned.
However, it is crucial to identify and prioritize serious leakages for immediate remediation to prevent environ-
mental pollution. This study developed a reliable predictive model by correlating the results of historical fieldtests with various well properties, including age, depth, production/injection history, and deviation, among
others. The trained model can predict the likelihood of serious leakage for untested wells, allowing for the
prioritization of wells with the highest probability of leaks for field testing. This approach leads to cost-effectivefield testing and environmental benefits.


In [77]:
docs[0]

Document(page_content='and environmental benefits.\nImpact Statement\nField test operations to detect methane and CO2 leakages from hydrocarbon wells can be costly. Most wells do\nnot have leaks or are categorized as non-serious, which means that no repair is needed until they are abandoned.\nHowever, it is crucial to identify and prioritize serious leakages for immediate remediation to prevent environ-\nmental pollution. This study developed a reliable predictive model by correlating the results of historical fieldtests with various well properties, including age, depth, production/injection history, and deviation, among\nothers. The trained model can predict the likelihood of serious leakage for untested wells, allowing for the\nprioritization of wells with the highest probability of leaks for field testing. This approach leads to cost-effectivefield testing and environmental benefits.', metadata={'page': 0, 'source': './data/paper.pdf'})

In [78]:
# Make a chain to answer questions
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=model_chat,
    chain_type="stuff",
    retriever=retriever,
    verbose=True,
    return_source_documents=True
)

In [79]:
## Cite sources - helper function to pretyfy responses
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

query = "tell me what is gas migration"
llm_response = qa_chain(query)
print(process_llm_response(llm_response=llm_response))




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Gas migration refers to the movement of gases, such as methane and CO2, from underground reservoirs to the surface or into surrounding areas like soils, shallow groundwater, or the atmosphere. In the context of energy wells, gas migration can occur due to improperly sealed wells, leading to the escape of gases through surface casing vent flows or other pathways. Monitoring gas migration is essential to detect leakage and prioritize repairs to prevent environmental contamination and greenhouse gas emissions.


Sources:
./data/paper.pdf
./data/paper.pdf
./data/paper.pdf
./data/paper.pdf
./data/paper.pdf
None


In [80]:
query = "what is the application of LU simulation?"
llm_response = qa_chain(query)
print(llm_response['result'])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The application of LU simulation includes geostatistics for geomo-del modeling, spatial resampling, imputation of missing data, and oversampling to improve the imbalance number of classes for classification. It can also be used for simulating correlated Gaussian realizations and for carrying out conditional simulation to simulate missing data conditioned based on non-missing values.


# Agents

LangChain is a framework for developing applications that utilize large language models (LLMs). A core feature of LangChain is its use of "agents"—systems capable of performing diverse tasks by leveraging LLMs, built around modular components that can interact with each other and external data sources.

![image.png](attachment:image.png)
Image retrieved from 
[Paulo Dichone,The Complete LangChain & LLMs Guide](https://learning.oreilly.com/videos/the-complete-langchain/9781835885925/9781835885925-video1_1/)

## Math Agent 

In [81]:
# Simple agent
llm = OpenAI(temperature=0.25)
print(llm.predict("what is the result of 4.2^3.2"))

  warn_deprecated(




The result of 4.2^3.2 is approximately 55.78.


In [82]:
4.2**3.2

98.71831395268974

LLM gives a wrong number

In [83]:
print(llm.predict("what is LangChain"))



LangChain is a decentralized blockchain platform that aims to provide a secure and efficient environment for language learning and teaching. It utilizes blockchain technology to create a transparent and decentralized ecosystem where students and teachers can connect, interact, and exchange language learning services without the need for intermediaries. The platform also offers features such as smart contracts, peer-to-peer payments, and a reputation system to ensure fair and reliable transactions. LangChain aims to revolutionize the traditional language learning industry by providing a more accessible, affordable, and personalized learning experience for individuals around the world.


In [84]:
from langchain.agents import Tool, initialize_agent, load_tools
from langchain.chains import LLMMathChain # to fix math issue

In [85]:
llm_math = LLMMathChain.from_llm(llm=llm)
math_tool = Tool(
    name="Calculator",
    func=llm_math.run,
    description="Useful for when you need to answer questions related to Math."
)

tools = [math_tool]
print(tools[0].name, tools[0].description)

Calculator Useful for when you need to answer questions related to Math.


In [86]:
#ReAct framework = Reasoning and Action
# if LLM cannot get the answer, the agent will do
agent = initialize_agent(
    agent="zero-shot-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3 # to avoid high bills from the LLM
)

  warn_deprecated(


In [87]:
print(agent("what is the result of 4.2^3.2"))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use a calculator to solve this problem
Action: Calculator
Action Input: 4.2^3.2[0m
Observation: [36;1m[1;3mAnswer: 98.71831395268974[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 98.71831395268974[0m

[1m> Finished chain.[0m
{'input': 'what is the result of 4.2^3.2', 'output': '98.71831395268974'}


In [88]:
4.2**3.2

98.71831395268974

wow, the answer with agent is correct.

* **Built-in Math Tool**

In [89]:
tools = load_tools(  #load_tools is another library allow us to use pre-built tools
    ['llm-math'],
    llm=llm
)
tools

[Tool(name='Calculator', description='Useful for when you need to answer questions about math.', func=<bound method Chain.run of LLMMathChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['question'], template='Translate a math problem into a expression that can be executed using Python\'s numexpr library. Use the output of running this code to answer the question.\n\nQuestion: ${{Question with math problem.}}\n```text\n${{single line mathematical expression that solves the problem}}\n```\n...numexpr.evaluate(text)...\n```output\n${{Output of running the code}}\n```\nAnswer: ${{Answer}}\n\nBegin.\n\nQuestion: What is 37593 * 67?\n```text\n37593 * 67\n```\n...numexpr.evaluate("37593 * 67")...\n```output\n2518731\n```\nAnswer: 2518731\n\nQuestion: 37593^(1/5)\n```text\n37593**(1/5)\n```\n...numexpr.evaluate("37593**(1/5)")...\n```output\n8.222831614237718\n```\nAnswer: 8.222831614237718\n\nQuestion: {question}\n'), llm=OpenAI(client=<openai.resources.completions.Completions o

In [90]:
agent = initialize_agent(
    agent="zero-shot-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3 # to avoid high bills from the LLM
)

In [91]:
print(agent("what is the result of 4.9^3.2"))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use a calculator to solve this problem
Action: Calculator
Action Input: 4.9^3.2[0m
Observation: [36;1m[1;3mAnswer: 161.66926210092953[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 161.66926210092953[0m

[1m> Finished chain.[0m
{'input': 'what is the result of 4.9^3.2', 'output': '161.66926210092953'}


In [92]:
query = """If I have $ 100.45, and give 20% of that to my brother and 10% to my 
 sister then receive 56.9 from my father, how much monery I will have at the end?"""
result = agent(query)
print(result['output'])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m You should always think about what to do
Action: Calculator
Action Input: 100.45 - (100.45 * 0.2) - (100.45 * 0.1) + 56.9[0m
Observation: [36;1m[1;3mAnswer: 127.215[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: $127.215[0m

[1m> Finished chain.[0m
$127.215


In [93]:
print(agent("how far (in km) from here to the moon?"))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m We can use the distance formula to calculate the distance between two points.
Action: Calculator
Action Input: distance formula[0m

ValueError: LLMMathChain._evaluate("
sqrt((x2-x1)**2 + (y2-y1)**2)
") raised error: 'x1'. Please try again with a valid numerical expression

This agent cannot answer this question.

In [94]:
print(agent("What is the capital city of Iran?"))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I don't know the answer, but I can use a calculator to find it.
Action: Calculator
Action Input: "capital city of Iran"[0m

ValueError: LLMMathChain._evaluate("
"Tehran"
") raised error: data type must provide an itemsize. Please try again with a valid numerical expression

<span class="mark">Since the tools we created only covers `llm-math`, it cannot answer this question, the Action is **Calculator**.</span>

## Adding General Knowledge Tool for Agent 

In [95]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.llms import OpenAI
from langchain.agents import Tool, initialize_agent, load_tools

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

In [96]:
#==== Using OpenAI Chat API =======
llm_model = "gpt-3.5-turbo"
llm = OpenAI(temperature=0.2)

In [97]:
# Second Generic Tool
prompt = PromptTemplate(
    input_variables=["query"],
    template="{query}"
)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [98]:
# Initialize the LLM Tool
llm_tool = Tool(
    name="Language Model",
    func=llm_chain.run,
    description="Use this tool for general queries and logic"
)

In [99]:
tools = load_tools(
    ['llm-math'],
    llm=llm
)
tools.append(llm_tool) # adding the new tool to our tools list

In [100]:
#ReAct framework = Reasoning and Action
agent = initialize_agent(
    agent="zero-shot-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3 # to avoid high bills from the LLM
)
query = "What is the capital city of Iran?"
    
print(agent.agent.llm_chain.prompt.template)

Answer the following questions as best you can. You have access to the following tools:

Calculator(*args: Any, callbacks: Union[List[langchain_core.callbacks.base.BaseCallbackHandler], langchain_core.callbacks.base.BaseCallbackManager, NoneType] = None, tags: Optional[List[str]] = None, metadata: Optional[Dict[str, Any]] = None, **kwargs: Any) -> Any - Useful for when you need to answer questions about math.
Language Model(*args: Any, callbacks: Union[List[langchain_core.callbacks.base.BaseCallbackHandler], langchain_core.callbacks.base.BaseCallbackManager, NoneType] = None, tags: Optional[List[str]] = None, metadata: Optional[Dict[str, Any]] = None, **kwargs: Any) -> Any - Use this tool for general queries and logic

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Calculator, Language Model]
Action Input: the input to the action
Observation: the result of the action

In [101]:
result = agent(query)
print(result['output'])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use a language model to answer this question.
Action: Language Model
Action Input: "What is the capital city of Iran?"[0m
Observation: [33;1m[1;3m

The capital city of Iran is Tehran.[0m
Thought:[32;1m[1;3m I now know the final answer.
Final Answer: Tehran[0m

[1m> Finished chain.[0m
Tehran


In [102]:
query = """If I have $100.45, and give 20% of that to my brother and 10% to my 
 sister then receive 56.9 from my father, how much monery I will have at the end?"""
result = agent(query)
print(result['output'])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to calculate the total amount of money I have and then subtract 20% and 10% from it, and then add 56.9 to the result.
Action: Calculator
Action Input: 100.45 - (20% of 100.45) - (10% of 100.45) + 56.9[0m
Observation: [36;1m[1;3mAnswer: 127.215[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: At the end, I will have $127.215.[0m

[1m> Finished chain.[0m
At the end, I will have $127.215.


## Agents Types

LangChain provides several agent types:

* **<span class="mark">zero-shot-react-description</span>**: the agent is not trained on specific examples of the tasks but instead uses the general capabilities of the language model to figure out what to do based on the task description.


* **<span class="mark">conversational-react-description</span>**: This type of agent is also based on the ReAct framework but is tailored for conversational contexts. It can handle back-and-forth dialogue, maintaining context and reasoning through the conversation to perform actions based on descriptions given during the interaction.


* **<span class="mark">react-docstore</span>**: This agent type combines the ReAct framework with the capability to interact with document stores. It is designed to reason about and act on information stored in documents, such as databases or knowledge bases.

### `zero-shot-react-description`

In [103]:
#ReAct framework = Reasoning and Action
agent = initialize_agent(
    agent="zero-shot-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3 # to avoid high bills from the LLM
)
query = "What is the capital city of Iran?"
    
# show the template used by our agent to represet what is going on under the hood
print(agent.agent.llm_chain.prompt.template)

Answer the following questions as best you can. You have access to the following tools:

Calculator(*args: Any, callbacks: Union[List[langchain_core.callbacks.base.BaseCallbackHandler], langchain_core.callbacks.base.BaseCallbackManager, NoneType] = None, tags: Optional[List[str]] = None, metadata: Optional[Dict[str, Any]] = None, **kwargs: Any) -> Any - Useful for when you need to answer questions about math.
Language Model(*args: Any, callbacks: Union[List[langchain_core.callbacks.base.BaseCallbackHandler], langchain_core.callbacks.base.BaseCallbackManager, NoneType] = None, tags: Optional[List[str]] = None, metadata: Optional[Dict[str, Any]] = None, **kwargs: Any) -> Any - Use this tool for general queries and logic

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Calculator, Language Model]
Action Input: the input to the action
Observation: the result of the action

In [104]:
query = """who is donald trump?"""
result = agent(query)
print(result['output'])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use a language model to answer this question.
Action: Language Model
Action Input: "Who is Donald Trump?"[0m
Observation: [33;1m[1;3m

Donald Trump is a businessman, television personality, and politician who served as the 45th President of the United States from 2017 to 2021. He was born on June 14, 1946 in New York City and grew up in Queens. Trump is known for his real estate empire, including the development of luxury properties such as Trump Tower in New York City. He also gained fame as the host of the reality TV show "The Apprentice." In 2016, Trump ran for president as the Republican nominee and won the election. During his presidency, he implemented policies on immigration, trade, and taxes, and faced numerous controversies and impeachment proceedings. He left office in January 2021 after losing the 2020 election to Joe Biden.[0m
Thought:[32;1m[1;3m I now know the final answer.
Final Answer: Donald Tr

### `conversational-react-description`

In [105]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.llms import OpenAI
from langchain.agents import Tool, initialize_agent, load_tools

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory

In [106]:
#==== Using OpenAI Chat API =======
llm_model = "gpt-3.5-turbo"
llm = OpenAI(temperature=0.0)

In [107]:
# memory
memory = ConversationBufferMemory(memory_key="chat_history")

# Second Generic Tool
prompt = PromptTemplate(
    input_variables=["query"],
    template="{query}"
)

llm_chain = LLMChain(llm=llm, prompt=prompt)

# Initialize the LLM Tool
llm_tool = Tool(
    name="Language Model",
    func=llm_chain.run,
    description="Use this tool for general queries and logic"
)
 
tools = load_tools(
    ['llm-math'],
    llm=llm
)
tools.append(llm_tool) # adding the new tool to our tools list

In [108]:
# Conversational Agent
conversational_agent = initialize_agent(
    agent="conversational-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3,
    memory=memory
)
    
print(conversational_agent.agent.llm_chain.prompt.template)

Assistant is a large language model trained by OpenAI.

Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.

Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.

Overall, Assistant is a powerful tool that can help with a wide range of tasks 

In [109]:
query_1 = "I was married at 2012"
query_2 = "I completed my phd 5 years after that"
query_3 = "At what year, I completed my PhD?"

result = conversational_agent(query_1)
results = conversational_agent(query_2)
results = conversational_agent(query_3)
print(result['output'])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: Do I need to use a tool? Yes
Action: Calculator
Action Input: 2021 - 2012[0m
Observation: [36;1m[1;3mAnswer: 9[0m
Thought:[32;1m[1;3m Do I need to use a tool? No
AI: Congratulations on your marriage! How has your relationship evolved since then?[0m

[1m> Finished chain.[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Do I need to use a tool? Yes
Action: Language Model
Action Input: Can you tell me more about your PhD?[0m
Observation: [33;1m[1;3m

Sure, my PhD is in the field of psychology, specifically in the area of social psychology. My research focuses on the influence of social media on self-esteem and body image in young adults. I am interested in understanding how social media use affects individuals' perceptions of themselves and their bodies, and how this can impact their mental health and well-being.

To conduct my research, I have been using a combination of quantitativ

### `react-docstore (docstore)`

In [110]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.llms import OpenAI
from langchain.agents import Tool, initialize_agent, load_tools


from langchain import Wikipedia
from langchain.agents.react.base import DocstoreExplorer

#==== Using OpenAI Chat API =======
llm_model = "gpt-3.5-turbo"
llm = OpenAI(temperature=0.0)

In [111]:
#pip install wikipedia

In [112]:
docstore = DocstoreExplorer(Wikipedia())
tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="search wikipedia"
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="lookup a term in wikipedia"
    )
]

  warn_deprecated(


In [113]:
# initialize our agent
docstore_agent = initialize_agent(
    tools,
    llm,
    agent="react-docstore",
    verbose=True,
    max_iterations=3
)

query = "Who is Ali Daei?"
result = docstore_agent.run(query)
# print(docstore_agent.agent.llm_chain.prompt.template)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to search Ali Daei and find out who he is.
Action: Search[Ali Daei][0m
Observation: [36;1m[1;3mCould not find [Ali Daei]. Similar: ['Ali Daei', 'Almoez Ali', "List of men's footballers with 50 or more international goals", 'Persepolis F.C.', 'List of international goals scored by Ali Daei', 'Ali Karimi', 'Saipa F.C.', 'Iran national football team', 'List of Iran national football team managers', 'Saba Qom F.C.'][0m
Thought:[32;1m[1;3m Ali Daei is not a well-known person, so I need to search for more information about him.
Action: Search[Ali Daei football][0m
Observation: [36;1m[1;3mAli Daei (Persian:  pronounced [ʔæliː dɑːjiː]; born 21 March 1969) is an Iranian football manager and former professional footballer. A striker, he was the captain of the Iranian national team between 2000 and 2006. He played in the German Bundesliga for Arminia Bielefeld, Bayern Munich and Hertha Berlin. He is regarded as 

### Self-Ask Agent (google-search)

In [114]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.llms import OpenAI
from langchain.agents import Tool, initialize_agent, load_tools
from langchain import SerpAPIWrapper

First we should get [serpapi](https://serpapi.com/?gad_source=1&gclid=Cj0KCQjwqpSwBhClARIsADlZ_Tni__Qr6UXUDLL6N9wX4K39ZccgklyRLtQC6dKCyPyf-1cdfeiaRRgaAtq_EALw_wcB)

![image.png](attachment:image.png)

we should register for API key:
![image.png](attachment:image.png)

In [115]:
os.environ["SERPAPI_API_KEY"] = "......"
SERP_API_KEY = os.getenv("SERPAPI_API_KEY") # must get the api key and add to .env go to https://serpapi.com/

In [116]:
# pip install google-search-results

In [117]:
#==== Using OpenAI Chat API =======
llm_model = "gpt-3.5-turbo"
llm = OpenAI(temperature=0.7)

search = SerpAPIWrapper(serpapi_api_key=SERP_API_KEY)

# tools
tools = [
    Tool(
        name="Intermediate Answer",
        func=search.run,
        description="google search"
    )
]

In [118]:
# initialize our agent
self_ask_with_search = initialize_agent(
    tools,
    llm,
    agent='self-ask-with-search',
    handle_parsing_errors=True,
    verbose=True
)

query = "What is largest ocean in the world?"
result = self_ask_with_search(query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mCould not parse output:  No.[0m
Intermediate answer: Invalid or incomplete response
[32;1m[1;3mSo the final answer is: Pacific Ocean [0m

[1m> Finished chain.[0m


See this page for other agent types https://python.langchain.com/docs/modules/agents/agent_types/

# Real-World Use Cases

## Bill Extractor

This sections shows how to extract required information below from monthly bills which are in PDF format:

* "Previous balance"
* "Electricity"
* "Natural Gas", 
* "Water Treatment and Supply",
* "Wastewater Collection and Treatment", 
* "Stormwater Management"
* "Waste and Recycling"
* "Due Date" 
* "Total Amount Due"




In [119]:
from langchain.llms import OpenAI
from pypdf import PdfReader
import pandas as pd
import re
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.agents.agent_types import AgentType
import PyPDF2
import openai
import os
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())
openai.api_key = os.getenv("OPENAI_API_KEY")


# get Info from PDF file
def pdf_text(pdf_doc):
    text=""  # make empty text
    pdf_reader = PyPDF2.PdfReader(pdf_doc)
    for page in pdf_reader.pages:  # read each page and convert to text
        text += page.extract_text()
    return text


openai.api_key = os.getenv("OPENAI_API_KEY") 

# get data from text of pdf
def extracted_data(pages_data):
    template = """Extract all the following values: "Previous balance", "Electricity", "Natural Gas", 
         "Water Treatment and Supply", "Wastewater Collection and Treatment", 
         "Stormwater Management", "Waste and Recycling", "Due Date" and "Total Amount Due".
         First read the text to find the key phrase.
         {pages}

        Expected output: dolloar sign should be removed
        {{"Due Date": "2024 March 05", "Total Amount Due": 4568, "Previous balance": 546, "Electricity": 124, "Natural Gas": 452, "Water Treatment and Supply": 456, "Wastewater Collection and Treatment": 145, "Stormwater Management": 12, "Waste and Recycling": 12}}
        Please notice "Due Date" comes after "If payment is received after".
        """
    prompt_template = PromptTemplate(input_variables=["pages"], template=template)
    llm = OpenAI(temperature=0.0)
    full_response = llm(prompt_template.format(pages=pages_data))
    
    full_response = full_response.replace('\n','')
    return full_response


In [120]:
# create documents from the uploaded pdfs

def create_docs(user_pdf_list):
    df = pd.DataFrame({"Due Date": pd.Series(dtype='str'),  
                   "Total Amount Due": pd.Series(dtype='str'),
                   "Previous balance": pd.Series(dtype='int'),
                   "Electricity": pd.Series(dtype='str'),
                   "Natural Gas": pd.Series(dtype='str'),
                   "Wastewater Collection and Treatment": pd.Series(dtype='str'),
                   "Stormwater Management": pd.Series(dtype='int'),
                   "Water Treatment and Supply": pd.Series(dtype='str'),
                   "Waste and Recycling": pd.Series(dtype='str')
                    })    
    ir = 1
    for filename in user_pdf_list:
        
        print(f"File {ir}: {filename}")
        ir+=1
        raw_data = pdf_text(filename)

        #key_phrase1 = "If payment "
        #key_phrase2 = "Free Outside Alberta:"
        llm_extracted_data = extracted_data(raw_data)

        pattern = r'{(.+)}' # capture one or more of any character, except newline
        match = re.search(pattern, llm_extracted_data, re.DOTALL)

        if match:
            extracted_text = match.group(1)
            # Converting the extracted text to a dictionary
            data_dict = eval('{' + extracted_text + '}')
        else:
            print("Nothing found.")

     
        df = pd.concat([df, pd.DataFrame([data_dict])], ignore_index=True)

        #df=df.append(save_to_dataframe(llm_extracted_data), ignore_index=True)

    return df

In [121]:
print("----------Load PDF files----------")
print("")
user_pdf_list=[
"BillExtractor/pdfs/2024_January.pdf",
"BillExtractor/pdfs/2024_February.pdf",
"BillExtractor/pdfs/2024_March.pdf",
]
df = create_docs(user_pdf_list)
print('\n')
print('Here is extracted information from the bills')
df

----------Load PDF files----------

File 1: BillExtractor/pdfs/2024_January.pdf


  warn_deprecated(


File 2: BillExtractor/pdfs/2024_February.pdf
File 3: BillExtractor/pdfs/2024_March.pdf


Here is extracted information from the bills


Unnamed: 0,Due Date,Total Amount Due,Previous balance,Electricity,Natural Gas,Wastewater Collection and Treatment,Stormwater Management,Water Treatment and Supply,Waste and Recycling
0,2024 February 12,362.97,355.22,101.48,113.66,54.95,17.71,40.25,24.17
1,2024 March 11,463.31,362.97,102.19,224.1,44.77,14.59,33.11,28.23
2,2024 April 08,361.63,463.31,104.71,131.92,42.58,15.11,30.63,24.84


## Multi-Doc-Chatbot

In this section, a simple Chatbot is developed to achieve information from pdf.

In [122]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader

# this library makes our life easier when it 
# comes to chatting within a library
from langchain.chains.question_answering import load_qa_chain 

load_dotenv(find_dotenv())
openai.api_key = os.getenv("OPENAI_API_KEY")


#==== Using OpenAI Chat API =======
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)

#### === packages to install ====
# pip install langchain pypdf openai chromadb tiktoken docx2txt

# load a pdf file
loader_pdf = PyPDFLoader('./MultiDocsChat/SampleResume.pdf')
docs = loader_pdf.load()

#set up question answering chain
chain = load_qa_chain(llm, verbose=True)
query = 'What is the name of person?'
response = chain.run(input_documents=docs,
                     question=query)

print(response)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
DANA LEE
Sales Representative
danalee@email.com (123) 456-7890 Provo, UT
LinkedIn
WORK EXPERIENCE
Sales Representative
Allied
November 2017 - current Provo, UT
Promoted the value of the customer loyalty program, leading to a
12% out-performance of sign-up targets
Performed in the top 5% of sales representatives in 2019 and
2020 in the intermountain region
Collaborated directly with potential clients, providing contract
estimates and building trust that resulted in 5-year customer
loyalty on average
Ensured customers received quality customer service, reducing
the likelihood of negative customer reviews by 80%
Attended 4 networking events annually, building relations with


Now we want to chat with the document. First step is to divide document into chunks

In [123]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA

# load a pdf file
loader_pdf = PyPDFLoader('./MultiDocsChat/SampleResume.pdf')
docs = loader_pdf.load()

In [124]:
# Split the data into chunks
text_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)
docs_chunks = text_splitter.split_documents(docs)

In [125]:
# create our vector db chromadb
vectordb = Chroma.from_documents(
    documents=docs_chunks,
    embedding=OpenAIEmbeddings(),
    persist_directory='./data'
)
vectordb.persist()

In [126]:
# RetrievalQA chain to get info from the vectorstore
chain_qa = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(search_kwargs={'k':4}),
    return_source_documents=True
)

#result_qa = chain_qa('What is LEE SKILLS in bullet points?')
result_qa = chain_qa('Where LEE works from 2012 to 2013?')
print(result_qa['result'])

Dana Lee worked as a Sales Associate at RMI Distributing from January 2012 to April 2014.


This chatbot does not have any memory

### Streamlit App: Full Multi-Document Chatbot

In [127]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.chat_models import ChatOpenAI
import streamlit as st

import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader # to load pdf files
from langchain.document_loaders import Docx2txtLoader # to load word files
from langchain.document_loaders import TextLoader # to load text files
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
import streamlit as st

# this library makes our life easier when it 
# comes to chatting within a library
from langchain.chains.question_answering import load_qa_chain 
from streamlit_chat import message # pip install streamlit_chat

load_dotenv(find_dotenv())

openai.api_key = os.getenv("OPENAI_API_KEY") 


#==== Using OpenAI Chat API =======
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)

#### === packages to install ====
# pip install langchain pypdf openai chromadb tiktoken docx2txt

# load a pdf file

files = st.file_uploader("Please upload your files", accept_multiple_files=True,
                             type=["txt", "docx", "pdf"])

if files:
    documents = []
    if files is not None:
        for ifiles in files:
            if ifiles.name[-4:] == '.txt':
                loader = TextLoader(ifiles.name)
                documents.extend(loader.load())
            elif ifiles.name[-5:] == '.docx' or ifiles.name[-4:] == '.doc':
                loader = Docx2txtLoader(ifiles.name)
                documents.extend(loader.load())            
            elif ifiles.name[-4:] == '.pdf':
                loader = PyPDFLoader(ifiles.name)
                documents.extend(loader.load())
    
    # load files
    chat_history = []
    
    # split the data into chunks
    text_splitter = CharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=5
    )
    docs = text_splitter.split_documents(documents)
    
    # create vector db chromadb
    vectordb = Chroma.from_documents(
        documents=documents,
        embedding=OpenAIEmbeddings(),
        persist_directory='./MultiDocsChat/data'
    )
    vectordb.persist()
    
    chain_qa = ConversationalRetrievalChain.from_llm(
        llm,
        vectordb.as_retriever(search_kwargs={'k': 5}),
        return_source_documents=True,
        verbose=False
    )
    
    #-------- Streamlit front-end #--------
    st.title("QA Bot for Documents by Langchain")
    st.header("You can ask anything about your document... 🤖")
    
    if 'produced' not in st.session_state:
        st.session_state['produced'] = []
        
    if 'old' not in st.session_state:
        st.session_state['old'] = []
    
    
    # get the user input
    user_input = st.chat_input("Ask a question from your documents...")
    if user_input:
        result = chain_qa({'question': user_input, 'chat_history': chat_history})
        st.session_state.old.append(user_input)
        st.session_state.produced.append(result['answer'])
        
        
    if st.session_state['produced']:
        for i in range(len(st.session_state['produced'])):
            message(st.session_state['old'][i], is_user=True, key=str(i)+ '_user')
            message(st.session_state['produced'][i], key=str(i))
 

## Question answering

The dataset is sourced from [Kaggle](https://www.kaggle.com/datasets/rtatman/questionanswer-dataset).

**Overview:**

It contains three question files, one for each student year: S08, S09, and S10, along with 690,000 words of cleaned text from Wikipedia, which was used to generate the questions.

The "question_answer_pairs.txt" files include both the questions and answers. The columns in this file are:

- **ArticleTitle**: The title of the Wikipedia article from which the questions and answers were derived.
- **Question**: The question itself.
- **Answer**: The corresponding answer.
- **DifficultyFromQuestioner**: The difficulty rating assigned by the question-writer.
- **DifficultyFromAnswerer**: The difficulty rating given by the person who answered, which may differ from the one in the previous column.
- **ArticleFile**: The file name of the relevant article.

Questions deemed poor in quality were excluded from this dataset.
We only need **Question** and answer **Answer** column.

In [128]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain_community.document_loaders.merge import MergedDataLoader

In [129]:
import pandas as pd
pd.set_option('display.max_colwidth', None)


# List of files
csv_files = ['./question_answering/S08_question_answer_pairs.csv',
             './question_answering/S09_question_answer_pairs.csv',
             './question_answering/S10_question_answer_pairs.csv']

## Load and concatenate the DataFrames
#dataframes = [load_txt_file(file) for file in txt_files]
#merged_df = pd.concat(dataframes, ignore_index=True)
#
## Display the merged DataFrame
#merged_df

In [130]:
clmns = ['ArticleTitle', 'Question', 'Answer']
pd.read_csv(csv_files[0])
df = pd.concat(map(pd.read_csv, [csv_files[0], csv_files[1], csv_files[2]]))[clmns]

In [131]:
df.reset_index(drop=True, inplace=True)
df = df.drop_duplicates(
    subset=['ArticleTitle', 
            'Question'], 
    keep='first').reset_index(drop=True)

In [132]:
df.head()

Unnamed: 0,ArticleTitle,Question,Answer
0,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of the United States?,yes
1,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1863?,yes
2,Abraham_Lincoln,Did his mother die of pneumonia?,no
3,Abraham_Lincoln,How many long was Lincoln's formal education?,18 months
4,Abraham_Lincoln,When did Lincoln begin his political career?,1832


In [133]:
questions_answers = []
for index, row in df.iterrows():
    txt = f"ArticleTitle: {row['ArticleTitle']}, Question: {row['Question']}, Answer: {row['Answer']}"
    questions_answers.append(txt+"\n")
questions_answers = ' '.join(questions_answers)

In [134]:
len(questions_answers)

303781

In [135]:
# Split the data into chunks
text_splitter = CharacterTextSplitter(separator="\n",
    chunk_size=800,
    chunk_overlap=400
)
qa_chunks = text_splitter.split_text(questions_answers)

Created a chunk of size 988, which is longer than the specified 800


In [136]:
qa_chunks[:2]

["ArticleTitle: Abraham_Lincoln, Question: Was Abraham Lincoln the sixteenth President of the United States?, Answer: yes\n ArticleTitle: Abraham_Lincoln, Question: Did Lincoln sign the National Banking Act of 1863?, Answer: yes\n ArticleTitle: Abraham_Lincoln, Question: Did his mother die of pneumonia?, Answer: no\n ArticleTitle: Abraham_Lincoln, Question: How many long was Lincoln's formal education?, Answer: 18 months\n ArticleTitle: Abraham_Lincoln, Question: When did Lincoln begin his political career?, Answer: 1832\n ArticleTitle: Abraham_Lincoln, Question: What did The Legal Tender Act of 1862 establish?, Answer: the United States Note, the first paper currency in United States history",
 "ArticleTitle: Abraham_Lincoln, Question: How many long was Lincoln's formal education?, Answer: 18 months\n ArticleTitle: Abraham_Lincoln, Question: When did Lincoln begin his political career?, Answer: 1832\n ArticleTitle: Abraham_Lincoln, Question: What did The Legal Tender Act of 1862 estab

In [137]:
from langchain.prompts import PromptTemplate
from langchain.chains.llm import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain.schema.document import Document
from typing import List

In [138]:
embedding = OpenAIEmbeddings()

persist_directory = "./chroma_qa"
# create our vector db chromadb
vectordb = Chroma.from_texts(
    texts=qa_chunks,
    embedding=embedding,
    persist_directory=persist_directory
)
vectordb.persist()

In [139]:
import sqlite3
import pandas as pd

persist_directory = "./chroma_qa/chroma.sqlite3"

# Create a SQL connection to our SQLite database
con = sqlite3.connect(persist_directory)

cur = con.cursor()

# Return all results of query
cur.execute('SELECT * FROM embedding_metadata limit 10')
#cur.fetchall()

<sqlite3.Cursor at 0x1bf154902d0>

In [140]:
## load the persisted db
vector_store = Chroma(persist_directory="./chroma_qa",
                      embedding_function=OpenAIEmbeddings())

In [141]:
# make a retriever
retriever_qa = vector_store.as_retriever(search_type="similarity", 
                                         search_kwargs={"k": 1})  # number of document to get is 4
docs = retriever_qa.get_relevant_documents("When was the Six Day War?")
print(retriever_qa.search_type)

similarity


In [142]:
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)

In [143]:
# RetrievalQA chain to get info from the vectorstore
chain_qa = RetrievalQA.from_chain_type(
    llm,
    chain_type="map_reduce",
    retriever=retriever_qa,
    verbose=True,
    return_source_documents=True,
)

In [144]:
#result_qa = chain_qa('What is LEE SKILLS in bullet points?')
result_qa = chain_qa("When was the Six Day War?")
print(result_qa['result'])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The Six-Day War took place from June 5 to June 10, 1967.


In [145]:
prompt_template = """
{question}
"""

In [146]:
# Define the PromptTemplate with the custom template
prompt = PromptTemplate(
    input_variables=["question"],  # The variable used inside the template
    template=prompt_template  # The custom template defined above
)

# Create a custom query with the template
query = "When was the Six Day War?"
formatted_query = prompt.format(question=query)


llm_response = chain_qa({'query': formatted_query})

print(llm_response['result'])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The Six-Day War took place from June 5 to June 10, 1967.


### Streamlit App

In [147]:
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.chat_models import ChatOpenAI
import streamlit as st
import pandas as pd
import os
from dotenv import find_dotenv, load_dotenv
import openai
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader # to load pdf files
from langchain.chains import RetrievalQA
from langchain.document_loaders import Docx2txtLoader # to load word files
from langchain.document_loaders import TextLoader # to load text files
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
import streamlit as st
from langchain.prompts import PromptTemplate
import time

# this library makes our life easier when it 
# comes to chatting within a library
from langchain.chains.question_answering import load_qa_chain 
from streamlit_chat import message # pip install streamlit_chat

load_dotenv(find_dotenv())

openai.api_key = os.getenv("OPENAI_API_KEY") 


#==== Using OpenAI Chat API =======
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)

#### === packages to install ====
# pip install langchain pypdf openai chromadb tiktoken docx2txt

#-------- Streamlit front-end #--------
st.title("Document QA Bot powered by LangChain")
st.header("Feel free to ask any questions about your document... 🤖")


# Load CSV files
files = st.file_uploader("Please upload your files", accept_multiple_files=True, type=["csv"])

if files:
    dfs = []
    clmns = ['ArticleTitle', 'Question', 'Answer']
    
    # Read each uploaded CSV file and filter the required columns
    for file in files:
        df = pd.read_csv(file)
        if all(col in df.columns for col in clmns):
            dfs.append(df[clmns])
        else:
            st.warning(f"File {file.name} does not contain the required columns: {clmns}")

    # Concatenate all DataFrames
    if dfs:
        merged_df = pd.concat(dfs, ignore_index=True)
    else:
        st.warning("No valid files uploaded.")

    
    # Example function to simulate a time-consuming task
    def long_task():
        time.sleep(3)  # Simulate a 5-second task

    
    # Display spinner while running the long task
    with st.spinner("Please wait, processing..."):
        long_task()

    merged_df.reset_index(drop=True, inplace=True)
    merged_df = merged_df.drop_duplicates(
    subset=['ArticleTitle', 
            'Question'], 
    keep='first').reset_index(drop=True)
    
    st.write(merged_df.head())

    questions_answers = []
    for index, row in merged_df.iterrows():
        txt = f"ArticleTitle: {row['ArticleTitle']}, Question: {row['Question']}, Answer: {row['Answer']}"
        questions_answers.append(txt+"\n")
    questions_answers = ' '.join(questions_answers)

    # Split the data into chunks
    text_splitter = CharacterTextSplitter(separator="\n",
        chunk_size=800,
        chunk_overlap=400
    )
    qa_chunks = text_splitter.split_text(questions_answers)

    prompt_template = """
    {question}
    """

    # Define the PromptTemplate with the custom template
    prompt = PromptTemplate(
        input_variables=["question"],  # The variable used inside the template
        template=prompt_template  # The custom template defined above
    )

    embedding = OpenAIEmbeddings()
    
    # create our vector db chromadb
    vectordb = Chroma.from_texts(
        texts=qa_chunks,
        embedding=embedding,
        persist_directory='./chroma_qa'
    )
    vectordb.persist()
    
    # Display spinner while running the long task
    with st.spinner("Please wait, processing..."):
        long_task()

    # RetrievalQA chain to get info from the vectorstore
    chain_qa = RetrievalQA.from_chain_type(
        llm,
        retriever=vectordb.as_retriever(search_kwargs={'k':2}),
        return_source_documents=True,
        verbose=True
    )
    # load files
    chat_history = []

    if 'produced' not in st.session_state:
        st.session_state['produced'] = []
        
    if 'old' not in st.session_state:
        st.session_state['old'] = []

    # get the user input
    user_input = st.chat_input("Ask a question from your documents...")
    formatted_query = prompt.format(question=user_input)
    if user_input:
        result = chain_qa({'query': formatted_query, 'chat_history': chat_history})
        st.session_state.old.append(formatted_query)
        st.session_state.produced.append(result['result'])
        
        
    if st.session_state['produced']:
        for i in range(len(st.session_state['produced'])):
            message(st.session_state['old'][i], is_user=True, key=str(i)+ '_user')
            message(st.session_state['produced'][i], key=str(i))

    # Display spinner while running the long task
    with st.spinner("Please wait, processing..."):
        long_task()