<a href="https://colab.research.google.com/github/AaryanNaruka14/AaryanNaruka14/blob/main/FS23_Workshop_2_LangChain_AI_Club.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 2: Extending Language Models
Building a system that can answer questions, using a document of your choice as a source.

We'll be using **LangChain**



## What is LangChain ?
LangChain is an open-source framework for building applications powered by language models. It provides developers with tools to build applications using large language models (LLMs). LangChain allows developers to chain different prompts interactively.

LangChain can be used to build applications that:
- Connect a language model to other sources of context
- Are context-aware
- Integrate with external sources such as Google Drive, Notion, and Wikipedia


# Setup: Install LangChain 🦜🔗

Run the cell below. It will take about 7 minutes to run the first time.

In [1]:
# the [all] refers to downloading the required dependencies along with LangChain itself
!pip install langchain[all] unstructured
!pip install pdf2image
!pip install chromadb

Collecting langchain[all]
  Downloading langchain-0.0.293-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unstructured
  Downloading unstructured-0.10.15-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain[all])
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting langsmith<0.1.0,>=0.0.38 (from langchain[all])
  Downloading langsmith-0.0.38-py3-none-any.whl (38 kB)
Collecting O365<3.0.0,>=2.0.26 (from langchain[all])
  Downloading O365-2.0.28-py3-none-any.whl (164 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.6/164.6 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aleph-alpha-client<3.0.0,>=2.15.0 (from langchain[all])
  Downloading aleph_alpha_client-2.17.0-py3-none-

Collecting pdf2image
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.16.3
Collecting chromadb
  Downloading chromadb-0.4.10-py3-none-any.whl (422 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m422.4/422.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi<0.100.0,>=0.95.2 (from chromadb)
  Downloading fastapi-0.99.1-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.23.2-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━

## Installing the **LangChain** library with all the dependencies

In [2]:
# Loading all the required functions for building the QA system
from langchain.document_loaders import WebBaseLoader, PDFMinerLoader
from langchain.document_loaders import OnlinePDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

## Setting Your OpenAI API Key
The key will be used to access ChatGPT from the notebook.

During the workshop, we'll provide you with a temporary key.

To create a key, follow these steps

In [3]:
# This is a temporary key
# If you're trying this on your own, you'll want to make your own API key (this one will stop working after the workshop)
OPENAI_API_KEY="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"

# Part 1: Data Preparation

## 1A: Getting the Text
You can download the data from an online PDF, such as an academic paper, using LangCHain's `OnlinePDFLoader`

In [4]:
# Defining a Variable to store the file path or article link you want to do question answering on.
# webpage_path = 'https://www.macworld.com/article/2059274/wonderlust-keynote-script-iphone-15-apple-watch-series-9-airpods.html'
webpage_path = 'https://arxiv.org/pdf/2211.12588.pdf'

In [5]:
# Defining a Loader object
# The loader object is specifc to the type of file you selected in the previous page.
# For webpage there is WebBaseLoader, for online pdfs (arxiv articles) there is OnlinePDFLoader

# loader = WebBaseLoader(webpage_path)
loader = OnlinePDFLoader(webpage_path)

In [6]:
# loader.load() starts to extract text/information from the provided file link.
# So if you are doing question answering on an Article.
# loader.load() step will extract all the text from that article link.
data = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [7]:
# lets see what the output looks like
print(data)

[Document(page_content='University of Waterloo Vector Institute, Toronto University of California, Santa Barabra Google Research {wenhuchen,x93ma}@uwaterloo.ca, xinyi_wang@ucsb.edu, wcohen@google.com\n\nProgram of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks ♦ ∗ , Xueguang Ma\n\n♠,♣\n\n∗ , Wenhu Chen\n\n♦\n\n♠\n\n♣\n\n♠\n\n♥\n\nXinyi Wang,\n\n♥\n\nWilliam W. Cohen\n\n2 2 0 2\n\nv o N 9 2\n\n] L C . s c [\n\n3 v 8 8 5 2 1 . 1 1 2 2 : v i X r a\n\nAbstract\n\nRecently, there has been signiﬁcant progress in teaching language models to perform step- by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is the state-of-art method for many of these tasks. CoT uses language models to produce text describing reasoning, and com- putation, and ﬁnally the answer to a ques- tion. Here we propose ‘Program of Thoughts’ (PoT), which uses language models (mainly Codex) to generate text and programming lan- guage st

## 1B: Splitting the Text Into Chunks

#### What is it ?
- Splitting text into chunks (often referred to as "segmentation" or "chunking") in information retrieval systems is a fundamental step.

#### Why do we do it ?
- Large documents or data sources can be unwieldy. By breaking them down into smaller, more digestible pieces, information retrieval systems can process and index the content more efficiently.

#### How are we going to do it ?
- Using RecursiveCharacterTextSplitter from LangChain


### Recursively split by character
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

How the text is split: by list of characters.<br>
How the chunk size is measured: by number of characters.

In [11]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)

In [9]:
# lets see what the 33rd piece of splitted text looks like
print(all_splits[33].page_content)

concatenation of task instruction, text, linearized table, and question. For conversational question answering, we simply concatenate all the dialog history in the prompt.


In [10]:
# Can you see what the 34th split looks like? Can you search for it in the document?

# Enter code here

## 1C: Turning the Chunks Into Numbers

We turn each chunk into a list of numbers, called "vectors".

### Why do we need to represent Text as **vectors** ?
Representing text as vectors, especially with methods like Word2Vec, FastText, or embeddings from models like BERT and GPT, captures the semantic meaning of the words or sentences. This allows the QA system to understand and match questions and answers based on their meaning rather than just keyword overlap.

### How do we store the **vector** representation of text ?
Using Vector Stores like Chromadb or FAISS (Facebook) or ANNOY (Spotify)

### What is a vectore store ?
In the context of QA (Question-Answering) systems, a "vector store" refers to a storage mechanism or database optimized for storing and retrieving high-dimensional vectors. These vectors are often representations of text data in a format that can be easily compared for similarity or used in mathematical operations.


### In our example we will use OpenAI's "text-embedding-ada-002" to convert text into vectors. And ChromaDb as the vectorestore database. But do explore other options provided by LangChain

In [12]:
# creating a vectorestore to put the vectorized text into
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY))

### Document Retrieval

Representing words as vectors allows us to compare how similar one vector is to another.

Given the question we will see which pieces of text does ChromaDB selects as most similar.

In [13]:
# Our question. Try changing this and see what happens
question = "Can you compare the results of the GPT-3 backend to the Codex backend according to the paper?"

# document retrieval using ChromaDb and provided question
docs = vectorstore.similarity_search(question)

# lets see which text pieces are retrieved
print(docs)

[Document(page_content='Backend GPT-3 vs. Codex We evaluated GPT- 3 (text-davinci-002) with PoT prompting. Un- like Codex, GPT-3 is not optimized for generating programs, and one would expect degraded perfor- mance with GPT-3 as the LLM. We choose three datasets, GSM8K, SVAMP, and FinQA, to analyze the performance difference of PoT and compare that relative to CoT. We show our experimental results in Table 4. We can see that the gap be- tween Codex and GPT-3 with PoT is consistently smaller than their gap with CoT. We', metadata={'source': '/tmp/tmpbmzs_hml/tmp.pdf'}), Document(page_content='Codex PoT GPT3 PoT Codex - GPT3 (PoT)\n\n63.1 46.9 +16.2\n\n71.6 60.4 +11.2\n\n76.4 58.9 +7.5\n\n85.2 80.1 +5.1\n\n40.4 26.1 +14.3\n\n64.5 56.7 +7.8\n\n0.62\n\n0.58\n\n0.55\n\n0.65\n\n0.68\n\n0.6\n\n0.67\n\n0.69\n\n0.65\n\n0.73\n\n0.7\n\n0.74\n\nTable 4: GPT-3 and Codex performance difference un- der CoT and PoT prompting.\n\n2-shots\n\n0.63\n\n0.64\n\n4-shots\n\n0.66\n\n0.63\n\n0.62\n\n6-shots\n\n

In the cell above, try asking a different question about the document, and see which docs are selected. Are they relevant to the question you asked?

## Part 2: Creating the Chain

We have the question we want to ask, and have retrieved the relevant documents to answer it. How do we put this all together to generate the answer for question?

## 2A: Specify the Language Model

### How do we communicate with LLMs ?



To communicate with large language models (LLMs), input a clear textual prompt or query. The model analyzes the input and generates a coherent, contextually relevant response. For optimal results, provide specific context or details within your queries


Below we either create our own prompt or use one of the templates from LangChain. Add the question and the retrieved context and just input it to the LLM. In this case the LLM is GPT 3.5 Turbo (ChatGPT). You can play around and used other publicly available LLMs like Falcon or Llama 2 mdoels.


In [None]:
# defining which LLM to use.
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=OPENAI_API_KEY)

## 2B: Specify The Prompt Template

In [None]:
# Defining the prompt
template = """Use the following pieces of a research paper to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

## 2C: Putting it Together: Creating the Chain

In [None]:
# combining all the previous steps and creating a nice and clean Chain Object.

qa_chain = RetrievalQA.from_chain_type(
    llm, # llm we created in step 2A
    retriever=vectorstore.as_retriever(), # vector store we created in step 1C
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT} # Prompt template we created in step 2B
)

# Part 3: Using the Chain

To use our chain, simply call the chain like a function, passing in your question.

In [None]:
# our question
question = "Can you provide a summary of the results mentioned in the paper?"

# getting the answer
answer = qa_chain.run(question)
print(answer)

The paper evaluates the performance of PoT prompting on five MWP datasets (GSM8K, AQuA, SVAMP, TabMWP, MultiArith) and three financial datasets (FinQA, ConvFinQA, TATQA). PoT outperforms CoT significantly across all evaluated datasets, with an average gain of around 8% for MWP datasets and 15% for financial datasets under the few-shot setting. Thanks for asking!


# (Bonus) Try It Yourself - AutoGPT

This langchain pipeline searches ArXiv for information, then answers your question.

Notice how when you run the cell below, the AI recognizes when it does not know the answer to a question, then uses the Search function we provide it to find a paper explaining it.

In [None]:
from langchain.chat_models import ChatOpenAI
import langchain
from langchain.agents import load_tools, initialize_agent, AgentType

langchain.verbose=True

llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

tools=load_tools(
    ["arxiv"]
)

chain = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False #or true
)

chain.run("What is contrast consistent search and how does it help with eliciting latent knowledge?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mAnswer the following questions as best you can. You have access to the following tools:

arxiv: A wrapper around Arxiv.org Useful for when you need to answer questions about Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering, and Economics from scientific articles on arxiv.org. Input should be a search query.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [arxiv]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: What is contrast consistent search and how does it help with eliciting latent knowledge?
Thought:[0

'"Contrast consistent search" is a method used to improve the self-consistency of rankings produced by language models. It does not directly relate to eliciting latent knowledge.'