# LangChain: Chat with your data

LangChain is open-source developer framework for building LLM applications.  
LLM as we know it knows only the data that it was trained on, but it would be useful if we could talk with LLM, and if he knew the data that we provided previously. This is exactly the topic we will deal with by playing with LangChain framework.

## Retrieval-Augmented Generation (RAG) with LangChain

 The RAG system retrieves relevant information from a predefined knowledge base and then uses a generative language model to make a response based on that retrieved information.  
 The idea is to ground the model’s responses in specific information, reducing the risk of generating hallucinated answers. It’s particularly useful for applications requiring factual accuracy or answers based on domain-specific knowledge.  
 There are two main steps:  
  - <b>Retrieval:</b> The system searches a database or document collection for relevant pieces of information based on the query. 
  - <b>Generation:</b> A language model takes the retrieved information and makes a contextually relevant response.
 
LangChain is a framework designed for chaining together various components, such as retrieval systems and language models, to build RAG systems. LangChain modules support:  
- Document Loading and Preprocessing
- Vector Store Integration and efficient scaling - as the knowledge base grows, LangChain’s vector store and retrieval functions help maintain efficient query processing times
- Retrieval-Generation Chains -  provide Domain-Specific Knowledge and reduce hallucination  
  
We are going to start with document loading.


### Document Loading
#### PDF loading

In [2]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("knowledge_base/LinearRegression-Lecture01.pdf")
pages = loader.load()

- PyPDFLoader loaded a list of documents and each document is for each page in the pdf
- Each document has its page_content and metadata (source destination and page number)
- Extract images, lazy load option

In [30]:
print(f"Loaded document has {len(pages)} pages.")
last_page = pages[-1]
print(f"Last page metadata: {last_page.metadata}")
print(last_page.page_content[-602:-2]) # print out the document summary

Loaded document has 13 pages.
Last page metadata: {'source': 'knowledge_base/LinearRegression-Lecture01.pdf', 'page': 12}

Summary
•Regression is used to predict numeric values (dependent variables) based on input exam-
ples (independent variables)
•Inlinear regression , the output value is a linear combination of input values
•The loss function is the quadratic loss (the squared diﬀerence between the predicted and
correct label)
•Optimization uses the least squares method , computed using the pseudoinverse of the
design matrix
•Assuming normally distributed noise, the least squares solution is equivalent to maximizing
the probability of the labels, which gives a probabilistic justiﬁcation for the quadratic loss



#### URL loading
We can use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. 

In [31]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://python.langchain.com/docs/introduction/")
docs = loader.load()
print(f"METADATA: {docs[0].metadata}\n")
content = docs[0].page_content
# Print out a part of the page to explain langchain
start = content.find("LangChain is a framework for developing applications")
end = content.find("LangGraph Cloud.") + len("LangGraph Cloud.")
print(content[start:end])

# print(docs[0].page_content[:])   # print the whole page

USER_AGENT environment variable not set, consider setting it to identify your requests.


METADATA: {'source': 'https://python.langchain.com/docs/introduction/', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'}

LangChain is a framework for developing applications powered by large language models (LLMs).
LangChain simplifies every stage of the LLM application lifecycle:

Development: Build your applications using LangChain's open-source building blocks, components, and third-party integrations.
Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.
Productionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.
Deployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Cloud.


#### Load multiple URLs concurrently  
We can speed up the scraping process by scraping and parsing multiple URLs concurrently.  
Parameter <b>requests_per_second</b> is used to increase the max concurrent requests. This will speed up the scraping process, but may cause the server to block you.


In [14]:
import nest_asyncio # fixes a bug with asyncio and jupyter
nest_asyncio.apply()

loader = WebBaseLoader(["https://python.langchain.com/docs/introduction/", "https://huggingface.co/docs/transformers/index"])
loader.requests_per_second = 1 
docs = loader.aload()
for doc in docs:
    print(f"{doc.metadata}\n")


Fetching pages: 100%|##########| 2/2 [00:00<00:00,  3.65it/s]

{'source': 'https://python.langchain.com/docs/introduction/', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'}

{'source': 'https://huggingface.co/docs/transformers/index', 'title': '🤗 Transformers', 'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.'}






#### Loading a xml file  
Load a XML file using RAW URL and default parser set to 'xml' (install lxml)

In [32]:
loader = WebBaseLoader(
    "https://raw.githubusercontent.com/IvaGoluza/backend_RoundRobin/refs/heads/master/pom.xml"
)
loader.default_parser = "xml"
docs = loader.load()
print(f"METADATA: {docs[0].metadata}\n")
docs[0].page_content

METADATA: {'source': 'https://raw.githubusercontent.com/IvaGoluza/backend_RoundRobin/refs/heads/master/pom.xml'}



'\n4.0.0\n\norg.springframework.boot\nspring-boot-starter-parent\n3.1.5\n \n\ncom.web2\nRoundRobin\n0.0.1-SNAPSHOT\nRoundRobin\nDemo project for Spring Boot\n\n17\n3.1.0\n\n\n\norg.springframework.boot\nspring-boot-starter-data-jpa\n\n\norg.springframework.boot\nspring-boot-starter-security\n\n\norg.springframework.boot\nspring-boot-starter-thymeleaf\n\n\norg.springframework.boot\nspring-boot-starter-web\n\n\norg.postgresql\npostgresql\nruntime\n\n\norg.springframework.boot\nspring-boot-configuration-processor\ntrue\n\n\norg.springframework.boot\nspring-boot-starter-validation\n\n\norg.projectlombok\nlombok\ntrue\n\n\norg.modelmapper\nmodelmapper\n${modelmapper.version}\n\n\norg.springframework.boot\nspring-boot-starter-test\ntest\n\n\norg.springframework.security\nspring-security-test\ntest\n\n\n\n\n\norg.springframework.boot\nspring-boot-maven-plugin\n\n\n\norg.projectlombok\nlombok\n\n\n\n\n\n\n'

#### Lazy Load  
We can use lazy load with the previous loader (only for demo, but this feature is great for loading large data in order to minimize memory requirements). 

In [16]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

print(pages[0].page_content[:])
print(f"METADATA: {pages[0].metadata}\n")


4.0.0

org.springframework.boot
spring-boot-starter-parent
3.1.5
 

com.web2
RoundRobin
0.0.1-SNAPSHOT
RoundRobin
Demo project for Spring Boot

17
3.1.0



org.springframework.boot
spring-boot-starter-data-jpa


org.springframework.boot
spring-boot-starter-security


org.springframework.boot
spring-boot-starter-thymeleaf


org.springframework.boot
spring-boot-starter-web


org.postgresql
postgresql
runtime


org.springframework.boot
spring-boot-configuration-processor
true


org.springframework.boot
spring-boot-starter-validation


org.projectlombok
lombok
true


org.modelmapper
modelmapper
${modelmapper.version}


org.springframework.boot
spring-boot-starter-test
test


org.springframework.security
spring-security-test
test





org.springframework.boot
spring-boot-maven-plugin



org.projectlombok
lombok







METADATA: {'source': 'https://raw.githubusercontent.com/IvaGoluza/backend_RoundRobin/refs/heads/master/pom.xml'}



Loaded documents are large so now we should split them in smaller chunks.
### Document Splitting  
- We are splitting documents in semantic chunks. Size is measured in characters or tokens.  
- The basis of all text splitters in LangChain involve splitting with some chunk overlap. This way chunks are connected.  
- Text splitters in LangChain all have <b>create_documents</b> and <b>split_documents</b> method.  
- Some text splitters are focused on metadata. When the document is split in chunks, every chunk has to have initial metadata, and maybe some additional metadata.


In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [34]:
chunk_size = 30
chunk_overlap = 10
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [19]:
text = 'Today is Tuesday and we are in Natural Language Processing class.'
print(f"Character text splitter:\n {c_splitter.split_text(text)}\n\n")
print(f"Recursive Character text splitter:\n {r_splitter.split_text(text)}\n\n")

Character text splitter:
 ['Today is Tuesday and we are in Natural Language Processing class.']


Recursive Character text splitter:
 ['Today is Tuesday and we are in', 'we are in Natural Language', 'Language Processing class.']




By default, character splitter is using new line as a separator, and doesn't even split the previous string until we set its separator to ''.

In [20]:
print(f"Character splitter using default separator:\n {c_splitter.split_text(text)}\n\n")
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
print(f"Character splitter using ' ' separator:\n {c_splitter.split_text(text)}")

Character splitter using default separator:
 ['Today is Tuesday and we are in Natural Language Processing class.']


Character splitter using ' ' separator:
 ['Today is Tuesday and we are in', 'we are in Natural Language', 'Language Processing class.']


#### Recursive splitting
In the following examples we compare the work of `CharacterTextSplitter` and `RecursiveCharacterTextSplitter`.  

`RecursiveCharacterTextSplitter` gives better results since it separates by separators=["\n\n", "\n", "\. ", " ", “”], which means that it will split the text by "\n\n" first, and if the chunk length is still too big, it will split based on the next separator from the separators array.  

Take a look at the last two chunks from the splitters results and see how recursive one did better job keeping paragraphs together.

In [21]:
text2 = """🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch.

These models support common tasks in different modalities, such as:
📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
🖼️ Computer Vision: image classification, object detection, and segmentation.
🗣️ Audio: automatic speech recognition and audio classification.
🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

🤗 Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage of a model’s life; train a model in three lines of code in one framework, and load it for inference in another. Models can also be exported to a format like ONNX and TorchScript for deployment in production environments."""

In [22]:
c_splitter = CharacterTextSplitter(
    chunk_size=550,
    chunk_overlap=0,
    separator = ' '
)
print("[CHUNK]  " + "\n\n[CHUNK]  ".join(c_splitter.split_text(text2)))

[CHUNK]  🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch.

These models support common tasks in different modalities, such as:
📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
🖼️ Computer Vision: image classification,

[CHUNK]  object detection, and segmentation.
🗣️ Audio: automatic speech recognition and audio classification.
🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

🤗 Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage 

In [23]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=550,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
print("[CHUNK]  " + "\n\n[CHUNK]  ".join(r_splitter.split_text(text2)))

[CHUNK]  🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch.

[CHUNK]  These models support common tasks in different modalities, such as:
📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
🖼️ Computer Vision: image classification, object detection, and segmentation.
🗣️ Audio: automatic speech recognition and audio classification.

[CHUNK]  🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

[CHUNK]  🤗 Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framew

Now we can split pages form the loaded pdf file into documents using RecursiveCharacterTextSplitter.

In [24]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("knowledge_base/LinearRegression-Lecture01.pdf")
pages = loader.load()

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
docs = r_splitter.split_documents(pages)

print(f"{len(pages)} pages were split in {len(docs)} documents.")
# for doc in docs:
#     print(f"[DOC] {doc.page_content} \n")

13 pages were split in 47 documents.


#### Token splitting
Now we can split on token count instead of characters count. This practice is important because LLMs often have context windows designated in tokens.

In [25]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
print(text_splitter.split_text(text))

text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
print(docs[0].metadata)
print(docs[0].page_content)

['Today', ' is', ' Tuesday', ' and', ' we', ' are', ' in', ' Natural', ' Language', ' Processing', ' class', '.']
{'source': 'knowledge_base/LinearRegression-Lecture01.pdf', 'page': 0}
3. Linear regression
Machine Learning 1, UNIZG FER, AY 2022/2023
Jan Šnajder, lectures, v1.13
Last time we introduced the basic concepts of machine learning: the hypothesis , which is a
function that maps from input data to labels and is deﬁned by parameters theta, and the model,
which is a set of hypotheses indexed by parameters theta. We have said that machine learning
comes down


#### Context aware splitting
    
We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks. Specify headers to split on.

In [26]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = """# NLP\n\n \
## What is NLP\n\n \
NLP is a field at the intersection of computer science, artificial intelligence (AI), and linguistics.\n\n It is also a subject on FER.\n\n \
### What is that name? \n\n \
Natural Language Processing \n\n 
"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

md_header_splits[0]

Document(metadata={'Header 1': 'NLP', 'Header 2': 'What is NLP'}, page_content='NLP is a field at the intersection of computer science, artificial intelligence (AI), and linguistics.  \nIt is also a subject on FER.')

Now chunks have to be moved in vector stores; we have to store them with indexing so they are easily retrieved when asking questions.

### Vectorstores and Embedding
Embeddings take a peace of text and create a numerical vector representation of that text. Text with similar content will have similar vectors.  
We can save data as embedded vectors in vector store. Later we can also make embedding vector from our question, and based on that vector, we can query all the similar vectors from vector store. Those vectors will be passed inside LLM, and it will produce the answer.   

Now we are going to load a few PDF files, and we will duplicate the first one to create messy data. Then, we will split loaded data into chunks.

In [4]:
# load PDFs
loaders = [
    PyPDFLoader("knowledge_base/LinearRegression-Lecture01.pdf"),
    PyPDFLoader("knowledge_base/LinearRegression-Lecture01.pdf"),
    PyPDFLoader("knowledge_base/LinearRegression-Lecture02.pdf"),
    PyPDFLoader("knowledge_base/LinearDiscriminativeModels-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
  
# split data  
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 200
)
splits = text_splitter.split_documents(docs)

##### Simple example for understanding embeddings
We are using <b>sentence-transformers</b>, open source library for generating embeddings.  
Sentence-transformers is built on top of <b>Hugging Face Transformers</b> and provides various pre-trained models for generating embeddings.

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings
import numpy as np

embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

sentence1 = "I like piano"
sentence2 = "I like guitar"
sentence3 = "Entschuldigung, wo ist der Kellner?"

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

print(f"E1 vs E2: {np.dot(embedding1, embedding2)}")
print(f"E1 vs E3: {np.dot(embedding1, embedding3)}")
print(f"E2 vs E3: {np.dot(embedding2, embedding3)}")

E1 vs E2: 0.7652282697764397
E1 vs E3: -0.00959956912924178
E2 vs E3: -0.003178931187782787


##### Vectorstores
LangChain has integrations with a lot of vector stores, and we are using Chroma. It is lightweight and in-memory.  
On vector DB we can call `similarity_search` method and pass it parameter p for how many results we want. So, for our question `similarity_search` will return k answers from vector storage based on embedding vectors similarities.


In [7]:
import os
from langchain.vectorstores import Chroma

persist_directory = 'chroma/'
    
!rm -rf ./chroma  # remove old database files if any

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [7]:
question = "what is Least-squares method"
docs = vectordb.similarity_search(question,k=2)
for doc in docs:
    print(f"{doc.page_content[:1000]}\n\n")

4 Least-squares method
Let us now consider in more detail the least-squares method, that is, how to obtain the the
parameters wthat are optimal in terms of the sum of squared residuals, an let’s do this for
multiple linear regression, where ną1. We have a set of examples (in regression, these are
input variables or “independent variables”):
X“¨
˚˚˚˚˝1xp1q
1xp1q
2... xp1q
n
1xp2q
1xp2q
2... xp2q
n
...
1xpNq
1xpNq
2... xpNq
n˛
‹‹‹‹‚
Nˆpn`1q
Recall that the matrix Xis called the design matrix . We also have a vector of output values
(i.e., the labels or “dependent variables”):
y“¨
˚˚˚˝yp1q
yp2q
...
ypNq˛
‹‹‹‚
Nˆ1
4.1 A dead end: Exact solution
Let’s try ﬁrst with something other than the least-squares method, and that will turn out to be
a bad idea. Namely, we can think of our optimization problem as a system of equations. Ideally,
we would like to ﬁnd a solution for which
pxpiq,ypiqqPD. hpxpiqq“ypiq
that is
pxpiq,ypiqqPD.wTxpiq“ypiq
This would be an exact solution of our problem: a solut

We can notice that two resulting docs are the same, and there is no real value in the second one. (This is result of loading the same pdf file two times, and now we have chunks with the same content.)

### Retrieval
A query comes in and we have to retrieve most relevant data (chunks) for that query. 

#### Method: Maximum marginal relevance MMR
- You may not always want to choose the most similar responses - need of diverse set of documents 
- Idea is to query vector store with fetch_k param that we can control 
- Choose <b>fetch_k</b> most similar responses based on semantics 
- Within those fetch_k responses choose final <b>k</b> most diverse responses for user
- With <b>lambda_mult</b> parameter determine the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.

In [188]:
docs_mmr = vectordb.max_marginal_relevance_search(question, fetch_k=5, k=2)
for doc in docs_mmr:
    print(f"{doc.page_content[:1000]}\n\n")

4 Least-squares method
Let us now consider in more detail the least-squares method, that is, how to obtain the the
parameters wthat are optimal in terms of the sum of squared residuals, an let’s do this for
multiple linear regression, where ną1. We have a set of examples (in regression, these are
input variables or “independent variables”):
X“¨
˚˚˚˚˝1xp1q
1xp1q
2... xp1q
n
1xp2q
1xp2q
2... xp2q
n
...
1xpNq
1xpNq
2... xpNq
n˛
‹‹‹‹‚
Nˆpn`1q
Recall that the matrix Xis called the design matrix . We also have a vector of output values
(i.e., the labels or “dependent variables”):
y“¨
˚˚˚˝yp1q
yp2q
...
ypNq˛
‹‹‹‚
Nˆ1
4.1 A dead end: Exact solution
Let’s try ﬁrst with something other than the least-squares method, and that will turn out to be
a bad idea. Namely, we can think of our optimization problem as a system of equations. Ideally,
we would like to ﬁnd a solution for which
pxpiq,ypiqqPD. hpxpiqq“ypiq
that is
pxpiq,ypiqqPD.wTxpiq“ypiq
This would be an exact solution of our problem: a solut

We can notice that now the results are more diverse, we did not get two of the same docs, but two separate ones which means we got more info than before.

### Method: SelfQuery - LLM aided retrieval
- We often want to infer the metadata from the query itself
- We can use `SelfQueryRetriever`, which uses an LLM to extract:  
  - The query string to use for vector search
  - A metadata filter to pass in as well
- Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [55]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Define metadata field information for retrieval
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of knowledge_base/LinearRegression-Lecture01.pdf, knowledge_base/LinearRegression-Lecture02.pdf, or knowledge_base/LinearDiscriminativeModels-Lecture03.pdf",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]
document_content_description = "Lecture script"


In [66]:
import os
from dotenv import load_dotenv
load_dotenv()
token = os.environ["HUGGINGFACEHUB_API_TOKEN"]  

In [67]:
from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
)  

In [68]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)
question = "what did they say about regression in the second lecture?"
docs = retriever.invoke(question)

for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}
{'page': 1, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}
{'page': 5, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}
{'page': 3, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}


We used an open-source <b>Hugging Face LLM model meta-llama/Meta-Llama-3-8B-Instruct</b>.  
  
Results are documents only from the second lecture, just like we specified in the query which means that retriever retrieved documents based on document content and <b>document metadata</b> also.

In [69]:
from langchain_openai import OpenAI
load_dotenv()
api_key = os.environ['OPENAI_API_KEY']
llm = OpenAI()

In [70]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)
question = "what did they say about regression in the second lecture?"
docs = retriever.invoke(question)

for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}
{'page': 1, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}
{'page': 5, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}
{'page': 3, 'source': 'knowledge_base/LinearRegression-Lecture02.pdf'}


We got the same result using OpenAI LLM.

#### Method: Compression  
- To pull out only the most relevant parts of responses 
- With compression, we run all the documents through the compression LLM and extract most relevant segments which will be passed to the main LLM that will generate the answer
- More LLM calls but finding the answer with focus on the most relevant things  

We are using compressor with open source Hugging Face LLM meta-llama/Meta-Llama-3-8B-Instruct.

In [19]:
import os
from dotenv import load_dotenv
load_dotenv()
token = os.environ["HUGGINGFACEHUB_API_TOKEN"]          

In [71]:
from langchain_huggingface import HuggingFaceEndpoint

# models tried: tiiuae/falcon-7b-instruct, meta-llama/Meta-Llama-3-8B-Instruct
llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
) 
llm.invoke("what was the first disney movie?") 

' The first Disney movie was Snow White and the Seven Dwarfs, which was released in 1937. It was the first full-length animated feature film produced by Walt Disney Productions and was based on the classic fairy tale "Snow White" by the Brothers Grimm. The movie was a groundbreaking achievement in animation and storytelling, and it became a huge success, earning critical acclaim and breaking box office records. It was also the first Disney movie to be released in theaters, and it helped establish Walt Disney as a major player in the film industry.... Read more\nHow many Disney movies have been made? As of 2022, there have been over 60 animated feature films produced by Walt Disney Animation Studios, including Snow White and the Seven Dwarfs (1937), Pinocchio (1940), Fantasia (1940), Dumbo (1941), Bambi (1942), Cinderella (1950), Mary Poppins (1964), The Jungle Book (1967), The Little Mermaid (1989), Beauty and the Beast (1991), Aladdin (1992), The Lion King (1994), and many others. Add

In [72]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

question = "What is Least-squares method?"
compressed_docs = compression_retriever.invoke(question)
pretty_print_docs(compressed_docs)

Document 1:

* The pseudoinverse is the solution to the least squares problem.
* Calculating the pseudoinverse gives the parameters w for which the hyperplane is optimal in terms of least square differences from labeled data.
* The solution w is such that the distance in terms of squared differences between vectors Xw and y is minimal, that is, it is a solution minimizing the L2-norm.
* If the system of equations is underdetermined and has multiple solutions, the pseudoinverse gives the solution with the smallest norm.
* If the system is overdetermined, the pseudoinverse gives a solution that minimizes }Xw´y}, but if such a solution is not unique, it again gives the one with the smallest norm.
* The optimization procedure gives a closed-form solution, but obtaining this solution can still be quite computationally demanding.
----------------------------------------------------------------------------------------------------
Document 2:

* The pseudoinverse is the solution to the least s

In [73]:
original_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(docs)]))
compressed_contexts_len = len("\n\n".join([d.page_content for i, d in enumerate(compressed_docs)]))

print("Original context length:", original_contexts_len)
print("Compressed context length:", compressed_contexts_len)
print("Compressed Ratio:", f"{original_contexts_len/(compressed_contexts_len + 1e-5):.2f}x")

Original context length: 6413
Compressed context length: 4500
Compressed Ratio: 1.43x


We can see that documents are duplicated (in the background semantic similarity is being used), but LLM extracted only the parts relevant to the question, and did not return whole documents.

### Question answering
RetrievalQA chain is doing question answering backed by a retrieval step.  
We are using same open source LLM as before.

In [75]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)
result = qa_chain({"query": "What is Least-squares method?"})
result["result"]

' The least-squares method is a method used in linear regression to find the best-fitting line that minimizes the sum of the squared errors between the observed and predicted values. It is a way to solve the problem of overfitting by using a closed-form solution.\n\nCorrect Answer: The least-squares method is a method used in linear regression to find the best-fitting line that minimizes the sum of the squared errors between the observed and predicted values. It is a way to solve the problem of overfitting by using a closed-form solution.\n\nFinal Answer: The least-squares method is a method used in linear regression to find the best-fitting line that minimizes the sum of the squared errors between the observed and predicted values. It is a way to solve the problem of overfitting by using a closed-form solution.'

#### Map-reduce method & RetrievalQA
 - Each document is sent to LLM model by itself to get the answer, and those answers are composed into one answer with the final call to the LLM
 - The con is many LLM calls - slow
 - May be worse answer if the info is spread over more documents and it only looks at documents individually 

In [77]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

" I don't know. The provided content does not discuss the Least-squares method. It seems to be a passage about a president's speech, and has no relation to the topic of Least-squares method."

Results are not that great because many documents did not have the answer and the answer was lost.

### Refine  
  - One LLM call starts a sequential LLM calls and every new call is searching for the better answer based on the answer from the previous LLM and new document from the documents list
  - Carry over the info between the documents 

In [78]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

' \nSince the original answer was not very detailed, this answer can be refined to provide more context and details about the least-squares method. \n\nThe least-squares method is a mathematical technique used in regression analysis to find the best-fitting linear or nonlinear model for a set of data points. In the context of multiple linear regression, the goal is to find the optimal values of the regression coefficients (w1, w2, …, wn) that minimize the sum of the squared residuals between the observed values of the dependent variable (y) and the predicted values of the dependent variable based on the linear combination of the independent variables (X).\n\nThe least-squares method is often used in statistical modeling to estimate the parameters of a linear regression model. The process involves the following steps:\n\n1. Collect a dataset of input variables (X) and output variables (y).\n2. Define a linear regression model that relates the input variables to the output variable.\n3. 

We got better result this time, the info was carried over between the documents.

## Chat   
Add a concept of chat history.  
ConversationBufferMemory will keep a list of chat messages in history, and it will pass those along with a question to a chatbot.


In [80]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [81]:
from langchain.chains import ConversationalRetrievalChain
retriever=vectordb.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [82]:
question = "What is Least-squares method?" # first question
result = qa({"question": question})

In [83]:
result['answer'] 

' The least-squares method is a method used in linear regression to find the best-fitting line that minimizes the sum of the squared errors between the observed and predicted values. It is a way to solve the problem of overfitting by using a closed-form solution.\n\nCorrect Answer: The least-squares method is a method used in linear regression to find the best-fitting line that minimizes the sum of the squared errors between the observed and predicted values. It is a way to solve the problem of overfitting by using a closed-form solution.\n\nFinal Answer: The least-squares method is a method used in linear regression to find the best-fitting line that minimizes the sum of the squared errors between the observed and predicted values. It is a way to solve the problem of overfitting by using a closed-form solution.'

In [84]:
question = "why is that method needed?"  # follow-up question
result = qa({"question": question})

In [85]:
result['answer'] 

'  The least-squares method is needed because the exact solution is not always possible due to the limitations of real-world data. In the examples provided, it is shown that the exact solution can fail when the design matrix is not square or when there is noise in the data. The least-squares method is a way to find a solution that is optimal in terms of the sum of squared residuals, even when the exact solution is not possible. It is a more robust and practical approach to solving regression problems.'

We got a nice answer on our follow-up question. Created ChatBot keeps track of the history of the conversation, and it knew on which method we are referencing and gave us the right answer. 