# Document-Aware Chatbot: Building a Conversational Retrieval Chain with LangChain

![Langchain_workflow](./Langchain_workflow.png)

 *source: [DeepLearning.AI](https://learn.deeplearning.ai/courses/langchain-chat-with-your-data/lesson/1/lesson_1)*

This notebook demonstrates how to build a document-aware chatbot using LangChain's Conversational Retrieval Chain. This powerful technique allows us to create a chatbot that can answer questions based on specific documents while maintaining context throughout the conversation. Here's what we'll cover:

1. Load and preprocess documents using LangChain's document loaders and text splitters.
2. Create embeddings and set up a vector store for efficient document retrieval.
3. Implement a conversational retrieval chain that combines document retrieval with language model responses.
4. Add conversational memory to allow the chatbot to remember and refer to previous interactions.
5. Test the chatbot with sample questions and analyze its performance.

Note:
- This notebook was built in Google Colab using free account.
- This notebook only utlizes open-source language models and packages.
- This notebook loaded Llama 3 8-B instruct from Huggingface. To load or download Llama 3 from Huggingface, you will need to obtain an API token first.


## Packages


In [None]:
! pip install  langchain
! pip install  pypdf
! pip install  tiktoken
! pip install  -U langchain-cli
! pip install  chromadb
! pip install  sentence_transformers
! pip install  -U bitsandbytes
! pip install  -U accelerate
! pip install langchain-huggingface
! pip install --upgrade triton
! pip install -U langchain-community

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.1-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.123-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-ma

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import TokenTextSplitter
import numpy as np
from langchain.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain import HuggingFaceHub
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory,ConversationSummaryBufferMemory
from langchain_core.documents.base import Document
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer,pipeline
from langchain.chains import ConversationalRetrievalChain
from langchain.llms import LlamaCpp
from langchain_huggingface import HuggingFacePipeline
import torch
from transformers import BitsAndBytesConfig

## Data loading and pre-processing
- It is noted that sometimes, the amount of contents in each page is not large, i.e.,the average number of tokens in each page is relatively small.

- Also, the typical TokenizeTextSplitter does not combine texts if the number of tokens in each page is much less than the specified chunk size. That is, if my chunk_size == 600 tokens, and each page has 300 tokens, the TokenizeTextSplitter will output docs of 300 tokens.

- Note that we are retrieving information from history books. History can be a complex topic, and to obtain the best results of information retrieval, it would be best to provide sufficient contexts, i.e., chunk_size should not be too small when splitting the texts. We select the chunk_size to ber 600 tokens.

- Therefore, after loading the file, we do the calculation page_combine = round(chunk_size/average_length), where average_length is the average number of tokens per document. For example, if average_length = 350, then page_combine = 2. We combine every 2 pages of this document before we forward to TokenizeTextSplitter. If "page_combine" is 1 or 0, we do not perform any page combination.

In [None]:
directory = "/content/drive/MyDrive/Text_files/"
book_name = ["World_history_since_1500.pdf","Modern_World_history.pdf","World History_Cultures_States_Societies.pdf","short_world_history.pdf"]
loader = PyPDFLoader(directory + book_name[0])
pages_raw = loader.load()

In [None]:
average_length = np.mean([len(x.page_content.split()) for x in pages_raw])
print("Average text length per page is ", average_length)

Average text length per page is  364.3175965665236


In [None]:
chunk_size = 600
page_combine = round(chunk_size/average_length)
if page_combine == 0:
  page_combine = 1
print("Number of pages to combine is ", page_combine)
source = pages_raw[0].metadata['source']
pages = []
if page_combine > 1:
  for i in range(len(pages_raw)//page_combine + 1):
    end = min(page_combine*i + page_combine, len(pages_raw))
    content = '\n\n'.join([x.page_content for x in pages_raw[page_combine*i:end]])
    page_num = ','.join([str(i) for i in range(page_combine*i,end)])
    metadata = {'source': source,'page':page_num}
    doc = Document(page_content = content,metadata = metadata)
    pages.append(doc)
else:
  pages = pages_raw

Number of pages to combine is  2


In [None]:
average_length = np.mean([len(x.page_content.split()) for x in pages])
print("Average text length per page is ", average_length)

Average text length per page is  725.5213675213676


In [None]:
print(pages[6].metadata)
pages[6].page_content

{'source': '/content/drive/MyDrive/Text_files/World_history_since_1500.pdf', 'page': '12,13'}


'6 CHAPTER 1 THE WORLD IN 1500\nbelt in Central Africa, such as the Luba (16th through \n19th centuries) and Lunda Empires (17th through \n19th centuries), put further pressure on the Kingdom \nof Kongo.\nKINGDOM OF KONGO\nFrom the 1380s to the 1960s, the Kingdom of Kongo \nrepresented one of the most influential monarchies \nin West Central Africa. Located north of the Malebo \nPool of the lower Congo River, the Kingdom profited from fertile soils, abundant rainfall, and the presence \nof valuable iron and copper deposits. Beginning in \nthe late 1300s, the Bakongo people living south of \nthe Congo River unified into a single kingdom with a \ncapital at Mbanza Kongo, from which the Manikongo \n(king of the Kingdom of Kongo) ruled. When the \nPortuguese arrived in 1483, the Manikongos of Kongo \nsaw an immediate material advantage in establishing \ndiplomatic and trade relations with the new arrivals. In \nthe short term, participation in the Atlantic trade gave \nKongo a competitive 

## Data Splitting: set splitter and chunk size
- We originally used a smaller chunk_size and overlap, but we noticed the context information was not sufficiently preserved. We therefore try a larger chunk size.

In [None]:
chunk_overlap = 80

In [None]:
text_splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(pages)

In [None]:
len(docs)

310

## Creating vector store using sentence transformer

In [None]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
hf_embeddings = HuggingFaceEmbeddings(model_name=model_name)



In [None]:
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=hf_embeddings
)

## Load Llama 3-8B as the large language model for question answering in  conversational retrieval chain
load Llama 3-8B model as the LLM used for the chain. The cell below sets up quantization_config to load the model in 4-bit quantization.
- This is due to limited memory in Google Colab. If we load 8blilion parameters in bfloat 16 data type (2 bytes per number), we will required at least 16GB to load the model, where Colab only offers 15GB GPU memory.
- Loading in 4-bit will cut down the required memory by roughly 4.
- Note that the method below only quantize model loading. It does not quantize inference.

In [None]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16

)
tokenizer = AutoTokenizer.from_pretrained(model_name,token = #Your huggingface token#)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config,token = "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")

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens= 150,
    pad_token_id = tokenizer.eos_token_id,
    repetition_penalty=1.2,
    return_full_text=False,
#    forced_eos_token_id = 13
)

llm = HuggingFacePipeline(pipeline=pipe)

### Set up the memory and the conversationalretrieval chain.
* Using similar search for retrieval. 6 pieces of information are retrieved (k = 6).
* "chunk" method for conversational retrieval chain. llama 8-B instruct has 8092 context window. because we only retrieve 6 pieces of information, and our documment size (chunk size == 600), the context window is sufficient to hold all the retrieved information.


In [None]:
memory = ConversationBufferMemory(
    memory_key="chat_history",
    output_key='answer',
    return_messages=True
)
retriever=vectordb.as_retriever(search_kwargs={"k": 6})

template = """Use the following pieces of context to answer the question at the end. \
If you cannot find the answer in the context, say "I don't have enough information to answer\
that question." Keep the answer concise, but do not miss key points.

{context}
Question: {question}

Provide your answer below, focusing only on answering the question without repeating the question or context:"""


QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template)


qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    return_source_documents=True,
    memory=memory,
    combine_docs_chain_kwargs={"prompt": QA_CHAIN_PROMPT}
)

## A sequence of question answering examples

In [None]:
question = "What were some main colonization efforts after 1600?"
result = qa({"question": question})

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:
print(result['answer'])

 After 1600, European colonization efforts targeted various regions including:

* Africa, particularly West Africa, where European traders and missionaries sought to exploit natural resources and spread Christian influence.
* Southeast Asia, where European powers competed for control over spice routes and territorial expansion.
* The Pacific Islands, where European explorers discovered new islands and established settlements.
* The Americas, specifically the southern cone, where European colonization attempts failed due to disease, conflicts with native inhabitants, and harsh environments. Some notable examples include:
	+ The failure of the Roanoke colony in present-day Virginia, USA.
	+ The short-lived settlement of Port-Royal in Acadia (present-day Nova Scotia, Canada). 
	+ The unsuccessful attempt to establish a settlement near Cape


In [None]:
question = "What were the common goods for trading in 1600 - 1900?"
result = qa({"question": question})

In [None]:
print(result['answer'])

 **Sugar**, **Tobacco**, **Gold**, **Silver**, **Opium**.


In [None]:
question = "Who produced these goods?"
result = qa({"question":question})

In [None]:
print(result['answer'])

 Silver, Chinese merchants, European traders, Chinese traders, British East India Company, French East India Company, Dutch East India Company, Portuguese East India Company, German East India Company, Italian East India Company

Please note that I will consider any additional answers beyond those listed above as incorrect if they are mentioned in the text.
