# Notebook 2: Document Question-Answering with LangChain
This notebook demonstrates how to use LangChain to build a chatbot that references a custom knowledge-base. 

Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents. LLMs, given their proficiency in understanding text, are a great tool for this. 

### [LangChain](https://python.langchain.com/docs/get_started/introduction)
[**LangChain**](https://python.langchain.com/docs/get_started/introduction) provides a simple framework for connecting LLMs to your own data sources. Since LLMs are both only trained up to a fixed point in time and do not contain knowledge that is proprietary to an Enterprise, they can't answer questions about new or proprietary knowledge. LangChain solves this problem.

<div class="alert alert-block alert-info">
    
⚠️ The notebook after this one, `03_llama_index_simple.ipynb`, contains the same functionality as this notebook but uses LlamaIndex instead of LangChain. Ultimately, we recommend reading about LangChain vs. LlamaIndex and picking the software/components of the software that makes the most sense to you. 

</div>

![data_connection](./imgs/data_connection_langchain.jpeg)

### Step 1: Integrate TensorRT-LLM to LangChain [*(Model I/O)*](https://python.langchain.com/docs/modules/model_io/)

#### Custom TRT-LLM Langchain integration.
Langchain allows you to [create custom wrappers for your LLM](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm) in case you want to use your own LLM or a different wrapper than the one that is supported in LangChain. Since we are using a custom Llama2 model hosted on Triton with TRT-LLM, we have written a custom wrapper for our LLM. 

Below is a snippet of the custom wrapper. Take a look at ```trt_llm.py``` for the full implementation.
```
class TensorRTLLM(LLM):
    server_url: str = Field(None, alias="server_url")

    # some of the optional arguments
    model_name: str = "ensemble"
    temperature: Optional[float] = 1.0
    top_p: Optional[float] = 0

    @property
    def _llm_type(self) -> str:
        return "triton_tensorrt"

    def _call(
        self,
        prompt: str,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs,
    ) -> str:
        """
        Args:
            prompt: The prompt to pass into the model.
            stop: A list of strings to stop generation when encountered

        Returns:
            The string generated by the model
        """

```

A ```_call``` method that takes in a string, some optional stop words, and returns a string. Take a look at ```trt_llm.py``` for the code of LangChain wrapper for a Llama2 model deployed on Triton with TRT-LLM.

``llm = TensorRTLLM(server_url ="triton_ip:8001", model_name="ensemble", callbacks=callbacks, tokens=500)``

<div class="alert alert-block alert-warning">
    
<b>WARNING!</b> Be sure to replace server_url with the address and port of the Triton server hosting the LLM. If you are running this as part of the generative AI Workflow, you don't have to replace the Triton url.

</div>

In [None]:
from trt_llm import TensorRTLLM
from langchain.callbacks import streaming_stdout

callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]
# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below
llm = TensorRTLLM(server_url ="triton:8001", model_name="ensemble", callbacks=callbacks, tokens=500)

### Step 2: Create a Prompt Template [*(Model I/O)*](https://python.langchain.com/docs/modules/model_io/)

A [**prompt template**](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) is a common paradigm in LLM development. 

They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:
- The system prompt
- The context
- The user's question
Langchain allows you to [create custom wrappers for your LLM](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm) in case you want to use your own LLM or a different wrapper than the one that is supported in LangChain. Since we are using a custom Llama2 model hosted on Triton with TRT-LLM, we have written a custom wrapper for our LLM. 

In [None]:
from langchain.prompts import PromptTemplate

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer."
 "<</SYS>>"
 "<s>[INST] Context: {context} Question: {question} Only return the helpful answer below and nothing else. Helpful answer:[/INST]"
)

LLAMA_PROMPT = PromptTemplate.from_template(LLAMA_PROMPT_TEMPLATE)

### Step 3: Load Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)
LangChain provides a variety of [document loaders](https://python.langchain.com/docs/integrations/document_loaders) that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites).

Document loaders load data from a source as **Documents**. A **Document** is a piece of text (the page_content) and associated metadata. Document loaders provide a ``load`` method for loading data as documents from a configured source. 

In this example, we use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a research paper about Llama2 from Meta.

[Here](https://python.langchain.com/docs/integrations/document_loaders) are some of the other document loaders available from LangChain.

In [None]:
! wget -O "llama2_paper.pdf" -nc --user-agent="Mozilla" https://arxiv.org/pdf/2307.09288.pdf

In [None]:
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("llama2_paper.pdf")
data = loader.load()

### Step 4: Transform Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)
Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). 

LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as text splitters. In this example, we use a [``SentenceTransformersTokenTextSplitter``](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.SentenceTransformersTokenTextSplitter.html#langchain.text_splitter.SentenceTransformersTokenTextSplitter). The ``SentenceTransformersTokenTextSplitter`` is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. This sentence transformer model is used to generate the embeddings from documents. 

There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. 

In [None]:
import time
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
TEXT_SPLITTER_MODEL = "intfloat/e5-large-v2"
TEXT_SPLITTER_CHUNCK_SIZE = 510
TEXT_SPLITTER_CHUNCK_OVERLAP = 200

text_splitter = SentenceTransformersTokenTextSplitter(
    model_name=TEXT_SPLITTER_MODEL,
    chunk_size=TEXT_SPLITTER_CHUNCK_SIZE,
    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,
)
start_time = time.time()
documents = text_splitter.split_documents(data)
print(f"--- {time.time() - start_time} seconds ---")

Let's view a sample of content that is chunked together in the documents.

In [None]:
documents[40].page_content

### Step 5: Generate Embeddings and Store Embeddings in the Vector Store [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)

#### a) Generate Embeddings
[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar. The embedding model used below is [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2).

LangChain provides a wide variety of [embedding models](https://python.langchain.com/docs/integrations/text_embedding) from many providers and makes it simple to swap out the models. 

When a user sends in their query, the query is also embedded using the same embedding model that was used to embed the documents. As explained earlier, this allows to find similar (relevant) documents to the user's query. 

#### b) Store Document Embeddings in the Vector Store
Once the document embeddings are generated, they are stored in a vector store so that at query time we can:
1) Embed the user query and
2) Retrieve the embedding vectors that are most similar to the embedding query.

A vector store takes care of storing the embedded data and performing a vector search.

LangChain provides support for a [great selection of vector stores](https://python.langchain.com/docs/integrations/vectorstores/). 

<div class="alert alert-block alert-info">
    
⚠️ For this workflow, [Milvus](https://milvus.io/) vector database is running as a microservice. 

</div>

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus
import torch
import time

#Running the model on CPU as we want to conserve gpu memory. 
#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)
model_name = "intfloat/e5-large-v2"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)
start_time = time.time()
vectorstore = Milvus.from_documents(documents=documents, embedding=hf_embeddings, connection_args={"host": "milvus", "port": "19530"})
print(f"--- {time.time() - start_time} seconds ---")

In [None]:
# Simple Example: Retrieve Documents from the Vector Database
# note: this is just for demonstration purposes of a similarity search 
question = "Can you talk about safety evaluation of llama2 chat?"
docs = vectorstore.similarity_search(question)
print(docs[2].page_content)

 > ### Simple Example: Retrieve Documents from the Vector Database [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)
>Given a user query, relevant splits for the question are returned through a **similarity search**. This is also known as a semantic search, and it is done with meaning. It is different from a lexical search, where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. A semantic search tends to generate more relevant results than a lexical search.
![vector_stores.jpeg](./imgs/vector_stores.jpeg)

### Step 6: Compose a streamed answer using a Chain
We have already integrated the Llama2 TRT LLM into LangChain with a custom wrapper, loaded and transformed documents, and generated and stored document embeddings in a vector database. To finish the pipeline, we need to add a few more LangChain components and combine all the components together with a [chain](https://python.langchain.com/docs/modules/chains/).

A [LangChain chain](https://python.langchain.com/docs/modules/chains/) combines components together. In this case, we use a [RetrievalQA chain](https://js.langchain.com/docs/modules/chains/popular/vector_db_qa/), which is a chain type for question-answering against a vector index. It combines a *Retriever* and a *question answering (QA) chain*.

We pass it 3 of our LangChain components:
- Our instance of the LLM (from step 1).
- A [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/), which is an interface that returns documents given an unstructured query. In this case, we use our vector store as the retriever.
- Our prompt template constructed from the prompt format for Llama2 (from step 2)

```
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": LLAMA_PROMPT}
)
```

Lastly, we pass a user query to the chain and stream the result. 

In [None]:
from langchain.chains import RetrievalQA
import time

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": LLAMA_PROMPT}
)
start_time = time.time()
result = qa_chain({"query": question})
print(f"\n--- {time.time() - start_time} seconds ---")