# Ollama PDF RAG Notebook
#### Original Project Idea by @Tony Kimpkemboi on Youtube

## Import Libraries


In [1]:
#Imports
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain.schema import Document
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

#Suppressing the  warnings
import warnings
warnings.filterwarnings('ignore')

#Jupyter-specific imports
from IPython.display import display, Markdown

#Set environment variable for the protobuf library to clear common errors
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

## Load PDF

In [2]:
#Load the PDF
#use transformer pdf, and this will extract the paper and pdf
local_path = "transformer-paper.pdf"
if local_path:
    loader = UnstructuredPDFLoader(file_path=local_path)
    data = loader.load()
    print(f"PDF loaded successfully: {local_path}")
else:
    print("Upload a PDF file")

PDF loaded successfully: transformer-paper.pdf


## Split text into chunks

In [3]:
#Split text into chunks
#this uses the RecursiveCharacterTexSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(data)
print(f"Text split into {len(chunks)} chunks")

Text split into 51 chunks


In [4]:
#test ollama embeddings
embedding_model = OllamaEmbeddings(model="nomic-embed-text")
test_embedding = embedding_model.embed_query("Test query")
print(test_embedding)

[0.021297751, 0.027801337, -0.1695024, -0.0060646716, 0.082031816, -0.036150075, 0.04433832, -0.010158394, 0.050543383, -0.034639053, 0.00083406776, 0.0592905, 0.045455262, -0.019677855, -0.094772786, -0.055518843, 0.049526308, -0.07056261, 0.004090541, -0.0014088722, 0.003917975, -0.01661441, -0.06654988, 0.007856057, 0.13825926, -0.050003737, -0.05544877, 0.040329337, -0.03442401, -0.017394057, 0.0013502602, -0.008086474, 0.050295394, -0.06049735, -0.036090594, -0.0078342, 0.019494731, 0.054925382, -0.015246182, 0.01627023, 0.05144798, 0.0055783708, 0.019464094, -0.04408677, 0.058690503, 0.0048292917, 0.029791553, 0.047673505, 0.04147772, -0.065289736, -0.060502533, -0.04464156, 0.048503365, 0.00051301694, 0.0363556, 0.021212632, -0.022007087, 0.01666229, 0.014475108, -0.016753413, 0.008460819, 0.011121671, -0.0549261, 0.04463891, 0.0418088, -0.0755826, -0.014451789, 0.0153478375, -0.021024639, 0.023898207, 0.024410984, 0.0006479432, 0.03350522, -0.029317653, -0.025719434, -0.0441762

## Create vector database

In [5]:
#Create vector database
#this will create and store the vector embeddings
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="local-rag" 
)
print("Vector database created successfully")

Vector database created successfully


## Set up LLM and Retrieval
#### 1. I setup the LLM using the llama2 model. 
#### 2. Langchain also has a chat model designed for ollama. 

In [6]:
#Set up local LLM model and retrieval
local_model = "llama2"  
#ollama chat model from langchain
llm = ChatOllama(model=local_model)

In [7]:
#Query prompt template
#Define the role of the AI
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

#Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

In [8]:
#verify vector_Db and retriever
print(f"Vector DB retriever: {vector_db.as_retriever()}")
print(f"Retriever: {retriever}")

Vector DB retriever: tags=['Chroma', 'OllamaEmbeddings'] vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000019E1EC9B8F0> search_kwargs={}
Retriever: retriever=VectorStoreRetriever(tags=['Chroma', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000019E1EC9B8F0>, search_kwargs={}) llm_chain=PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template='You are an AI language model assistant. Your task is to generate 2\n    different versions of the given user question to retrieve relevant documents from\n    a vector database. By generating multiple perspectives on the user question, your\n    goal is to help the user overcome some of the limitations of the distance-based\n    similarity search. Provide these alternative questions separated by newlines.\n    Original question: {question}')
| ChatOllama(model='llama2')
| LineListOutputParser()


## Create chain

In [9]:
#Provide context
#RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

In [10]:
#Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Chat with PDF

In [11]:
#the responses will be displayed using markdown
def chat_with_pdf(question):
    """
    Chat with the PDF using the RAG chain.
    """
    return display(Markdown(chain.invoke(question)))

In [12]:
# Example 1
chat_with_pdf("What is the main idea of this PDF document?")

The main idea of this PDF document appears to be the introduction and explanation of a new neural network architecture called the Transformer, which is designed specifically for sequence-to-sequence tasks such as machine translation. The document provides a detailed overview of the Transformer's components, including self-attention mechanisms, multi-head attention, and position-wise feedforward networks. It also discusses the advantages of the Transformer over traditional recurrent neural network (RNN) architectures, including its ability to parallelize computation across input sequences and its efficiency in handling long-distance dependencies. Additionally, the document includes figures and examples to illustrate the attention mechanism's ability to follow long-distance dependencies and resolve anaphora.

In [12]:
#Example 2
chat_with_pdf("Can you tell me what a Transformer is?")

Certainly! A Transformer is a type of neural network architecture that is specifically designed for sequence modeling and transduction tasks, such as language modeling and machine translation. It was introduced in a research paper titled "Attention is All You Need" by Vaswani et al. in 2017.

The Transformer model is based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. It consists of an encoder and a decoder, each comprised of multiple identical layers, where each layer exhibits self-attention and feed-forward processing. The self-attention mechanism allows the model to attend to different parts of the input sequence simultaneously and weigh their importance when computing the output.

The Transformer has several key advantages over traditional recurrent neural network (RNN) architectures:

1. Parallelization: The self-attention mechanism allows for parallelization, making it much faster and more scalable than RNNs.
2. Efficiency: The Transformer requires fewer parameters and computations compared to RNNs, making it more efficient in terms of both training time and memory usage.
3. Flexibility: The Transformer can handle input sequences of arbitrary length, without the need for segmentation or padding, which is a common challenge in sequence modeling tasks.
4. Quality: The Transformer has been shown to achieve state-of-the-art results in various sequence modeling tasks, such as language modeling and machine translation.

In summary, the Transformer is a powerful neural network architecture that leverages self-attention mechanisms to efficiently process sequential data, making it particularly well-suited for natural language processing tasks.

In [13]:
#Example 3
chat_with_pdf("Explain to me what Multihead attention is?")

Multihead attention is a mechanism introduced in the Transformer model (a popular deep learning architecture for natural language processing tasks) that allows the model to jointly attend to information from different representation subspaces at different positions.

In traditional attention mechanisms, a single attention function is applied to the query, keys, and values, and the output is computed as a weighted sum of these vectors. However, this can lead to the "attention collapse" problem, where the model only focuses on a limited subset of the input sequence and neglects the rest.

To address this issue, Multihead attention introduces multiple attention functions (or heads) that operate in parallel, each with its own set of weights. The outputs of these heads are then combined to form the final output. This allows the model to capture different aspects of the input sequence simultaneously and avoid the collapse problem.

Formally, given a query q and a set of keys k, Multihead attention computes the output o as follows:

o = Concat(h1, ..., hh) * W^O

where h1, ..., hh are the outputs of the multiple attention heads, and W^O is a learnable weight matrix. The Concatenation operator (Concat) combines the outputs of the different heads without any additional processing.

Multihead attention allows the model to capture different contextual relationships in the input sequence, leading to better performance in tasks such as machine translation, question answering, and text classification.

In [15]:
#Example 4
chat_with_pdf("What were the results of this paper?")

The paper "Attention Is All You Need" by Ashish Vaswani et al. (2017) introduced a new neural network architecture for machine translation called the Transformer model, which relies entirely on self-attention mechanisms instead of traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs). The Transformer model achieved state-of-the-art results in machine translation tasks, outperforming other state-of-the-art models of the time.

Specifically, the paper reported the following results:

1. The Transformer model achieved a BLEU score of 28.7 on the English-German translation task, which was a significant improvement over the previous state-of-the-art score of 25.3.
2. The Transformer model also achieved a ROUGE score of 40.6 on the English-French translation task, which was an improvement over the previous state-of-the-art score of 37.
3. The Transformer model was able to handle long-range dependencies in the input sequence more effectively than other models, as demonstrated by its performance on the WMT17 machine translation competition.
4. The Transformer model required significantly fewer parameters than other state-of-the-art models, which made it more computationally efficient and easier to train.

Overall, the results of the paper demonstrated the effectiveness of the Transformer model in machine translation tasks and its ability to handle long-range dependencies in input sequences. The Transformer model has since become a widely-used architecture in natural language processing tasks.

In [14]:
#Example 5
chat_with_pdf("What optimizer was used during training?")

According to the paper, the Adam optimizer was used during training. Specifically, the authors used the Adam optimizer with the following parameters:

* β1 = 0.9
* β2 = 0.98
* ε = 10^-9

They also varied the learning rate over time according to a specific formula, which is mentioned in the paper.

In [None]:
#example 6
#enter any question

## Clean up (optional)

In [16]:
#Optional: Clean up the vector database when done 
vector_db.delete_collection()
print("Vector database deleted successfully")

Vector database deleted successfully
