## PROJECT SETUP

Imports:
langchain for the ai in app integration and openai specific integration
chromadb is the vector database for storing and querying data
pypdf for parsing and reading pdfs in python
pandas for data manipulation and analysis
streamlit for the app UI
dotenv for managing environment variables

In [4]:
!pip3 install --upgrade --quiet langchain-community langchain-openai chromadb
!pip3 install --upgrade --quiet pypdf pandas streamlit python-dotenv

Open Api key is in env file 

In [None]:
# Import Langchain modules for alot of things
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# Other modules and packages that are needed
import os
import tempfile
import streamlit as st  
import pandas as pd
from dotenv import load_dotenv

In [6]:
load_dotenv() # reading all vaiables from .env file (api key)

True

In [8]:
OPENAPI_API_KEY = os.environ.get('OPENAI_API_KEY') # getting api key from .env file and bringing it to our notebook as a var

## DEFINING LLM

In [None]:
""" from langchain_openai we get our llm and specify the model (4o is cheap and fast)
    api key is optional here it will know our api key as we set the env var"""
llm = ChatOpenAI(model="gpt-4o-mini", api_key=OPENAPI_API_KEY)
llm.invoke("if active respond with active") # calling the llm for a prompt this is just like typing a message into chatgpt

## PROCESSING THE PDF FILE

loading the pdf file

In [None]:
pdf_loader = PyPDFLoader("./test data/testpaper.pdf") # loading the pdf file from our project directory
pdf_pages = pdf_loader.load() # loading the pages of the pdf
pdf_pages # printing all the pages of the pdf

""" pdf_pages contains a list of document objects, each document object representing a page of the pdf
    the metadata contains the source of the document and the page number etc etc
"""

- Problem, right now the pdfpages contains the whole pdf as you might have through there is no way we will put in a multi page reserach paper into open ai's llm model, firstly there a token limit, secondly and more importantly we need to specify parts in the document to get good results i.e the llm dose not need every word in the pdf, hence we only want to feed the most relevent part into the llm promt. passing too much info/ irelevent info to the llm gives bad results.
- Solution, split the pdf into smaller chunks like paragaphs/ sentences. as we slipt he document into smaller chunks each chunk will be more relevent and contain less data making our resulting prompt more accurate and more likely to get good results from the llm model.

In [None]:
# using RecursiveCharacterTextSplitter from langchain to split the text into chunks
"""
Parameters:
chunk_size is the maximum number of characters in each chunk, 
chunk_overlap is the number of characters to overlap between chunks so each chunk has some context from the previous chunk,
length_function is how we want to measure the length of each chunk i.e how we want to count the chunks,
separators is used so that we dont split in the middle of a word, or sentence etc we say, sperate on either page break or new line or space
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200, length_function=len, separators=['\n\n', '\n', " "]) 
# running the text splitter on test paper and storing the chunks
pdf_chunks = text_splitter.split_documents(pdf_pages) # retuns list of chunks 
pdf_chunks # printing the chunks

## TEXT EMBEDDINGS

We need a way to repersent the chunks numarically this is where we will use text embeddings

Text embeddings are a way of repersenting words or documents as numarical vectors that capture there meaning. this was text can be converted to a format that compueters can understand and work with. These embedding vectors are lists of numbers where each number a vector in space. These vector values dont have any real meaning on there own, but relationships between vectors dose have meaning and is important. EX: simmilar words will have simmilar vectors meaning there vectors will we closer together in space and vise versa. How do we know if there far or close? The distance between these vectors can be calculated using cosine similarity or euclidean distance. we dont need to calulate this ourself as there libraries to do that, but linear algebra is important to understand how this works. There are also many types of embedding models ranging from simple to complex. A better model can help capture the meaning of text better so having good embedding for our chunks is important.

Creating Embeddings

In [None]:
# using open ai's embeddings library to embed the chunks
def get_embeddings():
    # load the embeddings model 
    embeddings_model = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=OPENAPI_API_KEY)
    return embeddings_model # returning the embeddings model

embedding_model = get_embeddings()
test_vector = embedding_model.embed_query("test") # embedding the query test this will return us a large vector
    

Calculating Distance Between Two Vectors

In [None]:
# using langchains evaluator to evaluate the embeddings 

from langchain.evaluation import load_evaluator

evaluator = load_evaluator(evaluator="embedding_distance", embeddings=embedding_model) # loading the evaluator with our evaluator type and embeddings model

evaluator.evaluate_strings(prediction="Man", reference="Woman") # evaluating the embeddings of man and woman
evaluator.evaluate_strings(prediction="Man", reference="Queen") # evaluating the embeddings of man and Queen
# here in the frist result the prediction and reference are more similar than the second result
# both evaluators return a score between 0 and 1 repersenting the similarity of the embeddings, the first evaluator is a higher score than the second evaluator

## VECTOR DATABASE

We have alot of vectors because we have alot of words. We need a way to manage and query these vectors. So we use a database, a vector database. in our case we use Chroma DB

A Vector database is like a library we have our stuff organized and we can find it by looking up the name, insted of books we store chunks of information repersented as vectors. Chroma is a open source fast and scalable vector database, But there are others. How dose a vector database work? When we make a query like asking a question, how dose this book end? the database lloks at the question, creates a vector embedding for it, scans through all the vector embeddings in the datbase to find the ones that are most simmilar to the vectors of the question. Then it retuns the coresponding chunks that are most simmilar to the question. These relevent chunks can be put togther and fed into a llm like gpt4o to generate a good answer to our answer.

Creating a Vector Database

In [None]:
# using chroma ds to made a vector store the vectors of the chunks
# the function allows us to make a whole new vector store, NOTE: if we make more than one embedding for a file it will be sotred as two chunks (AVOID THIS)
def create_vector_store(pdf_chunks, embedding_model, store_name):
    # passing our pdf and embeddings model to the database, we store the database in a local folder called vector store so we can load it later on
    vectorstore = Chroma.from_documents(documents=pdf_chunks, embedding=embedding_model, persist_directory=store_name) 
    vectorstore.persist() # persisting the vector store to make the directory (for making sure the filder in made)
    return vectorstore # returning the vector store

## QUERY DATABASE FOR RELEVANT DATA

In [None]:
# load the Vector database using the vectorstore function
vectorstore = create_vector_store(pdf_chunks, embedding_model, "vector store")

In [None]:
# creata a data retriever from the vector store
# as retriever from langchain, search type is similarity it uses cosine distance, it will by default return the 4 most relevant chunks
retriever = vectorstore.as_retriever(search_type="similarity") 
relevant_chunks = retriever.invoke("What is the test paper about") # calling the retriever to get the relevant chunks from the vector store for our given question
relevant_chunks # printing the relevant chunks

## CREATING A PROMT FOR THE LLM

In [None]:
# promt template this is our gpt prompt start to tell gpt the context of what we are doing
# we have 2 place holders {context} and {question} that will be given to gpt when we call it
prompt_template = """
You are a helpful assistant that can answer questions about a PDF file.
Use the following pieces of context to answer the question, if you don't know the answer, 
just say that you don't know, don't try to make up an answer DONT DO IT DONT DO IT !!!!!

{context}

---

Answer the question based on the context given above: {question}
"""

In [None]:
# Concatinate all the relevent context into one string
context_text = "\n\n---\n\n".join([doc.page_content for doc in relevant_chunks]) 

# create the final prompt
prompt = ChatPromptTemplate.from_template(prompt_template) # creating the prompt using the chat prompt template library
final_prompt = prompt.format(context=context_text, question="What is the test paper about") # passing in the context and question to the prompt

## PASSINNG THE PROMT TO THE LLM

In [None]:
# Finally we pass this promt into the actual llm (like giving gpt a message)
llm.invoke(final_prompt) # calling the llm for the final prompt, we will get back a response

## RAG CHAIN EQUVALENT

In [None]:
def format_docs(docs): # function to format the doc passed in
    return "\n\n".join(doc.page_content for doc in docs)

# this rag chain first gets the relevent chunks from the vector store then we concatinate our relevant chunks into one string 
# and we pass the context and question into the prompt template and that promt is passed into the llm 
rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt_template
            | llm
        )
rag_chain.invoke("What's the title of this paper?") # same output as above