## PROJECT SETUP

Imports:
langchain for the ai in app integration and openai specific integration
chromadb is the vector database for storing and querying data
pypdf for parsing and reading pdfs in python
pandas for data manipulation and analysis
streamlit for the app UI
dotenv for managing environment variables

In [4]:
!pip3 install --upgrade --quiet langchain-community langchain-openai chromadb
!pip3 install --upgrade --quiet pypdf pandas streamlit python-dotenv

Open Api key is in env file 

In [None]:
# Import Langchain modules for alot of things
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# Other modules and packages that are needed
import os
import tempfile
import streamlit as st  
import pandas as pd
from dotenv import load_dotenv

In [6]:
load_dotenv() # reading all vaiables from .env file (api key)

True

In [8]:
OPENAPI_API_KEY = os.environ.get('OPENAI_API_KEY') # getting api key from .env file and bringing it to our notebook as a var

## DEFINING LLM

In [None]:
""" from langchain_openai we get our llm and specify the model (4o is cheap and fast)
    api key is optional here it will know our api key as we set the env var"""
llm = ChatOpenAI(model="gpt-4o-mini", api_key=OPENAPI_API_KEY)
llm.invoke("if active respond with active") # calling the llm for a prompt this is just like typing a message into chatgpt

## PROCESSING THE PDF FILE

loading the pdf file

In [None]:
pdf_loader = PyPDFLoader("./test data/testpaper.pdf") # loading the pdf file from our project directory
pdf_pages = pdf_loader.load() # loading the pages of the pdf
pdf_pages # printing all the pages of the pdf

""" pdf_pages contains a list of document objects, each document object representing a page of the pdf
    the metadata contains the source of the document and the page number etc etc
"""

- Problem, right now the pdfpages contains the whole pdf as you might have through there is no way we will put in a multi page reserach paper into open ai's llm model, firstly there a token limit, secondly and more importantly we need to specify parts in the document to get good results i.e the llm dose not need every word in the pdf, hence we only want to feed the most relevent part into the llm promt. passing too much info/ irelevent info to the llm gives bad results.
- Solution, split the pdf into smaller chunks like paragaphs/ sentences. as we slipt he document into smaller chunks each chunk will be more relevent and contain less data making our resulting prompt more accurate and more likely to get good results from the llm model.

In [None]:
# using RecursiveCharacterTextSplitter from langchain to split the text into chunks
"""
Parameters:
chunk_size is the maximum number of characters in each chunk, 
chunk_overlap is the number of characters to overlap between chunks so each chunk has some context from the previous chunk,
length_function is how we want to measure the length of each chunk i.e how we want to count the chunks,
separators is used so that we dont split in the middle of a word, or sentence etc we say, sperate on either page break or new line or space
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200, length_function=len, separators=['\n\n', '\n', " "]) 
# running the text splitter on test paper and storing the chunks
pdf_chunks = text_splitter.split_documents(pdf_pages) # retuns list of chunks 