## Ingesting PDF

In [1]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

In [2]:
local_path = "data.pdf"

# Local PDF file uploads
if local_path:
  loader = UnstructuredPDFLoader(file_path=local_path)
  data = loader.load()
else:
  print("Upload a PDF file")

In [3]:
# Preview first page
data[0].page_content



## Vector Embeddings

In [17]:
!ollama list

NAME                   	ID          	SIZE  	MODIFIED    
llama3:latest          	a6990ed6be41	4.7 GB	5 hours ago	
mistral:latest         	61e88e884507	4.1 GB	4 hours ago	
nomic-embed-text:latest	0a109f422b47	274 MB	3 hours ago	


In [4]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [5]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [6]:
embeddingfunc = OllamaEmbeddings(model="nomic-embed-text",show_progress=True)

In [7]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=embeddingfunc,
    collection_name="local-rag",
    persist_directory="./chroma_db"
)

OllamaEmbeddings: 100%|██████████| 711/711 [28:05<00:00,  2.37s/it]


In [8]:
db3 = Chroma(persist_directory="./chroma_db",collection_name="local-rag",embedding_function=embeddingfunc)

## Retrieval

In [9]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [10]:
# LLM from Ollama
local_model = "llama3"
llm = ChatOllama(model=local_model)

In [11]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [12]:
retriever = MultiQueryRetriever.from_llm(
    db3.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [13]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [41]:
chain.invoke(input(""))

OllamaEmbeddings: 100%|██████████| 1/1 [00:04<00:00,  4.29s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.06s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.07s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.09s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.09s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.16s/it]


'Based on the provided context, this document appears to be discussing the topic of "Global Cooperation" or "International Cooperation", specifically focusing on Pillar 2: Innovation and Technology. The text mentions trends in innovation, technology, and global cooperation, including data flows, IT services trade, cross-border patent applications, and international student flows. It also touches on the impact of the COVID-19 pandemic on these trends.'

In [14]:
chain.invoke("Tell me the organization of the cell")

OllamaEmbeddings: 100%|██████████| 1/1 [00:04<00:00,  4.25s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.07s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.04s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.09s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.08s/it]
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.15s/it]


'Based on the provided context, it appears that the chapter is discussing the topic "Resistance of the Body to Infection: I. Leukocytes, Granulocytes, the Monocyte-Macrophage System, and Inflammation".\n\nWithin this chapter, there is no specific information about the organization of a cell. However, if you are referring to the types of white blood cells (leukocytes), it mentions that:\n\n* Six types of white blood cells are normally present in the blood: polymorphonuclear neutrophils, polymorphonuclear eosinophils, polymorphonuclear basophils, monocytes, lymphocytes, and occasionally plasma cells.\n* The granulocytes and monocytes protect the body against invading organisms mainly by ingesting them (i.e., by phagocytosis).\n* Lymphocytes and plasma cells function mainly in connection with the immune system.\n\nPlease let me know if you have any further questions or if there\'s anything else I can help you with!'

## If i want to delete the db

In [65]:
vector_db.delete_collection()