# Ollama PDF RAG Notebook

## Import Libraries


In [1]:
# Imports
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Jupyter-specific imports
from IPython.display import display, Markdown

# Set environment variable for protobuf
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

## Load PDF

In [2]:
# Load PDF
local_path = "bib.pdf"
if local_path:
    loader = UnstructuredPDFLoader(file_path=local_path)
    data = loader.load()
    print(f"PDF loaded successfully: {local_path}")
else:
    print("Upload a PDF file")

PDF loaded successfully: bib.pdf


## Split text into chunks

In [3]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(data)
print(f"Text split into {len(chunks)} chunks")

Text split into 30 chunks


## Create vector database

In [4]:
!pip install --q chromadb
!pip install --q langchain-text-splitters

In [5]:
# Create vector database
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="local-rag"
)
print("Vector database created successfully")

Vector database created successfully


## Set up LLM and Retrieval

In [6]:
# Set up LLM and retrieval
local_model = "llama3:8b"  # or whichever model you prefer
llm = ChatOllama(model=local_model)

In [7]:
# Query prompt template
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

# Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

## Create chain

In [8]:
# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [9]:
# Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Chat with PDF

In [10]:
!ollama list

NAME                        ID              SIZE      MODIFIED     
mistral:latest              f974a74358d6    4.1 GB    7 hours ago     
llama3:8b                   365c0bd3c000    4.7 GB    8 hours ago     
llama3:latest               365c0bd3c000    4.7 GB    9 hours ago     
nomic-embed-text:latest     0a109f422b47    274 MB    14 hours ago    
gemma:latest                a72c7f4d0a15    5.0 GB    38 hours ago    
llama2-uncensored:latest    44040b922233    3.8 GB    40 hours ago    
codellama:7b-code           8df0a30bb1e6    3.8 GB    43 hours ago    


In [11]:
def chat_with_pdf(question):
    """
    Chat with the PDF using the RAG chain.
    """
    return display(Markdown(chain.invoke(question)))

In [12]:
# Example 1
chat_with_pdf("What is the main idea of this document?")

Based only on the provided context, it appears that the main idea of this document is a collection of references related to skin cancer detection using various methods and techniques, including machine learning, deep learning, and three-dimensional total body photography (3D-TBP). The references cited include studies on comparing the efficacy of different approaches for detecting skin cancer, as well as reviews of existing research in the field.

In [13]:
# Example 2
chat_with_pdf("What type of skin cancers?")

Based on the provided context, the types of skin cancer mentioned are:

* Basal cell carcinoma (BCC)
* Squamous cell carcinoma (SCC)
* Invasive skin tumors (general term, not specific to a particular type)

These references are from documents [Document(metadata={'source': 'bib.pdf'}, page_content='...')], specifically document 3 and 6.

In [14]:
# Example 3
chat_with_pdf("Can ?")

Based only on the provided context, I can answer:

Can Xin et al. propose an improved vision transformer (VIT) network named SkinTrans for the classification of skin cancer?

Answer: YES

## Clean up (optional)

In [15]:
# Optional: Clean up when done 
vector_db.delete_collection()
print("Vector database deleted successfully")

Vector database deleted successfully
