# Ollama PDF RAG Notebook

## Import & Install Libraries


In [1]:
!pip install langchain_community langchain_ollama langchain_text_splitters langchain_core
%pip install --q unstructured langchain langchain-community
%pip install --q "unstructured[all-docs]" ipywidgets tqdm
%pip install pymupdf
!sudo apt-get update
!sudo apt-get install -y tesseract-ocr poppler-utils
!sudo apt-get install -y tesseract-ocr-hin tesseract-ocr-urd tesseract-ocr-ben tesseract-ocr-eng tesseract-ocr-mar tesseract-ocr-chi-sim
!pip install pytesseract pdf2image Pillow


Get:1 https://cli.github.com/packages stable InRelease [3,917 B]
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 3,917 B in 3s (1,335 B/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (

In [2]:
!pip install langchain-classic



In [3]:
# Imports
from langchain_community.document_loaders import PyMuPDFLoader, OnlinePDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain_classic.retrievers.multi_query import MultiQueryRetriever

# from langchain_community.document_loaders import UnstructuredPDFLoader
from IPython.display import display as Markdown
from tqdm.autonotebook import tqdm as notebook_tqdm

import pytesseract
from pdf2image import convert_from_path
import os


from langchain_core.documents import Document

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Jupyter-specific imports
from IPython.display import display, Markdown

# Set environment variable for protobuf
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

## Combined Code for scanned and non-scanned pdfs
##### *Remember to specify the language_code and path to pdf

In [59]:
def extract_text_from_pdf(pdf_path, language='ben', min_word_threshold=100):
    """
    Extracts text from a PDF file using PyMuPDF first, then falls back to OCR if needed.

    This function first attempts direct text extraction. If the extracted text
    contains fewer words than the threshold, it switches to OCR extraction.

    Args:
        pdf_path (str): The file path to the PDF.
        language (str): The language code for Tesseract OCR (e.g., 'eng', 'hin').
        min_word_threshold (int): Minimum word count to consider direct extraction successful.

    Returns:
        tuple: (extracted_text, extraction_method) where extraction_method is 'direct' or 'ocr'
    """

    if not os.path.exists(pdf_path):
        return f"Error: The file '{pdf_path}' was not found.", "error"

    print(f"Processing PDF: {pdf_path}")

    # Step 1: Try direct text extraction using PyMuPDF
    try:
        loader = PyMuPDFLoader(pdf_path)
        data = loader.load()

        # Extract text content from all pages
        extracted_text = ""
        for document in data:
            extracted_text += document.page_content + "\n\n"

        # Count words in extracted text
        word_count = len(extracted_text.split())
        print(f"Direct extraction yielded {word_count} words")

        # If we have enough text, return it
        if word_count >= min_word_threshold:
            print("‚úÖ Direct extraction successful - sufficient text found")
            return extracted_text.strip(), "direct"
        else:
            print(f"‚ö†Ô∏è Direct extraction yielded only {word_count} words (< {min_word_threshold})")
            print("üîÑ Switching to OCR extraction...")

    except Exception as e:
        print(f"‚ùå Direct extraction failed: {e}")
        print("üîÑ Switching to OCR extraction...")

    # Step 2: Fall back to OCR extraction
    return extract_text_with_ocr(pdf_path, language)

def extract_text_with_ocr(pdf_path, language='ben'):
    """
    Extracts text from a PDF file using Tesseract OCR.

    Args:
        pdf_path (str): The file path to the PDF.
        language (str): The language code for Tesseract OCR.

    Returns:
        tuple: (extracted_text, extraction_method)
    """
    try:
        # Convert PDF pages to high-resolution images
        print("üìÑ Converting PDF pages to images...")
        images = convert_from_path(pdf_path, dpi=300)
    except Exception as e:
        return f"Error converting PDF to images: {e}", "error"

    full_text = ""
    print(f"üîç Processing {len(images)} page(s) with OCR (language: '{language}')...")

    # Process each page with OCR
    for i, image in enumerate(images):
        try:
            print(f"Processing page {i + 1}/{len(images)}...", end=" ")
            page_text = pytesseract.image_to_string(image, lang=language)
            full_text += f"--- Page {i + 1} ---\n{page_text}\n\n"
            print("‚úÖ")
        except pytesseract.TesseractNotFoundError:
            return ("Tesseract Error: The Tesseract executable was not found. "
                   "Please ensure Tesseract is installed correctly and in your system's PATH."), "error"
        except Exception as e:
            print(f"‚ö†Ô∏è Warning: Could not process page {i + 1}: {e}")

    word_count = len(full_text.split())
    print(f"‚úÖ OCR extraction completed - extracted {word_count} words")
    return full_text.strip(), "ocr"

# Example usage
if __name__ == "__main__":
    # Configuration
    pdf_file_path = "/content/AP Ramjan.pdf"  # Change this to your PDF path
    language_code = "ben"  # Language for OCR
    word_threshold = 100   # Minimum words for direct extraction to be considered successful

    # Extract text using hybrid approach
    extracted_text, method = extract_text_from_pdf(
        pdf_file_path,
        language=language_code,
        min_word_threshold=word_threshold
    )

    # Display results
    if method == "error":
        print(f"\n‚ùå {extracted_text}")
    else:
        print(f"\nüìã Extraction Method Used: {method.upper()}")
        print(f"üìä Total Characters: {len(extracted_text)}")
        print(f"üìä Total Words: {len(extracted_text.split())}")
        print("\n" + "="*50)
        print("EXTRACTED TEXT:")
        print("="*50)
        print(extracted_text[:2000] + "..." if len(extracted_text) > 2000 else extracted_text)


Processing PDF: /content/AP Ramjan.pdf
Direct extraction yielded 0 words
‚ö†Ô∏è Direct extraction yielded only 0 words (< 100)
üîÑ Switching to OCR extraction...
üìÑ Converting PDF pages to images...

‚ùå Error converting PDF to images: Unable to get page count.
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table



## Split text into chunks.

In [43]:
# Split text into chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_text(extracted_text)
print(f"Text split into {len(chunks)} chunks")

Text split into 7 chunks


In [44]:
print(chunks[0])

--- Page 1 ---
 

4 :ÿß 2024 €∏€∑€∞€∑€∞€µ€∏€±€îÿßŸÜÿß-ŸÜÿß1€µÿß 12

 

ÿ¢ÿ≤ÿ±ÿßÿ§⁄©⁄©ŸàŸÖÿ™ ÿ±€åÿßÿ≥ÿ™ ÿ¨Ÿà⁄∫ ŸàÿÆ€åÿ±
ÿ≤ÿß€åŸÜ ŸÜÿ≤ ÿß€å€åŸÜŸπÿ±⁄©€åŸÜ ⁄à€åŸæÿ±ŸÜŸÖŸÜŸπŸÅ
(ÿ∫€å€Åÿ™ÿ≤ŸÑ)
ÿ∑ÿ≤ ÿßŸà ÿßÿ≤ ÿ¨Ÿà ÿ¨ÿß
€± '' ÿ∑ÿ≤Ÿæÿ≤
ŸÅÿ±ÿ® ÿ±ŸÖÿ±ÿßÿ¨ÿ≤ŸÑ€îŸ≠(Ÿ°ŸÜÿ≠‚Äè /0Ÿ•0€±ÿØ'‚Äè ŸÖŸàÿ± 1 1 ÿ™ÿ® 2024ÿ°
ÿ±⁄Ø
ÿ®(ÿ±ÿ™ÿπ:€î-
ÿ±ÿ± ÿßŸÑÿ¨ €åŸÜ⁄Ø⁄æ⁄ë€åŸÑ(ÿπÿ≤ŸÑ),
ÿ¨ŸÜÿßÿ®⁄©ÿßÿ¶ ŸÖŸàÿ± ⁄ëÿ¢ ŸÅ ÿ±Ÿà €åŸà
ÿ¨ŸÜÿßÿ® ÿß€åŸÑ €åŸÅ ŸÖÿ±Ÿπÿ±€å(ÿ™ ŸÇÿßÿ™
ŸÑ€å ŸÖ⁄ë€å ÿ≥ÿß ÿ≠ÿ®ÿßŸÜÿπ⁄©ŸàŸÖÿ™, ÿ¢ ÿ≤ÿßÿ±ÿ±ÿ≥ÿ™ÿ±€åÿß ÿ≥ÿ™ ÿ¨ŸÖŸà⁄∫ Ÿà ÿ¥€å

ÿ¨ŸÜÿßÿ® ÿßÿ≥ ÿ¨ÿ≤ŸÑ ŸæŸ†‚Äè ŸÖÿ∏ŸÅ ÿ±ÿ¢ Ÿæÿßÿ±
ÿµ ‚Äè ÿ®ÿ±ÿßŸæÿß ŸÜ ÿ∂ŸÑ €åŸÖ⁄© ÿ¨ÿßÿ™ €å⁄Ü⁄æŸàŸÑ ÿÆŸàÿØÿÆÿ™ÿßÿ±ÿßÿØÿßÿ±: ÿ¨ÿßÿ™ Ÿà€åÿ¥ ÿßÿ¥ŸÜ ÿ≤ÿ°
⁄Øÿ∫ÿ≤€íÿß ÿ®ÿ∑ ÿßŸÖŸÖ 72 ÿßŸàÿ¨ ⁄àŸà ÿ™⁄ëŸÑÿ¶ÿ°

ÿπŸÜŸàÿßŸÜ: ⁄Øÿ±ÿß ÿπ€åÿØŸÖ€åŸÑÿß ÿØÿß ŸÉÿ™ ÿØÿ± ÿßŸÑÿßŸà 1446 ÿ¨ÿ±⁄© ÿ®ÿ∑ÿßÿ®ŸÇ 17 ÿ®ÿ±2024ÿ°

ŸÖŸÑÿßÿ≥ ŸÖ€å⁄Ø ! :


## Create vector database

In [7]:
!pip install chromadb



In [8]:
!sudo apt update
!sudo apt install -y pciutils
!curl fsSL https://ollama.com/install.sh | sh

Get:1 https://cli.github.com/packages stable InRelease [3,917 B]
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 3,917 B in 1s (3,118 B/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
81 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping

In [12]:
# # Pull nomic-embed-text model from Ollama if you don't have it
!ollama pull nomic-embed-text
# # List models again to confirm it's available
!ollama list

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l

In [46]:
# Create vector database
vector_db = Chroma.from_texts(
    texts=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="local-rag"
)
print("Vector database created successfully")

Vector database created successfully


## Set up LLM and Retrieval

In [47]:
# Set up LLM and retrieval
local_model = "llama3.2"  # or whichever model you prefer
llm = ChatOllama(model=local_model)

In [15]:
!ollama pull llama3.2 #We have to pull the model before entering any prompt


[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h

In [48]:
# Query prompt template
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

# Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(),
    llm,
    prompt=QUERY_PROMPT
)

## Create chain

In [49]:
# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [50]:
# Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Chat with PDF

In [51]:
def chat_with_pdf(question):
    """
    Chat with the PDF using the RAG chain.
    """
    return display(Markdown(chain.invoke(question)))

In [52]:
# Example 1
chat_with_pdf("What is this document about?")

This document appears to be a list of names and titles of individuals, likely related to the government or administration of Pakistan. It includes titles such as "Prime Minister", "President", "Minister", "Governor", and others, along with names.

The document also mentions "ÿ¢ÿ≤ÿ±ÿßÿ§⁄©⁄©ŸàŸÖÿ™ ÿ±€åÿßÿ≥ÿ™ ÿ¨Ÿà⁄∫ ŸàÿÆ€åÿ±" which translates to "State Council" or "State Government", and "ÿ≥ÿ±€å ÿ™ÿ±⁄©€å ÿµÿßÿ≠ÿ®" which means "Siri Trki Sahib" (a title for the Chief Minister of a state).

It is likely that this document is a list of key officials in Pakistan, possibly related to the government's response to a crisis or emergency situation. However, without more context, it is difficult to determine the exact nature and purpose of the document.

In [53]:
# Example 2
chat_with_pdf("What are some of the names mentioned?")

The names mentioned in the document are:

1. €Åÿ®€å ÿß ŸÖŸàÿ± (HBI MUR)
2. ŸÜÿßÿ≤ ŸÜÿ≤ ÿß€å€åŸÜŸπÿ±⁄©€åŸÜ ⁄à€åŸæÿ±ŸÜŸÖŸÜŸπ ŸÅ (Naze Naz Einterklein Department)
3. ÿ¨ŸÜÿßÿ®⁄©ÿßÿ¶ ŸÖÿ± ÿπŸÑÿß ŸÜ⁄ëÿß ŸÑ Ÿàÿßÿ± (Janab Kaimoor Allana Wara)
4. ÿ¨ŸÜÿßÿ® ÿß€åŸÑ €åŸÅ ŸÖÿ±Ÿπÿ±€å (Janab Lafat Miri)
5. ÿ¨ŸÜÿßÿ® ÿßÿ≥ ÿ¨ÿ≤ŸÑ ŸæŸà (Janab As Jal Po)
6. ÿµÿØÿ± ÿ≥ŸÖÿßÿ° ÿ™ŸÇŸÑ€Å (Sadr Samia Talah)
7. ŸÖÿπÿ∏ŸÖ ŸÇamamÿßÿπÿ∂ŸÑÿßÿπ (Maazan Qama' Alaa)
8. ÿØÿ≤ €åÿßŸÖ ÿßŸàÿ± ⁄à€åŸæŸÖÿ¥ŸÜÿ≤ÿßÿ∂ŸÑÿßÿ±ÿ° (Diyam Aor Dismensionz Addalarr)
9. ÿ¨ŸÜÿßÿ® ÿ¥ÿß€ÅÿØ ÿÆÿßŸÜ (Janab Shahid Khan)
10. ŸÖŸÑÿ≥ ŸÖ€å⁄Ø (Malas Meg)
11. ÿ±ÿßŸÜÿß ⁄©ŸàŸπ (Rana Kot)

In [None]:
# Example 3
# chat_with_pdf("What are the various ways in which a person can scam me?")

## Clean up (optional)

In [57]:
# Optional: Clean up when done
# Always clean this if you want to upload a new pdf and ask questions to it

# If we do not delete the previous vector database then the embedding of the new pdf text
# will get appended to the embeddings of the old pdf

vector_db.delete_collection()
print("Vector database deleted successfully")

NotFoundError: Collection [local-rag] does not exist