# Solution Architecture




```
┌─────────────────────────────────────────────────────────────────┐
│                         PDF Ingestion                           │
│─────────────────────────────────────────────────────────────────│
│ • Watch /content/data for new .pdf files                        │
│ • Load each PDF with pdfplumber                                 │
└─────────────────────────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────────────────────────┐
│                    Multimodal Text Extraction                   │
│─────────────────────────────────────────────────────────────────│
│ For each PDF page:                                              │
│  1. Extract “plain” text                                        │
│  2. Extract tables → pandas                                     │
│  3. Render page to image + EasyOCR for scans                    │
│ ⇒ Produce one raw‐text blob per PDF                            │
└─────────────────────────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────────────────────────┐
│                       Document Chunking                         │
│─────────────────────────────────────────────────────────────────│
│ • Wrap each blob in a LangChain Document                        │
│ • Split with RecursiveCharacterTextSplitter                     │
│   (chunk_size=500, chunk_overlap=50)                            │
│ ⇒ Yields ~N small “chunks” of text                             │
└─────────────────────────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────────────────────────┐
│                        Vector Indexing                          │
│─────────────────────────────────────────────────────────────────│
│ • Embed each chunk with                                         │
│   SentenceTransformerEmbeddings(all‑MiniLM)                     │
│ • Build / load FAISS index                                      │
│   – Save to “shipment_index” for persistence                    │
└─────────────────────────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────────────────────────┐
│                         RAG Retrieval                           │
│──────────────────────────────────────────────────────────────── │
│ answer_with_rag(query):                                         │
│  1. FAISS.similarity_search(query, k=1000)                      │
│  2. Take top‑K chunks (metadata.source + text)                  │
│  3. Concatenate into a “context” string                         │
└─────────────────────────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────────────────────────┐
│                        LLM Reasoning                            │
│──────────────────────────────────────────────────────────────── │
│ • SystemMessage: “You are a shipping‑document analyst.”         │
│ • HumanMessage:                                                 │
│     – Paste context                                             │
│     – “Group by consignor, document type…                       │
│        respond exactly in this format:                          │
│        I found <N> shipments FOR <X> …”                         │
│ • Call Gemini via ChatGoogleGenerativeAI                        │
│ • Parse AIMessage.content for final output                      │
└─────────────────────────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────────────────────────┐
│                         End User                                │
│─────────────────────────────────────────────────────────────────│
│ “How many shipments are currently tracked?”                     │
│ ⇒ Prints a clean, structured answer:                           │
│    I found 17 shipments FOR …                                   │
│       Shipment1: Reference: …                                   │
│                   …etc.                                         │
└─────────────────────────────────────────────────────────────────┘


```


# How to run this Colab

1. Create a folder name data (the location would be /content/data)
2. Upload the shipping pdfs into it
3. set runtime to T4 GPU with high RAM and connect
4. I set up hf token to Google Colab Secret, preferably set the Gemini API Key there too
4. Run the cells below

# Install required libraries

In [None]:
!pip install --quiet\
    google-ai-generativelanguage==0.6.15 \
    google-generativeai==0.8.5 \
    pdfplumber \
    easyocr \
    pillow \
    pandas \
    langchain \
    sentence-transformers \
    faiss-cpu \
    langchain-google-genai \
    torch\
    langchain-community


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m338.6 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Import required libraries

In [None]:
import os
import re
import pdfplumber
import easyocr
from PIL import Image
from collections import defaultdict
import pandas as pd

from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.schema import SystemMessage, HumanMessage

# Config

In [None]:
PDF_DIRECTORY    = "/content/data/"
VECTORSTORE_PATH = "shipment_index"
os.environ["GOOGLE_API_KEY"] = "AIzaSyDSmigHq-eJu3ezg_QAk-yxk0JFnlnVH4o"

# OCR - EasyOCR

In [None]:
# --- Initialize EasyOCR ---
reader = easyocr.Reader(['en'], gpu=True)

# --- Extract Text + Tables + OCR from PDFs ---
def extract_text_from_pdfs_in_directory(directory_path):
    all_raw, names = [], []
    pdfs = [f for f in os.listdir(directory_path) if f.lower().endswith('.pdf')]
    if not pdfs:
        print("⚠️ No PDFs found.")
        return [], []
    for fn in pdfs:
        pages = []
        try:
            with pdfplumber.open(os.path.join(directory_path, fn)) as pdf:
                for p in pdf.pages:
                    # text
                    txt = p.extract_text() or ""
                    pages.append(txt)
                    # tables
                    for table in p.extract_tables():
                        if table:
                            df = pd.DataFrame(table[1:], columns=table[0]) if table[0] else pd.DataFrame(table)
                            pages.append(df.to_csv(index=False))
                    # OCR
                    img = p.to_image(resolution=300).original
                    tmp = f"/tmp/{fn}-{p.page_number}.png"
                    img.save(tmp)
                    ocr = reader.readtext(tmp, detail=0)
                    os.remove(tmp)
                    if ocr:
                        pages.append(" ".join(ocr))
            all_raw.append("\n".join(pages))
            names.append(fn)
        except Exception as e:
            print(f"Error processing {fn}: {e}")
    return all_raw, names



Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |██████████████████████████████████████████████████| 100.0% Complete

# FAISS db

In [23]:
# load & chunk
texts, files = extract_text_from_pdfs_in_directory(PDF_DIRECTORY)
docs = [Document(page_content=txt, metadata={"source":fn}) for txt, fn in zip(texts, files)]
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
print(f"About to index {len(chunks)} chunks…")

# embed & save/load
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
if os.path.exists(VECTORSTORE_PATH):
    vectorstore = FAISS.load_local(VECTORSTORE_PATH, embeddings, allow_dangerous_deserialization=True)
else:
    vectorstore = FAISS.from_documents(chunks, embeddings)
    vectorstore.save_local(VECTORSTORE_PATH)



About to index 276 chunks…


# LLM & RAG

In [None]:
# Init Gemini LLM
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0, api_key=os.environ["GOOGLE_API_KEY"])

# RAG helper
def answer_with_rag(query: str, top_k: int = 1000) -> str:
    # retrieve
    docs = vectorstore.similarity_search(query, k=top_k)
    if not docs:
        return "Sorry, I found no relevant shipment data."

    # build context snippet
    context = "\n\n".join(
        f"Source: {d.metadata['source']}\n{d.page_content[:1000].replace(chr(10), ' ')}…"
        for d in docs
    )

    # prompt instructions + context + query
    system = SystemMessage(content="You are a shipping‑document analyst.")
    human = HumanMessage(content=(
        f"CONTEXT:\n{context}\n\n"
        f"QUESTION: {query}\n\n"
        "Group by CONSIGNOR, then by document type (PreAlert, NOA, POD)."
        "If a value is not present in any source, write N/A."
        "Respond **exactly** in this format (no extra words):\n\n"
        "I found <NUMBER_OF_SHIPMENTS> shipments FOR <CONSIGNOR> (from DocumentType)\n"
        "Shipment1:Reference: <Ocean Bill of lading>\n"
        "            Estimate Departing: <ETD>\n"
        "            Estimate Arriving: <ETA>\n"
        "            Actual Departing: <ATD>\n"
        "            Actual Arriving: <ATA>\n"
        "            Container#: <CONTAINER>\n"
        "            Delivered: <JobDate> at <Time Delivered>"
    ))

    # call LLM
    ai_message = llm.predict_messages([system, human])
    return ai_message.content

# Query

In [None]:
question = "How many shipments are currently tracked?"
print(answer_with_rag(question))

I found 10 shipments FOR CDE INTERNATIONAL (from PreAlert)
Shipment1:Reference: 08681143591
            Estimate Departing: 04-May-24 21:00
            Estimate Arriving: 11-May-24 11:00
            Actual Departing: N/A
            Actual Arriving: N/A
            Container#: N/A
            Delivered: N/A at N/A
Shipment2:Reference: HDMUSHAZ03315800
            Estimate Departing: 16-Mar-24
            Estimate Arriving: 05-Apr-24
            Actual Departing: N/A
            Actual Arriving: N/A
            Container#: HDMU5537247
            Delivered: N/A at N/A
I found 2 shipments FOR SOUTH PACIFIC LOGISTICS CO.,LTD (from PreAlert)
Shipment1:Reference: COSU6409766200
            Estimate Departing: 19-Feb-25
            Estimate Arriving: 06-Mar-25
            Actual Departing: N/A
            Actual Arriving: N/A
            Container#: TTNU8531064
            Delivered: N/A at N/A
Shipment2:Reference: AMG0150149
            Estimate Departing: 25-May-25
            Estimate Arr