# Solution Architecture


```
[PDF Files Input] → Google Colab Environment
            ↓
┌───────────────────────────────────────────────┐
│ PDF Processing Pipeline                       │
│ ├─ Extract Text (pdfplumber)                  │
│ ├─ Extract Tables (pdfplumber + pandas)       │
│ └─ OCR Images (EasyOCR + PIL)                 │
└───────────────────────────┬───────────────────┘
                            ↓
┌───────────────────────────────────────────────┐
│ Extraction & Consolidation Logic              │
│ ├─ Identify Document Type (regex patterns)    │
│ ├─ Extract Consignor Information (regex)      │
│ ├─ Extract References & Containers (regex)    │
│ └─ Extract Dates & Times (regex)              │
└───────────────────────────┬───────────────────┘
                            ↓
┌───────────────────────────────────────────────┐
│ Shipment Consolidation                        │
│ └─ Organize into Structured Shipment Data     │
└───────────────────────────┬───────────────────┘
                            ↓
┌───────────────────────────────────────────────┐
│ Query & Display Interface                     │
│ └─ Interactive Query:                         │
│    "How many shipments are currently tracked?"│
└───────────────────────────────────────────────┘
```




```
[ Google Colab Notebook ]
┌─────────────────────────────────────────────────────────────────┐
│ [ PDF Documents Directory: /content/data/ ]                     │
└─────────────┬───────────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────────────────────┐
│ PDF Document Processing                                         │
│   ├── pdfplumber (extract text & tables from PDF pages)         │
│   ├── EasyOCR (text extraction from images of PDF pages)        │
│   └── PIL (generate and handle images from PDF pages)           │
└─────────────┬───────────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Consolidated Document Text                                      │
│   └── Raw text from PDF pages, OCR text                         │
└─────────────┬───────────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Information Extraction (Regex-based)                            │
│   ├── Consignor                                                 │
│   ├── Ocean Bill of Lading (Reference number)                   │
│   ├── Container number                                          │
│   ├── Dates (ETD, ETA, ATD, ATA)                                │
│   └── Delivered (JobDate and Time Delivered)                    │
└─────────────┬───────────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Shipments Data Consolidation                                    │
│   ├── defaultdict-based grouping by Consignor                   │
│   └── Classified by document types (PreAlert, NOA, POD)         │
└─────────────┬───────────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────────────────────┐
│ RAG Integration (Vector DB & Retrieval)                         │
│   ├── LangChain Document objects                                │
│   ├── RecursiveCharacterTextSplitter (Text chunking)            │
│   ├── SentenceTransformerEmbeddings (Embeddings generation)     │
│   └── FAISS Vectorstore (Semantic search & indexing)            │
└─────────────┬───────────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Query Handling                                                  │
│   ├── Natural Language Queries                                  │
│   └── Semantic Retrieval (LangChain RAG)   │
└─────────────┬───────────────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Results Display                                                 │
│   └── Console/Text-based summary of extracted shipment details  │
│       per consignor and shipment type                           │
└─────────────────────────────────────────────────────────────────┘

```


# How to run this Colab

1. Create a folder name data (the location would be /content/data)
2. Upload the shipping pdfs into it
3. set runtime to T4 GPU with high RAM and connect
4. Run the cells below

# Install required libraries

In [None]:
!pip install --quiet\
  pdfplumber \
  easyocr \
  pillow \
  pandas \
  langchain \
  sentence-transformers \
  faiss-cpu \
  torch \
  langchain-community


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m335.6 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m81.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Import required libraries

In [None]:
import os
import re
import pdfplumber
import easyocr
from PIL import Image
from collections import defaultdict
import pandas as pd
from langchain.docstore.document import Document


# Config

In [None]:
PDF_DIRECTORY = "/content/data/"

# Data preprocessing

In [None]:

def extract_doc_type(filename):
    n = filename.lower().replace('_','').replace('-','')
    if "prealert" in n: return "PreAlert"
    if "noa" in n: return "NOA"
    if "pod" in n: return "POD"
    return "Other"

def extract_reference_and_container(text):
    reference = None
    container = None
    ocean = re.search(r'Ocean Bill of Lading[:\s]*([A-Z0-9\-]*\d+[A-Z0-9\-]*)', text, re.IGNORECASE)
    if ocean:
        reference = ocean.group(1).strip()
    else:
        generic = re.search(
            r'(?:Reference|B/L[- ]?NO|BILL OF LADING)[:\.\s]*([A-Z0-9]*\d+[A-Z0-9\-]*)',
            text, re.IGNORECASE
        )
        if generic:
            reference = generic.group(1).strip()
    cont = re.search(r'Container[#:\s]*([A-Z0-9]*\d+[A-Z0-9]*)', text, re.IGNORECASE)
    if cont:
        container = cont.group(1).strip()
    return reference, container

def extract_consignor(text):
    lines = text.splitlines()
    for idx, line in enumerate(lines):
        if re.match(r'^\s*PICKUP\b', line, re.IGNORECASE):
            for j in range(idx+1, len(lines)):
                nxt = lines[j].strip()
                if nxt:
                    return nxt
    m = re.search(r'(?:Consignor|Shipper|Client Name)\s*[:\-]\s*([A-Za-z0-9 &,.()]+)', text, re.IGNORECASE)
    if m:
        return m.group(1).strip()
    for i, line in enumerate(lines):
        if re.match(r'^\s*CONSIGNOR\b', line, re.IGNORECASE):
            for j in range(i+1, len(lines)):
                l = lines[j].strip()
                if l:
                    return l.split(',')[0].strip()
    for line in lines:
        l = line.strip()
        if not l or re.match(r'^(INVOICES? &? DOCS?|PAGE|POD|NOA|PreAlert|Shipment)', l, re.IGNORECASE) or len(l) < 3:
            continue
        return l
    return "UNKNOWN"

def extract_date_time_fields(text, doc_type):
    date_rx = r'\d{1,2}[-/]\d{1,2}[-/]\d{2,4}|\d{4}[-/]\d{1,2}[-/]\d{1,2}'
    time_rx = r'\d{1,2}:\d{2}(?::\d{2})?(?:\s*[APMapm]{2})?'
    etd = eta = atd = ata = "N/A"
    dd = dt = "N/A"
    if doc_type == "PreAlert":
        m1 = re.search(rf'ETD[:\s]*({date_rx})', text)
        m2 = re.search(rf'ETA[:\s]*({date_rx})', text)
        etd = m1.group(1) if m1 else etd
        eta = m2.group(1) if m2 else eta
    elif doc_type == "NOA":
        m = re.search(rf'ETA[:\s]*({date_rx})', text)
        eta = m.group(1) if m else eta
    else:
        m3 = re.search(rf'ATD[:\s]*({date_rx})', text)
        m4 = re.search(rf'ATA[:\s]*({date_rx})', text)
        atd = m3.group(1) if m3 else atd
        ata = m4.group(1) if m4 else ata
        m5 = re.search(rf'Delivered.*?({date_rx}).*?(?:at)?\s*({time_rx})?', text, re.IGNORECASE|re.DOTALL)
        if m5:
            dd = m5.group(1)
            dt = m5.group(2) or dt
    return etd, eta, atd, ata, dd, dt

# OCR - EasyOCR

In [None]:
# --- Initialize EasyOCR ---
reader = easyocr.Reader(['en'], gpu=True)

# --- Extract Text + Tables + OCR from PDFs ---
def extract_text_from_pdfs_in_directory(directory_path):
    all_raw_texts, filenames = [], []
    pdf_files = [f for f in os.listdir(directory_path) if f.lower().endswith('.pdf')]
    if not pdf_files:
        print(f"No PDF files found in {directory_path}.")
        return [], []
    for pdf_file in pdf_files:
        try:
            pdf_path = os.path.join(directory_path, pdf_file)
            pages_text = []
            with pdfplumber.open(pdf_path) as pdf:
                for page in pdf.pages:
                    txt = page.extract_text() or ""
                    pages_text.append(txt)
                    for table in page.extract_tables():
                        if table:
                            df = pd.DataFrame(table[1:], columns=table[0]) if table[0] else pd.DataFrame(table)
                            pages_text.append(df.to_csv(index=False))
                    img = page.to_image(resolution=300).original
                    tmp = f"/tmp/{pdf_file}-{page.page_number}.png"
                    img.save(tmp)
                    ocr = reader.readtext(tmp, detail=0)
                    os.remove(tmp)
                    if ocr:
                        pages_text.append(" ".join(ocr))
            all_raw_texts.append("\n".join(pages_text))
            filenames.append(pdf_file)
        except Exception as e:
            print(f"Error processing {pdf_file}: {e}")
    return all_raw_texts, filenames



Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |██████████████████████████████████████████████████| 100.0% Complete

# Data Formatting

In [None]:
def consolidate_shipments(docs):
    shipments = defaultdict(lambda: defaultdict(dict))
    for doc in docs:
        ref, cont = extract_reference_and_container(doc.page_content)
        consignor = extract_consignor(doc.page_content)
        doc_type = extract_doc_type(doc.metadata.get('source',''))
        etd, eta, atd, ata, dd, dt = extract_date_time_fields(doc.page_content, doc_type)
        key = ref or cont or doc.metadata.get('source','unknown_shipment')
        shipments[consignor][key][doc_type] = {
            'reference': ref or 'N/A',
            'container': cont or 'N/A',
            'etd': etd,
            'eta': eta,
            'atd': atd,
            'ata': ata,
            'delivered_date': dd,
            'delivered_time': dt
        }
    return shipments

def display_all_shipments(shipments):
    for consignor, groups in shipments.items():
        first_type = next(iter(next(iter(groups.values())).keys()))
        print(f"I found {len(groups)} shipments FOR {consignor}  (from {first_type})")
        for i, (shipment_key, parts) in enumerate(groups.items(), 1):
            data = parts[next(iter(parts))]
            print(f"           Shipment{i}:  Reference: {data['reference']}  ")
            print(f"                         Estimate Departing:  {data['etd']}  ")
            print(f"                         Estimate Arriving:  {data['eta']}  ")
            print(f"                         Actual Departing: {data['atd']}")
            print(f"                         Actual Arriving: {data['ata']}    ")
            print(f"                         Container#:  {data['container']}  ")
            print(f"                         Delivered: {data['delivered_date']} at {data['delivered_time']}  \n")



# Main operations

In [None]:
texts, files = extract_text_from_pdfs_in_directory(PDF_DIRECTORY)
docs = [Document(page_content=text, metadata={'source':filename}) for text, filename in zip(texts, files)]
shipments = consolidate_shipments(docs)



# Query

In [None]:
query = "How many shipments are currently tracked?"
if "how many" in query.lower() and "shipment" in query.lower():
    display_all_shipments(shipments)

I found 1 shipments FOR TAX INVOICE MEDITERRANEAN SHIPPING COMPANY(AUST) PTY LIMITED  (from POD)
           Shipment1:  Reference: HDMUSHAZ03315800  
                         Estimate Departing:  N/A  
                         Estimate Arriving:  N/A  
                         Actual Departing: N/A
                         Actual Arriving: N/A    
                         Container#:  N/A  
                         Delivered: N/A at N/A  

I found 2 shipments FOR SOUTH PACIFIC LOGISTICS CO.,LTD  (from PreAlert)
           Shipment1:  Reference: N/A  
                         Estimate Departing:  N/A  
                         Estimate Arriving:  N/A  
                         Actual Departing: N/A
                         Actual Arriving: N/A    
                         Container#:  N/A  
                         Delivered: N/A at N/A  

           Shipment2:  Reference: N/A  
                         Estimate Departing:  N/A  
                         Estimate Arriving:  N/A  
      