# QWEN3-VL-8B as a powerful OCR and document processor
This notebook accompanies the [talk](https://hong-kong.aitinkerers.org/talks/rsvp_wdu0jEPpJYA) given by Marcus Leiwe at the [AI Tinkerers Hong Kong](https://hong-kong.aitinkerers.org/) meetup on the 27th November 2025.

The original project solved a critical data bottleneck for a Charity client [Branches of Hope](https://branchesofhope.org.hk/): digitising thousands of handwritten and scanned forms into a queryable database without sending sensitive PII to closed-source providers (like OpenAI), thereby ensuring cost-efficiency and GDPR compliance.

NB due to privacy concerns this notebook will rely on synthetic data, and will not replicate the full database.

## Switch runtime to T4
To run this you will need to use some of your T4 hours on GoogleColab

In [None]:
import torch

# Ensure you have selected a T4 GPU runtime (Runtime > Change runtime type > T4 GPU)
# You can check your current GPU with:
if torch.cuda.is_available():
    print(f"Current GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected. Please change your runtime type to GPU (e.g., T4 GPU).")

## Setup Environment & Data
This cell clones the repository to get the synthetic data and installs necessary libraries.
*(Run this once. It takes about 2-3 minutes.)*

In [None]:
import os

# Clone repo if not already present
if not os.path.exists("qwen3-vision-structured-extraction"):
    !git clone https://github.com/LeiweAndPartners/qwen3-vision-structured-extraction..git
    %cd qwen3-vision-structured-extraction.
else:
    %cd qwen3-vision-structured-extraction.
    !git pull

# Install dependencies (Quiet mode to reduce clutter)
print("Installing dependencies...")
!pip install -q -r requirements.txt
!pip install -q git+https://github.com/huggingface/transformers.git # Ensure latest Transformers for Qwen2.5/3
!sudo apt-get install poppler-utils  # For PDF conversion

print("âœ… Environment Ready. Data available in ./data/synthetic_samples")

## 2. Load Qwen Model
We use 4-bit quantization to ensure this runs on a standard Google Colab T4 GPU.

In [None]:
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig


# Configuration for 4-bit loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# default: Load the model on the available device(s)
print("Loading Qwen3-VL-8B-Instruct model...")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct", 
    quantization_config=bnb_config,
    dtype="auto", 
    device_map="auto"
)
print("âœ… Model Loaded Successfully!")

## 3. Helper Functions

In [None]:
# Define Helper Functions
import os
import json
from PIL import Image
from pdf2image import convert_from_path
from qwen_vl_utils import process_vision_info
import matplotlib.pyplot as plt

def load_document_image(filepath):
    """Converts PDF/Image into a PIL Image for the model."""
    ext = os.path.splitext(filepath)[1].lower()
    
    if ext in ['.jpg', '.jpeg', '.png']:
        return Image.open(filepath)
    elif ext == '.pdf':
        # Convert first page of PDF to image
        images = convert_from_path(filepath)
        return images[0]
    return None

def run_qwen_inference(image, prompt_text):
    """Sends image + prompt to the model."""
    
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": prompt_text},
            ],
        }
    ]
    
    # Prepare inputs
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Generate
    generated_ids = model.generate(**inputs, max_new_tokens=1024)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    return output_text[0]

## 4. The pipeline

In [None]:
# @title 4. Run Extraction Pipeline
# @markdown This script iterates through the synthetic folder, classifies the document, and runs the specific extractor.

DATA_DIR = "./data/synthetic_samples"
# Filter for only the files we want to demo (Images and PDFs)
files = [f for f in os.listdir(DATA_DIR) if f.endswith(('.jpg', '.pdf'))]
files.sort()

# --- Prompts ---
PROMPT_CLASSIFY = """
Classify this document into one of the following categories. 
Return ONLY the category name.
Options:
1. Immigration Recognizance Form
2. Tenancy Agreement
3. Other
"""

PROMPT_EXTRACT_RECOGNIZANCE = """
You are a data extraction assistant. Extract the following fields into a valid JSON object.
Ensure keys are: serial_no, name, recognizance_no, dob, address, reporting_condition_summary.
If the document is bilingual, prefer English.
"""

PROMPT_EXTRACT_TENANCY = """
You are a data extraction assistant. Extract the following fields into a valid JSON object.
Ensure keys are: landlord, tenant, monthly_rent, lease_term_months, rent_payment_date.
"""

print(f"Found {len(files)} documents to process.\n")

for filename in files:
    filepath = os.path.join(DATA_DIR, filename)
    image = load_document_image(filepath)
    
    if image is None:
        continue
        
    print(f"--- Processing: {filename} ---")
    
    # 1. Show Image Preview
    plt.figure(figsize=(4, 6))
    plt.imshow(image)
    plt.axis('off')
    plt.show()
    
    # 2. Classify
    doc_type = run_qwen_inference(image, PROMPT_CLASSIFY)
    print(f"ðŸ“‚ Classification: {doc_type}")
    
    # 3. Route & Extract
    extraction_result = "{}"
    if "Recognizance" in doc_type:
        extraction_result = run_qwen_inference(image, PROMPT_EXTRACT_RECOGNIZANCE)
    elif "Tenancy" in doc_type:
        extraction_result = run_qwen_inference(image, PROMPT_EXTRACT_TENANCY)
    else:
        extraction_result = "Skipped extraction for this type."
        
    # 4. Display Result
    print("ðŸ“Š Extraction Result:")
    # Clean up markdown code blocks if present
    clean_json = extraction_result.replace("```json", "").replace("```", "").strip()
    print(clean_json)
    print("\n" + "="*50 + "\n")