<a href="https://colab.research.google.com/github/0x0is1/0x0is1/blob/master/notebooks/chatpdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Engineering Intern Assessment

### Objective
This assessment is designed to evaluate your end-to-end understanding of AI engineering, including
OCR, text preprocessing, embedding-based retrieval, and LLM integration.

### OCR model selection
There are many lightweight and options like TrOCR for llm-ocr.

One can also go with native vision based **Tesseract**.

but i chose a fine-tuned version of Qwen-2-VL by Jack Chew as it is open source and lightweight as well, while doing better on many benchmarks.

https://huggingface.co/JackChew/Qwen2-VL-2B-OCR

In [1]:
from transformers import AutoProcessor, AutoModelForImageTextToText

ocr_processor = AutoProcessor.from_pretrained("JackChew/Qwen2-VL-2B-OCR")
ocr_model = AutoModelForImageTextToText.from_pretrained("JackChew/Qwen2-VL-2B-OCR")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/567 [00:00<?, ?B/s]

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

adapter_config.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/4.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/232M [00:00<?, ?B/s]

In [3]:
! pip install pdf2image

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


## Bonus point 1: Handle pdf and image upload

### Step 1: Reading input file (uploading)


In [4]:
! sudo apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.11 [186 kB]
Fetched 186 kB in 1s (177 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package poppler-utils.
(Reading database ... 126675

In [5]:
from google.colab import files
from PIL import Image
from pdf2image import convert_from_path
import cv2, numpy as np, io

uploaded = files.upload()
file_name = list(uploaded.keys())[0]
ext = file_name.split(".")[-1].lower()
image_exts = ["png", "jpg", "jpeg", "bmp", "tiff", "webp"]

if ext in image_exts:
    image = Image.open(io.BytesIO(uploaded[file_name]))
elif ext == "pdf":
    image = convert_from_path(file_name, dpi=300)[0]
else:
    raise ValueError("Unsupported file type")

Saving test.png to test.png


### Preprocessing image

Applying CLAHE histogram equilization for better input reading for the model


In [None]:
img_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2GRAY)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
image = Image.fromarray(clahe.apply(img_cv))

In [2]:
import requests
import torch
from torchvision import io
from typing import Dict

ocr_model = ocr_model.to("cuda")


### Step 2: Text extraction

In [6]:
conversation = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
            },
            {
                "type":"text",
                "text":"extract all data from this document without missing anything"
            }
        ]
    }
]


text_prompt = ocr_processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = ocr_processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')

output_ids = ocr_model.generate(**inputs, max_new_tokens=2048)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = ocr_processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)

["### Personal Information\n- **Name:** [Your Name]\n- **Address:** Worcester, MA 01601\n\n### Objective\nDetail-oriented and results-driven Marketing Specialist with 10 years of experience in digital marketing and brand management. Seeking to leverage my expertise in campaign strategy and social media optimization to contribute as a Senior Marketing Manager. Committed to delivering high-quality work and driving continuous improvement.\n\n### Professional Experience\n#### Senior Marketing Manager\n**InnovaTech Solutions - New York, NY**\n- **Date:** January 2059 – Present\n  - Led a team of 15 marketers in developing and executing digital campaigns that increased company revenue by 35% in 2059.\n  - Pioneered a social media strategy that boosted engagement by 50% and expanded the company's online presence across multiple platforms.\n  - Collaborated with product teams to launch three new products in 2059, achieving a 20% market share within the first six months.\n\n#### Marketing Speci

In [7]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [8]:
text_data = output_text[0].strip()

### Step 3: Preprocessing the raw extracted data by tokenizing them with commans, and periods.

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)

docs = splitter.create_documents([text_data])

In [10]:
docs[0]

Document(metadata={}, page_content='### Personal Information\n- **Name:** [Your Name]\n- **Address:** Worcester, MA 01601\n\n### Objective\nDetail-oriented and results-driven Marketing Specialist with 10 years of experience in digital marketing and brand management. Seeking to leverage my expertise in campaign strategy and social media optimization to contribute as a Senior Marketing Manager. Committed to delivering high-quality work and driving continuous improvement.')

In [11]:
doc_texts = [d.page_content for d in docs]

### Step 4: Embedding-based retrieval model selection

Similarly, this also has many options but this time i am going with a sentence-transformers based retrieval model: all-MiniLM-L6-v2

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [12]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = embedder.encode(doc_texts, show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

#### Normalizing embeddings

In [13]:
# normalizing
import numpy as np
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

#### Generating embeddings and adding them to embedding store based on cosine similarity

In [14]:
import faiss, os, pickle, numpy as np

index_file, mapping_file = "ocr_docs_index.faiss", "docs_mapping.pkl"

if os.path.exists(index_file) and os.path.exists(mapping_file):
    # bonus point 3: loading cache embeddings and docs if available
    index = faiss.read_index(index_file)
    with open(mapping_file, "rb") as f: doc_texts = pickle.load(f)
else:
    # if not available, proceed with generating embeddings
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)
    index.add(np.array(embeddings, dtype=np.float32))
    faiss.write_index(index, index_file)
    with open(mapping_file, "wb") as f: pickle.dump(doc_texts, f)

### Step 5: LLM selection for the RAG
This time i am going with a tinyllama options as it is lightweight and better accuracy with just 1.1 billions parameters.

https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

In [15]:
import torch
from transformers import pipeline
pipe = pipeline(
    "text-generation",
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Device set to use cuda


### Final step: Inference function

In [16]:
def query_document(query_text, top_k=2, max_tokens=200):
    index_file, mapping_file = "ocr_docs_index.faiss", "docs_mapping.pkl"

    if os.path.exists(index_file) and os.path.exists(mapping_file):
        index = faiss.read_index(index_file)
        with open(mapping_file, "rb") as f:
            doc_texts = pickle.load(f)
    else:
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatIP(dimension)
        index.add(np.array(embeddings, dtype=np.float32))
        faiss.write_index(index_file)
        with open(mapping_file, "wb") as f:
            pickle.dump(doc_texts, f)

    query_emb = embedder.encode([query_text])
    query_emb = query_emb / np.linalg.norm(query_emb, axis=1, keepdims=True)
    D, I = index.search(np.array(query_emb, dtype=np.float32), top_k)
    retrieved_chunks = [doc_texts[i] for i in I[0]]

    prompt = f"""
You are an intelligent assistant reading document text.
Here are the most relevant sections:

{"\n\n".join(retrieved_chunks)}

User query: {query_text}

Answer:
"""
    response = pipe(prompt, max_new_tokens=max_tokens, do_sample=False)
    return response[0]["generated_text"]

### Bonus step 1: Summarizing cell

In [17]:
query = "Summarize this document"
result = query_document(query)
print(result)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



You are an intelligent assistant reading document text.
Here are the most relevant sections:

### Additional Information
- **Education:** Not applicable
- **Skills:** Digital marketing, brand management, campaign strategy, social media optimization
- **Certifications:** Not applicable
- **Languages:** Fluent in English
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable

- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:** Not applicable
- **Projects:**

User quer

### Bonus point 4: Store chat history
You can copy paste this cell as much you want and it will stay there forever

In [None]:
query = "More questions"
result = query_document(query)
print(result)


You are an intelligent assistant reading document text.
Here are the most relevant sections:

Dexter Jones
Marketing Specialist
222 555 777
your@email.com
Worcester, MA 01601

Dexter Jones
Marketing Specialist
222 555 777
your@email.com
Worcester, MA 01601

User query: More questions

Answer:
Dexter Jones
Marketing Specialist
222 555 777
your@email.com
Worcester, MA 01601

User query: More questions

Answer:
Dexter Jones
Marketing Specialist
222 555 777
your@email.com
Worcester, MA 01601

User query: More questions

Answer:
Dexter Jones
Marketing Specialist
222 555 777
your@email.com
Worcester, MA 01601

User query: More questions

Answer:
Dexter Jones
Marketing Specialist
222 555 777
your@email.com
Worcester, MA 01601

User query
