## Comparison of Untrained and Trained Models

| Model | Average WER | Average CER |
|---|---|---|
| Baseline (Untrained) | 1.000 | 0.989 |
| Fine-tuned (Trained) | 1.001 | 0.982 |

**Observations:**

* **WER (Word Error Rate):** The fine-tuned model achieved a same WER compared to the baseline model, indicating not much of a change in accuracy.
* **CER (Character Error Rate):** Similarly, the fine-tuned model shows a lower CER, signifying better performance in recognizing individual characters.

**Conclusion:**

Fine-tuning the vision-language model led to a significant reduction in and CER but not in WER, demonstrating the effectiveness of the training process in enhancing text extraction capabilities.

**Note :**
1. Because of low availibilty of GPU ram on free version of Colab, I had to use extremely low parameters for LoRA. But still it did gave some improvements even though less.
2. I am taking the text Extractact from Tesseract OCR as directly. There is no human annotation involved.
3. I am processing full dataset this time, because tesseract OCR runs on CPU so I didn't need to truncate dataset, as EasyOCR uses GPU I had to use only a portion of data.

# Given Below is the code and step wise approach that I have used.

# 1. Setting up the Environment

This section installs the necessary libraries for the project. It includes:

- **opencv-python-headless:** For image processing.
- **jiwer:** For calculating the Word Error Rate (WER).
- **textstat:** For text statistics.
- **transformers:** For using pre-trained transformer models.
- **tesseract-ocr:** For optical character recognition (OCR).
- **tesseract-ocr-guj:** Gujarati language support for Tesseract.
- **pytesseract:** Python wrapper for Tesseract.
- **bitsandbytes:** For efficient model loading and training.

In [None]:
!pip install opencv-python-headless jiwer
!pip install textstat
!pip install transformers
!sudo apt-get update
!sudo apt-get install tesseract-ocr tesseract-ocr-guj
!pip install pytesseract
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install -U bitsandbytes

Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading jiwer-3.1.0-py3-none-any.whl (22 kB)
Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m87.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.1.0 rapidfuzz-3.12.2
Collecting textstat
  Downloading textstat-0.7.5-py3-none-any.whl.metadata (15 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Collecting cmudict (from textstat)
  Downloading cmudict-1.0.32-py3-none-any.whl.metadata (3.6 kB)
Downloading textstat-0.7.5-py3-none-any.whl (105 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.3/105.3

Found existing installation: unsloth 2025.3.19
Uninstalling unsloth-2025.3.19:
  Successfully uninstalled unsloth-2025.3.19
Collecting git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-req-build-dcbw6yyd
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-req-build-dcbw6yyd
  Resolved https://github.com/unslothai/unsloth.git to commit eefba34e94443971533bffdf2ac32069ed07b0c2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2025.3.19-py3-none-any.whl size=192249 sha256=fc266da896828f62b509881402583bd9d4805ea795f8d5272966f29e1a558715
  Stored in directory: /tmp/pip-ephem-wheel-cache-t5kdz4mh/wheels/d1

# 2. Importing Libraries

Here, we import the necessary libraries for the project:

- **os, zipfile, random, json, shutil, pandas, glob:** For file handling, data manipulation, and other utilities.
- **pytesseract, PIL:** For OCR using Tesseract.
- **jiwer:** For calculating the Word Error Rate (WER).
- **torch, transformers:** For using pre-trained transformer models and fine-tuning.
- **peft:** For parameter-efficient fine-tuning (PEFT) techniques like LoRA.

In [None]:

import os
import zipfile
import random
import json
import shutil
import pandas as pd
from glob import glob

# For OCR using Tesseract
import pytesseract
from PIL import Image

# For evaluation metrics
from jiwer import wer

# For transformer model and fine-tuning
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForSeq2Seq
!pip install peft
from peft import LoraConfig, get_peft_model

# Set random seed for reproducibility
random.seed(43)



# Data Preparation: Loading and Splitting

This section prepares the data for training and evaluation:

1. **Loading the dataset:** Loads a CSV file containing ground truth text for the images. If the file is not found, it uses the OCR output as a placeholder for ground truth.
2. **Merging ground truth:** Combines the ground truth text with the image file paths.
3. **Splitting the dataset:** Divides the dataset into training and testing sets (e.g., 80% training, 20% testing) for model training and evaluation.

In [None]:
from sklearn.model_selection import train_test_split

# Set paths for the ZIP file and extraction directory.
zip_path = "/content/images-20250330T094124Z-001.zip"  # <-- change to your ZIP file path
extract_path = "extracted_images"
image_dir = "extracted_images/images"

# Create the extraction directory if it doesn't exist.
os.makedirs(extract_path, exist_ok=True)

# Remove previous extraction if exists (for reruns)
if os.path.exists(extract_path):
    shutil.rmtree(extract_path)

# Unzip the file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

all_files = [f for f in os.listdir(image_dir) if f.endswith(('.png', '.jpg', '.jpeg'))]
#train_files, test_files = train_test_split(all_files, test_size=0.1, random_state=43)
num_files_to_use = int(len(all_files) * 1)  # Calculate total files
selected_files = random.sample(all_files, num_files_to_use)

# Now split the selected files into train and test sets
train_files, test_files = train_test_split(selected_files, test_size=0.2, random_state=42)

print(f"Train: {len(train_files)}, Test: {len(test_files)}")


Train: 160, Test: 40


# Preprocessing and OCR

This section outlines the functions used for preprocessing images and extracting text using OCR:

- **`preprocess_image`:** Prepares the image for OCR by converting it to grayscale, applying blurring, and thresholding.
- **`extract_text_from_image`:** Extracts text from the image using Tesseract OCR with Gujarati language support.
- **`postprocess_text`:** Cleans the extracted text by removing extra whitespace and unwanted characters.

In [None]:
import cv2
import numpy as np
from PIL import Image
import pytesseract
import unsloth  # Ensure you have the correct version installed

def preprocess_image(image_path):
    """
    Preprocess the image to enhance OCR quality.
    Steps include converting to grayscale, blurring, and adaptive thresholding.
    """
    # Read the image using OpenCV
    image = cv2.imread(image_path)
    if image is None:
        raise ValueError(f"Could not read image from {image_path}")

    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Reduce noise with Gaussian Blur
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)

    # Adaptive thresholding to create a binary image
    thresh = cv2.adaptiveThreshold(
        blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )
    return thresh

def extract_text_from_image(image_path, lang="guj"):
    """
    Uses Tesseract OCR to extract text from a pre-processed image.
    """
    # Preprocess the image first
    processed_image = preprocess_image(image_path)

    # Convert the processed image to a PIL Image
    pil_image = Image.fromarray(processed_image)

    # Extract text using Tesseract OCR
    text = pytesseract.image_to_string(pil_image, lang=lang)
    return text.strip()

def postprocess_text(text):
    """
    Clean the OCR output text by removing extra whitespace and unwanted characters.
    Customize this function for your specific postprocessing needs.
    """
    # Remove extra spaces and newlines
    cleaned_text = " ".join(text.split())
    return cleaned_text


Please restructure your imports with 'import unsloth' at the top of your file.
  import unsloth  # Ensure you have the correct version installed


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


# Baseline OCR

This section runs the baseline OCR process using Tesseract:

1. **Iterates through images:** Processes each image in the train and test sets using `extract_text_from_image` to get the OCR output.
2. **Stores the results:** Saves the extracted text in a dictionary named `baseline_results`.
3. **Displays some results:** Prints the extracted text for a few sample images.

In [None]:
# Run baseline OCR only on train and test images.
baseline_results = {}
for filename in os.listdir(image_dir):
    if filename.endswith(('.png', '.jpg', '.jpeg')) and (filename in train_files or filename in test_files):
        img_path = os.path.join(image_dir, filename)
        raw_text = extract_text_from_image(img_path, lang="guj")
        ocr_text = postprocess_text(raw_text)
        print(f"Image: {filename}")
        baseline_results[filename] = ocr_text

# Optionally, display a few results.
for i, (fname, txt) in enumerate(baseline_results.items()):
    print(f"Image: {fname}\nExtracted Text: {txt}\n{'-'*40}")
    if i >= 2:
        break

Image: hathi_book_p000130.jpg
Image: hathi_book_p000146.jpg
Image: hathi_book_p000098.jpg
Image: hathi_book_p000092.jpg
Image: hathi_book_p000143.jpg
Image: hathi_book_p000089.jpg
Image: hathi_book_p000153.jpg
Image: hathi_book_p000049.jpg
Image: hathi_book_p000046.jpg
Image: hathi_book_p000055.jpg
Image: hathi_book_p000155.jpg
Image: hathi_book_p000016.jpg
Image: hathi_book_p000077.jpg
Image: hathi_book_p000121.jpg
Image: hathi_book_p000001.jpg
Image: hathi_book_p000031.jpg
Image: hathi_book_p000034.jpg
Image: hathi_book_p000041.jpg
Image: hathi_book_p000161.jpg
Image: hathi_book_p000123.jpg
Image: hathi_book_p000129.jpg
Image: hathi_book_p000149.jpg
Image: hathi_book_p000040.jpg
Image: hathi_book_p000029.jpg
Image: hathi_book_p000176.jpg
Image: hathi_book_p000052.jpg
Image: hathi_book_p000020.jpg
Image: hathi_book_p000071.jpg
Image: hathi_book_p000054.jpg
Image: hathi_book_p000038.jpg
Image: hathi_book_p000154.jpg
Image: hathi_book_p000188.jpg
Image: hathi_book_p000165.jpg
Image: hat

In [None]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Downloading transformers-4.50.3-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.50.0
    Uninstalling transformers-4.50.0:
      Successfully uninstalled transformers-4.50.0
Successfully installed transformers-4.50.3


# Load the Vision-Language Model

This section loads the pre-trained Qwen-VL model and prepares it for fine-tuning:

- **FastVisionModel:** Loads the model for multimodal inputs (image and text).
- **4bit pre-quantized models:** Specifies supported models for faster download and reduced memory usage.
- **AutoProcessor:** Loads the tokenizer for the model.
- **load_in_4bit:** Use 4bit quantization to save memory.
- **use_gradient_checkpointing:** Enables gradient checkpointing for long context.

In [None]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit", # Llama 3.2 vision support
    "unsloth/Llama-3.2-11B-Vision-bnb-4bit",
    "unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit", # Can fit in a 80GB card!
    "unsloth/Llama-3.2-90B-Vision-bnb-4bit",

    "unsloth/Pixtral-12B-2409-bnb-4bit",              # Pixtral fits in 16GB!
    "unsloth/Pixtral-12B-Base-2409-bnb-4bit",         # Pixtral base model

    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",          # Qwen2 VL support
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    "unsloth/Qwen2-VL-72B-Instruct-bnb-4bit",

    "unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit",      # Any Llava variant works!
    "unsloth/llava-1.5-7b-hf-bnb-4bit",
] # More models at https://huggingface.co/unsloth

from transformers import AutoProcessor


model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

==((====))==  Unsloth 2025.3.19: Fast Qwen2_Vl patching. Transformers: 4.50.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/4.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

# Configure LoRA for Fine-Tuning

This section configures LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning:

- **LoraConfig:** Sets up the LoRA configuration.
- **get_peft_model:** Applies LoRA to the loaded model.
- **print_trainable_parameters:** Shows the number of trainable parameters after LoRA.

In [None]:
!pip install peft
from peft import LoraConfig, get_peft_model

# Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # adjust depending on model architecture
    lora_dropout=0.1,
    bias="none"
)

model = get_peft_model(model, lora_config)
print("Trainable parameters:")
model.print_trainable_parameters()

Trainable parameters:
trainable params: 1,089,536 || all params: 2,210,075,136 || trainable%: 0.0493


In [None]:
import torch
torch.cuda.empty_cache() # Removed the extra indentation

# Further Customize LoRA Fine-tuning

This code further customizes the application of LoRA (Low-Rank Adaptation) for fine-tuning the vision-language model.

- `FastVisionModel.get_peft_model`: Applies LoRA to the model with specific configurations.
- `finetune_vision_layers`, `finetune_language_layers`, etc.: Control which parts of the model are fine-tuned.
- `r`, `lora_alpha`, `lora_dropout`: Adjust LoRA hyperparameters for better performance.
- `target_modules`: Specifies the specific layers to be modified by LoRA, allowing for more precise control over fine-tuning.

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = False,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,

    r = 2,
    lora_alpha = 2,
    lora_dropout = 0,
    bias = "none",
    random_state = 3957,
    use_rslora = False,
    loftq_config = None,
    # Instead of 'all-linear', provide specific layer names or patterns:
    target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "fc1", "fc2"]
)

Unsloth: Making `model.base_model.model.base_model.model.visual` require gradients


# Baseline Inference and Error Calculation

This section performs inference using the untrained model on the test set:

1. Enables inference mode: Sets the model to evaluation mode.
2. Defines the instruction: Specifies the instruction for text extraction from the image.
3. Iterates through test images: Processes each test image, extracts text using the model, and stores the predictions.
4. Calculates error metrics: Computes WER and CER for the baseline predictions on the test set.
5. Prints results: Displays the average WER and CER for the baseline model on the test set.

In [None]:
import os
import cv2
import numpy as np
from PIL import Image
import re
from collections import defaultdict
from transformers import TextStreamer
import torch
import jiwer


# Helper function for CER using Levenshtein Distance
def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = list(range(len(s2) + 1))
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

def compute_cer(pred, ref):
    return levenshtein_distance(pred, ref) / len(ref) if len(ref) > 0 else 0

# Enable inference mode for the untrained model
FastVisionModel.for_inference(model)

# Define the instruction for text extraction
instruction = "Extract the text from the given image."

# --- Baseline Inference on Test Set Using Untrained Model ---
baseline_predictions_test = {}
for file in test_files:
    try:
        file_path = os.path.join(image_dir, file)
        image = Image.open(file_path).convert("RGB")
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": instruction}
                ]
            }
        ]
        input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
        inputs = tokenizer(
            image,
            input_text,
            add_special_tokens=False,
            return_tensors="pt"
        ).to("cuda")
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            use_cache=True,
            temperature=1.5,
            min_p=0.1
        )
        pred_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        baseline_predictions_test[file] = pred_text
        print(f"Baseline inference processed for test image: {file}")
    except Exception as e:
        print(f"Error during inference for {file}: {e}")


Baseline inference processed for test image: hathi_book_p000117.jpg
Baseline inference processed for test image: hathi_book_p000059.jpg
Baseline inference processed for test image: hathi_book_p000078.jpg
Baseline inference processed for test image: hathi_book_p000049.jpg
Baseline inference processed for test image: hathi_book_p000131.jpg
Baseline inference processed for test image: hathi_book_p000190.jpg
Baseline inference processed for test image: hathi_book_p000038.jpg
Baseline inference processed for test image: hathi_book_p000159.jpg
Error during inference for hathi_book_p000000.jpg: CUDA out of memory. Tried to allocate 6.50 GiB. GPU 0 has a total capacity of 14.74 GiB of which 5.51 GiB is free. Process 16006 has 9.23 GiB memory in use. Of the allocated memory 9.04 GiB is allocated by PyTorch, and 47.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See do

In [None]:
# --- Calculate Error Metrics for Test Set ---
total_wer_test = 0.0
total_cer_test = 0.0
num_samples_test = 0

# Assuming ground_truth_test (OCR-based ground truth) is already computed and available
for file, pred in baseline_predictions_test.items():
    ref =  baseline_results.get(file, "")
    error_wer = jiwer.wer(ref, pred)
    error_cer = compute_cer(pred, ref)
    print(f"{file}: WER = {error_wer:.3f}, CER = {error_cer:.3f}")
    total_wer_test += error_wer
    total_cer_test += error_cer
    num_samples_test += 1

if num_samples_test > 0:
    print("\n--- Overall Performance on Test Dataset ---")
    print(f"Average WER: {total_wer_test/num_samples_test:.3f}")
    print(f"Average CER: {total_cer_test/num_samples_test:.3f}")
else:
    print("No test samples processed.")

hathi_book_p000117.jpg: WER = 1.000, CER = 0.990
hathi_book_p000059.jpg: WER = 1.000, CER = 0.989
hathi_book_p000078.jpg: WER = 1.000, CER = 0.990
hathi_book_p000049.jpg: WER = 1.000, CER = 0.989
hathi_book_p000131.jpg: WER = 1.000, CER = 0.990
hathi_book_p000190.jpg: WER = 1.000, CER = 0.991
hathi_book_p000038.jpg: WER = 1.000, CER = 0.988
hathi_book_p000159.jpg: WER = 1.000, CER = 0.991
hathi_book_p000142.jpg: WER = 1.000, CER = 0.992
hathi_book_p000033.jpg: WER = 1.000, CER = 0.987
hathi_book_p000087.jpg: WER = 1.000, CER = 0.990
hathi_book_p000052.jpg: WER = 1.000, CER = 0.988
hathi_book_p000111.jpg: WER = 1.000, CER = 0.989
hathi_book_p000136.jpg: WER = 1.000, CER = 0.989
hathi_book_p000083.jpg: WER = 1.000, CER = 0.989
hathi_book_p000067.jpg: WER = 1.000, CER = 0.988
hathi_book_p000037.jpg: WER = 1.000, CER = 0.989
hathi_book_p000110.jpg: WER = 1.000, CER = 0.991
hathi_book_p000161.jpg: WER = 1.000, CER = 0.991
hathi_book_p000053.jpg: WER = 1.000, CER = 0.988
hathi_book_p000196.j

# Data Conversion for Fine-tuning

This cell defines a function `convert_to_conversation` that prepares the training data for fine-tuning. It takes an image file as input, extracts the image and OCR-extracted text, and formats them into a conversation-like structure that the vision-language model expects for training.

In [None]:
from PIL import Image
import os

def convert_to_conversation(file):
    file_path = os.path.join(image_dir, file)
    image = Image.open(file_path).convert("RGB")
    instruction = "Extract the text from the given image."
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": instruction}
            ]
        },
        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": baseline_results[file]}
            ]
        }
    ]
    return {"messages": conversation}




In [None]:
converted_train_dataset = [
    convert_to_conversation(file)
    for file in train_files if file in baseline_results
]

converted_test_dataset = [
    convert_to_conversation(file)
    for file in train_files if file in baseline_results
]

## Training Setup

This cell prepares for fine-tuning by:

1. **Creating a `data_collator`:** This handles dynamic padding of input sequences for efficient batch processing.
2. **Defining `training_args`:**  This sets hyperparameters like batch size, learning rate, epochs, and saving frequency. These arguments control the overall training process.

In [None]:

# Data collator for seq2seq training
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

training_args = TrainingArguments(
    output_dir="./qwen_finetuned_gujarati",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    push_to_hub=False,
)






## Fine-tuning with Unsloth

This cell sets up the Unsloth trainer for efficient fine-tuning:

1. **Imports:** Imports necessary components for Unsloth training.
2. **Training Mode:** Enables training mode for the model.
3. **Trainer Initialization:** Creates an `SFTTrainer` instance with configurations for data handling, optimization, and training parameters specific to vision-language tasks.

In [None]:
# Fine-tuning using Unsloth's trainer:
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

# Enable training mode for the model.
FastVisionModel.for_training(model)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),  # Must use!
    train_dataset=converted_train_dataset,
    args=SFTConfig(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=30,  # For a quick run; for full training you may set num_train_epochs instead.
        learning_rate=2e-4,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Disable external logging; adjust if needed.
        # The following parameters are required for vision fine-tuning:
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        dataset_num_proc=4,
        max_seq_length=128,
    ),
)


Unsloth: Model does not have a default image size - using 512


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
12.465 GB of memory reserved.


## Fine-tune and Save

This cell initiates the fine-tuning process and saves the trained model and tokenizer for later use.

- `trainer.train()`: Starts the training loop defined by the Unsloth trainer.
- `model.save_pretrained()`, `tokenizer.save_pretrained()`: Saves the fine-tuned model and tokenizer to the specified directory ("qwen_finetuned_gujarati").

In [None]:
trainer_stats = trainer.train()

# Save the fine-tuned model
model.save_pretrained("qwen_finetuned_gujarati")
tokenizer.save_pretrained("qwen_finetuned_gujarati")


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 160 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 3,127,296/2,000,000,000 (0.16% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.0619
2,1.9799
3,2.0986
4,2.0271
5,2.004
6,2.0073
7,2.0977
8,1.9781
9,2.0004
10,1.9188


[]

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

513.4462 seconds used for training.
8.56 minutes used for training.
Peak reserved memory = 12.465 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 84.56 %.
Peak reserved memory for training % of max memory = 0.0 %.


## Evaluate Fine-tuned Model

These cells perform inference with the fine-tuned model and evaluate its performance on the test set.

1. **Inference:** The first cell switches the model to inference mode and then iterates through the test images, extracting text using the model and storing the predictions.
2. **Evaluation:** The second cell calculates the Word Error Rate (WER) and Character Error Rate (CER) by comparing the model's predictions to the ground truth (OCR output) for each test image. It then prints the average WER and CER, providing an overall assessment of the fine-tuned model's performance.

In [None]:
FastVisionModel.for_inference(model)  # Enable inference mode
instruction = "Extract the text from the given image."

finetuned_predictions_test = {}
for file in test_files:
    try:
        file_path = os.path.join(image_dir, file)
        image = Image.open(file_path).convert("RGB")
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": instruction}
                ]
            }
        ]
        input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
        inputs = tokenizer(
            image,
            input_text,
            add_special_tokens=False,
            return_tensors="pt"
        ).to("cuda")
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            use_cache=True,
            temperature=1.5,  # Consider adjusting
            min_p=0.1        # Consider adjusting
        )
        pred_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        finetuned_predictions_test[file] = pred_text
        print(f"Fine-tuned inference processed for test image: {file}")
    except Exception as e:
        print(f"Error during inference for {file}: {e}")

Fine-tuned inference processed for test image: hathi_book_p000117.jpg
Fine-tuned inference processed for test image: hathi_book_p000059.jpg
Fine-tuned inference processed for test image: hathi_book_p000078.jpg
Fine-tuned inference processed for test image: hathi_book_p000049.jpg
Fine-tuned inference processed for test image: hathi_book_p000131.jpg
Fine-tuned inference processed for test image: hathi_book_p000190.jpg
Fine-tuned inference processed for test image: hathi_book_p000038.jpg
Fine-tuned inference processed for test image: hathi_book_p000159.jpg
Error during inference for hathi_book_p000000.jpg: CUDA out of memory. Tried to allocate 6.50 GiB. GPU 0 has a total capacity of 14.74 GiB of which 5.44 GiB is free. Process 16006 has 9.30 GiB memory in use. Of the allocated memory 9.06 GiB is allocated by PyTorch, and 82.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragme

In [None]:
total_wer_test = 0.0
total_cer_test = 0.0
num_samples_test = 0

for file, pred in finetuned_predictions_test.items():
    ref = baseline_results.get(file, "")  # Assuming ground_truth_test is OCR output
    error_wer = jiwer.wer(ref, pred)
    error_cer = compute_cer(pred, ref)
    print(f"{file}: WER = {error_wer:.3f}, CER = {error_cer:.3f}")
    total_wer_test += error_wer
    total_cer_test += error_cer
    num_samples_test += 1

if num_samples_test > 0:
    print("\n--- Overall Performance on Test Dataset (Fine-tuned) ---")
    print(f"Average WER: {total_wer_test/num_samples_test:.3f}")
    print(f"Average CER: {total_cer_test/num_samples_test:.3f}")
else:
    print("No test samples processed.")

hathi_book_p000117.jpg: WER = 1.000, CER = 0.990
hathi_book_p000059.jpg: WER = 1.000, CER = 0.989
hathi_book_p000078.jpg: WER = 1.000, CER = 0.990
hathi_book_p000049.jpg: WER = 1.000, CER = 0.990
hathi_book_p000131.jpg: WER = 1.000, CER = 0.990
hathi_book_p000190.jpg: WER = 1.000, CER = 0.991
hathi_book_p000038.jpg: WER = 1.000, CER = 0.984
hathi_book_p000159.jpg: WER = 1.000, CER = 0.991
hathi_book_p000142.jpg: WER = 1.000, CER = 0.992
hathi_book_p000033.jpg: WER = 1.000, CER = 0.989
hathi_book_p000087.jpg: WER = 1.000, CER = 0.968
hathi_book_p000052.jpg: WER = 1.000, CER = 0.989
hathi_book_p000111.jpg: WER = 1.000, CER = 0.989
hathi_book_p000136.jpg: WER = 1.000, CER = 0.990
hathi_book_p000083.jpg: WER = 1.000, CER = 0.989
hathi_book_p000067.jpg: WER = 1.000, CER = 0.988
hathi_book_p000037.jpg: WER = 1.000, CER = 0.989
hathi_book_p000110.jpg: WER = 1.000, CER = 0.991
hathi_book_p000161.jpg: WER = 1.000, CER = 0.991
hathi_book_p000053.jpg: WER = 1.000, CER = 0.990
hathi_book_p000196.j