# Day 5 - Benchmarking Pre-Trained Arabic OCR Models ##
### Objetive: Exercise on Benchmarking Pre-Trained Arabic OCR Models ###
### Dataset: Labeled Handwritten Arabic Words ###
### Please fill in all sections that start with "# Task" , sections that start with "# Step" are  pre-implemented #######

#### Section 1 - Dependencies & Libraries

In [None]:
# Step 1.1. - Install required libraries
!pip3 install numpy matplotlib opencv-python opencv-contrib-python python-Levenshtein easyocr pytesseract transformers[torch]==4.26.0 torch torchvision 

# Step 1.2. - Restart Kernel Manually
# Toolbar -> Kernel -> Restart & Clear Output -> Restart & Clear All Outputs

In [1]:
# Step 1.3. - Import required libraries
import matplotlib.pyplot as plt, numpy as np, glob, cv2, pandas as pd, easyocr, time, keras_ocr, pytesseract
from Levenshtein import distance as lev
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

  from .autonotebook import tqdm as notebook_tqdm


#### Section 2 - Read Dataset

In [2]:
# Step 2.1. Load Images and Label Files into Pandas DataFrame
dataset_path = "Labeled_Handwritten_Arabic_Words/"
id_from_path = lambda x : x.split("\\")[1].split(".")[0]
images = [{"id": id_from_path(path), "img": cv2.resize(cv2.imread(path), (128, 64), interpolation=cv2.INTER_LINEAR)} for path in glob.glob(f"{dataset_path}/images/*.jpg")]
labels = [{"id": id_from_path(path), "label": open(path, encoding="utf8").read().strip()} for path in glob.glob(f"{dataset_path}/labels/*.txt")]
df = pd.merge(pd.DataFrame(images), pd.DataFrame(labels), on='id')

#### Section 3 - Benchmark EasyOcr - Calculating the Character Error Rate and Prediction Run-time
#### More Information available @  https://pypi.org/project/easyocr/

In [3]:
# Task 3.1. Instantiate EasyOCR Reader for Arabic Language
reader = easyocr.Reader(['ar'])

# Task 3.2. Store Current Time using time.time
st = time.time()

# Task 3.3. Define variables to calculate CER; two variables 
#          s_d_i captures the number of character changes required 
#          c captures the total number of characters within the ground truth
s_d_i = 0
c = 0

# Task 3.4. Iterate through a sample of 50 images and labels
for i, row in df.iloc[:50].iterrows():
    # Task 3.5. Store the recognized arabic text in a variable
    result = reader.readtext(row['img'])
    pred_text = result[0][1] if result else ''
    # Task 3.6. Use Levenshtein (lev) function to calculate the number of changes required between prediction and truth
    # Adding the value to s_d_i
    s_d_i += lev(row['label'], pred_text)
    # Task 3.7. Add number of characters in the original label to c
    c += len(row['label'])

# Task 3.8. Print the Calculated CER and Time Taken
print(f"Character Error Rate: {s_d_i/(s_d_i+c)} , Time Taken (Seconds): {time.time()-st}")

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


Character Error Rate: 0.37327188940092165 , Time Taken (Seconds): 4.8026206493377686


#### Section 4 - Benchmark UBC-NLP/ArOCR-handwritting-v2 - calculating the Character Error Rate and Prediction Run-time
#### More Information available @ https://huggingface.co/UBC-NLP/ArOCR-handwritting-v2

In [4]:
# Task 4.1. Instantiate UBC-NLP/ArOCR-handwritting-v2 using TrOCRProcessor and VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained('UBC-NLP/ArOCR-handwritting-v2')
model = VisionEncoderDecoderModel.from_pretrained('UBC-NLP/ArOCR-handwritting-v2')

# Task 4.2. Store Current Time using time.time
st = time.time()

# Task 4.3. Define variables to calculate CER; two variables 
#          s_d_i captures the number of character changes required 
#          c captures the total number of characters within the ground truth
s_d_i = 0
c = 0

# Task 4.4. Iterate through a sample of 50 images and labels
for i, row in df.iloc[:50].iterrows():
    # Task 4.5. Store the recognized arabic text in a variable
    pixel_values = processor(images=row['img'], return_tensors="pt").pixel_values
    generated_ids = model.generate(pixel_values, max_length=16)
    pred_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    # Task 4.6. Use Levenshtein (lev) function to calculate the number of changes required between prediction and truth
    # Adding the value to s_d_i
    s_d_i += lev(row['label'], pred_text)
    # Task 4.7. Add number of characters in the original label to c
    c += len(row['label'])

# Task 4.8. Print the Calculated CER and Time Taken
print(f"Character Error Rate: {s_d_i/(s_d_i+c)} , Time Taken (Seconds): {time.time()-st}")

Character Error Rate: 0.5915915915915916 , Time Taken (Seconds): 84.89572191238403


#### Section 5 - Benchmark gagan3012/ArOCR - calculating the Character Error Rate and Prediction Run-time
#### More Information available @ https://huggingface.co/gagan3012/ArOCR

In [5]:
# Task 5.1. Instantiate gagan3012/ArOCR using TrOCRProcessor and VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained('gagan3012/ArOCR')
model = VisionEncoderDecoderModel.from_pretrained('gagan3012/ArOCR')

# Task 5.2. Store Current Time using time.time
st = time.time()

# Task 5.3. Define variables to calculate CER; two variables 
#          s_d_i captures the number of character changes required 
#          c captures the total number of characters within the ground truth
s_d_i = 0
c = 0

# Task 5.4. Iterate through a sample of 50 images and labels
for i, row in df.iloc[:50].iterrows():
    # Task 5.5. Store the recognized arabic text in a variable
    pixel_values = processor(images=row['img'], return_tensors="pt").pixel_values
    generated_ids = model.generate(pixel_values, max_length=16)
    pred_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    # Task 5.6. Use Levenshtein (lev) function to calculate the number of changes required between prediction and truth
    # Adding the value to s_d_i
    s_d_i += lev(row['label'], pred_text)
    # Task 5.7. Add number of characters in the original label to c
    c += len(row['label'])

# Task 5.8. Print the Calculated CER and Time Taken
print(f"Character Error Rate: {s_d_i/(s_d_i+c)} , Time Taken (Seconds): {time.time()-st}")

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


Character Error Rate: 0.6738609112709832 , Time Taken (Seconds): 18.45766305923462


#### Section 6 - Benchmark PyTesseract - calculating the Character Error Rate and Prediction Run-time
#### More Information available @ https://github.com/tesseract-ocr/tesseract

In [6]:
# Step 6.1. Instantiate PyTesseract Library
# Follow Guide on Tesseract Installation @ https://medium.com/@marioruizgonzalez.mx/how-install-tesseract-orc-and-pytesseract-on-windows-68f011ad8b9b
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

# Task 6.2. Store Current Time using time.time
st = time.time()

# Task 6.3. Define variables to calculate CER; two variables 
#          s_d_i captures the number of character changes required 
#          c captures the total number of characters within the ground truth
s_d_i = 0
c = 0

# Task 6.4. Iterate through a sample of 50 images and labels
for i, row in df.iloc[:50].iterrows():
    # Task 6.5. Store the recognized arabic text in a variable
    pred_text = pytesseract.image_to_string(row['img'], lang='ara').replace('\n', '')
    # Task 6.6. Use Levenshtein (lev) function to calculate the number of changes required between prediction and truth
    # Adding the value to s_d_i
    s_d_i += lev(row['label'], pred_text)
    # Task 6.7. Add number of characters in the original label to c
    c += len(row['label'])

# Task 6.8. Print the Calculated CER and Time Taken
print(f"Character Error Rate: {s_d_i/(s_d_i+c)} , Time Taken (Seconds): {time.time()-st}")

Character Error Rate: 0.4581673306772908 , Time Taken (Seconds): 1.1063807010650635
