# Assessing accuracy of TrOCR model using IAM Dataset

In this project, I sought to use a recent optical character recognition model from this paper: https://arxiv.org/pdf/2109.10282

I begin by playing with the TrOCR model with some handwritten text of my own. I determine the effectiveness of the model both without noise and in the presence of Gaussian and Poisson noise.

Next, I input text from the famous IAM Handwritten Database. I assess the accuracy of the model in identifying written words both before and after the addition of noise.

## Loading Libraries

In [3]:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image, ImageDraw
import requests
import pandas as pd
import numpy as np
import cv2
import random
import os

## Loading Pretrained Model and Applying to First Image

In [7]:
# load trial image from the IAM database
# url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
# image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image.show()

In [9]:
# Access Pretrained TrOCR Model
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')

Config of the encoder: <class 'transformers.models.vit.modeling_vit.ViTModel'> is overwritten by shared encoder config: ViTConfig {
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "qkv_bias": false,
  "transformers_version": "4.47.0"
}

Config of the decoder: <class 'transformers.models.trocr.modeling_trocr.TrOCRForCausalLM'> is overwritten by shared decoder config: TrOCRConfig {
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_cross_attention": true,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classifier_dropout": 0.0,
  "cross_attention_hidden_size": 768,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder

In [11]:
pixel_values = processor(images=image, return_tensors="pt").pixel_values

In [13]:
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [15]:
print(generated_text)

indus the


As we can see, the text (industrie) is a little difficult to read, but the model does a fairly good job reading the text. Next, we provide the model with more nicely written text. Then, we will determine whether the model can read the text in the presence of noise.

## Applying Model to Image Applied for Project Proposal

In [17]:
# Loading in Personal Image 
personalImage = Image.open("/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/ProjectProposal/brothersGonnaWorkIt.jpg").convert("RGB")
personalImage.show()

In [19]:
pixel_values = processor(images=personalImage, return_tensors="pt").pixel_values

In [21]:
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

The brothers gonna work it out


Running TrOCR works exactly as it should! Very exciting! Indeed, this system is already working better than the model that was applied in the project proposal (the previous model, paddleOCR, identified "g" as a "q"). This is a great start, but let's see how this models performs in the presence of noise.

## Helper Functions - Altering Images, Adding Noise, Checking Arrays

Below, we write out the functions for adding noise to our text. This was done in the project proposal, but we clean the code here for ease of application for others.

### Changing Image to Grayscale (if necessary)

The following code changes the text image to grayscale (all of the text in the IMA dataset is already grayscale, but if the user wants to apply TrOCR, this step may be necessary).

In [23]:
def converting_grayscale(image):
    image = image.convert("L")
    return image

### Adding Gaussian Noise

In [25]:
def gauss_noise(image, variance):
  mean = 0
  var = variance
  sigma = var**0.5
  gauss = np.random.normal(mean, sigma, image.shape)
  gauss = np.reshape(gauss, newshape = (image.shape[0], image.shape[1], image.shape[2]), order = "F")
  noisy = image + gauss
  return noisy

## Applying Model to Noisy Image 

First, we add Gaussian noise using the function defined above; we add the same Gaussian noise to each color channel

In [27]:
# Determining if grayscale conversion affects model performance
personalImage = Image.open("/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/ProjectProposal/brothersGonnaWorkIt.jpg").convert("RGB")
personalImage_GaussNoise = gauss_noise(np.array(personalImage), 100000)
cv2.imwrite('personalImage_GaussNoise.jpg', personalImage_GaussNoise)

True

Next, we reopen the noisy image with the library Pillow

In [29]:
personalImage_GaussNoise = Image.open("/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/personalImage_GaussNoise.jpg")
personalImage_GaussNoise.show()

Finally, we can assess model performance on this image.

In [31]:
pixel_values = processor(images=personalImage_GaussNoise, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [33]:
print(generated_text)

The brothers gave back it out


As we can see, if the noise is sufficiently high, the model fails to perform. That said, it is still performing *much* better than the model that was tested for the project proposal. Next, we begin reading in text from the IMA dataset.

## Checking to see if threshold% of Array Elements are Equal

In [35]:
# This function checks to see if threshold% of the elements in the output_array are in the correct_array
def check_arrays(correct_array, output_array, threshold):
    correct = 0
    for element in output_array:
        if element in correct_array:
            correct += 1
    if (correct / len(correct_array)) >= threshold:
        return True
    return False

## Reading in Text from the IMA Dataset

### IMA Image without Noise

In [37]:
firstIMAImage = Image.open("/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/words/b02/b02-013/b02-013-08-02.png").convert("RGB")
firstIMAImage.show()

In [39]:
pixel_values = processor(images=firstIMAImage, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [41]:
print(generated_text)

should


The model correctly identified the text. Next, we determine how the model performs on this image in the presence of noise. After that, we can begin to apply the model to a larger dataset to assess the model's accuracy.

### IMA Image with Noise

In [43]:
firstIMAImage_noise = Image.open("/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/words/b02/b02-013/b02-013-08-02.png").convert("RGB")
firstIMAImage_noise.show()

In [45]:
firstIMAImage_noise = gauss_noise(np.array(firstIMAImage_noise), 10000)
cv2.imwrite('firstIMAImage_noise.jpg', firstIMAImage_noise)

True

In [47]:
firstIMAImage_noise = Image.open("/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/firstIMAImage_noise.jpg").convert("RGB")

In [57]:
pixel_values = processor(images=firstIMAImage_noise, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [59]:
print(generated_text)

of the team's life


The model *completely* failed with this level of noise. We now proceed to the thrust of this project: we will determine the model performance under different levels of noise.

## Reading in Directory with Correct Sentences and Labels

In [61]:
flatten_directory = "/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/sentences_flatten"
corrupted_directory = "/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/sentences_corrupted"

In [63]:
# The following function selects n files randomly from a given path
def select_random_file(directory):
    files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]
    if not files:
        return None
    return np.random.choice(files, 1000, replace=False)

In [65]:
# The following stores the random selection of files
my_file_list = select_random_file(flatten_directory)

## Applying Model to IAM Handwritten Dataset, No noise

In [67]:
# The following identifies the words in each entry of my_file_list based on ascii/sentences.txt
sentence_oracle_dir = "/Users/chandle/Desktop/STATS507/Data_Project/FinalProject_chandle/STATS507_FinalProject/ascii/sentences_manipulate.txt"
file_and_correct_sentence = {}
with open(sentence_oracle_dir, 'r') as file:
    for line in file:
        file_name = line.split(" ")[0]
        correct_sentence = (" ".join(line.split(" ")[-1].split("|"))).replace("\n", "")
        file_and_correct_sentence[file_name] = correct_sentence.split(" ")

file.close()

In [69]:
correct = 0
for i in range(len(my_file_list)):
    
    path_to_pic = flatten_directory + '/' + my_file_list[i]
    readingImage = Image.open(path_to_pic).convert("RGB")
    pixel_values = processor(images=readingImage, return_tensors="pt").pixel_values
    generated_ids = model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    my_file_list_final = my_file_list[i].replace(".png", "")
    correct += check_arrays(file_and_correct_sentence[my_file_list_final], generated_text.split(" "), 0.7)


KeyError: 'd06-020-s03-03_corrupt_var=6400 .jpg'

In [60]:
print(correct)

897


## Applying Model in the Presence of Gaussian Noise, $\text{Noise} \sim \mathcal{N}(0, \sigma^2)$

Now that we have analyzed the proportion of sentences the model predicts correctly, we now proceed to determine model performance in the presence of noise. This is not fundamentally different than the previous section; we carry out precisely the same data pipeline, but now include code to evaluate model performance when noise is added. We steadily increase noise and record the deterioration in model performance.

In [1]:
correct = 0
var = 6400
for i in range(len(my_file_list)):

    # Reads in original image
    path_to_pic = flatten_directory + '/' + my_file_list[i]
    readingImage = Image.open(path_to_pic).convert("RGB")

    # Writes Corrupted Image
    corrupt_path = corrupted_directory + '/' my_file_list[i] '/' + f"_corrupt_var={var}" + ".jpg"
    temp_corrupt = gauss_noise(np.array(readingImage), var)
    cv2.imwrite(corrupt_path, temp_corrupt)
    
    # Read in Corrupted Image
    corruptedImage = Image.open(corrupt_path).convert("RGB")
    
    # Process Corrupted Image
    pixel_values = processor(images=corruptedImage, return_tensors="pt").pixel_values
    generated_ids = model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    my_file_list_final = my_file_list[i].replace(".png", "")
    correct += check_arrays(file_and_correct_sentence[my_file_list_final], generated_text.split(" "), 0.7)


NameError: name 'my_file_list' is not defined

In [None]:
print(correct)