# Using ML to improve the recognition of Handwritten Ge'ez Characters

## - Motivation:

This project's motivation is advancing handwritten image recognition for Ge'ez (Ethiopic) script. The Ge'ez script is used as an alphabet system by more than 20 Afro-Asiatic and Nilo-Saharan languages in Ethiopia and Eritrea and has more than 50 million users. It also serves as the script for Amharic, the official language of the Federal Government of Ethiopia. The Ge'ez script is also valuable because it makes Ethiopia the only country in africa with its own unique alphabet system. Because of this and becuase of its large user base, the Ge'ez script needs highly reliable Handwriting recognition software for different purposes like preserving ancient historical documents (digital transcription), securing bank check processing, fast postal address sorting, and so on. This project aims to replicate (and if possible improve) the accuracy of existing handwritten Ethiopic alphabet recognition programs by using a combination of several neural network algorithms (CNN and RNN).

## - Paper used for reference

<a href="https://link.springer.com/article/10.1007/s42452-021-04742-x#Bib1" target="_blank">AHWR-Net: offline handwritten amharic word recognition using convolutional recurrent neural network</a>


## - Dataset Used

In this project I will be using the <a href="https://sites.google.com/view/hawdb-v1?pli=1" target="_blank">HARD-1 dataset</a>, a publicly available dataset prepared for handwritten Ge’ez letter recognition collected using Amharic native speakers and writers. The dataset contains 33, 672 handwritten and labeled Amharic word images (12, 064 original handwritten images and 21,608 augmented images from the originals by randomly applying functions such as rotation, shifting, shrinking, expanding, degrading, and applying a varying amount of Gaussian noise and blurring). The images are grey scale images of size 32 X 128 pixels with only one channel. The images in the HARD-1 training and testing Data sets have been manually labeled (numbered) using the table below. Each input image contains no more than 11 Ge’ez characters (letters, numbers, punctuation etc..). 

#### Table: Ge'ez characters used for this project 
![Table](images/ethiopic_numbering.png)


## - Final Prgress of Project

### Importing Datasets

- Successusfully imported the training, validation and testing datasets 
- Displayed some of their features like thier number of dimensions, shape and size

### Visualizing inputs

- Generated random numbers to select several random inputs from the training dataset X_train
- Since my input data was stored in a numpy array, I was able to convert them into png files using cv2 and display them with their corresponding labels
- Verified labels using the above table

### Created DataLoader objects 

- Created three DataLoader objects from existing datasets using batch size 64
- Tried to replicate the DataLoader objects used by the CNN model in our textbook

### Building a CNN model

- To make my training process go faster I build a CNN model with 4 convoltional laters
- Input size is [64, 1, 32, 128] dimensions corresponding to [batch_size, channels, W, H]
- To get optimal output I converted the input to a tensor object of size [64, 512, 1, 32] 
    - This gives me a 512, 1 x 32 tensor objects

### Preparing CNN output for CNN input 

- Converted feature map into a sequence of size 512 and length 32 by doing column wise concatenation for each sample
- [64, 512, 1, 32] tensor object converted to a sequence of size [64, 32, 512]
- Took me a long time to understand how to convert the output of my CNN model to sth that can be used by a BI-LSTM model

### Building a Bi-LSTM model

- Created a Bidirectional Gated Reccurrent Unit (BI-LSTM) RNN model that has 2 layers, hidden unit size of 512, and dropout at each layer


### Creating a combined Model

- Despite my best efforts, I was not able to replicate the findings of the paper I referenced without running into numerous issues
    - Problems faced
        - I found the process of converting the feature map output from my CNN model into a sequential data really challenging 
        - I had trouble merging my two Neural Network models into a single model that can be trained using a single loss function for parameter tuning 
        - Difficulty interpreting the feature extraction and sequnce encoding process used by the reaserchers used in the study 
        (there is no guide with regard to how they build their models -- they only mention what they were. Because of this I found myself spending a lot of time debuggin code trying to replicate their findings rather than making meaningful progress on my project) 



### Results and Conclusion

- Because of the numerous problems I faced builing my program I was not able to make any meaningful conclusions with regard to the aim of my project
(well outside of building several neural network models from scratch is really hard)
- On the otherhand I have leaned a lot about image processing using Machine learning models and I hope to use what I have learned to continually build on this project.
- Despite the challenges I faced with this project, this project has widened (exponentially!) my knowledge in the field of visual/image processing



### Future Work 

- Since I have put a lot of time and energy into this project and also becuase I am interested in the potential outcome of this project, I plan on uploading my code to github and continutually working on this project until I manage to create a progarm that I does what I hoped to acheive (improve the recognition of Handwritten Ge'ez Characters). 
- Hopefully someday I will manage to build a program that improves upon the work of the paper I sited (I plan on contacting professor Lee Spector if I manage to do so)!

### Importing HARD-1 (Handwritten Amharic text Recognition Dataset) Training, Validaing and Testing Datasets 

In [1]:
import numpy as np

X_train = np.load("HARD-1/X_train.npy")
y_train = np.load("HARD-1/y_train.npy")
X_val = np.load("HARD-1/X_val.npy")
y_val = np.load("HARD-1/y_val.npy")
X_test = np.load("HARD-1/X_test.npy")
y_test = np.load("HARD-1/y_test.npy")

print("X_train (", "ndim:", X_train.ndim, " shape:", X_train.shape, " size:", X_train.size , ")")
print("y_train (", "ndim:", y_train.ndim, " shape:", y_train.shape, " size:", y_train.size, ") \n")
print("X_val (", "ndim:", X_val.ndim, " shape:", X_val.shape, " size:", X_val.size , ")")
print("y_val (", "ndim:", y_val.ndim, " shape:", y_val.shape, " size:", y_val.size, ") \n")
print("X_test (", "ndim:", X_test.ndim, " shape:", X_test.shape, " size:", X_test.size , ")")
print("y_test (", "ndim:", y_test.ndim, " shape:", y_test.shape, " size:", y_test.size, ")")

X_train ( ndim: 4  shape: (9622, 32, 128, 1)  size: 39411712 )
y_train ( ndim: 2  shape: (9622, 11)  size: 105842 ) 

X_val ( ndim: 4  shape: (1202, 32, 128, 1)  size: 4923392 )
y_val ( ndim: 2  shape: (1202, 11)  size: 13222 ) 

X_test ( ndim: 4  shape: (1200, 32, 128, 1)  size: 4915200 )
y_test ( ndim: 2  shape: (1200, 11)  size: 13200 )


### Visualizing the input and labels of our dataset 

In [2]:
import random
from numpy.random import seed
from numpy.random import randint
# seed random number generator
seed(1)
# generate three integers
rand_numbers = randint(0, 9622, 3)

# use random numbers to get random words in our dataset 
rand_word1 = X_train[rand_numbers[0]]
rand_word2 = X_train[rand_numbers[1]]
rand_word3 = X_train[rand_numbers[2]]


# convert to the proper data type to be used by cv2
rand_word1_img = rand_word1 * 255
rand_word1_img = rand_word1_img.astype(np.uint8)
rand_word2_img = rand_word2 * 255
rand_word2_img = rand_word2_img.astype(np.uint8)
rand_word3_img = rand_word3 * 255
rand_word3_img = rand_word3_img.astype(np.uint8)

# convert numpy arrays to an images using the imwirte module
# images used in the description section of this project 
# import cv2
# cv2.imwrite("images/rand_amharic_word1.png", rand_word1_img)
# cv2.imwrite("images/rand_amharic_word2.png", rand_word2_img)
# cv2.imwrite("images/rand_amharic_word3.png", rand_word3_img)

##### Fig: Sample image of 3 random inputs (X) with their corresponding labels (y)

Word1 = ![Image1](images/rand_amharic_word1.png)

Word2 = ![Image2](images/rand_amharic_word2.png)

Word3 = ![Image3](images/rand_amharic_word3.png)

In [3]:
print("Word1 Label:", y_train[rand_numbers[0]])
print("Word2 Label:", y_train[rand_numbers[1]])
print("Word3 Label:", y_train[rand_numbers[2]])

Word1 Label: [150  89 188  43 109  91 114 300 300 300 300]
Word2 Label: [125 245 191 188  67 300 300 300 300 300 300]
Word3 Label: [272 300 300 300 300 300 300 300 300 300 300]


### Global Varibales 

In [4]:
import torch
import torch.utils.data

# device configuration
device = torch.device('cuda' if torch.cuda.is_available else 'cpu')

# Hyper-parameters
NUM_EPOCHS = 1
BATCH_SIZE = 64
LEARNING_RATE = 0.001

### Creating a DataLoader objects out of training, validation and testing datasets

In [5]:
# Convert numpy arrays to PyTorch tensors
X_train_torch = torch.tensor(X_train.transpose(0, 3, 1, 2))
y_train_torch = torch.tensor(y_train.reshape(9622,1,11))
X_val_torch = torch.tensor(X_val.transpose(0, 3, 1, 2))
y_val_torch = torch.tensor(y_val.reshape(1202,1,11))
X_test_torch = torch.tensor(X_test.transpose(0, 3, 1, 2))
y_test_torch = torch.tensor(y_test.reshape(1200,1,11))

# Create custom dataset class
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        return self.data[index], self.labels[index]

# Create dataset objects
train_dataset = MyDataset(X_train_torch, y_train_torch)
val_dataset = MyDataset(X_val_torch, y_val_torch)
test_dataset = MyDataset(X_test_torch, y_test_torch)


# Create dataloader objects
train_dl = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_dl= torch.utils.data.DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_dl = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)


### Bulding a CNN model

In [6]:
import torch 
import torch.nn as nn


model = nn.Sequential()
x = torch.ones((64, 1, 32, 128)) 
print(model(x).shape)

model.add_module('Conv1', nn.Conv2d(in_channels=1, out_channels=64, kernel_size=3, padding=1))
model.add_module('BN1', nn.BatchNorm2d(64))
model.add_module('ReLU1', nn.ReLU())
print(model(x).shape)

model.add_module('pool1', nn.MaxPool2d(kernel_size=2))
print(model(x).shape)

model.add_module('Conv2', nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1))
model.add_module('BN2', nn.BatchNorm2d(128))
model.add_module('ReLU2', nn.ReLU())
print(model(x).shape)

model.add_module('pool2', nn.MaxPool2d(kernel_size=2))
print(model(x).shape)

model.add_module('Conv3', nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1))
model.add_module('BN3', nn.BatchNorm2d(256))
model.add_module('ReLU3', nn.ReLU())
print(model(x).shape)

model.add_module('pool3', nn.MaxPool2d((2, 1)))
print(model(x).shape)

model.add_module('pool4', nn.MaxPool2d((2, 1)))
print(model(x).shape)

model.add_module('Conv4', nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=1))
model.add_module('BN4', nn.BatchNorm2d(512))
model.add_module('ReLU4', nn.ReLU())
print(model(x).shape)

model.add_module('pool5', nn.MaxPool2d((2, 1)))
print(model(x).shape)

feature_map = model(x)

torch.Size([64, 1, 32, 128])
torch.Size([64, 64, 32, 128])
torch.Size([64, 64, 16, 64])
torch.Size([64, 128, 16, 64])
torch.Size([64, 128, 8, 32])
torch.Size([64, 256, 8, 32])
torch.Size([64, 256, 4, 32])
torch.Size([64, 256, 2, 32])
torch.Size([64, 512, 2, 32])
torch.Size([64, 512, 1, 32])


### Converting CNN output to RNN input

In [7]:
# Convert feature map into a sequence of size 32 X 512 by doing column-wise concatenation

import numpy as np

# Initialize empty list to store sequential features
sequential_features = []
# For each batch iterate over each column in the feature map
for batch in range(feature_map.shape[0]):
    tens0 = feature_map[0][0][0]
    tens1 = feature_map[0][1][0]
    stacked = torch.stack((tens0, tens1), -1)
    for col in range(2, 511, 2):
        tens0 = feature_map[0][col][0]
        tens1 = feature_map[0][col+1][0]
        combined2 = torch.stack((tens0, tens1), -1)
        stacked = torch.cat((stacked, combined2), -1)
    sequential_features.append(stacked)
    
print(f'Batch Size = {len(sequential_features)}')
print(sequential_features[0].shape)

Batch Size = 64
torch.Size([32, 512])


#### Creating a Bidirectional LSTM model with dropout and 512 hidden cells

In [8]:
class BILSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes, dropout):
        super(BILSTM, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size*2, num_classes)  # 2 for bidirection
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Set initial hidden and cell states
        h0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(device)  # 2 for bidirection 
        c0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(device)

        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))  # out: tensor of shape (batch_size, seq_length, hidden_size*2)

        # Decode the hidden state of the last time step
        out = self.fc(self.dropout(out[:, -1, :]))
        return out

In [9]:
my_lstm = BILSTM(64, 512, 2, 11, 0.5)
print(model)
# result = my_lstm(torch.randn(64, 32, 512))

Sequential(
  (Conv1): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (BN1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (ReLU1): ReLU()
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (Conv2): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (BN2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (ReLU2): ReLU()
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (Conv3): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (BN3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (ReLU3): ReLU()
  (pool3): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
  (pool4): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
  (Conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), paddi

In [10]:
# class BIGRU(nn.Module):
#     def __init__(self, input_size, hidden_size):
#         super().__init__()
#         # self.rnn = nn.RNN(input_size, hidden_size, num_layers=2, batch_first=True)
#         self.rnn = nn.GRU(input_size, hidden_size, num_layers = 2, batch_first=True, bidirectional=True)
#         # self.rnn = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
#         self.fc = nn.Linear(hidden_size, 1) 
#     def forward(self, x):
#         _, hidden = self.rnn(x)
#         out = hidden[-1, :, :] # we use the final hidden state
#         # from the last hidden layer as
#         # the input to the fully connected
#         # layer
#         out = self.fc(out)
#         return out

In [11]:
# Creating the LSTM model
lstm_model = BILSTM(512, 512, 2, 11, 0.5)
z = torch.ones((64, 512, 1, 32)) 
#lstm_model(z)

### Initiailzing loss and optimizor functions

In [12]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

### Create a function that trains our model using both the CNN and RNN I built and a loss function

In [13]:
def train(model, num_epochs, train_dl, valid_dl):
    loss_hist_train = [0] * num_epochs 
    accuracy_hist_train = [0] * num_epochs
    loss_hist_valid = [0] * num_epochs
    accuracy_hist_valid = [0] * num_epochs
    for epoch in range(num_epochs):
        model.train()
        for x_batch, y_batch in train_dl:
            pred = model(x_batch.float())
            loss = loss_fn(pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            loss_hist_train[epoch] += loss.item()*y_batch.size(0)
            is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
            accuracy_hist_train[epoch] += is_correct.sum()
        loss_hist_train[epoch] /= len(train_dl.dataset)
        accuracy_hist_train[epoch] /= len(train_dl.dataset) 

        model.eval()
        with torch.no_grad():
            for x_batch, y_batch in val_dl:
                pred = model(x_batch)
                loss = loss_fn(pred, y_batch)
                loss_hist_valid[epoch] += loss.item() * y_batch.size(0)
                is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
                accuracy_hist_valid[epoch] += is_correct.sum()
        loss_hist_valid[epoch] /= len(val_dl.dataset)
        accuracy_hist_valid[epoch] /= len(val_dl.dataset)

        print(f'Epoch {epoch+1} accuracy: '
            f'{accuracy_hist_train[epoch]:.4f} val_accuracy: '
            f'{accuracy_hist_valid[epoch]:.4f}')
    return loss_hist_train, loss_hist_valid, \
         accuracy_hist_train, accuracy_hist_valid

In [14]:
# torch.manual_seed(1)
# num_epochs = 1
# hist = train(model, NUM_EPOCHS, train_dl, val_dl)