# **Multimodal Document Image Classification**
# The models will be configured/trained using Tensorflow/Keras

## **I- Visual Modality**

In this part, we will train a visual model pretrained on ImageNet dataset (which is a large-scale dataset consisting of 1M images of 1000 classes). The aim of this first part is to extract the visual features of document images to perform document image classification on the Tobacco-3482 dataset.

### A - Load Tobacco-3482 Dataset

The Tobacco-3482 dataset is a document dataset which consists of 10 classes, and 3482 images will be found in Moodle.

## **B - Import packages**

In [5]:
!pip install opencv-python
!pip install pandas
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.10.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (578.1 MB)
[K     |████████████████████████████████| 578.1 MB 16 kB/s  eta 0:00:011    |█████▉                          | 105.0 MB 8.0 MB/s eta 0:00:59███████▏| 562.3 MB 4.5 MB/s eta 0:00:04
Collecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.49.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 8.8 MB/s eta 0:00:01
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting tensorflow-estimator<2.11,>=2.10.0
  Downloading tensorflow_estimator-2.10.0-py2.py3-none-any.whl (438 kB)
[K     |████████████████████████████████| 438 kB 7.1 MB/s eta 0:00:01
Collecting keras<2.11,>=2.10.0
  Downloading keras-2.10.0-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 6.9 MB/s eta 0:00:01
[?25hCollecting astunparse>=1.6.0
  Downloading astunparse-1.6.3-py2.p

In [6]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time
import os
import cv2
import numpy as np
import pandas as pd
import matplotlib.image as mpimg
import matplotlib.pyplot as plt


import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping, CSVLogger, ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import SGD, Adam
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

2022-10-07 14:34:44.035501: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-07 14:34:44.165608: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/tpuser/.local/lib/python3.8/site-packages/cv2/../../lib64:
2022-10-07 14:34:44.165622: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-07 14:34:44.189583: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already b

# **C - Configure GPUs and Enable Distributed Strategy**

In [7]:
# Memory growth must be set before GPUs have been initialized (if there are ones)
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  0


2022-10-07 14:35:34.098083: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/tpuser/.local/lib/python3.8/site-packages/cv2/../../lib64:
2022-10-07 14:35:34.098138: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-07 14:35:34.098161: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (naja03): /proc/driver/nvidia/version does not exist


Enable GPU Memory Growth to not consume the whole memory while loading the model

In [None]:
#Configure CPU/GPU Distributed Training Strategy


Parameters to consider while building your visual *model*

In [None]:
image_path = "/content/tobacco/Image/"
text_path = "/content/tobacco/Text/"
batch_size = 16
output_classes = 10
image_size = (224, 224)
channels = 3
dropout_rate = 0.1
project_dim = 128
EPOCHS = 10
buffer_size = 3482

# **G - Preprocess Inputs**



1.   Create a function that processes the images as below
    *   Read image
    *   Resize image to image_size parameter
    *   Normalize image so that our input is in the range of [0, 1]
2.   Create a function that returns the class of the image given its path 
3.   Create a function that returns two lists: a list of image paths, and a list of labels (classes)
4.   Create a function that:
    *   shuffles data, with seed equal to the number of samples
    *   converts target classes to categorical ones
    *   Splits data to #Train, #Valid, and #Test sets, with ratios of 0.8, 0.1, 0.1





In [16]:
print(dir(os))

['CLD_CONTINUED', 'CLD_DUMPED', 'CLD_EXITED', 'CLD_TRAPPED', 'DirEntry', 'EX_CANTCREAT', 'EX_CONFIG', 'EX_DATAERR', 'EX_IOERR', 'EX_NOHOST', 'EX_NOINPUT', 'EX_NOPERM', 'EX_NOUSER', 'EX_OK', 'EX_OSERR', 'EX_OSFILE', 'EX_PROTOCOL', 'EX_SOFTWARE', 'EX_TEMPFAIL', 'EX_UNAVAILABLE', 'EX_USAGE', 'F_LOCK', 'F_OK', 'F_TEST', 'F_TLOCK', 'F_ULOCK', 'GRND_NONBLOCK', 'GRND_RANDOM', 'MFD_ALLOW_SEALING', 'MFD_CLOEXEC', 'MFD_HUGETLB', 'MFD_HUGE_16GB', 'MFD_HUGE_16MB', 'MFD_HUGE_1GB', 'MFD_HUGE_1MB', 'MFD_HUGE_256MB', 'MFD_HUGE_2GB', 'MFD_HUGE_2MB', 'MFD_HUGE_32MB', 'MFD_HUGE_512KB', 'MFD_HUGE_512MB', 'MFD_HUGE_64KB', 'MFD_HUGE_8MB', 'MFD_HUGE_MASK', 'MFD_HUGE_SHIFT', 'MutableMapping', 'NGROUPS_MAX', 'O_ACCMODE', 'O_APPEND', 'O_ASYNC', 'O_CLOEXEC', 'O_CREAT', 'O_DIRECT', 'O_DIRECTORY', 'O_DSYNC', 'O_EXCL', 'O_LARGEFILE', 'O_NDELAY', 'O_NOATIME', 'O_NOCTTY', 'O_NOFOLLOW', 'O_NONBLOCK', 'O_PATH', 'O_RDONLY', 'O_RDWR', 'O_RSYNC', 'O_SYNC', 'O_TMPFILE', 'O_TRUNC', 'O_WRONLY', 'POSIX_FADV_DONTNEED', 'POSIX_

In [15]:
def getClass(image):
    return image.split("/")[4]

def prepare_image_data(data_path):
    images = []
    labels = []

    os.d

data_path = ['/content/tobacco/Image/Scientific/image1.jpg', '/content/tobacco/Image/ADVE/image2.jpg']
prepare_image_data(data_path=data_path)

(['/content/tobacco/Image/Scientific/image1.jpg',
  '/content/tobacco/Image/ADVE/image2.jpg'],
 ['Scientific', 'ADVE'])

In [None]:
def Preprocess_Image(image):
    # Read image using opencv
    img = cv2.imread(image, cv2.IMREAD_COLOR)
    #Resize image using tf.image functions
    img_resized = tf.image.resize(img, image_size)  
    #Normalize image in the range of [0, 1]
    img_resized = tf.cast(img_resized, "float32") / 255

    return img_resized

def getClass(image):
    return image.split("/")[4] # Return image class based on list entry (path) (e.g, input="/content/tobacco/Image/Scientific/image.jpg" ==> output= "Scientific")

def prepare_image_data(data_path):
    images, labels = []

    for path in data_path:
        images.append(path)
        labels.append(getClass(path))
    return images , labels #(e.g. images=['/content/tobacco/Image/Scientific/image1.jpg', /content/tobacco/Image/ADVE/image2.jpg, ...], labels=['Scientific', 'ADVE'])
                  
def create_image_split_dataset(data_path):
    # Load Image Data and Labels from prepare_data function
    images, labels = prepare_image_data(data_path)

    # Shuffle images and Labels   using sklearn.shuffle   
    
    # Convert Categories to Binary labels using (from sklearn.preprocessing import LabelBinarizer)
 
    # Split images into training, validation, and test 

    return data_train, labels_train, data_valid, labels_valid, data_test, labels_test



Load Train, Valid, and Test Data

In [2]:
# data_train, labels_train, data_valid, labels_valid, data_test, labels_test = create_image_split_dataset(image_path)

# **Data Visualization**

Visualize the repartition of data across the classes of Tobacco dataset: (Number of samples in each class) ==> Use Pandas library

In [None]:
#Visualize data repartition across classes

Visualize Some samples of the training data with matplotlib library

In [None]:
#Visualize some samples of data

# **Prepare Data for Training**
# Create a Data Loader to Load Data from Disk.
The use of Data Generators is preferrable while dealing with huge amount of data. There are two methods that can be used to load data:

*   "tf.data.Dataset.from_tensor_slices" for small amount of data
*   "tf.data.Dataset.from_generator" for large amount of data


Create a Training, Validation, and Test Data Generators to load data in batches instead of saving the entire data **in** memory. The data generator yields images and labels

In [None]:
#example:
def train_batch_generator():
    while True:
        #Loop through images and labels
        #use the Preprocess_Image function to read and process images
        yield train_images, train_labels



In [None]:
#Use tf.data.Dataset.from_generator() to load data in batches from the example above (train_batch_generator will be the input to tf.data.Dataset.from_generator)
# Batch data to be loaded in batches of batch size parameter

# Disable AutoShard.
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF

train_dataset = tf.data.Dataset.from_generator(...).batch(...).with_options(options)
valid_dataset = tf.data.Dataset.from_generator(...).batch(...).with_options(options)
test_dataset = tf.data.Dataset.from_generator(...).batch(...).with_options(options)

## **H - Load a Deep CNNs model to train our document image classification model (e.g. NASNetMobile)**

## We will train our model using Transfer learning from pre-trained Imagenet weights

Here we initialize the model with pre-trained ImageNet weights, and we fine-tune it on our own dataset.

In [None]:
from tensorflow.keras.applications import NASNetMobile

In [None]:
def visual_model():

    #Initialize your model with a Input layer

    #Load Nasnet_Mobile Model with imagenet weights and set include_top to False and set the number of classes

    #Add a Globalaveragepooling2D layer

    #Add a dropout layer with dropout rate equal to 0.1

    #Add a dense layer with 128 units (project_dim parameter), and set "relu" as the activation function 

    #Add a dropout layer with dropout rate equal to 0.1

    #Add a Dense layer with 10 units (each unit corresponds to a class), and an activation function="softmax" to perform classification

    #Define your functional with: tf.keras.Model()
    model = tf.keras.Model(inputs = , outputs = )
    return model

In [None]:
# Open a strategy scope.
with strategy.scope():
    # Everything that creates variables should be under the strategy scope.
    # In general this is only model construction & `compile()`.

    #Load Model
    model = visual_model()

    # Freeze the pretrained weights from NasNet model, we won't train the whole model. 
    # We want to train and update the weights of only the added last layers

    #compile your model using :
      #Stochastic gradient Descent (SGD) as optimizer (set the learning rate to 1e-3, and momentum=0.9)
      #Use an appropriate loss function
      #Use an appropriate metric
    freezed_model.compile(optimizer = , 
                  loss = , 
                  metrics = ,
    )
#model summary
freezed_model.summary()

# ***I - Training***


Note: the accuracy will increase very slowly and may overfit.

Use model.fit to train your model. Define all the parameters.

In [None]:
# Train the model on all available devices.
t0=time.time()

freezed_history = freezed_model.fit()

print('')
print('total computing time: '+str(time.time()-t0))

Evaluate the performance of your model on the Test set

Use model.evaluate to evaluate the performance of your model

In [None]:
# Test the model after training
scores = freezed_model.evaluate()

#print loss and test accuracies

Plot models' history (training accuracy, validation accuracy)

In [None]:
#Create a function that plots the models' training accuracy, validation accuracy, training loss, and validation loss

def plot_hist(hist):
    
plot_hist(freezed_history)

Re-Train the model and update all parameters. (Unfreez the first 2 layers).

*   Re-compile your second model
*   Re-Train your second model
*   Compare the training time between the two trained models
*   Compare Test accuracy between the two trained models



In [None]:
# Open a strategy scope.
with strategy.scope():
    # Everything that creates variables should be under the strategy scope.
    # In general this is only model construction & `compile()`.

    #Load Model
    unfreezed_model = visual_model()

    # Freeze the pretrained weights from NasNet model, we won't train the whole model. 
    # We want to train and update the weights of only the added last layers
    

    #compile your model using :
      #Stochastic gradient Descent (SGD) as optimizer (set the learning rate to 1e-3, and momentum=0.9)
      #Use an appropriate loss function
      #Use an appropriate metric
    unfreezed_model.compile(optimizer = , 
                  loss = , 
                  metrics = ,
    )
#model summary
unfreezed_model.summary()

In [None]:
#Re-train the model and unfreez the first two layers
# Train the model on all available devices.
t1=time.time()

unfreezed_history = unfreezed_model.fit()

print('')
print('total computing time: '+str(time.time()-t1))

In [None]:
plot_hist(freezed_history)

\Create a directory ("weights") to save the weights of your trained model in the .h5 format

In [None]:
!mkdir weights
model.save_weights(filepath="/content/weights/visual_model.h5")

------
------
------

## **II- Textual Modality**

In this part, we will train a textual model. The aim of this second part is to extract the textual features of document images to perform document text classification on the Tobacco-3482 dataset.

Load Train, Valid, and Test Data

In [None]:
def getClass(text):
    #return category (e.g, input="/content/tobacco/Text/Scientific/text.txt" ==> output= "Scientific")
    return

def prepare_text_data(data_path):

    return texts , labels #(e.g. images=['/content/tobacco/Text/Scientific/text1.txt', /content/tobacco/Text/ADVE/text2.jpg, ...], labels=['Scientific', 'ADVE'])
                  
def create_image_split_dataset(data_path):
    # Load Text Data and Labels from prepare_text_data function
    texts, labels = prepare_text_data(data_path)
    # Shuffle texts and Labels using sklearn.shuffle  (use random_state parameter)  

    # Convert labels to categorical (use from tensorflow.keras.utils import to_categorical) 

    # Split text data into training, validation, and test 

    return data_train, labels_train, data_valid, labels_valid, data_test, labels_test



In [None]:
text_data_train, text_labels_train, text_data_valid, text_labels_valid, text_data_test, text_labels_test = create_split_dataset(text_path)

Import Libraries to process the text

In [None]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer

Parameters to consider while building your *model*

In [None]:
vocab_size = 20000
buffer_size = 3482

Create a function that do the same thing as the function (prepare_image_data()), but instead of appending to the new list the list of text file paths, we will append the text readed.

In [None]:
def feed_text(data, label):
    text_data = []
    text_labels = []

    # Loop over your text data and labels

    # Read the text of each file (.txt)

    #Append both the text readed and text labels to text_data and to text_labels lists respectively  

    return data, text_labels

Read your Data

In [None]:
text_data_train, text_labels_train = feed_text(...)
text_data_valid, text_labels_valid = feed_text(...)
text_data_test, text_labels_test = feed_text(...)

Now, instead of using a data generator with "tf.data.Dataset.from_generator", we will be using a data loader with "tf.data.Dataset.from_tensor_slices" as we won't be dealing with images but texts, and they don't consume memory.

It is a good chance for you to learn the two methods to preparing and loading data.

Note: Use the parameters of "tf.data.Dataset.from_tensor_slices(...)" (e.g. .batch(), .prefetch(), .shuffle()). Check more information in tensorflow/keras websites.

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices(...).batch(...)
#Add parameters

valid_dataset = tf.data.Dataset.from_tensor_slices(...).batch(...)
#Add parameters

test_dataset = tf.data.Dataset.from_tensor_slices(...).batch(...)
#Add parameters


In [None]:
# The raw text loaded by the function feed_text() needs to be processed before it can be used in a model. 
# The simplest way to process text for training is using the "tf.keras.layers.TextVectorization" laye.
# Create the TextVectorization layer. (use the max_tokens parameter)
encoder = tf.keras.layers.TextVectorization(...)

# Use the .adapt() method to set the layer's vocabulary
# After the padding and unknown tokens they're sorted by frequency:
encoder.adapt(train_dataset.map(lambda text, label: text))

In [None]:
# Here are the first 20 tokens.
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

In [None]:
# Initially this returns a batch of the dataset of (text, label pairs):
for example, label in train_dataset.take(1):
    print('text: ', example.numpy())
    print('label: ', label.numpy())

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed output_sequence_length):

In [None]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

Now, let's define Our RNN model to perform text classification. We have defined the Visual model usinf a Functional Model (tf.keras.Model()). For the Textual Model, we will define  Sequential Model using tf.keras.Sequential() model

In [None]:
def textual_model():

    model = tf.keras.Sequential([
      #Add th encoder layer                           
      encoder,
      # Add an Embedding Layer(with input_dim == len(encoder.get_vocabulary(), and output_dim == 512 units)
      
      # Add a Bidirectional Layer(with layer as tf.keras.layers.LSTM layer with 512 units, activation_function="tanh", and return_sequences == True))
      
      # Add a Bidirectional Layer(with layer as tf.keras.layers.LSTM layer with 256 units, activation_function="tanh", and return_sequences == False))
      
      # Add a Dense Layer(with 128 units, and relu as activation function)
      
      # Add a Dropout Layer (with a dropout rate of 0.5)
      
      # Add a last dense Layer (with 10 units, which corresponds to the number of classes, and a "softmax" activation function)
      ])
    return model

# Open a strategy scope.
with strategy.scope():

  #Re-define the TextVectorization Layer and the .adapt() in the strategy scope 
  encoder = ...
  encoder.adapt(...) 

  print('Build model...')
  
  text_model = textual_model()
  #compile your model using :
      #Adam optimizer (set the learning rate to 1e-4)
      #Use an appropriate loss function
      #Use an appropriate metric
  text_model.compile(optimizer = , 
                     loss = , 
                     metrics = ,
    )

text_model.summary()

In [None]:
t3=time.time()

# Train the model on all available devices.
text_history = text_model.fit(...)

print('')
print('total computing time: '+str(time.time()-t3))

In [None]:
# Test the model after training
text_scores = text_model.evaluate()

#print loss and test accuracies

In [None]:
plot_hist(text_history)

In [None]:
model.save_weights(filepath="/content/weights/textual_model.h5")

---
---
---

## **III- Fusion Modality (Multimodal Modality)**

In this part, we will load our pretrained visual and textual models. The aim of this second part is to get the extracted visual and textual features across the visual and textual modalities; Then, we will use two fusion strategies (early, and late fusion), to perform multimodal document image classification 

# ***A - Late Fusion***


## 1. Load the Visual and Textual Models with their trained weights inside the strategy scope, and Define your final Model

In [None]:
# Open a strategy scope.

#---- 
# Load Visual Model
# Load the weights saved in the "weights" directory
# Get the output(last layer) of the visual model

#---- 
# Load Textual Model
# Load the weights saved in the "weights" directory
# Get the output(last layer) of the textual model

#---- 
# Concatenate the outputs of the visual and textual models
# Add a Dense layer on top of the concaenated outputs (with 20 units, and a "relu" activation function)
# Add a Dense layer (wich corresponds to the last layer (with 10 units (number of classes), and a "softmax" activation function))

#----
# Finally, define your Functional Model (your inputs should be the image and text, and the output should be the last dense layer added to the model)

#----
# Compile your model:
    #Take into account that now you have two inputs instead of only one (both image and text data)
    #Use SGD optimizer, set the learning rate to 1e-3, and momentum to 0.9
    #Use an appropriate loss function
    #Use an appropriate metric


### 2. Modify the Data Loader to take into account a list of two inputs, instead of one.
NB !! USE a DATA GENERATOR in the format of (tf.data.Dataset.from_generator()) function

In [None]:
#Modify Data Loader

### 2. Your Model is by now ready for Training

In [None]:
#Train your model here

### 3. Your Model is by now ready for Testing

In [None]:
#Evaluate your model here

###Compare the Test accuracies of Visual, Textual, and Fusion Modalities

Is there any improvement of accuracy ?

# ***B - Early Fusion***


## 1. Repeat Step A.1

Here, we will compare between 2 early fusion methods using these layers: "concatenate", "average" 

In [None]:
# Open a strategy scope.

#---- 
# Load Visual Model
# Load the weights saved in the "weights" directory
# Get the output of the layer with 128 units of the visual model

#---- 
# Load Textual Model
# Load the weights saved in the "weights" directory
# Get the output of the layer with 128 units of the textual model

#---- 1st Method: "Add"
# Add the outputs of the visual and textual models using the "Add" layer
# Add a Dense layer on top of the added outputs (with 128 units, and a "relu" activation function)
# Add a Dropout Layer (with a dropout rate of 0.5)
# Add a Dense layer (wich corresponds to the last layer (with 10 units (number of classes), and a "softmax" activation function))

#----
# Finally, define your Functional Model (your inputs should be the image and text, and the output should be the last dense layer added to the model)

#----
# Compile your model:
    #Take into account that now you have two inputs instead of only one (both image and text data)
    #Use SGD optimizer, set the learning rate to 1e-3, and momentum to 0.9
    #Use an appropriate loss function
    #Use an appropriate metric


### 2. Your Model is by now ready for Training

In [None]:
#Train your model here

### 3. Your Model is by now ready for Testing

In [None]:
#Evaluate your model here

###Compare the Test accuracies of Visual, Textual, and Fusion Modalities

Is there any improvement of accuracy ?

# 4. Repeat the same exact steps. Modify only the method of fusion and see results

# 5. Save the weights of your fusion model

In [None]:
model.save_weights(filepath="/content/weights/fusion_model.h5")