# Livrable final data science

### Contexte

L'entreprise TouNum est une entreprise de numérisation de documents. Elle prospose différents services dont la numérisation de base de document papier pour les entreprises clientes. TouNum veut optimiser et rendre intelligent ce processus de scanning en incluant des outils de Machine Learning. Le gain de temps serait important aux vues des nombreuses données que l'entreprise doit scanner et étiqueter.
Pour cela, TouNum fait appel à CESI pour réaliser cette prestation.

### Objectif

L'objectif est que l'équipe de data scientist de CESI réalise cette solution visant à analyser des photographies pour en déterminer une légende descriptive de manière automatique. Il faudra également améliorer la qualité des images scannées ayant des qualités variables (parfois floues, ou bruitées).

<img src="imageSrc/caption image.PNG"/>

### Enjeux

TouNum devait trier et étiqueter chaque document scanné. La solution délivré par CESI permet l'automatisation de ces tâches en faisant donc gagner un temps non négligeable. Elle va donc pouvoir réaliser plus de contrats et augmenter la satisfaction client.

### Contraintes techniques

L'implémentation des algorithmes doit être réaliser sur Python, notamment les librairies Scikit et TensorFlow. La librairie Pandas doit être utilisé pour manipuler le dataset et ImageIO pour le charger. NumPy et MatPlotLib seront nécessaire pour le calcul scientifique et la modélisation.

Le programme à livrer devra respecter le workflow suivant :

<img src="imageSrc/workflow.PNG"/>

#### Classification:

La classification d'image se fera à l'aide de réseaux de neurones. Cette dernière doit distinguer les photos d'un autre documents, tel que schémas, textes scannés, voir peintures.
TouNoum possède un dataset rempli d'images divers pour entrainer le réseau de neurones.

#### Prétraitement

Le prétraitement dois utiliser des filtres convolutifs afin d'améliorer la qualité. Il doit établir un compromis entre débruitage et affutage.

#### Captionning

Le Captionning devra légender automatiquement les images. Il utilisera deux techniques de Machine Learning : les réseaux de neurones convolutifs (CNN) pour prétraiter l'image en identifiant les zones d’intérêt, et les réseaux de neurones récurrents (RNN) pour générer les étiquettes. Il faudra être vigilant quant aux ressources RAM. Un dataset d'étiquetage classique est disponible pour l’apprentissage supervisé.

### Livrable

La solution doit sous forme de notebook Jupiter entièrement automatisé. Il doit être conçu pour être faciliter mis en production et maintenance.
Il faut démontrer la pertinence du modèle de manière rigoureuse et pédagogique.

#### Jalons

CESI devra dois rendre le prototype complet et fonctionnel du programme pour le 23 janvier. 
TouNum exige également 3 dates de rendu pour suivre la bonne avancé du projet.
<ul>
    <li>18/12/20 : Prétraitement d'image</li>
    <li>15/01/21 : Classification binaire</li>
    <li>20/01/21 : Captioning d'images</li>
    <li>22/01/21 : Démonstration </li>
</ul>


### *Importation des librairies utilisées*

In [None]:
import os
import time
import random

# Check if imageio package is installed
try:
    import imageio
except ImportError:
    !pip install imageio
    
import imageio
import matplotlib.pyplot as plt
import numpy as np

# Check if cikit-image package is installed
try:
    import skimage
except ImportError:
    !pip install scikit-image

from skimage import io
from skimage.restoration import estimate_sigma

# Check if opencv-python package is installed
try:
    import cv2
except ImportError:
    !pip install opencv-python

import cv2

import threading
from queue import Queue
from multiprocessing import Pool

import pathlib

# Check if pandas package is installed
try:
    import pandas
except ImportError:
    !pip install pandas

import pandas as pd

# Livrable 2
import PIL
import PIL.Image

# Check if tensorflow package is installed
try:
    import tensorflow
except ImportError:
    !pip install tensorflow

# Check if tensorflow_datasets package is installed
try:
    import tensorflow_datasets
except ImportError:
    !pip install tensorflow_datasets    


import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import layers

# Check if tensorflow_datasets package is installed
try:
    import keras
except ImportError:
    !pip install keras    
from keras.preprocessing import image
from keras import backend as K

# Livrable 3 
import json
import collections

from PIL import Image

# Check if tqdm package is installed
try:
    import tqdm
except ImportError:
    !pip install -q tqdm

from tqdm import tqdm

import shutil

### *Chemins physiques*

In [None]:
#Livrable 2
classification_dataset_path = "./Dataset/2/dataset/"
classification_model_path = "./Models/classification/"
classification_input_path = './Dataset/2/input'
classification_output_path = './Dataset/2/output'

# Livrable 1
blurry_dataset_path = "./Dataset/1/dataset/Blurry/"
noisy_dataset_path = "./Dataset/1/dataset/Noisy/"
deblured_dataset_path = "./Dataset/1/dataset/deblurred/"
denoised_dataset_path = "./Dataset/1/dataset/denoised/"
treatments_output_path = './Dataset/1/output'

In [None]:
#basics checks for image classifications
print("executing tensorflow version " + tf.__version__)

if (len(tf.config.experimental.list_physical_devices('GPU')) == 1):
    print("GPU is detected")
else :
    print("GPU isn't detected")

## Livrable 2 - Classification binaire

In [None]:
#Parameters for the dataset (amount of images per batch, image resolution and training percentage)
batch_size = 16
img_height = 200
img_width = 200
validation_split = 0.2

epochs=6
classes = ['Photo', 'Other']

def get_f1(y_true, y_pred): #taken from old keras source code
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

def generate_model():
    #generation of the training dataset
    train_ds = tf.keras.preprocessing.image_dataset_from_directory(
      classification_dataset_path,
      validation_split=validation_split,
      subset="training",
      seed=123,
      image_size=(img_height, img_width),
      batch_size=batch_size)

    #generation of the validation dataset
    val_ds = tf.keras.preprocessing.image_dataset_from_directory(
      classification_dataset_path,
      validation_split=validation_split,
      subset="validation",
      seed=123,
      image_size=(img_height, img_width),
      batch_size=batch_size)
    
    #retrieve the amount of classes for the model
    num_classes = len(train_ds.class_names)
    print("Classes found : " + str(num_classes))
    print(train_ds.class_names)

    #Allow for perfomance compilation times by preventing IO bottleneck on disks while compiling the model
    AUTOTUNE = tf.data.experimental.AUTOTUNE
    train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
    val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
    
    #Structure of the neural network
    model = tf.keras.Sequential([
      layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
      layers.Conv2D(16, 3, activation='relu'),
      layers.MaxPooling2D(),
      layers.Conv2D(32, 3, activation='relu'),
      layers.MaxPooling2D(),
      layers.Conv2D(64, 3, activation='relu'),
      layers.MaxPooling2D(),
      layers.Dense(128, activation='relu'),
      layers.GlobalAveragePooling2D(),
      layers.Dropout(0.2),
      layers.Dense(2)
    ])
    
    #display neural network structure
    model.summary()

    #compile the model
    model.compile(
        optimizer='adam',
        loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['accuracy', get_f1])
    
    #amount of training and fitting
    history = model.fit(
      train_ds,
      validation_data=val_ds,
      epochs=epochs
    )
    
    #display statitics over training accuracy
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']

    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs_range = range(epochs)

    plt.figure(figsize=(8, 8))
    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, acc, label='Training Accuracy')
    plt.plot(epochs_range, val_acc, label='Validation Accuracy')
    plt.legend(loc='lower right')
    plt.title('Training and Validation Accuracy')

    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, loss, label='Training Loss')
    plt.plot(epochs_range, val_loss, label='Validation Loss')
    plt.legend(loc='upper right')
    plt.title('Training and Validation Loss')
    plt.show()
    
    return model

def classify_image(model, classes, impath):
    #load disj image
    img = image.load_img((impath), target_size=(img_height, img_width))
    img  = image.img_to_array(img)
    img  = img.reshape((1,) + img.shape)

    #use model to predict classe
    prediction = model.predict(img)
    score = tf.nn.softmax(prediction[0])
    #     print(
    #         "This image most likely belongs to {} with a {:.2f} percent confidence."
    #         .format(classes[np.argmax(score)], 100 * np.max(score))
    #     )
    #return clas and percentage of confidence
    return [classes[np.argmax(score)], 100 * np.max(score)]

In [None]:
def get_model(newModel=False):
    # Checking if the Models folder is empty
    if newModel == True: 
        model = generate_model()
        tf.keras.models.save_model(model, classification_model_path)
        return model    
    # Or not
    else:
        return tf.keras.models.load_model(classification_model_path)
    
model = get_model(True)

In [None]:
# confidence threshold about consider a image as a photo
confidence_threshold = 90

# Classify images and moved "Photo" to the output folder 
for pictures in os.listdir(classification_input_path):
    res_classe, res_score = classify_image(model, classes, os.path.join(classification_input_path,pictures))
    
    print(pictures, res_classe, res_score, '%')

    if (res_classe == 'Photo' and res_score > confidence_threshold):
        shutil.copy2(classification_input_path + "/" + pictures, classification_output_path) 

# Single image test
#img_name = 'photo_0001.jpg'
#result = classify_image(model, classes, './Dataset/2/input/'+img_name)
#print(result)

# Livrable 1 - Prétraitement (denoising/sharpening…)

In [None]:
# Deblurring function
def remove_blur(img, high):
    kernel = []
    
    if high:
        # Creation of a Laplacian kernel to use for debluring
        kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
    else:
        kernel = np.array([[0,-1,0], [-1,5,-1], [0,-1,0]])
    
    # Convolution of the kernel with the image given in the function's parameter
    return cv2.filter2D(img, -1, kernel)

def get_blurry_indicator(image):
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    fm = cv2.Laplacian(gray_image, cv2.CV_64F).var()
    return fm

# Thread execution
def process_fpath(name):
    path = blurry_dataset_path + name
    img = get_image(blurry_dataset_path,name)

     # Get initial Blur metric
    original_blur_metric = get_blurry_indicator(img)
    pre_processed_data.append(original_blur_metric)
    
    # Remove blur from the colored image image
    deblurred_img = remove_blur(img, high=True)

    # Get initial Blur metric
    processed_blur_metric = get_blurry_indicator(deblurred_img)
    post_processed_data.append(processed_blur_metric)
    
    #print("image " + name + " - initial : " + str(original_blur_metric)
     #   + " - processed : " + str(processed_blur_metric)
     #   + " - difference : " + str(processed_blur_metric - original_blur_metric)+"\n")

    data_preview_blurr.append([name, original_blur_metric, processed_blur_metric])

    # Saving Image
    save_image(deblured_dataset_path, name, deblurred_img)

In [None]:
# Create the list of files to treat
listing = os.listdir(blurry_dataset_path)

# Loop on the list of file
threads = []
pre_processed_data = []
post_processed_data = []
data_preview_blurr = []

if __name__ == '__main__':
    for name in listing:
        #process_fpath(name)
        t = threading.Thread(target=process_fpath, args=(name,))
        threads.append(t)
        
    # Start them all
    for thread in threads:
        thread.start()

    # Wait for all to complete
    for thread in threads:
        thread.join()
    
    get_metric_stat(pre_processed_data, post_processed_data)
    get_list_data(data_preview_blurr)
    
plt.figure(figsize=(24, 8))

def display_image_diff(originalPath, diffPath, filename=None):
    if not filename:
        # Get a random file from directory
        filename = random.choice(os.listdir(originalPath)) 
    
    plt.subplot(121)
    plt.imshow(get_image(originalPath, filename))
    plt.axis('off')
    plt.title("Original Image")
    
    # Corrected Image noise
    plt.subplot(122)
    plt.imshow(get_image(diffPath, filename))
    plt.axis('off')
    plt.title("Corrected Image")
    
# Filename MUST be the same for both directories    
display_image_diff(blurry_dataset_path, deblured_dataset_path)

In [None]:
# Remove Noise function
def remove_noise(image, high):
    if high == 2:
        return cv2.fastNlMeansDenoisingColored(image, None, 10, 10, 7, 15)
    elif high == 1:
        return cv2.fastNlMeansDenoisingColored(image, None, 5, 10, 7, 15)
    else:
        return cv2.fastNlMeansDenoisingColored(image, None, 3, 3, 7, 15)

def estimate_noise(img):
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return estimate_sigma(img)

# Thread execution
def process_fpath(name):
    path = noisy_dataset_path + name
    img = get_image(noisy_dataset_path,name)
    
    # Get initial noise metric
    original_noise_metric = estimate_noise(img)
    pre_processed_data.append(original_noise_metric)
    
    denoised_img = remove_noise(img, high=2)
    
    # Get initial noise metric
    processed_noise_metric = estimate_noise(denoised_img)
    post_processed_data.append(processed_noise_metric)

    data_preview_denoised.append([name, original_noise_metric, processed_noise_metric])
    
    save_image(denoised_dataset_path, name, denoised_img)

In [None]:
# Create the list of files to treat
listing = os.listdir(noisy_dataset_path)

# Loop on the list of file
threads = []
pre_processed_data = []
post_processed_data = []
data_preview_denoised = []

if __name__ == '__main__':
    for name in listing:
        #process_fpath(name)
        t = threading.Thread(target=process_fpath, args=(name,))
        threads.append(t)
        
    # Start them all
    for thread in threads:
        thread.start()

    # Wait for all to complete
    for thread in threads:
        thread.join()
    
    get_metric_stat(pre_processed_data, post_processed_data)
    get_list_data(data_preview_denoised)
    
plt.figure(figsize=(24, 8))

# Filename MUST be the same for both directories    
display_image_diff(noisy_dataset_path, denoised_dataset_path)

## Optimisation entre Défloutage et Débruitage

In [None]:
def clear_image(img):
    
    #if you want random testing
    #img = get_image("./Dataset/Blurry/", random.choice(os.listdir("./Dataset/Blurry/")))

    #initial image measurements
    initial_noise = estimate_noise(img)
    initial_blur = get_blurry_indicator(img)
    # print(initial_noise, initial_blur)

    #image is blurry
    if initial_blur < 3000:
        # high deblur of the image
        img = remove_blur(img, high=False)

        #second image measurements
        second_noise = estimate_noise(img)
        second_blur = get_blurry_indicator(img)

        #image doesn't meets requirements in terms of noise
        if second_noise > 1:
            #low denoise of the image
            img = remove_noise(img, 0)

            img = remove_noise(img, 0)

    #if image is noisy      
    if initial_noise > 1:
        # high denoise of the image
        img = remove_noise(img, 2)

        img = remove_noise(img, 1)

        #second image measurements
        second_noise = estimate_noise(img)
        second_blur = get_blurry_indicator(img)

        #image meets requirements in terms of blur
        if second_blur < 48000:
            #low deblur of the image
            img = remove_blur(img, high=False)
            img = remove_blur(img, high=False)
            
    print('Image cleaning is done ;)')
    return img

In [None]:
for pictures in os.listdir(classification_output_path):
    c_img = clear_image(img)
    
    #displaying images
    plt.figure(figsize=(24, 8))
    plt.imshow(img)
    plt.axis('off')
    plt.show()

    hutil.copy2(classification_output_path + "/" + pictures, treatments_output_path) 

# Livrable 3 - Captioning

In [None]:
# N'hésitez pas à modifier ces paramètres en fonction de votre machine
BATCH_SIZE = 64 # taille du batch
BUFFER_SIZE = 1000 # taille du buffer pour melanger les donnes
embedding_dim = 256
units = 512 # Taille de la couche caché dans le RNN
vocab_size = top_k + 1
num_steps = len(img_name_train) // BATCH_SIZE

# La forme du vecteur extrait à partir d'InceptionV3 est (64, 2048)
# Les deux variables suivantes representent la forme de ce vecteur
features_shape = 2048
attention_features_shape = 64

# Fonction qui charge les fichiers numpy des images prétraitées
def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8')+'.npy')
    return img_tensor, cap

# Creation d'un dataset de "Tensor"s (sert à representer de grands dataset)
# Le dataset est cree a partir de "img_name_train" et "cap_train"
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

# L'utilisation de map permet de charger les fichiers numpy (possiblement en parallèle)
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
          map_func, [item1, item2], [tf.float32, tf.int32]),
          num_parallel_calls=tf.data.experimental.AUTOTUNE)

# Melanger les donnees et les diviser en batchs
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

class CNN_Encoder(tf.keras.Model):
    # Since you have already extracted the features and dumped it using pickle
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x
    
class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
        
    def call(self, features, hidden):
        # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)
        
        # hidden shape == (batch_size, hidden_size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        
        # attention_hidden_layer shape == (batch_size, 64, units)
        attention_hidden_layer = (tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))
        
        # score shape == (batch_size, 64, 1)
        # This gives you an unnormalized score for each image feature.
        score = self.V(attention_hidden_layer)
        
        # attention_weights shape == (batch_size, 64, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
        
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        return context_vector, attention_weights
    
class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units
        
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        
        self.fc1 = tf.keras.layers.Dense(self.units)
        self.fc2 = tf.keras.layers.Dense(vocab_size)
        
        self.attention = BahdanauAttention(self.units)
        
    def call(self, x, features, hidden):
        # defining attention as a separate model
        context_vector, attention_weights = self.attention(features, hidden)
        
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)
        
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        
        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        
        # shape == (batch_size, max_length, hidden_size)
        x = self.fc1(output)
        
        # x shape == (batch_size * max_length, hidden_size)
        x = tf.reshape(x, (-1, x.shape[2]))
        
        # output shape == (batch_size * max_length, vocab)
        x = self.fc2(x)
        
        return x, state, attention_weights
    
    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

In [None]:
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

In [None]:
def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))
    hidden = decoder.reset_state(batch_size=1)
    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
    features = encoder(img_tensor_val)
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
        result.append(tokenizer.index_word[predicted_id])

        if tokenizer.index_word[predicted_id] == '<end>':
            return result, attention_plot
        
        dec_input = tf.expand_dims([predicted_id], 0)
    
    attention_plot = attention_plot[:len(result), :]
    
    return result, attention_plot

# Fonction permettant la représentation de l'attention au niveau de l'image
def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))

    len_result = len(result)
    for l in range(len_result):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        ax.set_title(result[l])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

In [None]:
def save_picture_to_output(img, result):
    # Remove the ''<end>' from the titles
    result = filter(lambda x: x != '<end>', result)
    # Remove the ''<unk>' from the titles
    result = filter(lambda x: x != '<unk>', result)
    # Remove the '\n' from the titles
    result = filter(lambda x: x != '\n', result)
    
    # Move the image with his result as name to output folder
    dest_path = captioning_output_path + '/' + img + ' - ' + ' '.join(result) + '.jpg'
    shutil.copy(image_path, dest_path)

In [None]:
# Evaluate all the pictures into the "input" folder
for img in os.listdir(treatments_output_path):
    image_path = treatments_output_path + '/' + img
    
    # Evaluate the given image
    result, attention_plot = evaluate(image_path)
    print(img, 'Prediction Caption:', ' '.join(result))

    # Give the details of the word founded
    # plot_attention(image_path, result, attention_plot)

    # Save image into output folder
    save_picture_to_output(img, result)