# Artifical Neural Networks & Deep Learning
# Homework 3 - Visual Question Answering

**Developement Team:**
- Acquati Marco - 10583134 
- Brugali Giorgio - 10794550
- Puoti Francesco - 10595640 


# *1. Data acquisition and augmentation*
> We decide to not apply data augmentation since the data is big enough. We needed to split the loading of the images to deal with RAM upperbounds. The images were resized to 299 x 299 to comply with the expected input size of the Convolutional Neural Network we used for the images' features extraction.


> **1.1.Tokenizer and word Embedding**
>> For the sake of simplicity and clearness, all the comments are written **as the code flows** 



# *2. Model overview*


> **2.1. The Network**
>> It is composed by three main components:
- Convolutional Neural Network (Inception V3) for image analysis and features extraction 
- Embedding layer + LSTM network to word analysis.
- Classification part composed of a concatenation layer and of a Dense layer with softmax activation function.

To be noticed is that we used also two dense layers (1024 neurons each), one after the Inception and the other one after the LSTM, with ReLU activation function layers.

We also used: 
- Weight decay ( l2_norm )
- Weight initialization ( he_normal since it works well with ReLu activation function)
- Batch normalization between the dense layers and the ReLU layers


> **2.2. Optimizer & LossFunction** 
>> - Adam, with a starting learning rate of 1e-4 and amsgrad = True to have an adaptive learning rate, so as to prevent the network from being stuck on a suboptimal solution.
>> - Loss function : Sparse Categorical Crossentropy since we are dealing with integer numbers, not one-hot encoded.

> **2.3. Further information about the implemention process**
>> No EarlyStopping has been used in the final model as, after some trials, such model got stopped even though the learning process would have subsequently led to noteworthy improvements.

> **2.4. Training Process**
>> Number of epochs is set to 10, since we noticed that after this threshold, the network gets stuck in its accuracy and loss values without improvements.

>>In order to preserve some RAM memory space we explicitly called the garbage collector and the del keyword for deleting the no more used images arrays.

>> As aforementioned, we split the dataset to comply with RAM limits. In order to simulate the fitting process at the best, we iterates 10 times over the whole dataset that were being loaded split. 
We opted for this type of data loading, instead of using a custom dataset, since we found out that the first is much more faster then the latter.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import os
import gc
import tensorflow as tf
import numpy as np

# Set the seed for random operations. 
# This let our experiments to be reproducible. 
SEED = 1234
tf.random.set_seed(SEED)  
np.random.seed(SEED)
cwd = os.getcwd()

In [None]:
import json 

train_jsonLoad = json.load(open('../input/anndlo2020vqa/VQA_Dataset/train_questions_annotations.json', 'r'))
test_jsonLoad = json.load(open('../input/anndlo2020vqa/VQA_Dataset/test_questions.json', 'r'))

In [None]:
# At the loading time, we already compose the input and the target sequences by :
#   1) Appending <EOS> to the source_sequence
#   2) Prepending <go> and appending <eos> to obtain the target_sequence.

def json_analyzer(json_, isTrain) :
  X = []
  for key in json_ :
    image = os.path.join('../input/anndlo2020vqa/VQA_Dataset/Images/', (json_[key]['image_id']+'.png'))
    question = json_[key]['question'] + ' <eos>'.lower().replace("/" , " ")

    if isTrain :
      answer = '<go> ' + json_[key]['answer'].lower() +' <eos>'     
      X.append( (question, image, answer) ) 
      
    else:
      X.append( (question , image) )
  return X


In [None]:
train = json_analyzer(train_jsonLoad, isTrain = True)
train = np.array(train)

In [None]:
question_idx = 0
images_idx = 1
answer_idx = 2

#Converting to list is necessary since tokenizer works with list, but np array are handier
train_questions = list(train[:, question_idx])
train_answers = list(train [:, answer_idx])
#there's no need for train_images to be converted to a list
train_images = train[:, images_idx]

In [None]:
MAX_QUESTION_WORD_LENGTH = 0
MAX_ANSWER_WORD_LENGTH = 0
QUESTION_WORDS = 0
ANSWER_WORDS = 0 

#QUESTIONS
phrases = (train_questions)
all_words= []

# We used the following for loop to analyze the questions in order to find out which is the longest word 
# and the maximum number of words belonging the the question dataset.
# This information will be used for the tokenizer.
for p in phrases : 
  words_ = p.replace("?" , "").split()
  for w in words_ :
    all_words.append(w)
    MAX_QUESTION_WORD_LENGTH =  max(len(w), MAX_QUESTION_WORD_LENGTH)

QUESTION_WORDS = len(np.unique(np.array(all_words)))

#ANSWERS
labels_dict = {
        '0': 0,
        '1': 1,
        '2': 2,
        '3': 3,
        '4': 4,
        '5': 5,
        'apple': 6,
        'baseball': 7,
        'bench': 8,
        'bike': 9,
        'bird': 10,
        'black': 11,
        'blanket': 12,
        'blue': 13,
        'bone': 14,
        'book': 15,
        'boy': 16,
        'brown': 17,
        'cat': 18,
        'chair': 19,
        'couch': 20,
        'dog': 21,
        'floor': 22,
        'food': 23,
        'football': 24,
        'girl': 25,
        'grass': 26,
        'gray': 27,
        'green': 28,
        'left': 29,
        'log': 30,
        'man': 31,
        'monkey bars': 32,
        'no': 33,
        'nothing': 34,
        'orange': 35,
        'pie': 36,
        'plant': 37,
        'playing': 38,
        'red': 39,
        'right': 40,
        'rug': 41,
        'sandbox': 42,
        'sitting': 43,
        'sleeping': 44,
        'soccer': 45,
        'squirrel': 46,
        'standing': 47,
        'stool': 48,
        'sunny': 49,
        'table': 50,
        'tree': 51,
        'watermelon': 52,
        'white': 53,
        'wine': 54,
        'woman': 55,
        'yellow': 56,
        'yes': 57,
        '<go>' : 58,
        '<eos>' : 59,
        '<unk>' :60
} 



# Notice that we increased of 1 the answers vocabulary values in order to use 0 as padding value.
# Successively, when generating the CSV file, we substract 1 to recover the initial values
# Initially, we tried to add a value in the answer_dictionary for the padding, as it has been done for '<go>' '<eos>' '<unk>'.
# But, we noticed that different values for padding lead the network to worse results. Therefore, we opted for the solution above, 
# in order to have the same value for the questions padding as for the answers padding.
for key in labels_dict :
  MAX_ANSWER_WORD_LENGTH =  max(len(key), MAX_ANSWER_WORD_LENGTH)
  labels_dict[key] += 1


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Create Tokenizer to convert words to integers
# Replace out of vocabulary (OOV) tokens with <UNK>.
questions_tokenizer = Tokenizer(num_words= QUESTION_WORDS, filters='?!,."/', oov_token = '<unk>')
questions_tokenizer.fit_on_texts(train_questions)

train_questions_tokenized = questions_tokenizer.texts_to_sequences(train_questions)

questions_wtoi = questions_tokenizer.word_index

answers_tokenizer = Tokenizer(oov_token = '<unk>', filters='?!,."/', num_words =len(labels_dict)) 
answers_tokenizer.word_index = labels_dict # This assignment is to use the given vocabulary for the answers
answers_tokenized = answers_tokenizer.texts_to_sequences(train_answers)

answers_wtoi = answers_tokenizer.word_index

In [None]:
#WE NEED THE SAME PADDING FOR QUESTION AND ANSWERS
MAX_WORD_LENGTH = max(MAX_ANSWER_WORD_LENGTH, MAX_QUESTION_WORD_LENGTH)

pad_train_questions = pad_sequences(train_questions_tokenized, maxlen=MAX_WORD_LENGTH, padding='post', value = 0)

pad_answers = pad_sequences(answers_tokenized, maxlen=MAX_WORD_LENGTH, padding='post', value = 0)

#  Model definition

In [None]:
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, ReLU, BatchNormalization, Softmax, Dropout, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.applications import InceptionV3
import tensorflow.keras

img_height = 299 
img_width = 299


initializer = tf.keras.initializers.he_normal(seed=SEED)
regularizer = tf.keras.regularizers.l2(0.001) #weight decay

# IMAGE FEATURE EXTRACTOR
pre_trained_model = InceptionV3(include_top=False, weights='imagenet', input_shape = (img_height, img_width, 3))
for layer in pre_trained_model.layers:
    layer.trainable = False

img_features_extractor = Flatten()(pre_trained_model.output)

img_features_extractor = Dense(1024,kernel_regularizer=regularizer, kernel_initializer = initializer)(img_features_extractor)
img_features_extractor = BatchNormalization()(img_features_extractor)
img_features_extractor = ReLU()(img_features_extractor)

# LSTM FOR QUESTION EMBEDDING
question_input = Input(shape=[MAX_QUESTION_WORD_LENGTH], dtype='int32')

embedded_question = Embedding(input_dim=len(questions_wtoi) + 1, output_dim=64, input_length=MAX_QUESTION_WORD_LENGTH, mask_zero = True)(question_input)
encoded_question = LSTM(512 ,input_shape=(None, ),return_sequences = True)(embedded_question)

encoded_question = Dense(1024, kernel_regularizer=regularizer, kernel_initializer = initializer)(encoded_question)
encoded_question = BatchNormalization()(encoded_question)
encoded_question = ReLU()(encoded_question)

# COMBINE CNN AND LSTM TO MAKE UP THE FINAL MODEL
merged = tensorflow.keras.layers.Multiply()([img_features_extractor,encoded_question])
merged = tensorflow.keras.layers.BatchNormalization()(merged)

#OUTPUT CLASSIFICATION LAYER
output = Dense(len(answers_wtoi)+1, activation = 'softmax')(merged)

vqa_model = Model(inputs=[pre_trained_model.input, question_input], outputs=output)

In [None]:
vqa_model.summary()

In [None]:
loss = loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits= True)

lr = 1e-4  # learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=lr, amsgrad=True)

metrics = ['accuracy']

# Compile Model
vqa_model.compile(optimizer=optimizer, loss=loss, metrics=metrics )

In [None]:
import PIL
from PIL import Image
from tqdm import tqdm

SPLITS = 13

def loadImages(list_, no_run):
    gc.collect()
    start = int((list_.shape[0]/SPLITS)*no_run)
    end = int( (list_.shape[0]/SPLITS)*no_run+(list_.shape[0]/SPLITS) -1 )
    no_data = end - start

    print( 'Split n°: ' + str(no_run+1) + ' --> From index: ' + str(start) + ' to index: ' + str(end))

    img_list = np.empty((no_data, img_height, img_width, 3) , dtype=np.float32)
    q_list = np.empty((no_data, MAX_WORD_LENGTH), dtype=np.int32)
    targets = np.empty((no_data, MAX_WORD_LENGTH), dtype=np.int32)
    
    k = 0;

    for i in tqdm(range(start, end)):
        targets[k] = (pad_answers[i])
        img = PIL.Image.open(list_[i][images_idx])
        img = img.convert('RGB')
        img = img.resize((img_height, img_width), resample=Image.NEAREST)
        img = np.array(img)  
        img_list[k] = tf.keras.applications.inception_v3.preprocess_input(img)
        q_list[k] = (list_[i][0])
        k += 1
    gc.collect()
    return q_list, img_list, targets

In [None]:
train_valid_ = np.array(list(zip(pad_train_questions, np.array(train_images))))
target = pad_answers

In [None]:
num_ep = 10
bs = 32

for ep in range(0, num_ep) :
  for i in range(0, SPLITS) :
    print('Epoch n°: ' + str(ep+1))
    questions, imgages, targets = loadImages(train_valid_,i)
    vqa_model.fit(x = [imgages, questions],
                y = targets,
                validation_split = 0.2,
                epochs = 1,
                batch_size = bs,
                steps_per_epoch = (len(questions)*0.8)//bs,
                validation_steps = (len(questions)*0.2)//bs,
                verbose = 1
                )
    del questions 
    del imgages
    del targets 


In [None]:
#--------------Saving the model---------------
#---------------------------------------------

from datetime import datetime

savedir ='./savedModels'

if not os.path.exists(savedir):
  os.makedirs(savedir) 
  
savePath =  os.path.join(savedir, 'VQA_model' + datetime.now().strftime('%b%d_%H-%M-%S')+'.h5')
vqa_model.save(savePath)

# ***TESTING***

In [None]:
# This cell is used to select which model to train ( either one from those saved or the one just trained )

# model = tf.keras.models.load_model('path_to_the_saved_model_to_use')
model = vqa_model

In [None]:
def test_json_analyzer(json_) :
  X = []
  for key in json_ :
    image = os.path.join('../input/anndlo2020vqa/VQA_Dataset/Images/', (json_[key]['image_id']+'.png'))
    question = json_[key]['question'] + ' <eos>'.lower().replace("/" , " ")
    X.append( (key, question , image) )
  return X

In [None]:
test = test_json_analyzer(test_jsonLoad)
test = np.array(test)

id_idx = 0
question_idx = 1
images_idx = 2

test_ids = test[:, id_idx]
test_questions = list(test[:, question_idx])
test_images = list(test[:, images_idx])

In [None]:
test_questions_tokenized = questions_tokenizer.texts_to_sequences(test_questions)
pad_test_questions = pad_sequences(test_questions_tokenized, maxlen=MAX_WORD_LENGTH, padding='post')

test = np.array(list(zip(pad_test_questions, np.array(test_images))))

In [None]:
gc.collect()

In [None]:
import PIL
from PIL import Image
from tqdm import tqdm

img_height = 299
img_width = 299

def load_test_images(list_):
    gc.collect()
    no_data = list_.shape[0]

    img_list = np.empty((no_data, img_height, img_width, 3) , dtype=np.float32)
    q_list = np.empty((no_data, MAX_WORD_LENGTH), dtype=np.int32)
    k = 0; # This parameter is needed to initialize values in empty np.arrays

    for i in tqdm(range(0, no_data)):
        img = PIL.Image.open(list_[i][1])
        img = img.convert('RGB')
        img = img.resize((img_height, img_width), resample=Image.NEAREST)
        img = np.array(img)  
        img_list[k] = tf.keras.applications.inception_v3.preprocess_input(img)
        q_list[k] = (list_[i][0])
        k += 1
    gc.collect()
    return q_list, img_list

In [None]:
questions, images = load_test_images(test)

In [None]:
gc.collect()

pred = model.predict(x=[images,questions])

del questions
del images
gc.collect()

In [None]:

def take_test_result(pred_) :
  results = []
  for k in pred_:
    result = []
    for i in k:
      first = True
      maxx = []
      idx = 0
      for j in i:
        if first:
          first = False
          maxx = [j, idx]
        if j > maxx[0]:
          maxx = [j, idx]
        idx += 1
      result.append(maxx)
    results.append(result)
  return np.array(results)

In [None]:
results = take_test_result(preds)

In [None]:
def get_test_word(list_):
  predicted_aswers = []
  for k in list_:
    predicted_asw = []
    # k.shape = (20,2)
    for el in k:
      # el is the tuple ( probability , predicted word )
      ch = int(el[1]) - 1 # now the pading value would be -1. This is where we recover the initial values of the dictionary.
      if (ch!= -1) :
        predicted_asw.append(ch)
    predicted_aswers.append(predicted_asw)
  return np.array(predicted_aswers)

In [None]:
answers = get_test_word(results)

In [None]:
csv_results = []
i=0
for el in test_ids :
  asw = answers[i][1]
  csv_results.append(
      [str(el), asw]
      )
  i+=1
csv_results = np.array(csv_results)

In [None]:
import os
from datetime import datetime

def create_csv(results, results_dir='./'):

    csv_fname = 'results_'
    csv_fname += datetime.now().strftime('%b%d_%H-%M-%S') + '.csv'

    with open(os.path.join(results_dir, csv_fname), 'w') as f:

        f.write('Id,Category\n')

        for el in results:
            f.write(str(el[0])+ ',' + str(el[1]) + '\n')

In [None]:
create_csv(csv_results)