
# **Gruppo "Deep Learning Warlords"**

Jean Paul Guglielmo **Baroni**\
Maurizio **Cerisola**\
Davide **Maran**

**INDEX:**
1. Dataset management

2. Embedding Matrix

3. Hyperparameters tuning
     
4. Final Model


## 1
After having loaded and unzipped the dataset, we defined a dictionary in order to map each of the 58 classes to a number and another one to map each of them to a "macro label". The "macro labels" are 11 categories of classes of answers, such as "numbers", "Yes vs No answers", "colors" and so on.

The peculiarity of the "macro labels" is that they can be identified starting from the text of the question, through the function "tipo": e. g. if the word "ball" is contained in the text of the question, the macro label will almost surely be "games".

Having performed this partition of the questions, we could identify the macro label at the beginning, training one Neural Network for each macro label. Starting from this partition of the dataset, we generated the data randomly selecting a batch of questions from the training set and returning:
1. an array representing the images in the current batch;
2. an array containing the questions in the current batch;
3. an array of the one-hot-encoded expected outputs in the current batch.

The test data is instead returned by a similar function (not a generator), requiring the codes identifying the data, clearly without the need of batches or outputs.

## 2
In order to let the network capture the questions semantic, we relied on the GloVe embedding space. 

First, we computed the embedding matrix, in particular: 
1. we downloaded the GloVe dictionary; 
2. we looked for our words;
3. we truncated the GloVe representation according to the chosen embedding dimension;
4. we filled the embedding matrix.

The whole operation is quite expensive but given the embedding dimension it can be done only once for all the training procedures.
As regards to the embedding dimension, it is an hyperparameter and it was set to 300 as a result of a trial and error procedure.

The embedding matrix is then given as a starting parameter to the embedding-input layer of the network branch which manages the input questions.

## 3
Having an input made by both an image and a question, we had to build the network in two branches, one for extracting features from the image, the other from the question. For the first branch, we went with transfer learning from some popular architectures. We chose InceptionResnet after having also tryed VGG16, DenseNet and MobileNet because of its slightly better performances.

For the question branch we proceeded by inserting two LSTM layers after the previously described embedding matrix. The two layers are concatenated and then merged by a dense layer which connects to the final output layer. 

The main problem of this network is overfitting: 
1. for the first branch we faced it by using an average pooling layer after the convolutional part and L2 regularization on the dense layer which comes right after and is prone to overfitting.
2. For the second branch we preferred to use many dropouts, after each of the two LSTMs and before the last layer. 

We tried to enlarge the first branch by adding a second fully connected layer, and the second branch by putting three LSTM layers, though this generates additional overfitting.

On the other side, restricting the model by reducing the neurons in each layer leads to a fall in the overall performance. 

At the end, we added a BatchNormalization layer just before the head, which improves a bit the results.

## 4
To start training, we only need to choose the learning rate and the batch size. The first one, is chosen to be relatively big since otherwise the network stops learning too early in the epochs. As for the batch size, we didn't encounter any memory issues, however we went with 12 after trials and errors.

# **Setup**

In [None]:
import json
import random
import os
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import tensorflow as tf



In [None]:
from google.colab import drive  
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
os.chdir('/content/drive/My Drive/AN2DLthird/')
random.seed(0)

# **Dataset Extraction (only first time)**

In [None]:
os.chdir('/content/')
!unzip /content/drive/MyDrive/AN2DLthird/anndl-2020-vqa.zip

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
  inflating: VQA_Dataset/Images/5390.png  
  inflating: VQA_Dataset/Images/5391.png  
  inflating: VQA_Dataset/Images/5392.png  
  inflating: VQA_Dataset/Images/5393.png  
  inflating: VQA_Dataset/Images/5394.png  
  inflating: VQA_Dataset/Images/5395.png  
  inflating: VQA_Dataset/Images/5396.png  
  inflating: VQA_Dataset/Images/5397.png  
  inflating: VQA_Dataset/Images/5398.png  
  inflating: VQA_Dataset/Images/5399.png  
  inflating: VQA_Dataset/Images/54.png  
  inflating: VQA_Dataset/Images/540.png  
  inflating: VQA_Dataset/Images/5400.png  
  inflating: VQA_Dataset/Images/5401.png  
  inflating: VQA_Dataset/Images/5402.png  
  inflating: VQA_Dataset/Images/5403.png  
  inflating: VQA_Dataset/Images/5404.png  
  inflating: VQA_Dataset/Images/5405.png  
  inflating: VQA_Dataset/Images/5406.png  
  inflating: VQA_Dataset/Images/5407.png  
  inflating: VQA_Dataset/Images/5408.png  
  inflating: VQA_Dataset/Images/5409

In [None]:
# Extracts the .zip
import shutil, os
if not os.path.exists("VQA_Dataset"):
  shutil.copytree("/content/VQA_Dataset","VQA_Dataset")

https://www.kaggle.com/takuok/glove840b300dtxt

In [None]:
!unzip /content/drive/MyDrive/AN2DLthird/glove.840B.300d.txt.zip

Archive:  /content/drive/MyDrive/AN2DLthird/glove.840B.300d.txt.zip
  inflating: glove.840B.300d.txt     


# **Dataset Management**

In [None]:
# Intended Outputs
labels_dict = {
  '0': 0,
  '1': 1,
  '2': 2,
  '3': 3,
  '4': 4,
  '5': 5,
  'apple': 6,
  'baseball': 7,
  'bench': 8,
  'bike': 9,
  'bird': 10,
  'black': 11,
  'blanket': 12,
  'blue': 13,
  'bone': 14,
  'book': 15,
  'boy': 16,
  'brown': 17,
  'cat': 18,
  'chair': 19,
  'couch': 20,
  'dog': 21,
  'floor': 22,
  'food': 23,
  'football': 24,
  'girl': 25,
  'grass': 26,
  'gray': 27,
  'green': 28,
  'left': 29,
  'log': 30,
  'man': 31,
  'monkey bars': 32,
  'no': 33,
  'nothing': 34,
  'orange': 35,
  'pie': 36,
  'plant': 37,
  'playing': 38,
  'red': 39,
  'right': 40,
  'rug': 41,
  'sandbox': 42,
  'sitting': 43,
  'sleeping': 44,
  'soccer': 45,
  'squirrel': 46,
  'standing': 47,
  'stool': 48,
  'sunny': 49,
  'table': 50,
  'tree': 51,
  'watermelon': 52,
  'white': 53,
  'wine': 54,
  'woman': 55,
  'yellow': 56,
  'yes': 57
}

labels_macro = {
  '0': 'number',
  '1': 'number',
  '2': 'number',
  '3': 'number',
  '4': 'number',
  '5': 'number',
  'apple': 'food',
  'baseball': 'game',
  'bench': 'object',
  'bike': 'object',
  'bird': 'animal',
  'black': 'color',
  'blanket': 'object',
  'blue': 'color',
  'bone': 'object',
  'book': 'object',
  'boy': 'person',
  'brown': 'color',
  'cat': 'animal',
  'chair': 'object',
  'couch': 'object',
  'dog': 'animal',
  'floor': 'object',
  'food': 'food',
  'football': 'game',
  'girl': 'person',
  'grass': 'object',
  'gray': 'color',
  'green': 'color',
  'left': 'position',
  'log': 'object',
  'man': 'person',
  'monkey bars': 'object',
  'no': 'yesno',
  'nothing': 'action',
  'orange': 'color',
  'pie': 'food',
  'plant': 'object',
  'playing': 'action',
  'red': 'color',
  'right': 'position',
  'rug': 'object',
  'sandbox': 'object',
  'sitting': 'action',
  'sleeping': 'action',
  'soccer': 'game',
  'squirrel': 'animal',
  'standing': 'action',
  'stool': 'object',
  'sunny': 'weather',
  'table': 'object',
  'tree': 'object',
  'watermelon': 'food',
  'white': 'color',
  'wine': 'drink',
  'woman': 'person',
  'yellow': 'color',
  'yes': 'yesno'
}

yesnoverbs = ['Is','Are','Does','Has','Can','Do','Could','Should','The','Will','Did','Would']
foods = ['fruit','pie','eat','eating','food']
colors = ['color','What kind of wine','What type of wine']
games = ['ball']
positions = ['direction']
animals = ['animal','pet']
weathers = ['weather','sunny','cloudy','rainy']
drinks = ['drinking']

def tipo(s):
  seq=s.split()
  if (seq[0]=='How' and seq[1]=='many') or (seq[0]=='What' and seq[1]=='number'):
    return 'number'
  if yesnoverbs.count(seq[0]):
    return 'yesno'
  if any(w in s for w in weathers):
    return 'weather'
  if any(w in s for w in colors):
    return 'color'
  if seq[0] == 'What' and any(w in seq for w in games):
    return 'game'
  if any(w in seq for w in positions) or 'Which side' in s or 'What hand' in s:
    return 'position'
  if any(w in seq for w in animals):
    return 'animal'
  if 'doing' in s:
    return 'action'
  if 'What' in s and any(w in s for w in drinks):
    return 'drink'
  if 'Where is' in s: #  10% false positive
    return 'object'
  if 'Who' in s: #  20% false positive
    return 'person'
  return ''

In [None]:
# Opens the training questions

# Train & Val
with open('/content/drive/MyDrive/AN2DLthird/VQA_Dataset/train_questions_annotations.json', 'r') as f:
  data_raw = json.load(f)
f.close()

selected_macro = 'yesno' #### SELECT MACROLABEL
data_raw = {dr:data_raw[dr] for dr in data_raw if tipo(data_raw[dr]['question'])==selected_macro}

# Test
with open('VQA_Dataset/test_questions.json', 'r') as f:
  test_raw = json.load(f)
f.close()

# Splits Validation and Training Datasets
val_split = 0.02  ## yesno = 0.02
if selected_macro != 'yesno':
  val_split = 0.2
data_keys = list(data_raw.keys())
random.shuffle(data_keys)
val_code = data_keys[:round(val_split*len(data_keys))]
train_code = data_keys[round(val_split*len(data_keys)):]
test_code = list(test_raw.keys())

In [None]:
macro = selected_macro

true_positive=[]
false_positive=[]
false_negative=[]
for key in data_raw:
  el = data_raw[key]
  label = labels_macro[el['answer']]
  pred = tipo(el['question'])
  if pred != '':
    if label == macro and pred == macro:
      true_positive.append(el)
    if label != macro and pred == macro:
      false_positive.append(el)
    if label == macro and pred != macro:
      false_negative.append(el)

print(len(true_positive), "true positives")
print(len(false_positive), "false positives")
print(len(false_negative), "false negatives")
#for key in data_raw:
#  el = data_raw[key]
#  if(labels_macro[el['answer']]==macro):
#    print(el['question'])

30930 true positives
320 false positives
0 false negatives


In [None]:
# Dataset Settings

random.seed(0)

img_h = 400
img_w = 700 

max_words = 20 # max words per sentence
embedding_dim = 300 # dim of embedding (suggested 300)
max_len = 20 # max number of sequences
batch_size = 32 #12 

classes = list(labels_dict.keys())
num_classes = len(classes)

# Encodes the classes
label_encoder = LabelEncoder()
integer_encoder_ = label_encoder.fit(classes)
integer_encoded = integer_encoder_.transform(classes)
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoder_ = onehot_encoder.fit(integer_encoded)

In [None]:
# Data Generator

def data_generator(tokenizer, dataset='train', batch_size = 12):

  raw = data_raw

  # Sets the dataset
  if dataset == 'train':
    codes = train_code
  elif dataset == 'val':
    codes = val_code
  else:
    raise ValueError('Dataset are either train or val')


  idxs_list = []
  # Generates
  while True:

    # Inits batches
    batch_input_img = []
    batch_input_txt = []
    batch_output = [] 

    if len(idxs_list) == 0:
      range_idxs = list(range(0, len(codes)))
      random.shuffle(range_idxs)
      idxs_list = [range_idxs[i*batch_size:(i+1)*batch_size] for i in range(len(codes)//batch_size)]

    idxs = idxs_list[0].copy()
    idxs_list.pop(0)
    idxs = np.array(random.sample(range(0, len(codes)), batch_size))
    batch_addresses = [codes[i] for i in idxs]

    # Goes through the selected batches elements
    for i in batch_addresses:
      # IMAGE
      image_name = raw[i]['image_id']
      img = Image.open('VQA_Dataset/Images/' + str(image_name)+'.png').convert('RGB')
      img_array = np.array(img)
      img_array = np.expand_dims(img_array, 0) # gets the batch dim
      batch_input_img += [ np.true_divide(img_array,255) ]

      # QUESTION
      batch_input_txt += [ raw[i]['question'] ]
      
      # ANSWER
      output = raw[i]['answer']
      batch_output += [ output ]
    
    # Return a tuple of (input, output) to feed the network
    batch_x_img = np.array( batch_input_img )
    batch_x_txt = np.array( batch_input_txt )
    batch_x_resp = np.array( batch_output )
    batch_x_img = batch_x_img[:,-1]    
    # Prepares sequences with tokens and padding
    tokenized = tokenizer.texts_to_sequences(batch_x_txt)
    batch_x_txt = pad_sequences(tokenized, padding = 'post', maxlen = max_len) 
    
    # Yields the processed data
    batch_y = np.array( batch_output )
    y_c = integer_encoder_.transform(batch_y)
    y_c = y_c.reshape(len(y_c), 1)
    batch_y = onehot_encoder_.transform(y_c)
    yield ([batch_x_img,batch_x_txt], batch_y )
    
def data_test(tokenizer,codes):
  raw = test_raw

  batch_input_img = []
  batch_input_txt = []

  # Generates
  for i in codes:
    # IMAGE
    image_name = raw[i]['image_id']
    img = Image.open('VQA_Dataset/Images/' + str(image_name)+'.png').convert('RGB')
    img_array = np.array(img)
    img_array = np.expand_dims(img_array, 0) # gets the batch dim
    batch_input_img += [ np.true_divide(img_array,255) ]

    # QUESTION
    batch_input_txt += [ raw[i]['question'] ]
  
  # Return a tuple of (input, output) to feed the network
  batch_x_img = np.array( batch_input_img )
  batch_x_txt = np.array( batch_input_txt )
  batch_x_img = batch_x_img[:,-1]    
  # Prepares sequences with tokens and padding
  tokenized = tokenizer.texts_to_sequences(batch_x_txt)
  batch_x_txt = pad_sequences(tokenized, padding = 'post', maxlen = max_len) 
    
  return [batch_x_img,batch_x_txt]

In [None]:
# Tokenizer

ita_tokenizer = Tokenizer(num_words = 200000)

def create_tokens(tokenizer):
    tot_txt = []
    for key in data_raw:
      tot_txt += [data_raw[key]['question']] 
        
    tokenizer.fit_on_texts(tot_txt)
    return tokenizer 

Token = create_tokens(ita_tokenizer)
word_index = Token.word_index
vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index

# **Embedding Matrix**

In [None]:
# Embedding Matrix

#### creo doms che contiene tutte le domande
doms=[]
for d in train_code:
  doms.append(data_raw[d]['question'])

vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
embedding_matrix = np.zeros((vocab_size, embedding_dim))

#I search in the embedding text file the words in order to build the embedding matrix
with open('/content/drive/MyDrive/AN2DLthird/glove.840B.300d.txt') as f:
  count = 0
  for line in f:
    word, *vector = line.split()
    if word in word_index and count<(len(word_index)-1):
      idx = word_index[word] 
      try:
        embedding_matrix[idx] = np.array(
          vector, dtype=np.float32)[:embedding_dim]
        count += 1
      except:
        _


# **Model**

In [None]:
from tensorflow.keras import regularizers

#arch =  tf.keras.applications.vgg16.VGG16(include_top=False, weights='imagenet', input_shape=(img_h, img_w, 3))

arch = tf.keras.applications.InceptionResNetV2(include_top=False, weights='imagenet', input_shape=(img_h, img_w, 3))

freeze_until = -1

for layer in arch.layers[:freeze_until]:
      layer.trainable = False

branch1 = arch.output

branch1 = tf.keras.layers.AveragePooling2D(pool_size=(4,4), strides=4, padding="valid") (branch1)
branch1 = tf.keras.layers.Flatten() (branch1)
#branch1 = tf.keras.layers.GlobalMaxPooling2D() (branch1)
branch1 = tf.keras.layers.Dense(128, activation='tanh', trainable = True, kernel_regularizer = regularizers.l2(0.04)) (branch1)

#branch1 = tf.keras.layers.Dense(128, activation='tanh') (branch1)

text_inputs = tf.keras.Input(shape=[max_len])

#bidirectional 
emb = tf.keras.layers.Embedding(vocab_size, embedding_dim, 
                            input_length=max_words,
                            weights=[embedding_matrix], 
                            trainable=True) (text_inputs)

branch2 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128,return_sequences=True))(emb)
branch2 = tf.keras.layers.Activation('tanh')(branch2)
branch2 = tf.keras.layers.Dropout(0.25)(branch2)

branch2 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))(branch2)
branch2 = tf.keras.layers.Activation('tanh')(branch2)
branch2 = tf.keras.layers.Dropout(0.25)(branch2)

#straightforward concatenation
joint = tf.keras.layers.concatenate([branch1, branch2])
#joint = tf.keras.layers.Dropout(0.5)(joint)
joint = tf.keras.layers.Dense(256, activation='relu', kernel_regularizer = regularizers.l2(0.01))(joint)
joint = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.01, epsilon=0.001, center=True, scale=True)(joint)
joint = tf.keras.layers.Activation('relu')(joint)
joint = tf.keras.layers.Dropout(0.25)(joint)

predictions = tf.keras.layers.Dense(num_classes, activation='softmax')(joint)

model = tf.keras.models.Model(inputs=[arch.input, text_inputs], outputs=[predictions])



model.summary()

loss = tf.keras.losses.CategoricalCrossentropy()
lr = 1e-3
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)

model.compile(loss = loss,
                    optimizer = optimizer,
                    metrics = ['accuracy'])

callbacks=[]
callbacks.append(tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 10,restore_best_weights=True))


callbacks.append(tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, verbose=1, mode='auto', min_delta=0.001, cooldown=0, min_lr=0))



Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_resnet_v2/inception_resnet_v2_weights_tf_dim_ordering_tf_kernels_notop.h5
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 400, 700, 3) 0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 199, 349, 32) 864         input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 199, 349, 32) 96          conv2d[0][0]                     
__________________________________________________________________________________________________
activation (Activation)   

In [None]:

n_ds = len(data_raw)
epochs = 20
#spe = min(1000,n_ds//epochs)
spe = 3*n_ds//batch_size//epochs # goes 3 times on the whole dataset
print(n_ds, "elements, we chose", spe, "steps for each of the", epochs, "epochs")

model.fit(data_generator(Token,"train",batch_size), validation_data = data_generator(Token,"val",batch_size), steps_per_epoch = spe, validation_steps = min(spe,200), epochs=epochs, callbacks=callbacks, verbose=1) #, workers=8, use_multiprocessing=True, max_queue_size=100)

    


31250 elements, we chose 146 steps for each of the 20 epochs
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20

Epoch 00010: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Epoch 11/20
Epoch 12/20
Epoch 13/20

Epoch 00013: ReduceLROnPlateau reducing learning rate to 4.0000001899898055e-05.
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20

Epoch 00019: ReduceLROnPlateau reducing learning rate to 8.000000525498762e-06.
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fd3372338d0>

In [None]:
os.chdir('/content/drive/My Drive/AN2DLthird/')
model.save('modelli/model_'+selected_macro+'.h5')

# **Macro labels models Merge**

In [None]:
from tensorflow.keras.models import load_model

#Load models 
model_number=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_number')
model_yesno=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_yesno')
model_weather=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_weather')
model_color=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_color')
model_game=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_game')
model_position=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_position')
model_animal=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_animal')
model_action=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_action')
model_drink=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_drink')
model_object=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_object')
model_person=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_person')
model_=load_model('/content/drive/MyDrive/AN2DLthird/modelli/model_') #residuals


In [None]:
#Predict 
result=[]
risposta=test_raw

cont=0
for key in test_raw:

  cont=cont+1
  if(cont%100==0):
    print(cont)

  num=tipo(test_raw[key]['question'])

  inp=my_data_test(test_raw[key]['image_id'],test_raw[key]['question'],Token)

  if(num=='number'):
    res=model_number.predict(inp)
  if(num=='yesno'):
    res=model_yesno.predict(inp)
  if(num=='weather'):
    res='sunny'
  if(num=='color'):
    res=model_color.predict(inp)
  if(num=='game'):
    res=model_game.predict(inp)
  if(num=='position'):
    res=model_position.predict(inp)
  if(num=='animal'):
    res=model_animal.predict(inp)
  if(num=='action'):
    res=model_action.predict(inp)
  if(num=='drink'):
    res='wine'
  if(num=='object'):
    res=model_object.predict(inp)
  if(num=='person'):
    res=model_person.predict(inp)
  if(num==''):
    res=model_.predict(inp)

  result.append(res)
  risposta[key]=np.argmax(res)



100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300


# **Create CSV**

In [None]:
#Create CSV
import os
from datetime import datetime

def create_csv(results, results_dir='/content/drive/MyDrive/AN2DLthird/'):

    csv_fname = 'results_'
    csv_fname += datetime.now().strftime('%b%d_%H-%M-%S') + '.csv'

    with open(os.path.join(results_dir, csv_fname), 'w') as f:

        f.write('Id,Category\n')

        for key, value in results.items():
            f.write(key + ',' + str(value) + '\n')

create_csv(risposta)