# Multimodal Classification Kaggle Competition

Thomas Nigoghossian 

This multimodal classification problem is a supervised classification problem: from labeled images and sound files, a model capable of classifying new inputs must be established.

This notebook was made during an ENSTA competition on Kaggle and it ranked 1st with an accuracy of 0.99942.

In this notebook, I explain my approach to achieve this result and I propose some analysis of the result (including the quality of the dataset).

# Initialisation

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

datadir = '/kaggle/input/multimodal-classification-2021-mi203'

import os
'''
for dirname, _, filenames in os.walk(datadir):
    for filename in filenames:
        print(os.path.join(dirname, filename))
'''
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


traindata = pd.read_csv(os.path.join(datadir,'data/data_train.csv'), delimiter=',', nrows = None)
data = np.array(traindata)

y_train = data[:,-1].astype('int32')

audio_train = data[:, 1:-1]

trainimg_list = traindata['IMAGE']

testdata = pd.read_csv(os.path.join(datadir,'data/data_test_novt.csv'), delimiter=',', nrows = None)
data = np.array(testdata)

audio_test = np.array(data[:, 1:], dtype='float64')

testimg_list = testdata['IMAGE']



In [None]:
#====================================================================
# Pour séparer les données en apprentissage et test

import numpy as np
import matplotlib.pyplot as plt
import random

from sklearn.model_selection import train_test_split

from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(audio_train)

audio_train_scaled = scaler.transform(audio_train)

Xn, Xv, yn, yv = train_test_split(audio_train_scaled, y_train,
                                                    random_state=42,)

# I. Sound approach

**In this first part, we focus only on the audio data**.
i.e. we have for each element of the database a hundred of coefficients called Mel-Frequency Cepstral Coefficients (MFCC) which are very used for feature extraction in audio files.

# 1.  Decision Tree

We have a classification problem. From digital data, we have to give a label to a sound file.
Our first approach is to build a simplistic model in order to measure the model and to check if it is necessary to
complex.

For this we chose to start with decision trees.
We could have tried a k-nearest neighbor model. However, our database contains too many features to hope for a good result. We would certainly have been confronted with the curse of high dimensions.

In [None]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(criterion="gini",max_depth=30,min_samples_split=50, splitter="best")
clf = clf.fit(Xn,yn )

preds = clf.predict(Xv)


In [None]:
import sklearn.metrics as perf

#affichage du score
def show_result(preds):
    oa = perf.accuracy_score(yv, preds)
    bas = perf.balanced_accuracy_score(yv, preds) #moyenne des scores dans chaque classes
    cm = perf.confusion_matrix(yv, preds)
    print("Accuracy_score : ",oa,"Balanced_accuracy_score : ",bas)
    print("Matrice de confusion")
    print(cm)

In [None]:
show_result(preds)

We define the tree_optimizer function. It allows us to determine the parameters that will give the best results for our decision tree. The parameters in question are: max_depth and min_samples_split.

In [None]:
def tree_optimizer():
    res = 0
    best_max_depth = 2
    best_min_samples_split = 2
    for i in range(2,50):
        for j in range(2,50):
            clf = tree.DecisionTreeClassifier(criterion="gini",max_depth=i,min_samples_split=j, splitter="best")
            clf = clf.fit(trainx,trainy )
            preds = clf.predict(testx)
            acc = perf.balanced_accuracy_score( testy, preds )
            if (acc > res):
                res = acc
                best_max_depth = i
                best_min_samples_split = j
                print(acc)
    return i,j

Result
==
Thanks to this function, we quickly observe a level in the results.
Indeed, the maximum score obtained with a tree is: 0.6935632245695139  
Knowing that there are 9 different classes, a random model would have a result of 0.11. Thus a decision tree allows to obtain satisfactory results in a simple and fast way.

# 2. Random Forest / Bagging

Then comes the problem of how to optimize this first model. For this we will perform bagging with decision trees, i.e. we will develop a random forest.

In [None]:
from sklearn.model_selection import train_test_split
from collections import Counter


def learn_forest(XTrain, yTrain, nb_trees, depth=15):
    forest = []
    singleperf=[]

    for ss in range(nb_trees):
        # bagging for subset
        X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(
            XTrain, yTrain, test_size=0.2 )

        # single tree training
        clf = tree.DecisionTreeClassifier(criterion="gini",max_depth=30,min_samples_split=50, splitter="best")
        clf = clf.fit( X_train_sub, y_train_sub )

        # grow the forest
        forest.append( clf )

        # single tree evaluation
        curr_train_pred=clf.predict(X_train_sub)
        curr_test_pred=clf.predict(X_test_sub)
        singleperf.append([perf.balanced_accuracy_score( y_train_sub, curr_train_pred ), perf.balanced_accuracy_score( y_test_sub,curr_test_pred)])

    return forest,singleperf

#Renvoi un array contenant les prédictions de notre random forest
def most_frequent(preds):
    res = np.empty((preds.shape[0],1))
    for i in range(preds.shape[0]):
        occurence_count = Counter(preds[i].tolist())
        res[i] = occurence_count.most_common(1)[0][0]
    return res

#transforme la matrice où chaque ligne correspond aux predictions d'un arbre en une matrice où chaque ligne correspond aux prédictions de tous les arbres sur le même cas
def array_transform(preds):
    final_array = np.empty((preds.shape[1],preds.shape[0]))
    for i in range(preds.shape[1]):
        for j in range(preds.shape[0]):
            final_array[i][j] = preds[j][i]
    return final_array

def predict_forest(forest, XTest, yTest = None):
  
    singleperf=[]
    all_preds=[]
    nb_trees = len(forest)
    for ss in forest:# nb_trees
        test_pred=ss.predict(XTest)
        all_preds.append(test_pred)

        if (yTest is not None):
            singleperf.append(perf.balanced_accuracy_score( yTest, test_pred ))

    all_preds=np.array(all_preds)
    #print(all_preds)

  # Vote
    final_preds = array_transform(all_preds)
    final_pred = most_frequent(final_preds)

    if (yTest is not None):
        return final_pred,singleperf
    else:
        return final_pred


F,singleperf = learn_forest(Xn,yn, 20, depth=15)
pred = predict_forest(F, Xv)

In [None]:
show_result(pred)

Result
==
Using a Random Forest, it is possible to obtain a score of about 79%!
So we have improved the model by 10 points, but this is not enough.

# 3. Boosting / XGBoost

Finally, we will use boosting to try to improve the model. To do this, we will use the xgboost library, which allows us to implement a simple gradient boosting algorithm.

https://xgboost.readthedocs.io/en/latest/index.html

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, KFold

xgbc = XGBClassifier(booster='gbtree',max_depth=15,n_estimators=500,use_label_encoder=False,learning_rate=0.8)

xgbc.fit(Xn, yn)
scores = cross_val_score(xgbc, Xn, yn, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())
 
kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(xgbc, Xn, yn, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())



In [None]:
ypred = xgbc.predict(Xv)

show_result(ypred)

In [None]:
xgbc.plot_importance(boost)

Result
==
After a (long) training, we obtain a score of 90%, which is an increase of 10 points compared to the previous method.  
The disadvantage of this method is that it is very long. It is therefore difficult to find the best parameters to optimize the result.  
The advantage of this method, in addition to the performance, is that it directly includes a cross validation.

# 4. SVM

Finally, we will leave the trees aside and experiment with SVMs.  
The use of SVMs seems to make sense since we have to find how to separate points in high dimension. We will see that the use of a non-linear kernel allows to obtain very good results.

NB: This approach is provided by the teacher. Only the optimization of the parameters has been modified.

# SVM Linear

In [None]:
#====================================================================
#   Apprentissage (SVM)

from sklearn import svm
svc = svm.LinearSVC(max_iter=10000, tol=1e-4, verbose=True, dual=False)

svc.fit(Xn, yn)

score_train = svc.score(Xn, yn)
score_test = svc.score(Xv, yv)

print("Taux de reco = {:.2f}% / {:.2f}%".format(score_train*100, score_test*100))

# Prédiction
y_pred = svc.predict(scaler.transform(audio_test))
print("Taux de reco = {:.2f}% / {:.2f}%".format(score_train*100, score_test*100))
print(y_pred)

# Noyau RBF

In [None]:
#====================================================================
#   Apprentissage (SVM)

from sklearn import svm
svc = svm.SVC(kernel='rbf', max_iter=-1, verbose = True, C=10, gamma='auto')

svc.fit(Xn, yn)

score_train = svc.score(Xn, yn)
score_test = svc.score(Xv, yv)

print("Taux de reco = {:.2f}% / {:.2f}%".format(score_train*100, score_test*100))

# Prédiction

y_pred = svc.predict(scaler.transform(audio_test))
print("Taux de reco = {:.2f}% / {:.2f}%".format(score_train*100, score_test*100))
print(y_pred)


In [None]:
y_pred = svc.predict(Xv)
show_result(y_pred)

Result
==
With an SVM we quickly get very good results. The best configuration is to use a RBF (Radial basis function) kernel.  
However, we can see a difference between the score on the training base and the test base. The model is in overfitting. Indeed, we obtain almost 100% of success on the training base but "only" 93% on the test base.
To avoid overlearning, we have to vary the parameters C and gamma which modify the size of the margin and the accepted classification errors.  
Note that polynomial kernels also give good results (around 90%)


# II. Image Approach

In this part, we will try to develop a model using only the images.

# Initialisation

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

datadir = '/kaggle/input/multimodal-classification-2021-mi203/data'
#datadir = 'data'

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

data_df = pd.read_csv(os.path.join(datadir,'data_train.csv'), delimiter=',', nrows = None)
data = np.array(data_df)

labels = data[:,-1].astype('int32')

audio = data[:, 1:-1].astype('float32')

img_list = data_df['IMAGE']

FileNotFoundError: ignored

In [None]:
import torch
import random

torch.manual_seed(0)
torch.cuda.manual_seed(0)
random.seed(0)
np.random.seed(0)

# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

In [None]:
import matplotlib.pyplot as plt
from PIL import Image

# visu image
idx = 0
class_list = ['FOREST', 'CITY', 'BEACH', 'CLASSROOM', 'RIVER', 'JUNGLE', 'RESTAURANT', 'GROCERY-STORE', 'FOOTBALL-MATCH']
img = Image.open(os.path.join(datadir, img_list.iloc[idx]))
plt.imshow(np.asarray(img))
print(class_list[labels[idx]])

In [None]:
from PIL import Image
from torch.utils.data import Dataset, DataLoader

class ImageAudioDataset(Dataset):
    def __init__(self, root_dir, files, audio, labels=None, img_transform=None, audio_transform=None):
        self.root_dir = root_dir
        self.files = files
        self.audio = audio
        self.labels = labels
        self.img_transform = img_transform
        self.audio_transform = audio_transform

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        img = Image.open(os.path.join(self.root_dir, self.files.iloc[idx]))
        audio = self.audio[idx,:]
        if self.img_transform is not None:
            img = self.img_transform(img)
        if self.audio_transform is not None:
            audio = self.audio_transform(audio)
        if self.labels is not None:
            return img, audio, int(self.labels[idx])
        else:
            return img, audio

import torchvision

img_list_transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize((224,224)),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# img_list_transform = None
audio_transform = None

img_dataset = ImageAudioDataset(root_dir=datadir,
                               files=img_list,
                                 audio=audio,
                                 labels=labels,
                              img_transform=img_list_transform,
                                 audio_transform=audio_transform)
## Taille du batch
nsample = 100

# img_dataset = ImageAudioDataset(root_dir=datadir,
#                                files=img_list[:10*nsample],
#                                  audio=audio[:10*nsample,:],
#                                  labels=labels[:10*nsample],
#                               img_transform=img_list_transform,
#                                  audio_transform=audio_transform)

# Shuffle = false pour echantillonner les données dans l'ordre
img_loader = DataLoader(img_dataset, batch_size=nsample, shuffle=False, num_workers=4, pin_memory=True)

In [None]:
import torchvision.models
import torch
import torch.nn as nn
import torch.nn.functional as F

resnext = torchvision.models.resnext50_32x4d(pretrained=True, progress=True)

# Si besoin de changer la derniere couche
num_ftrs = resnext.fc.in_features
resnext.fc = nn.Linear(num_ftrs, 9)


if device.type == 'cuda':
    resnext = resnext.cuda()

model = resnext


#print(model)

In [None]:
resnext.fc

In [None]:
from tqdm import tqdm

feat = []
for i, data in enumerate(tqdm(img_loader)):   ## on itere sur les données 
    img, audio, targets = data
    #print(targets)
    if device.type == 'cuda':
        img, audio, targets = img.cuda(), audio.cuda(), targets.cuda()
    with torch.no_grad():
        outputs = model(img)
    if device.type == 'cuda':
        feat.append(outputs.cpu().numpy().squeeze())
    else:
        feat.append(outputs.numpy().squeeze())

In [None]:
torch.relu(torch.tensor(feat[-1][-1]))

# 1. Réseau de neuronnes

# 1.1 MLP

# 1.2 CNN

Thus, we turn to a neural network of type CNN to solve this task. CNNs are indeed widely used in image recognition.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision

**Creation of train and validation dataloader**

In [None]:
img_dataset_train,img_dataset_val = torch.utils.data.random_split(img_dataset, [10000, 3802])
img_loader_train = DataLoader(img_dataset_train, batch_size=nsample, shuffle=False, num_workers=4, pin_memory=True)
img_loader_val = DataLoader(img_dataset_val, batch_size=nsample, shuffle=False, num_workers=4, pin_memory=True)

In [None]:
class MonReseauCNN(nn.Module):
    def __init__(self):
        super(MonReseauCNN, self).__init__()
        
        self.conv1 = nn.Conv2d(3, 32, kernel_size=5, padding=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=1)

        
        self.final = nn.Linear(50176,9)
      
    def forward(self, x):
        x = F.leaky_relu(self.conv1(x)) ## l'image 3x224x224 devient 32x224x224
        x = F.max_pool2d(x, kernel_size=2, stride=2) ## puis 32x112x112
        x = F.leaky_relu(self.conv2(x)) ## puis devient 64x112x112
        x = F.max_pool2d(x, kernel_size=2, stride=2) ## puis devient 64x56x56
        x = F.leaky_relu(self.conv3(x)) ## pas de changement
        x = F.max_pool2d(x, kernel_size=2, stride=2) ## puis devient 64x28x28
        x = F.leaky_relu(self.conv4(x)) ## pas de changement
        
        
        x = x.view(-1,50176) ## 64x28x28 devient 50176
        
        x = self.final(x) 
        return x

monreseauCNN = MonReseauCNN()
monreseauCNN = monreseauCNN.cuda()

In [None]:
optimizerCNN = optim.Adam(monreseauCNN.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss()
nbepoch = 15

In [None]:
from sklearn.metrics import confusion_matrix

for epoch in range(nbepoch):
    monreseauCNN.train()
    print("epoch", epoch)
    for inputs, data in enumerate(img_loader_train):   ## on itere sur les données 
        inputs,audio, targets = data
        inputs, targets = inputs.cuda(),targets.cuda()

        mespredictions = monreseauCNN(inputs)    ## on les fait rentrer dans le réseau
        loss = criterion(mespredictions,targets)    ## on compare la sortie courante à la sortie voulue

        optimizerCNN.zero_grad() ## supprime les gradients courants
        loss.backward() ## le gradient 
        optimizerCNN.step() ## on actualise les poids pour que la sortie courante soit plus proche que la sortie voulue

        if random.randint(0,90)==0:
            print("\tloss=",loss) ## on affiche pour valider que ça diverge pas

cm = np.zeros((9,9),dtype=int)
monreseauCNN.eval()
with torch.no_grad():  
    for inputs, data in enumerate(img_loader_val):
        inputs,audio, targets = data
        inputs = inputs.cuda()
        outputs = monreseauCNN(inputs)
        _,pred = outputs.max(1)
        cm += confusion_matrix(pred.cpu().numpy(),targets.cpu().numpy(),list(range(9)))

# print(cm)
# print(accuracy(cm))

Result
==
Using a 4-layer CNN network, it is possible to obtain a score of 93.8% on the validation database with nsample = 100, nepochs = 15, learning rate = 0.0001.  
Remark:  
It is interesting to experiment with different types of CNN network(number of layer, number of channel, padding, stride ...). However, as the learning time is quite long, it is simpler to limit ourselves to light networks.



In [None]:

def accuracy(cm):
    return np.sum(cm.diagonal())/np.sum(cm)
print(accuracy(cm))
print(cm)

# 2.  Resnet

In this part, we will use a pre-trained network for location recognition. It is called Resnet. We will see in the following the detailed structure of this neural network.
Moreover we are going to use fastai, a pytorch overlay which has many interests:  
There are many functions that allow to evaluate the model.   
The functions are well optimized and simple to implement.  


* **Initialisation**

In [None]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastbook import *
from fastai.vision.widgets import *

In [None]:
#Retourne a partir d'un chemin de donnée, la liste des images de train
def getImage(data):
    L= []
    for filename in os.listdir(data):
        if filename.endswith(".png") and filename.startswith("train"):
            L.append((os.path.join(data, filename)))
    return L

In [None]:
#renvoie a partir d'une image son label
def label_img(fname):
    img = str(fname[-9:-4])
    i = 0
    while (img[i] == 0 & i<4):
        i+=1
    indice = int(img[i:])
    return class_list[labels[indice]]


Creation of a DataBlock that will store the images with their labels, applying some transformations including a normalization, as well as a separation of the database into train and test.  
The normalization of the data is very important because without it we get a lot of overfitting (+20% more error in general).

In [None]:
scenes = DataBlock(
    blocks = (ImageBlock, CategoryBlock),
    get_items = getImage,
    splitter = RandomSplitter(valid_pct=0.2,seed = 42),
    get_y=label_img,
batch_tfms=[*aug_transforms(size=(224,224)), Normalize.from_stats(*imagenet_stats)])


In [None]:
dls = scenes.dataloaders(datadir)

One of the advantages of fastai is that it has many functions to display the data like show_batch which allows to display the images of a batch with their label 

In [None]:
dls.valid.show_batch(max_n=10,nrows=2)

* **Apprentissage**

Apprentissage de resnet50 sur 4 epochs avec un learning rate de 0.001

In [None]:
learn = cnn_learner(dls,resnet50,metrics=error_rate,lr=0.001)
learn.fine_tune(4)

**Exemple de résultats**

In [None]:
learn.show_results()

**Matrice de confusion**

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

**Affichage des prédictions les moins fiables**

In [None]:
interp.plot_top_losses(5,nrows=5)

**Sauvegarde du modèle pour ne pas avoir à le recalculer à chaque fois.**

In [None]:
learn.save("fastaiCNN_Error_001")

In [None]:
learn.export()

Result
==
Finally, the model takes less than 2min to be trained. This is longer than the methods used in the first part but it is still very convenient for the results obtained. Indeed the model obtained is very good, with a loss of 0.007 in training as in test, so no overfitting or underfitting.  
However, we have to pay attention to the parameters before launching the model (like the learning rate, the number of epochs or the architecture used, here resnet50).  


Finally, we have to be careful with the input data, there are 2 types of errors: 
* at the image level (blur, bad quality), even for a human, it is impossible to guess the represented scene  
* at the level of the labels: an apron can be worn by a salesman as well as by a waiter. (ambiguous scene)



Avant d'utiliser le modèle sur la compétition, nous allons rapidement analyser la constitution du modèle

In [None]:
learn.model

Thus the model used is like the model created by hand in the CNN part in that it uses Conv2d and ReLU. In addition to that, BatchNorm2d is used: the data in the batches are normalized in order to speed up the learning process.  
At the end, the model uses a dropout with a probability of 0.5. This dropout is supposed to improve the regularization of the model.  
Finally, there is a linear transformation that allows us to obtain the probabilities for each of our 9 classes.

# Application sur la compétition Kaggle

* **Chargement du modèle et prédictions des données tests**

In [None]:
learn = cnn_learner(dls,resnet50,metrics=error_rate,lr=0.001)
learn.load("fastaiCNN_Error_001")

In [None]:
datadir = '/kaggle/input/multimodal-classification-2021-mi203/data'
#datadir = 'data'

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

data_df = pd.read_csv(os.path.join(datadir,'data_test_novt.csv'), delimiter=',', nrows = None)
data_test = np.array(data_df)

# labels_test = data[:,-1].astype('int32')

# audio_test = data[:, 1:-1].astype('float32')

img_list_test = data_df['IMAGE']

In [None]:
#renvoie un array des testimg
def testImg(img_test):
    L = []
    for img in img_test:
        L.append(np.asarray(Image.open(os.path.join(datadir, str(img)))))
    return np.asarray(L)

In [None]:
test_img = testImg(img_list_test)

In [None]:
#renvoie une liste des prédictions
def learner_results(test_img):
    L = []
    for img in test_img:
        pred = learn.predict(img)
        L.append([pred[0],pred[2][pred[1]]])#on ajoute le label ainsi que la "confiance" du modèle
    return L

In [None]:
#renvoie la liste des labels
def learner_labels(L):
    y_pred = []
    for i in range(len(L)):
        if (i%2 ==0):
            y_pred.append(L[i])
    return y_pred

#renvoie la liste des "confiances"
def learner_confiance(L):
    conf = []
    for i in range(len(L)):
        if (i%2 == 1):
            conf.append(L[i])
    return conf

In [None]:
#renvoie les indices correspondants aux labels 
def learner_submission(L):
    class_dict = { class_list[i] : i for i in range(len(class_list))}    
    res = []
    for i in L:
        res.append(class_dict[i])
    return res

In [None]:
y_pred = learner_labels(L)
y_confidence = learner_confiance(L)

In [None]:
res = learner_submission(y_pred)

Kaggle Result
==
With this model we get a score of 99.5% on the Kaggle competition. 
3450 * (1-0.99594) = 14, there are about 14 misclassified images.  
However, when analyzing the test images, we observe the same types of errors as on the train images.

For example, we can select the images whose model is the least certain. By displaying them, we observe that they are very ambiguous.

In [None]:
print(y_confidence.index(min(y_confidence)))
img = Image.open(os.path.join(datadir, 'testimg_03303.png'))
plt.imshow(np.asarray(img))

In [None]:
y_pred[3303]

This is the image of which the model is the least certain. He classifies it as "GROCERY-STORE". We can understand this choice because we have the impression of being located indoors and the blue element seems to be a cadis. However it is very likely that the dataset is labeled as "CLASSROOM".

In [None]:
predictions_copy = y_confidence.copy()
predictions_copy.sort() #classement des prédictions dans l'ordre croissant
predictions_copy

In [None]:
print(y_confidence.index('tensor(0.7618)')) #correspond à l'indice de l'image dont le modèle a une proba de 0.7618


In [None]:
y_pred[64] #correspond au label prédit

# Images with the lowest confidence in our model:
testimg_00651.png => Oversaturated image, predicts as "RESTAURANT" but doesn't look like anything in reality  
testimg_01910.png => Looks like a casino     
testimg_02004.png => Predicted as "GROCERY-STORE" but looks more like a restaurant kitchen  
testimg_02813.png => Predicted as "GROCERY-STORE". However the image is blurred and it is impossible to discern anything  
testimg_02817.png => Predicted as "RESTAURANT". However the scene does not seem to be classifiable.



In [None]:
#affiche l'image n°x
def show_image(x):
    img = Image.open(os.path.join(datadir, 'testimg_'+x+'.png'))
    plt.imshow(np.asarray(img))

In [None]:
show_image('00651')
# show_image('01910')
# show_image('02004')
# show_image('02813')
# show_image('02817')

# III. Multimodal

In view of the results obtained in the last section, it seems unnecessary to try to make the model more complex to obtain better results.
However, it may be interesting to examine some approaches that allow to solve multimodal problems.  
For example, there is an approach in the literature that consists in using 3 neural networks:
The first one that takes as input the images, the second one takes as input the MFCC coefficients. Finally the last one takes as input the results of these 2 first neural networks and returns the prediction.  
Indeed, in the first part, nothing prevented us from using a neural network. However, the performance of CNNs on image recognition has encouraged us to reserve this method for the second part.

However, there is an intuitive and fairly simple multimodal method that we will try to implement.  
It consists in examining the elements for which the model is the least certain using the tools of the first part. Indeed, if the images do not clearly distinguish the location, perhaps the audio files do. It is unlikely that both the image and the sound are unusable for the same element.

In [None]:
#Retourne une liste contenant les éléments dont le modèle est le moins sûr
def least_confident(pred):
    predictions_copy = pred.copy()
    predictions_copy.sort()
    return predictions_copy[:25]#nous choisissons les 25 éléments avec les probas les plus faibles. Ce choix est arbitraire

#Modifie la liste res des prédictions avec les nouvelles prédictions via le SVM de la partie I.
def new_predictions(probas):
    y_pred_SVM = svc.predict(scaler.transform(audio_test)) #Prédit les audio_test en utilisant un SVM de la partie I.
    for x in probas:
        index = y_confidence.index(x)
        res[index] = y_pred_SVM[index]#modifie le résultat final
    

In [None]:
new_predictions(predictions_copy[:28])

Result 
==
The intuitive idea paid off. We now get a score of 0.99942 instead of 0.99594. This means that out of 3450 items to classify, our model makes 2 errors.

# Creating the rendering

In [None]:
#====================================================================
# Création du ficher de soumission

submission = pd.DataFrame({'CLASS':res})
submission=submission.reset_index()
submission = submission.rename(columns={'index': 'Id'})

#======================================================================
# Sauvegarde du fichier
submission.to_csv('ThomasNigoghossian3.csv', index=False)


# In depth 
Articles on multimodal prediction

https://www.youtube.com/watch?v=VIq5r7mCAyw&list=PLTLz0-WCKX616TjsrgPr2wFzKF54y-ZKc&ab_channel=CMU11-777course

 coefficients MFCC, cf https://github.com/devinvenable/mfcc-audio-experiment/blob/master/MFCC%20transforms.ipynb
