<a id='data_section'></a>
# Data section

This section adds some informations about the datas used to train and test the model. It is important to know well the dataset if we want to explain the current model's behavior and improve it.

## Imports

In [1]:
import pandas as pd
import requests as rq
import numpy as np
import os
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import seaborn as sns

## Data Loading

In [2]:
# Load training and test sets

df_train = pd.read_json('datas/training_set.json')
df_test = pd.read_json('datas/testing_set.json')

print(f"Train shape : {df_train.shape}")
print(f"Test shape : {df_test.shape}")

Train shape : (6035, 2)
Test shape : (1065, 2)


## Informations and visualisations of datasets

In [3]:
# Stats on the training set
df_train.describe()

Unnamed: 0,intent,sentence
count,6035,6035
unique,8,6035
top,irrelevant,Je voudrais que tu me dises si le dernier Mari...
freq,3852,1


### Comments
We see here there is indeed a total of 8 dfferent intents.
Moreover, the *irrelevant* intent in highly represented in the dataset (3852/6035).  

This can be good since *irrelevant* is the intent for every sentence that doesn't fit one of the 7 others. It is less specific than the others so it may need more examples to be well-recognized. Nevertheless, it can involve weak detections for the other intents by the model, because of a too small amount of examples.

In [4]:
# Informations about colums
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6035 entries, 0 to 6034
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   intent    6035 non-null   object
 1   sentence  6035 non-null   object
dtypes: object(2)
memory usage: 94.4+ KB


### Comments
The dataset is made of 2 columns:
- `intent` is our target
- `sentence` is the input, it's what the user will give to the model

Both columns are categorical, we have no numerical value here.
In addition, there isn't any missing value (*non-null*).  
To finish, an important preprocessing part is needed before feeding a model with the datas since it is text (*i.e.* Tokenization, Feature extraction, ...).

In [5]:
# Show 10 first elements
df_train.head(n=10)

Unnamed: 0,intent,sentence
0,irrelevant,"850€ maximum pour le loyer, à partir de janvie..."
1,irrelevant,D'imprimer
2,purchase,Le meilleur cabriolet hybrid moins de 5m10 min...
3,find-hotel,en ce moment je cher un location pour les vaca...
4,irrelevant,c'est possible de t'utiliser la nuit ?
5,irrelevant,J'ai besoin d'acheter un fusil
6,irrelevant,Vous pouvez réserver pour 09h oui
7,irrelevant,Du 20 au 22 novembre pour 100-150 euros la nuit
8,purchase,Mon docteur m'a suggéré de porter des bandes p...
9,purchase,Commande à effectuer : 30 bloc note petits car...


### Comments
<p>This gives us some examples of rows in the dataset. We notice there are both short and long sentences, well-written or not, which is good to train the model on various writting styles (to work well on the different users' styles in production).</p>

In [6]:
# Show the 8 different intents rows counts
df_train["intent"].value_counts()

irrelevant           3852
purchase              613
find-restaurant       469
find-around-me        383
find-hotel            316
find-train            143
find-flight           142
provide-showtimes     117
Name: intent, dtype: int64

In [7]:
# Ratios (%) for each intent
total = df_train.shape[0]
round(df_train['intent'].value_counts() / total * 100 , 2)

irrelevant           63.83
purchase             10.16
find-restaurant       7.77
find-around-me        6.35
find-hotel            5.24
find-train            2.37
find-flight           2.35
provide-showtimes     1.94
Name: intent, dtype: float64

### Comments
We get here a more detailed count of rows for each intent. As we said before, there are mostly *irrelevant* rows (63%), the dataset is really unbalanced. There will be some analysis to make about the measures used to evaluate any model trained on this dataset.  

It's also unbalanced between the 7 'specific' intents. Maybe that's due to the use of the app made by the clients, asking more often for purchase matters than for showtimes ones.

# Model section

The only informations we have about the model are the measures given in the base project's README file. <br>
As the main problem is that we don't know how they were computed (which datas, cross-validation or not, etc..), we will try in the following cells to get our own measures on the datas we were given.

In [None]:
# Launch application to have access to the model
# Don't forget to pull image before doing anything else : docker pull wiidiiremi/projet_industrialisation_ia_3a
# This may take some time

# Change your custom port here
port = '8080'
os.system('docker run -p 8080:' + port + ' 3eec8ccf7aec &')

## About the model
The goal of the model is to find among the 8 intents which one fits to the most the user's request (*i.e.* sentence).
As a consequence, it is a **classification problem with 8 classes**.  
We can confirm the model is a classifier with the 8 probabilities he returns (*json* response) when given a sentence.  
However, we can't know what it is made of. It could be either a Neural Network or a Softmax Regression for example.

## Split datasets in inputs and labels

In [None]:
df_x_train = df_train['sentence']
df_y_train = df_train['intent']
df_x_test = df_test['sentence']
df_y_test = df_test['intent']

print(f"Train data shape : {df_x_train.shape}")
print(f"Train labels shape : {df_y_train.shape}")
print(f"Test data shape : {df_x_test.shape}")
print(f"Test labels shape : {df_y_test.shape}")

## Get model predictions for both datasets

In [None]:
# Change the 8080 is you custom port from docker run command
route = 'http://localhost:' + port + '/api/intent?'

# Function to get the model's predictions for a given dataset
def predict(datas):
    
    # List of predicted intents
    predicted_labels = []
    # List of probabilities for the predicted intents
    prediction_probabilities = []
    
    # Request the model for each data
    for data in datas:
        
        try:
            res = rq.get(route, {'sentence':data}).json()
        except:
            print("Request Error: Service not available")
            return predicted_labels, prediction_probabilities 
        
        predicted_class = max(res, key=res.get)
        predicted_values = list(res.values())
 
        predicted_labels.append(predicted_class)
        prediction_probabilities.append(predicted_values)
        
    return predicted_labels, prediction_probabilities   

In [None]:
# Get both datasets predictions from the model
train_predicted_labels, train_predicted_probabilities = predict(df_x_train)
test_predicted_labels, test_predicted_probabilities = predict(df_x_test)

assert len(train_predicted_labels) == df_x_train.shape[0]
assert len(train_predicted_labels) == df_x_train.shape[0]
assert len(test_predicted_labels) == df_x_test.shape[0]
assert len(test_predicted_labels) == df_x_test.shape[0]

## Compute model's various scores

In [None]:
conf_matrix = confusion_matrix(df_y_train, train_predicted_labels)
sns.heatmap(conf_matrix, annot=True)

In [None]:
# Training scores
print(classification_report(df_y_train, train_predicted_labels))

In [None]:
# Test scores
print(classification_report(df_y_test, test_predicted_labels))

In [None]:
# List of all intents in the same order as the model's output
intents = ["find-train", "irrelevant", "find-flight", "find-restaurant", "purchase", "find-around-me", "provide-showtimes", "find-hotel"]

# Function mapping a true label to a probabilities array 
def labelToProbs(label):
    
    assert label in intents
    
    # All probabilities are 0
    probs = np.zeros(8)
    
    for index, intent in enumerate(intents):
        if label == intent:
            # Set true label porbability to 1
            probs[index] = 1
            return probs
    

In [None]:
# ROC AUC scores

# Map True labels to probabilties arrays in order to compute ROC score
mapping = lambda x: labelToProbs(x)

y_true_train = list(map(mapping, df_y_train.to_numpy()))
y_true_test = list(map(mapping, df_y_test.to_numpy()))

train_roc = roc_auc_score(y_true_train, train_predicted_probabilities, multi_class='ovo')
test_roc = roc_auc_score(y_true_test, test_predicted_probabilities, multi_class='ovo')

print(f"Training ROC AUC : {round(train_roc,3)}")
print(f"Test ROC AUC : {round(test_roc, 3)}")

### Comments

The classification reports show an accuracy of **81%** for training (**80%** for test). We could say it is not that bad for a first model.
However, as we saw in the [Data Section](#data_section) that our dataset is unbalanced (64:36 ratio between *irrelevant* class and the seven others). As a conclusion, we can't rate our model with its accuracy since it will tend to choose *irrelevant* to have the best accuracy ([see 'Accuracy Paradox'](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/)).


# Application section

## Comments about the UI

...

## Comments about hte performances

...

## Comments about the documentation

...