# ABOUT


Datascientest's Datascientist continuous bootcamp - cohorte Mars2022 -  AeroBOT project

**Tutor**

* Alban THUET

**Authors:**

* Hélène ASSIR
* Hichem HADJI  
* [Ioannis STASINOPOULOS](https://www.linkedin.com/in/ioannis-stasinopoulos/)

</br>

---
</br>

**Version History**

Version | Date       | Author(s)  | Modification
--------|----------- | ---------  | --------------------------
X.X     | XX/XX/2022 | A.B        | modif
1.0     | 14/09/2022 | I.S        | Document creation

This notebook creates classification reports in two formats (dictionnary and pd.DataFrame) from .pkl files containing `y_pred_proba` and `y_test`.

Reason: Ioannis had saved only the model, `y_pred_proba` and `y_test` when running his BERT experiments. Hence, the classif reports had to be recreated.

**Note that the clf_rep in pd.DataFrame format contains metadata, i.e. experiment parameters.**

Only after having run several BERT experiments, became clear what the tuning parameters should be, hence we could not have predicted the columns to include in the pd.DataFrame from the beginning.

# IMPORT PACKAGES


In [39]:
#@title Import packages
#######################
# Import packages
#######################
import numpy as np
import seaborn as sns
import math # for math.pi etc.
import time # time code execution

#######################
# Pandas
#######################
import pandas as pd
# Set pandas settings to show all data when using .head(), .columns etc.
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option("display.colheader_justify","left") # left-justify the print output of pandas

### Display full columnwidth
# Set pandas settings to display full text columns
#pd.options.display.max_colwidth = None
# Restore pandas settings to display standard colwidth
pd.reset_option('display.max_colwidth')

import itertools # Pour créer des iterateurs

# Package to show the progression of pandas operations
from tqdm import tqdm
# from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()
# simply use .progress_apply() instead of .apply() on your pd.DataFram

######################
# PLOTTING
######################
import matplotlib.pyplot as plt
%matplotlib inline
# Define global plot parameters for better readability and consistency among plots
# A complete list of the rcParams keys can be retrieved via plt.rcParams.keys() function
plt.rcParams['axes.titlesize'] = 30
plt.rcParams['axes.labelsize'] = 23
plt.rcParams['xtick.labelsize'] = 23
plt.rcParams['ytick.labelsize'] = 23
plt.rc('legend', fontsize=23)    # legend fontsize

# BOKEH 
from bokeh.plotting import figure # Importation de la classe figure qui permet de créer un graphique bokeh.
from bokeh.io import  push_notebook, output_notebook, show
output_notebook() # permet d'afficher tous les futurs graphiques dans l'output d'une cellule jupyter. Si cette instruction n'est pas lancée, la figure s'affichera dans un nouvel onglet.
from bokeh.models import ColumnDataSource, Label
from bokeh.transform import dodge
from bokeh.models.tools import HoverTool


###############################
# Other
###############################
import pickle as pkl # Saving data externally
from sklearn.metrics import classification_report, confusion_matrix

## Mount GDrive

In [40]:
#@title
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive/')

#check your present working directory 
%pwd

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


'/content/drive/My Drive/data/saved models/Yannis/BERT/7_3_9_3_UNfrozen_2022_09_14'

In [41]:
#@title
# move to the transformed data location (you can create a deeper structure, if needed, e.g. to save a trained model):
%cd /content/drive/MyDrive/data/transformed/

/content/drive/MyDrive/data/transformed


In [42]:
#@title
!ls # list the content of the pwd

#!ls "/content/drive/MyDrive/Data_Science/Formations/DataScienceTest/projet/AeroBot/" # list contect of a speficic folder

 2022_09_11_7_4_3_raw_narr_BERT_BASE_frozen_max_length_345.pkl
 complaints-2022-08-05_13_55.csv
'Copy of Qualified abbreviations_20220718.xlsx.gsheet'
'Data Dictionnary.xlsx'
 data_for_BERT_multilabel_20220805.pkl
 logs
 model.png
 model_results
 Narrative_PP_stemmed_24072022_TRAIN.pkl
 Narrative_Raw_Stemmed_24072022_TRAIN.pkl
 Narrative_RegEx_subst_21072022_TRAIN.pkl
'Qualified abbreviations_20220707_test.csv'
'Qualified abbreviations_20220708.csv'
'Qualified abbreviations_20220718.csv'
'Qualified abbreviations_20220718_Google_sheet.gsheet'
 test_data_final.pkl
 train_data_final.pkl


# Create classification report
I had saved only the model, `y_pred_proba` and `y_test`.
Hence, we have to create the classif report.

## Function definitions

In [43]:
def get_filenames_in_dir(dir, extension = '.pkl', include_path = False):
  '''
  Find all .pkl (or other format's) files in the directory and create a list with their names
  
  Input: 
  - directory
  - file extension type
  - include_path: whether to include the entire path in the filename; default: False
  
  Return:
  - list of .pkl
  '''
  import os
  files_to_import = []
  # traverse whole directory
  for root, dirs, files in os.walk(dir):
      # select file name
      for file in files:
          # check the extension of files
          if file.endswith(extension):
            if include_path == True:
              files_to_import.append(os.path.join(root, file)) # print whole path of files
            else:  
              files_to_import.append(os.path.join('', file))
  
  return files_to_import

In [44]:
def y_prob_to_y_pred_ML(y_pred_proba, threshold = 0.5):
  """
  Converts probabilities into 0's and 1's. We are still in the MULTILABEL context.
  Input: MULTILABEL predictions (probabilities whose sum for each sample may exceed > 1) coming directly from the model
  Using a user-defined threshold, return a MULTILABEL prediction vector 'y_pred' containing 0's and 1's
  """
  y_pred=[]
  for sample in y_pred_proba:
    y_pred.append([1 if i>= threshold else 0 for i in sample])
  y_pred = np.array(y_pred)

  return y_pred

In [45]:
def create_clf_rep_dict_from_saved_y_test_y_pred_proba(dir, threshold = 0.5):
  '''
  - Load y_test and y_pred_proba from their respective .pkl files, located in dir
  - Calculate y_pred from y_pred_proba using the function y_prob_to_y_pred_ML()
  - Create a classification report

  Return a classification report 'clf_rep' in dictionnary format.
  '''

  # Import DataFrames into a list 'files_to_import'
  %cd $dir
  # the '$' extracts the value from the string. Don't put any comments in the line above

  files_to_import = get_filenames_in_dir(dir, extension = '.pkl', include_path = False)

  print('\nFiles found:')
  for filename in files_to_import:
    print(filename)

  # Load y_test
  filename = files_to_import[1]
  with open(filename, "rb") as f:
    y_test = pkl.load(f)

  # Load y_pred_proba
  filename = files_to_import[2]
  with open(filename, "rb") as f:
    y_pred_proba = pkl.load(f)

  # Calculate y_pred given a specific threshold
  y_pred = y_prob_to_y_pred_ML(y_pred_proba, threshold = threshold)

  anomalies = ['Anomaly_Aircraft Equipment', 
              'Anomaly_Airspace Violation',
              'Anomaly_ATC Issue', 
              'Anomaly_Flight Deck / Cabin / Aircraft Event',
              'Anomaly_Conflict', 
              'Anomaly_Deviation - Altitude',
              'Anomaly_Deviation - Speed', 
              'Anomaly_Deviation - Track / Heading',
              'Anomaly_Deviation / Discrepancy - Procedural',
              'Anomaly_Ground Excursion', 
              'Anomaly_Ground Incursion',
              'Anomaly_Ground Event / Encounter',
              'Anomaly_Inflight Event / Encounter',
              'Anomaly_No Specific Anomaly Occurred']
  # I got this list from df.columns
  # 14 labels

  clf_rep = classification_report(y_test, y_pred, output_dict = True)
  print(f"\n\n Classification Report: \n {classification_report(y_test, y_pred, target_names = anomalies)}\n")

  return clf_rep

In [46]:
  anomalies = ['Anomaly_Aircraft Equipment', 
              'Anomaly_Airspace Violation',
              'Anomaly_ATC Issue', 
              'Anomaly_Flight Deck / Cabin / Aircraft Event',
              'Anomaly_Conflict', 
              'Anomaly_Deviation - Altitude',
              'Anomaly_Deviation - Speed', 
              'Anomaly_Deviation - Track / Heading',
              'Anomaly_Deviation / Discrepancy - Procedural',
              'Anomaly_Ground Excursion', 
              'Anomaly_Ground Incursion',
              'Anomaly_Ground Event / Encounter',
              'Anomaly_Inflight Event / Encounter',
              'Anomaly_No Specific Anomaly Occurred']

## Import files

### clf_rep as dictionnary

In [47]:
# Call the function
experiment_name = '2022_09_15_7_3_9_4_UNfrozen_random_state_222'
dir = '/content/drive/MyDrive/data/saved models/Yannis/BERT/' + experiment_name

clf_rep = create_clf_rep_dict_from_saved_y_test_y_pred_proba(dir, threshold = 0.5)

/content/drive/MyDrive/data/saved models/Yannis/BERT/2022_09_15_7_3_9_4_UNfrozen_random_state_222

Files found:
y_pred_proba_2022_09_15_7_3_9_4_UNfrozen_random_state_222.pkl
y_test_2022_09_15_7_3_9_4_UNfrozen_random_state_222.pkl
2022_09_15_7_3_9_4_UNfrozen_random_state_222.pkl


ValueError: ignored

### clf_rep as pd.DataFrame

In [None]:
def convert_clf_rep_to_df_multilabel_BERT(clf_rep):
  '''
  Basically returns the classification report in form of a pd.DataFrame.
  Tailored for extracting MULTILABEL BERT experiment results.
  
  Input: 
  - multilabel classification report in dictionary format
  (does not contain '0' and '1' keys)

  Returns:
  - classification report in form of a pd.DataFrame
  '''

  # write classification report dictionnary into pd.DataFrame
  metrics = pd.DataFrame(clf_rep)

  # The rest of the code is basically kind of 'transposing' the format 
  # and adding extra columns with parameter values

  # Rename columns with anomaly names
  # Crete dictionary with correspondance among label indices and anomaly names
  anomaly_labels = dict(zip(metrics.columns[0:14], anomalies))
  metrics = metrics.rename(columns = anomaly_labels)

  ##########################################################
  # Create DataFrame in the right format for the plotting of results
  clf_rep_df = pd.DataFrame()
  for anomaly in metrics.columns[0:14]:

    temp_df = pd.DataFrame(index = metrics.index) # create temporary DataFrame with the 4 metrics as index
    temp_df['values'] = metrics.filter(items = [anomaly]).values # write the 4 values for the selected anomaly
    temp_df['anomaly'] = anomaly # fill in the column with the selected anomaly label
    clf_rep_df = pd.concat([clf_rep_df, temp_df])

  clf_rep_df = clf_rep_df.reset_index().rename(columns = {'index': 'metric'})

  # Fill in additionnal columns with metadata
  clf_rep_df['classifier'] = 'BERT_BASE'        # 'BERT_BASE' or 'DistilBERT' 
  clf_rep_df['preprocessing'] = 'raw'           # 'raw' or 'raw_stem' or 'PP'
  clf_rep_df['undersampling'] = 0               # 1 if undersampling was applied

  # layers run from 1 to 12
  clf_rep_df['UNfrozen_layers'] = '9_10_11_12'     # last 4 layers = '9_10_11_12', 'NO' if all layers frozen
  clf_rep_df['concat_layers']   = 'NO'       # '8_9_10_11' or 'NO' if no layers concatenation
  
  clf_rep_df['comments'] = 'last_hidden_state_CLS_random_state_222' # misc. comments, e.g. 'Flatten layer X' or 'max_length_345' or 'last_hidden_state_CLS'
  clf_rep_df['experiment_ID'] = '7_3_9_4'              # e.g. '7_5_2_1' if available
  #clf_rep_df['padding'] = padding              # 'pre' or 'post'
  #clf_rep_df['truncating'] = truncating        # 'pre' or 'post'
  #clf_rep_df['maxlen'] = maxlen   
  #clf_rep_df['num_words'] = num_words 

  # Reorder columns
  clf_rep_df = clf_rep_df[[\
                           'experiment_ID',
                           'classifier', 
                           'preprocessing', 
                           'undersampling',
                           'UNfrozen_layers',
                           'concat_layers',
                           'comments',
                           'anomaly', 
                           #'num_words', 
                           #'maxlen', 
                           #'padding', 
                           #'truncating', 
                           'metric', 
                           'values']]

  print("DataFrame length:", len(clf_rep_df)) #should be 56 = 14 anomalies * 4 metrics

  return clf_rep_df

In [None]:
# Convert the classification report into pd.DataFrame format 
clf_rep_df = convert_clf_rep_to_df_multilabel_BERT(clf_rep)
clf_rep_df.head(5)

# Save the clf reports as .pkl

In [None]:
# Save classif report to the pwd in 2 formats (dict, DataFrame)
!pwd
filename = 'clf_rep_' + experiment_name + '.pkl'
pkl.dump(clf_rep, open(filename, 'wb'))

filename = 'clf_rep_df_' + experiment_name + '.pkl'
pkl.dump(clf_rep_df, open(filename, 'wb'))