# ABOUT


Datascientest's Datascientist continuous bootcamp - cohorte Mars2022 -  AeroBOT project

**Tutor**

* Alban THUET

**Authors:**

* [Ioannis STASINOPOULOS](https://www.linkedin.com/in/ioannis-stasinopoulos/)

</br>

---
</br>

**Version History**

Version | Date       | Author(s)  | Modification
--------|----------- | ---------  | --------------------------
4.1     | 23/10/2022 | I.S        | migrate function/ class definitions on GitHUB
4.0     | 23/10/2022 | I.S        | reproduce the training of model 7.3.9.3
3.0     | 06/10/2022 | I.S        | global function calling the classes
2.0     | 06/10/2022 | I.S        | pipeline with dummy test set
1.1     | 30/09/2022 | I.S        | Remove preprocessing part by using `04.1_Anomaly - Feature definition.ipynb`
1.0     | 24/09/2022 | I.S        | Document creation

This notebook is the refactored version of the code for BERT, using functions and classes.
It was created on the basis of 7_3_9_3_UNfrozen_2022_09_14.ipynb, the model with best performance so far in aeroBOT.

The present notebook calls a 'global function' which is defined on AeroBOT's GitHUB repo, along with all the dependencies of this function. For more details, refer to the docstrings on GitHUB.

**Using Premium GPU, extended RAM greatly improves the execution of this notebook.**

# IMPORT PACKAGES


In [1]:
#######################
# Import packages
#######################
import numpy as np

#######################
# Pandas
#######################
import pandas as pd
# Set pandas settings to show all data when using .head(), .columns etc.
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option("display.colheader_justify","left") # left-justify the print output of pandas


### Display full columnwidth
# Set pandas settings to display full text columns
#pd.options.display.max_colwidth = None
# Restore pandas settings to display standard colwidth
pd.reset_option('display.max_colwidth')

######################
# PLOTTING
######################
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['axes.titlesize'] = 30
plt.rcParams['axes.labelsize'] = 23
plt.rcParams['xtick.labelsize'] = 23
plt.rcParams['ytick.labelsize'] = 23
plt.rc('legend', fontsize=23)    # legend fontsize

###############################
# Other
###############################
import pickle as pkl # Saving data externally

# LOAD DATA

## Mount GDrive

In [2]:
#@title
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive/')

#check your present working directory 
%pwd

Mounted at /content/drive/


'/content'

In [3]:
#@title
# move to the transformed data location (you can create a deeper structure, if needed, e.g. to save a trained model):
%cd /content/drive/MyDrive/data/transformed/

/content/drive/MyDrive/data/transformed


In [4]:
#@title
!ls # list the content of the pwd

#!ls "/content/drive/MyDrive/Data_Science/Formations/DataScienceTest/projet/AeroBot/" # list contect of a speficic folder

 2022_09_11_7_4_3_raw_narr_BERT_BASE_frozen_max_length_345.pkl
 complaints-2022-08-05_13_55.csv
'Copy of Qualified abbreviations_20220718.xlsx.gsheet'
'Data Dictionnary.xlsx'
 data_for_BERT_multilabel_20220805.pkl
 df_for_Anomaly_prediction.pkl
 df_test_for_Anomaly_prediction.pkl
 model.png
 model_results
 Narrative_PP_stemmed_24072022_TRAIN.pkl
 Narrative_Raw_Stemmed_24072022_TRAIN.pkl
 Narrative_RegEx_subst_21072022_TRAIN.pkl
'Qualified abbreviations_20220707_test.csv'
'Qualified abbreviations_20220708.csv'
'Qualified abbreviations_20220718.csv'
'Qualified abbreviations_20220718_Google_sheet.gsheet'
 test_data_final.pkl
 train_data_final.pkl


## Import AeroBOT packages from GitHUB

In [5]:
%cd /content/drive/MyDrive/

/content/drive/MyDrive


In [6]:
# Create temporary folders for importing the entire remote repo
# Ioannis tried to import only the aerobotpackages, but it reads it as a repo and does not work
!mkdir AeroBOTTemp -p     # temporary folder to store the repo
!mkdir aerobotpackages -p # temporary folder to store the aerobotpackages

# Fetch data from Github
username = 'DataScientest-Studio'
repository = 'Aerobot'
git_token = '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' # will expire on 31.01.2023
#'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' 

!git clone https://{git_token}@github.com/{username}/{repository} ./AeroBOTTemp --dissociate

# Copy the aerobotpackages into temp folder defined above
!cp ./AeroBOTTemp/aerobotpackages/* ./aerobotpackages

# Delete temp repo folder 
!rm AeroBOTTemp -r

Cloning into './AeroBOTTemp'...
remote: Enumerating objects: 661, done.[K
remote: Counting objects: 100% (279/279), done.[K
remote: Compressing objects: 100% (202/202), done.[K
remote: Total 661 (delta 158), reused 162 (delta 77), pack-reused 382[K
Receiving objects: 100% (661/661), 89.44 MiB | 21.62 MiB/s, done.
Resolving deltas: 100% (306/306), done.
Checking out files: 100% (117/117), done.


In [7]:
from aerobotpackages import train_load_transformer_model

# Functions for threshold optimization
from aerobotpackages import  y_prob_to_y_pred, y_multilabel_to_binary, find_opt_threshold_PR, get_list_of_opt_thresholds, plot_PR_curve_opt_thresh, convert_clf_rep_to_df_multilabel_BERT_kw_args

# this explicit function import is necessary to avoid typing 
# aerobotpackages.[FUNCTION or CLASS NAME] at each function / class call

In [8]:
# After the import, delete the packages folder
# This does not affect the imported packages
!rm aerobotpackages -r

## Load data from .pkl file


In [9]:
# # Load the TRAIN data (97417 entries)
# # Do not touch the TEST data until the end of the project!
# # or the curse of the greek gods will fall upon you!

# %cd /content/drive/MyDrive/data/transformed/
# with open("df_for_Anomaly_prediction.pkl", "rb") as f:
#     loaded_data = pkl.load(f)

# df = loaded_data
# print("\nA Dataframe with", len(df), "entries has been loaded")

In [10]:
# Load the FINAL TEST data (10805 entries)
# Do not touch the TEST data until the end of the project!
# or the curse of the greek gods will fall upon you!

%cd /content/drive/MyDrive/data/transformed/
with open("df_test_for_Anomaly_prediction.pkl", "rb") as f:
    loaded_data = pkl.load(f)

df = loaded_data
print("\nA Dataframe with", len(df), "entries has been loaded")

/content/drive/MyDrive/data/transformed

A Dataframe with 10805 entries has been loaded


In [11]:
df.head(2)

Unnamed: 0_level_0,Narrative,Anomaly,Anomaly_Deviation / Discrepancy - Procedural,Anomaly_Aircraft Equipment,Anomaly_Conflict,Anomaly_Inflight Event / Encounter,Anomaly_ATC Issue,Anomaly_Deviation - Altitude,Anomaly_Deviation - Track / Heading,Anomaly_Ground Event / Encounter,Anomaly_Flight Deck / Cabin / Aircraft Event,Anomaly_Ground Incursion,Anomaly_Airspace Violation,Anomaly_Deviation - Speed,Anomaly_Ground Excursion,Anomaly_No Specific Anomaly Occurred
ACN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1014798,Flying into SLC on the DELTA THREE RNAV arriva...,Aircraft Equipment Problem Less Severe; Deviat...,1,1,0,0,0,1,0,0,0,0,0,0,0,0
1806744,ORD was on a very busy east flow arrival push....,ATC Issue All Types; Conflict NMAC; Deviation ...,1,0,1,0,1,0,0,0,0,0,0,0,0,0


## Define Anomaly_RootLabels_columns list from data set

In [12]:
# Retriece the list of Anomaly label columns
Anomaly_RootLabels_columns = []

for col in df.columns:
  if 'Anomaly_' in str(col):
      Anomaly_RootLabels_columns.append(col)

In [13]:
Anomaly_RootLabels_columns

['Anomaly_Deviation / Discrepancy - Procedural',
 'Anomaly_Aircraft Equipment',
 'Anomaly_Conflict',
 'Anomaly_Inflight Event / Encounter',
 'Anomaly_ATC Issue',
 'Anomaly_Deviation - Altitude',
 'Anomaly_Deviation - Track / Heading',
 'Anomaly_Ground Event / Encounter',
 'Anomaly_Flight Deck / Cabin / Aircraft Event',
 'Anomaly_Ground Incursion',
 'Anomaly_Airspace Violation',
 'Anomaly_Deviation - Speed',
 'Anomaly_Ground Excursion',
 'Anomaly_No Specific Anomaly Occurred']

# Multilabel with BERT


## Install ðŸ¤— Hugging Face 

In [None]:
! pip install transformers 

#! pip install datasets
# Use this instead (see https://github.com/huggingface/datasets/pull/5120): 
! pip install git+https://github.com/huggingface/datasets#egg=datasets

! pip install huggingface_hub

# GLOBAL FUNCTION


In [15]:
# Create the desired directory for saving the outputs
dir_name = '/content/drive/MyDrive/data/saved models/Yannis/BERT/2022_10_25_11_3_5_repeatability_test2_of_11_3_3/'
experiment_name = '11_3_5' # the subdirectory is created automatically

In [16]:
train_load_transformer_model(dir_name = dir_name,
          experiment_name = experiment_name,
          df = df,
          anomalies = Anomaly_RootLabels_columns, 
          train_mode = False, 
          num_epochs = 20,
          load_model = True, 
          save_and_overwrite_model = True)

Creating multilabels...
Example of text and corresponding multilabel:

labels           [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
text      Flying into SLC on the DELTA THREE RNAV arriva...
Name: 0, dtype: object


get_text_and_labels() done
****************************** 

Building dummy train and validation datasets: they contain only the first two entries of the test set.
dummy train set length: 2
dummy validation set length: 2
test set length: 10805


build_dummy_datasets() done
****************************** 

Combining pd.DataFrames into a HuggingFace dataset...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



 Structure of Hugging Face dataset (train): Dataset({
    features: ['labels', 'text'],
    num_rows: 2
})

 Structure of the complete Hugging Face dataset:
 DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 2
    })
    validation: Dataset({
        features: ['labels', 'text'],
        num_rows: 2
    })
    test: Dataset({
        features: ['labels', 'text'],
        num_rows: 10805
    })
})

 First entry of the train dataset:
 {'labels': [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'text': "Flying into SLC on the DELTA THREE RNAV arrival (DELTA.DELTA3). Somewhere prior to MLF; we were told to descend via the arrival and delete speed restrictions. We already were descending in VNAV to JAMMN at 17;000 FT so we selected 11;000 FT (lowest crossing altitude). We had thoroughly briefed the descent/arrival and approach prior to top of descent. We were monitoring the descent at each point and the aircraft was doing a crossing JAMMN at 17;000 FT; D

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Performing tokenization in batches on: train, validation, test sets...


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

Columns added by tokenizer: ['attention_mask', 'input_ids', 'token_type_ids']

 Structure of the complete tokenized Hugging Face dataset:
 DatasetDict({
    train: Dataset({
        features: ['labels', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2
    })
    validation: Dataset({
        features: ['labels', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2
    })
    test: Dataset({
        features: ['labels', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 10805
    })
})


tokenize_the_BERT_way() done
****************************** 

Converting tokenized_dataset into tf.data.Dataset datasets...

 Structure of the train tf.Data.dataset:
 <PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(None, 200), dtype=tf.int64, name=None), 'token_type_ids': TensorSpec(shape=(None, 200), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, 200), dtype=tf.int64, name=None)}, Tensor

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
