**Kaggle competition**: [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started)

- The metric used is "F1-score".

- The algortihm developed in this Notebook achieved a F1-score of about **82%** on the test set (The score is calculated by the Kaggle platform).

# Download libraries

In [None]:
# A dependency of the preprocessing for BERT inputs
!pip install -q -U tensorflow-text

# AdamW optimizer
!pip install -q tf-models-official

# Optuna: model optimization lib
!pip install -q optuna

[K     |████████████████████████████████| 4.3 MB 8.7 MB/s 
[K     |████████████████████████████████| 1.6 MB 7.7 MB/s 
[K     |████████████████████████████████| 43 kB 2.1 MB/s 
[K     |████████████████████████████████| 679 kB 47.9 MB/s 
[K     |████████████████████████████████| 636 kB 44.5 MB/s 
[K     |████████████████████████████████| 37.1 MB 87 kB/s 
[K     |████████████████████████████████| 90 kB 10.2 MB/s 
[K     |████████████████████████████████| 211 kB 59.4 MB/s 
[K     |████████████████████████████████| 99 kB 7.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 17.0 MB/s 
[K     |████████████████████████████████| 352 kB 55.8 MB/s 
[?25h  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


# Import libraries

In [None]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, jaccard_score, precision_score, recall_score

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer
import tensorflow_addons as tfa

import optuna

nlp = spacy.load("en_core_web_sm")

# Set the configuration of this project

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

CONFIGURATION = dict (
    seed = 0,
    nbr_classes = 1,
    nbr_folds = 5,
    batch_size = 32,
    learning_rate_min = 1e-7,
    epochs = 20
)

Check if a GPU is available

In [None]:
gpus = tf.config.list_physical_devices('GPU')

if gpus:
  try:    
    # Currently, memory growth needs to be the same across GPUs
    
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

1 Physical GPUs, 1 Logical GPUs


# Acces to Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Data is saved in a zip file

zip_path = '/content/drive/MyDrive/Gaetan_Travail/ML/Projets_perso/NLP/Kaggle/Disaster_Tweets/nlp-getting-started.zip'

!cp {zip_path} /content/
!unzip /content/nlp-getting-started.zip -d /content/
!rm /content/nlp-getting-started.zip

Archive:  /content/nlp-getting-started.zip
  inflating: /content/sample_submission.csv  
  inflating: /content/test.csv       
  inflating: /content/train.csv      


# Check the DataFrames

In [None]:
df_train = pd.read_csv("/content/train.csv")
df_test = pd.read_csv("/content/test.csv")

print(f"df_train shape = {df_train.shape}\ndf_test shape = {df_test.shape}")

df_train shape = (7613, 5)
df_test shape = (3263, 4)


In [None]:
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [None]:
df_train.isna().sum()/len(df_train)*100

id           0.000000
keyword      0.801261
location    33.272035
text         0.000000
target       0.000000
dtype: float64

In [None]:
df_train_clean = df_train[["text", "target"]].reset_index(drop=True)
print(f"df_train_clean shape = {df_train_clean.shape}")
df_train_clean.head()

df_train_clean shape = (7613, 2)


Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


# Define functions for the pipeline

In [None]:
# https://www.kaggle.com/alifvianmarco/nlp-disaster-tweets-classification

abbreviations = {
    "$" : " dollar ",
    "€" : " euro ",
    "4ao" : "for adults only",
    "a.m" : "before midday",
    "a3" : "anytime anywhere anyplace",
    "aamof" : "as a matter of fact",
    "acct" : "account",
    "adih" : "another day in hell",
    "afaic" : "as far as i am concerned",
    "afaict" : "as far as i can tell",
    "afaik" : "as far as i know",
    "afair" : "as far as i remember",
    "afk" : "away from keyboard",
    "app" : "application",
    "approx" : "approximately",
    "apps" : "applications",
    "asap" : "as soon as possible",
    "asl" : "age, sex, location",
    "atk" : "at the keyboard",
    "ave." : "avenue",
    "aymm" : "are you my mother",
    "ayor" : "at your own risk", 
    "b&b" : "bed and breakfast",
    "b+b" : "bed and breakfast",
    "b.c" : "before christ",
    "b2b" : "business to business",
    "b2c" : "business to customer",
    "b4" : "before",
    "b4n" : "bye for now",
    "b@u" : "back at you",
    "bae" : "before anyone else",
    "bak" : "back at keyboard",
    "bbbg" : "bye bye be good",
    "bbc" : "british broadcasting corporation",
    "bbias" : "be back in a second",
    "bbl" : "be back later",
    "bbs" : "be back soon",
    "be4" : "before",
    "bfn" : "bye for now",
    "blvd" : "boulevard",
    "bout" : "about",
    "brb" : "be right back",
    "bros" : "brothers",
    "brt" : "be right there",
    "bsaaw" : "big smile and a wink",
    "btw" : "by the way",
    "bwl" : "bursting with laughter",
    "c/o" : "care of",
    "cet" : "central european time",
    "cf" : "compare",
    "cia" : "central intelligence agency",
    "csl" : "can not stop laughing",
    "cu" : "see you",
    "cul8r" : "see you later",
    "cv" : "curriculum vitae",
    "cwot" : "complete waste of time",
    "cya" : "see you",
    "cyt" : "see you tomorrow",
    "dae" : "does anyone else",
    "dbmib" : "do not bother me i am busy",
    "diy" : "do it yourself",
    "dm" : "direct message",
    "dwh" : "during work hours",
    "e123" : "easy as one two three",
    "eet" : "eastern european time",
    "eg" : "example",
    "embm" : "early morning business meeting",
    "encl" : "enclosed",
    "encl." : "enclosed",
    "etc" : "and so on",
    "faq" : "frequently asked questions",
    "fawc" : "for anyone who cares",
    "fb" : "facebook",
    "fc" : "fingers crossed",
    "fig" : "figure",
    "fimh" : "forever in my heart", 
    "ft." : "feet",
    "ft" : "featuring",
    "ftl" : "for the loss",
    "ftw" : "for the win",
    "fwiw" : "for what it is worth",
    "fyi" : "for your information",
    "g9" : "genius",
    "gahoy" : "get a hold of yourself",
    "gal" : "get a life",
    "gcse" : "general certificate of secondary education",
    "gfn" : "gone for now",
    "gg" : "good game",
    "gl" : "good luck",
    "glhf" : "good luck have fun",
    "gmt" : "greenwich mean time",
    "gmta" : "great minds think alike",
    "gn" : "good night",
    "g.o.a.t" : "greatest of all time",
    "goat" : "greatest of all time",
    "goi" : "get over it",
    "gps" : "global positioning system",
    "gr8" : "great",
    "gratz" : "congratulations",
    "gyal" : "girl",
    "h&c" : "hot and cold",
    "hp" : "horsepower",
    "hr" : "hour",
    "hrh" : "his royal highness",
    "ht" : "height",
    "ibrb" : "i will be right back",
    "ic" : "i see",
    "icq" : "i seek you",
    "icymi" : "in case you missed it",
    "idc" : "i do not care",
    "idgadf" : "i do not give a damn fuck",
    "idgaf" : "i do not give a fuck",
    "idk" : "i do not know",
    "ie" : "that is",
    "i.e" : "that is",
    "ifyp" : "i feel your pain",
    "IG" : "instagram",
    "iirc" : "if i remember correctly",
    "ilu" : "i love you",
    "ily" : "i love you",
    "imho" : "in my humble opinion",
    "imo" : "in my opinion",
    "imu" : "i miss you",
    "iow" : "in other words",
    "irl" : "in real life",
    "j4f" : "just for fun",
    "jic" : "just in case",
    "jk" : "just kidding",
    "jsyk" : "just so you know",
    "l8r" : "later",
    "lb" : "pound",
    "lbs" : "pounds",
    "ldr" : "long distance relationship",
    "lmao" : "laugh my ass off",
    "lmfao" : "laugh my fucking ass off",
    "lol" : "laughing out loud",
    "ltd" : "limited",
    "ltns" : "long time no see",
    "m8" : "mate",
    "mf" : "motherfucker",
    "mfs" : "motherfuckers",
    "mfw" : "my face when",
    "mofo" : "motherfucker",
    "mph" : "miles per hour",
    "mr" : "mister",
    "mrw" : "my reaction when",
    "ms" : "miss",
    "mte" : "my thoughts exactly",
    "nagi" : "not a good idea",
    "nbc" : "national broadcasting company",
    "nbd" : "not big deal",
    "nfs" : "not for sale",
    "ngl" : "not going to lie",
    "nhs" : "national health service",
    "nrn" : "no reply necessary",
    "nsfl" : "not safe for life",
    "nsfw" : "not safe for work",
    "nth" : "nice to have",
    "nvr" : "never",
    "nyc" : "new york city",
    "oc" : "original content",
    "og" : "original",
    "ohp" : "overhead projector",
    "oic" : "oh i see",
    "omdb" : "over my dead body",
    "omg" : "oh my god",
    "omw" : "on my way",
    "p.a" : "per annum",
    "p.m" : "after midday",
    "pm" : "prime minister",
    "poc" : "people of color",
    "pov" : "point of view",
    "pp" : "pages",
    "ppl" : "people",
    "prw" : "parents are watching",
    "ps" : "postscript",
    "pt" : "point",
    "ptb" : "please text back",
    "pto" : "please turn over",
    "qpsa" : "what happens", #"que pasa",
    "ratchet" : "rude",
    "rbtl" : "read between the lines",
    "rlrt" : "real life retweet", 
    "rofl" : "rolling on the floor laughing",
    "roflol" : "rolling on the floor laughing out loud",
    "rotflmao" : "rolling on the floor laughing my ass off",
    "rt" : "retweet",
    "ruok" : "are you ok",
    "sfw" : "safe for work",
    "sk8" : "skate",
    "smh" : "shake my head",
    "sq" : "square",
    "srsly" : "seriously", 
    "ssdd" : "same stuff different day",
    "tbh" : "to be honest",
    "tbs" : "tablespooful",
    "tbsp" : "tablespooful",
    "tfw" : "that feeling when",
    "thks" : "thank you",
    "tho" : "though",
    "thx" : "thank you",
    "tia" : "thanks in advance",
    "til" : "today i learned",
    "tl;dr" : "too long i did not read",
    "tldr" : "too long i did not read",
    "tmb" : "tweet me back",
    "tntl" : "trying not to laugh",
    "ttyl" : "talk to you later",
    "u" : "you",
    "u2" : "you too",
    "u4e" : "yours for ever",
    "utc" : "coordinated universal time",
    "w/" : "with",
    "w/o" : "without",
    "w8" : "wait",
    "wassup" : "what is up",
    "wb" : "welcome back",
    "wtf" : "what the fuck",
    "wtg" : "way to go",
    "wtpa" : "where the party at",
    "wuf" : "where are you from",
    "wuzup" : "what is up",
    "wywh" : "wish you were here",
    "yd" : "yard",
    "ygtr" : "you got that right",
    "ynk" : "you never know",
    "zzz" : "sleeping bored and tired"
}

def word_abbrev(word):
    return abbreviations[word.lower()] if word.lower() in abbreviations.keys() else word

# Replace all abbreviations
def replace_abbrev(text):
    string = ""
    for word in text.split():
        string += word_abbrev(word) + " "        
    return string

In [None]:
def keep_text_column(df, train=False):
  if train:
    return df[["text", "target"]].reset_index(drop=True)
  else:
    return df[["text"]].reset_index(drop=True)

def delete_special_char(df_column):
  # df_column = df_column.apply(lambda x: re.sub(r'^RT[\s]+', '', x))
  #remove urls
  df_column = df_column.apply(lambda x: re.sub(r'https?:\/\/.*[\r\n]*', '', x))
  #remove hashtags
  df_column = df_column.apply(lambda x: re.sub(r'#', '', x))
  # Remove mentions and characters that not in the English alphabets
  df_column = df_column.apply(lambda x: re.sub(r'@\w+',' ', x))
  df_column = df_column.apply(lambda x: re.sub(r'[^A-Za-z0-9 ]+', '', x))
  return df_column

def spacy_tokenizer(sentence):
  # Parser for reviews
  parser = English()
  mytokens = parser(" ".join([str(x) for x in nlp(sentence) if x.pos_ != "NUM" and x.pos_ != "-PRON-"]))
  mytokens = [word.lemma_.lower().strip() for word in mytokens]
  mytokens = [word for word in mytokens if word not in list(STOP_WORDS) and word not in string.punctuation]
  mytokens = " ".join([i for i in mytokens])

  return mytokens

def pipeline_for_disater_tweets(df, train=False):
  df_clean = keep_text_column(df, train)
  df_clean.text = df_clean.text.apply(replace_abbrev)
  df_clean.text = delete_special_char(df_clean.text)
  df_clean.text = df_clean.text.apply(lambda x: spacy_tokenizer(x))
  
  return df_clean

In [None]:
new_df = pipeline_for_disater_tweets(df_train, train=True)

In [None]:
print(f"new df shape = {new_df.shape}")
new_df.head()

new df shape = (7613, 2)


Unnamed: 0,text,target
0,deeds reason earthquake allah forgive,1
1,forest fire near la ronge sask canada,1
2,residents asked shelter place notified officer...,1
3,people receive wildfires evacuation orders cal...,1
4,got sent photo ruby alaska smoke wildfires pou...,1


In [None]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    7613 non-null   object
 1   target  7613 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 119.1+ KB


In [None]:
# saving this new df
new_df.to_csv("/content/drive/MyDrive/Gaetan_Travail/ML/Projets_perso/NLP/Kaggle/Disaster_Tweets/df_cleaned_v2.csv", index=False)

# Imbalance of the dataset

In [None]:
df_clean = pd.read_csv("/content/drive/MyDrive/Gaetan_Travail/ML/Projets_perso/NLP/Kaggle/Disaster_Tweets/df_cleaned_v2.csv")
df_clean = df_clean[df_clean.text.isnull() == False].reset_index(drop=True)
print(f"new df shape = {df_clean.shape}")
df_clean.head()

new df shape = (7551, 2)


Unnamed: 0,text,target
0,deeds reason earthquake allah forgive,1
1,forest fire near la ronge sask canada,1
2,residents asked shelter place notified officer...,1
3,people receive wildfires evacuation orders cal...,1
4,got sent photo ruby alaska smoke wildfires pou...,1


In [None]:
# Estimate class weights based on the imbalance of the data set

neg = df_clean.target.value_counts()[0]
pos = df_clean.target.value_counts()[1]

print("neg = {}\npos = {}\n".format(neg, pos))

# weigts to correct imbalance:
total = neg + pos

# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.
weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)

class_weights = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}\n'.format(weight_for_1))

# Initial bias:

initial_bias = np.log([pos/neg])
print("initial_bias: {:.2f}".format(initial_bias[0]))

neg = 4309
pos = 3242

Weight for class 0: 0.88
Weight for class 1: 1.16

initial_bias: -0.28


# Cross validation (to obtain train and validation sets)

In [None]:
kfold = StratifiedKFold(n_splits=CONFIGURATION["nbr_folds"], shuffle=True, random_state=CONFIGURATION["seed"])

for n, (train_index, val_index) in enumerate(kfold.split(df_clean, df_clean['target'])):
  df_clean.loc[val_index, 'fold'] = int(n)

df_clean['fold'] = df_clean['fold'].astype(int)

df_clean.groupby(['fold', 'target']).size()

fold  target
0     0         862
      1         649
1     0         861
      1         649
2     0         862
      1         648
3     0         862
      1         648
4     0         862
      1         648
dtype: int64

In [None]:
df_clean.head()

Unnamed: 0,text,target,fold
0,deeds reason earthquake allah forgive,1,1
1,forest fire near la ronge sask canada,1,4
2,residents asked shelter place notified officer...,1,4
3,people receive wildfires evacuation orders cal...,1,1
4,got sent photo ruby alaska smoke wildfires pou...,1,0


In [None]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7551 entries, 0 to 7550
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    7551 non-null   object
 1   target  7551 non-null   int64 
 2   fold    7551 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 177.1+ KB


# Creating Datasets

In [None]:
# Dataloaders for the train set and the validation set

def get_dataloaders(train_df, valid_df):
    trainloader = tf.data.Dataset.from_tensor_slices((train_df.text.values, train_df.target.values))
    validloader = tf.data.Dataset.from_tensor_slices((valid_df.text.values, valid_df.target.values))
    
    trainloader = (
        trainloader
        .shuffle(1024)
        .batch(CONFIGURATION['batch_size'])
        .prefetch(AUTOTUNE)
    )

    validloader = (
        validloader
        .batch(CONFIGURATION['batch_size'])
        .prefetch(AUTOTUNE)
    )
    
    return trainloader, validloader

In [None]:
def get_dataloader_test(test_df):
    testloader = tf.data.Dataset.from_tensor_slices(test_df.text.values)
    
    testloader = (
        testloader
        .batch(CONFIGURATION['batch_size'])
        .prefetch(AUTOTUNE)
    )
    
    return testloader

# Embedding - BERT

In [None]:
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8'

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


# Model optimization with Optuna

In [None]:
def objective(trial):
  # Clear clutter from previous Keras session graphs.
  tf.keras.backend.clear_session()

  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)

  x = outputs['pooled_output']
  x = tf.keras.layers.Dropout(
      rate = trial.suggest_float('dropout_rate', 0.1, 0.5)
  )(x)      

  output_ = tf.keras.layers.Dense(
      CONFIGURATION['nbr_classes'],
      activation="sigmoid",
      name="optimized_model",
      bias_initializer=tf.keras.initializers.Constant(initial_bias)  
  )(x)

  model = tf.keras.Model(text_input, output_)

  #----------------------#

  # Prepare train and valid df
  fold = 0
  df = df_clean.copy()
  train_df = df.loc[df.fold != fold].reset_index(drop=True)
  valid_df = df.loc[df.fold == fold].reset_index(drop=True)
  trainloader, validloader = get_dataloaders(train_df, valid_df)

  #----------------------#

  steps_per_epoch = tf.data.experimental.cardinality(trainloader).numpy()
  num_train_steps = steps_per_epoch * CONFIGURATION["epochs"]
  num_warmup_steps = int(0.1*num_train_steps)

  lr = trial.suggest_loguniform('lr', 1e-6, 1e-4)
  optimizer_ = optimization.create_optimizer(init_lr=lr,
                                            num_train_steps=num_train_steps,
                                            num_warmup_steps=num_warmup_steps,
                                            optimizer_type='adamw')
  
  # We compile our model with a sampled learning rate.  
  
  model.compile(
      optimizer = optimizer_,
      loss = tf.keras.losses.BinaryCrossentropy(),
      metrics = tf.metrics.BinaryAccuracy()
  )

  #----------------------#

  # Model training
  model.fit(
    trainloader,
    epochs=CONFIGURATION["epochs"],
    batch_size=CONFIGURATION["batch_size"],
    validation_data=validloader,
    class_weight=class_weights,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)
    ]
  )

  #----------------------#

  # Evaluate the model accuracy on the validation set.
  score = model.evaluate(validloader)
  return score[1]

In [None]:
# Create a study object and optimize the objective function.
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20, timeout=600)

[32m[I 2021-08-16 14:26:19,755][0m A new study created in memory with name: no-name-9846440d-00cc-496e-a174-a4e731bbc299[0m


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20


[32m[I 2021-08-16 14:30:06,113][0m Trial 0 finished with value: 0.8133686184883118 and parameters: {'dropout_rate': 0.39168283760237466, 'lr': 7.74060918356966e-05}. Best is trial 0 with value: 0.8133686184883118.[0m


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20


[32m[I 2021-08-16 14:34:31,670][0m Trial 1 finished with value: 0.790205180644989 and parameters: {'dropout_rate': 0.4963480362076478, 'lr': 1.152995463262945e-05}. Best is trial 0 with value: 0.8133686184883118.[0m


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20


[32m[I 2021-08-16 14:39:34,256][0m Trial 2 finished with value: 0.8146922588348389 and parameters: {'dropout_rate': 0.13355663396579953, 'lr': 4.779455018530521e-05}. Best is trial 2 with value: 0.8146922588348389.[0m


In [None]:
print(f"Number of finished trials: {len(study.trials)}")

trial = study.best_trial
print("Best trial:")  
print(f"\tValue: {trial.value}")
print("\tParams: ")
for key, value in trial.params.items():
  print(f"\t\t{key}: {value}")

Number of finished trials: 3
Best trial:
	Value: 0.8146922588348389
	Params: 
		dropout_rate: 0.13355663396579953
		lr: 4.779455018530521e-05


# Model with the best hyperparameters

In [None]:
def build_classifier_model(model_name):
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)

  x = outputs['pooled_output']
  x = tf.keras.layers.Dropout(
      trial.params["dropout_rate"]
  )(x)

  output_ = tf.keras.layers.Dense(
      CONFIGURATION['nbr_classes'],
      activation="sigmoid",
      name=model_name,
      bias_initializer=tf.keras.initializers.Constant(initial_bias)  
  )(x)

  return tf.keras.Model(text_input, output_)

In [None]:
def model_cv(model_name, df, fold):
  
  print(f"Model trained: {model_name}")

  #----------------------#

  print("\n******************\n")

  print(f"Fold {fold}:\n")

  # Prepare train and valid df
  train_df = df.loc[df.fold != fold].reset_index(drop=True)
  valid_df = df.loc[df.fold == fold].reset_index(drop=True)
  trainloader, validloader = get_dataloaders(train_df, valid_df)

  #----------------------#

  # Creating the model
  tf.keras.backend.clear_session()
  model = build_classifier_model(model_name)

  #----------------------#

  steps_per_epoch = tf.data.experimental.cardinality(trainloader).numpy()
  num_train_steps = steps_per_epoch * CONFIGURATION["epochs"]
  num_warmup_steps = int(0.1 * num_train_steps)

  optimizer_ = optimization.create_optimizer(
      init_lr=trial.params["lr"],
      num_train_steps=num_train_steps,
      num_warmup_steps=num_warmup_steps,
      optimizer_type='adamw'
  )
  
  # compiling the model
  model.compile(
      optimizer = optimizer_,
      loss = tf.keras.losses.BinaryCrossentropy(),
      metrics = tf.metrics.BinaryAccuracy()
  ) 

  #----------------------#
        
  # Model training
  model.fit(
    trainloader,
    epochs=CONFIGURATION["epochs"],
    batch_size=CONFIGURATION["batch_size"],
    validation_data=validloader,
    class_weight=class_weights,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)
    ]
  )

  #----------------------#

  # Saving the model
  # model.save(f'/content/drive/MyDrive/Gaetan_Travail/ML/Projets_perso/NLP/Kaggle/Disaster_Tweets/Models/{model_name}_{fold}.h5')

  return model

In [None]:
list_models = []

for fold_ in range(CONFIGURATION["nbr_folds"]):
  model = model_cv(model_name="model_optimized", df=df_clean, fold=fold_)
  list_models.append(model)

Model trained: model_optimized

******************

Fold 0:

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Model trained: model_optimized

******************

Fold 1:

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Model trained: model_optimized

******************

Fold 2:

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Model trained: model_optimized

******************

Fold 3:

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Model trained: model_optimized

******************

Fold 4:

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20


# Checking F1 score and threshold to use

In [None]:
# The max (for F1 and Jaccard scores) is near to a threshold ~ 0.9
def max_scores(target, y_pred, min_range=10, max_range=90):

    result_threshold = []
    result_f1 = []
    result_jaccard = []
    result_prec = []
    result_recall = []

    for i in range(min_range, max_range, 1):
      threshold = i/100

      y_pred_threshold = [1 if x > threshold else 0 for x in y_pred]
                                                           
      result_threshold.append(threshold)
      result_jaccard.append(jaccard_score(target, y_pred_threshold, average='weighted'))
      result_f1.append(f1_score(target, y_pred_threshold, average='weighted'))        
      result_prec.append(precision_score(target, y_pred_threshold, average='weighted'))
      result_recall.append(recall_score(target, y_pred_threshold, average='weighted'))

    # we want to maximize the F1 score:
    index = result_f1.index(max(result_f1))
    
    print("Threshold = ", result_threshold[index])
    print("\nScores (average == weighted):\n")
    print("Jaccard score    = {:.4f}".format(result_jaccard[index]))
    print("F1 score         = {:.4f}".format(result_f1[index]))
    print("Precision score  = {:.4f}".format(result_prec[index]))
    print("Recall score     = {:.4f}".format(result_recall[index]))
  
    return result_threshold[index]

In [None]:
list_threshold = []

for fold, model in enumerate(list_models):

  print(f"Fold {fold}:")

  train_df = df_clean.loc[df_clean.fold != fold].reset_index(drop=True)
  valid_df = df_clean.loc[df_clean.fold == fold].reset_index(drop=True)
  trainloader, validloader = get_dataloaders(train_df, valid_df)

  valid_df["y_pred"] = model.predict(validloader)

  list_threshold.append(max_scores(valid_df.target, valid_df.y_pred, min_range=10, max_range=90))
  
  print()
  results = model.evaluate(validloader, verbose=2)
  
  print("\n******************\n")

Fold 0:
Threshold =  0.3

Scores (average == weighted):

Jaccard score    = 0.6790
F1 score         = 0.8080
Precision score  = 0.8081
Recall score     = 0.8087

48/48 - 5s - loss: 0.4706 - binary_accuracy: 0.8028

******************

Fold 1:
Threshold =  0.47

Scores (average == weighted):

Jaccard score    = 0.6831
F1 score         = 0.8104
Precision score  = 0.8160
Recall score     = 0.8132

48/48 - 5s - loss: 0.4276 - binary_accuracy: 0.8113

******************

Fold 2:
Threshold =  0.51

Scores (average == weighted):

Jaccard score    = 0.7077
F1 score         = 0.8278
Precision score  = 0.8314
Recall score     = 0.8298

48/48 - 5s - loss: 0.4361 - binary_accuracy: 0.8272

******************

Fold 3:
Threshold =  0.42

Scores (average == weighted):

Jaccard score    = 0.6966
F1 score         = 0.8204
Precision score  = 0.8207
Recall score     = 0.8212

48/48 - 5s - loss: 0.4216 - binary_accuracy: 0.8166

******************

Fold 4:
Threshold =  0.32

Scores (average == weighted):


The best model (i.e. higher F1-score) used the fold 2 for the validation set, and the others for the train set.

It achieved a F1-score of about 83% on the validation set.

# Test set forecasting

Predict the test set for a Kaggle submission.

In [None]:
new_df_test = pipeline_for_disater_tweets(df_test, train=False)

In [None]:
# Best model -> Fold 2
fold_ = 2

testloader = get_dataloader_test(new_df_test)

y_pred_test = list_models[fold_].predict(testloader)

submit = pd.DataFrame({
    'id': df_test.id,
    'target': [1 if x > list_threshold[fold_] else 0 for x in y_pred_test]
})

name_ = f"submit_v2_fold_{fold_}"
submit.to_csv(f"/content/drive/MyDrive/Gaetan_Travail/ML/Projets_perso/NLP/Kaggle/Disaster_Tweets/{name_}.csv", index=False)