# 1-Introducción

Basado en: https://medium.com/analytics-vidhya/finetuning-bert-using-ktrain-for-disaster-tweets-classification-18f64a50910b

## Objetivos
Participar en la competencia de Kaggle 'Real or Not', donde se deben utilizar los datos de Tweets que nos brinda Kaggle en 2 archivos CSVs. Debemos clasificar los Tweets que hablan sobre desastres naturales contra los que NO hablan de estos (Y generalmente hablan de los mismos "metafóricamente"). 

Link set de datos y competencia: https://www.kaggle.com/c/nlp-getting-started

De esta manera, como dijimos previamente, debemos identificar y clasificar si los tweets corresponden o no a tweets que hablan sobre catástrofes. Tenemos un dataset 'train' con una columna 'target' donde "etiquetamos" cuales son verdaderos (1) o falsos (0). Identificar esto es una tarea compleja debido a la ambigüedad en la estructura lingüística de los tweets y, por lo tanto, no siempre está claro si las palabras de una persona realmente están anunciando un desastre o no. Por ejemplo, si una persona tuitea:
“On the plus side look at the sky last night, it was ablaze” (En español: 
"En el lado positivo, miré el cielo anoche, estaba en llamas"). 
La expresión 'ablaze' no significa que está en llamas realmente, sino que es una metáfora indicando que el cielo está anaranjado. Para nosotros es fácil entenderlo, pero para las máquinas no lo es. 





## Importamos Librerias

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#from wordcloud import WordCloud, STOPWORDS
%matplotlib inline

# 2-Preparación de los Datos

## Cargamos los datos csv locales descargados de Kaggle

In [2]:
df_train = pd.read_csv('Dataset/train.csv')
df_test = pd.read_csv('Dataset/test.csv')
df_Sample_Subm = pd.read_csv('Dataset/sample_submission.csv')

## Exploración mínima de los datos (la exploración completa la hicimos en el TP1)


Nuestras Columnas del dataset son:
 - id: Identificador único de cada tweet
 - keyword: Una palabra clave particular de cada tweet (puede ser NaN)
 - location - El lugar donde fue emitido el tweet (puede ser NaN)
 - text: texto del tweet
 - target: Si el tweet trata acerca de un desastre real, el valor es 1, sino 0  (solo en train.csv).

In [3]:
print (df_train.shape, df_test.shape, df_Sample_Subm.shape) 

(7613, 5) (3263, 4) (3263, 2)


In [4]:
df_train.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [5]:
df_test.head(10)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
5,12,,,We're shaking...It's an earthquake
6,21,,,They'd probably still show more life than Arse...
7,22,,,Hey! How are you?
8,27,,,What a nice hat?
9,29,,,Fuck off!


In [6]:
df_Sample_Subm.head(10)

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0
5,12,0
6,21,0
7,22,0
8,27,0
9,29,0


In [7]:
# Data cleaning

In [8]:
df_train.loc[df_train['keyword'].notnull() == False,'cleaned_text'] = df_train['text'] 
df_train.loc[df_train['keyword'].notnull() == True,'cleaned_text'] = df_train['keyword'] + ' ' + df_train['text']

In [9]:
df_test.loc[df_test['keyword'].notnull() == False,'cleaned_text'] = df_test['text'] 
df_test.loc[df_test['keyword'].notnull() == True,'cleaned_text'] = df_test['keyword'] + ' ' + df_test['text']

In [10]:
import re

In [11]:
def new_line(text):
    text = re.sub(r'\t', ' ', text) # remove tabs
    text = re.sub(r'\n', ' ', text) # remove line jump
    return text

In [12]:
def url(text):
# quite many tweets are truncated like "Experts in France 
# begin examining airplane debris found on Reunion Island: French air 
# accident experts o... http://t.co/YVVPznZmXg #news" , the explanation is above
    text = re.sub(r' \w{1,3}\.{3,3} http\S{0,}', ' ', text)
    text = re.sub(r' \w{1,3}Û_ http\S{0,}', ' ', text)
# some symbols and words one space before 'http' are eliminated, it is assumed the words have no a 
# semantical meaning and predictive power in the position. 
    text = re.sub(r"mp3 http\S{0,}", r" ", text)
    text = re.sub(r"rar http\S{0,}", r" ", text)
    pattern = re.compile(r'( pin\:\d+ | via )http\S{0,}')
    text = pattern.sub(r' ', text)
# the pattern in tweet context have no a big meaning and the elimination of the words 
# unify the strings structure 
    pattern = re.compile(r'Full read by|Full read b|Full read|Full rea|Full re|Full r')
    text = pattern.sub(r' ', text)
    pattern = re.compile(r'Full story at|Full story a|Full story|Full stor|Full sto|Full st|Full s')
    text = pattern.sub(r' ', text)
    
    return text

In [13]:
def clean(text):    
    text = new_line(text)
# eliminate the pattern
    text = re.sub(r'(&amp;|&gt;|&lt;)', " ", text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces
    text = url(text)
    
# the pattern is 'translated as 'USER'
# in https://www.kaggle.com/quentinsarrazin/tweets-preprocessing similar 'translation' is used
# in https://arxiv.org/ftp/arxiv/papers/1807/1807.07752.pdf similar pattern 
# is 'translated as 'USER_NAME'
    text = re.sub(r'@\S{0,}', ' USER ', text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces  
# shrink multiple USER USER USER ... to USER
    text = re.sub(r'\b(USER)( \1\b)+', r'\1', text)
    
# multiple  letters repeats like in 'Oooooohhh' are truncated to 2 letters, not possible to truncate 
# to 1 letter, because it may generated false meaning like  'good' to 'god'
    text = re.sub(r'([a-zA-Z])\1{1,}', r'\1\1', text)
    
#  URLs , if not yet eliminated by url function are eliminated 
    text = re.sub(r"htt\S{0,}", " ", text)
    
# remove all characters if not in the list [a-zA-Z\d\s]
    text = re.sub(r"[^a-zA-Z\d\s]", " ", text)
    
# the digit(s) pattern is 'translated' to 'NUMBER'
# in https://www.kaggle.com/quentinsarrazin/tweets-preprocessing similar 'translation' is used
    text = re.sub(r'^\d\S{0,}| \d\S{0,}| \d\S{0,}$', ' NUMBER ', text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces 
# shrink multiple NUMBER NUMBER  ... to NUMBER
    text = re.sub(r'\b(NUMBER)( \1\b)+', r'\1', text)
    
# remove digits if not eliminated above in 'NUMBER translation'
    text = re.sub(r"[0-9]", " ", text)
    
    text = text.strip() # remove spaces at the beginning and at the end of string    
# to reveal more equivalence classes the ' via USER' at the end of string is eliminated
    text = re.sub(r' via\s{1,}USER$', ' ', text)
    
    text = re.sub(r"\s+", " ", text) # remove extra spaces
    text = text.strip() # remove spaces at the beginning and at the end of string
    
    return text

In [14]:
df_train.cleaned_text = df_train.cleaned_text.apply(clean)
df_test.cleaned_text = df_test.cleaned_text.apply(clean)

In [15]:
max_length_tr = df_train.cleaned_text.map(len).max()
max_length_te = df_test.cleaned_text.map(len).max()
max_length = max(max_length_tr, max_length_te)

print("At the stage of text processing:")
print(f"...the size of longest text string in train set is  {max_length_tr}")
print(f"...the size of longest text string in test set is  {max_length_te}")

At the stage of text processing:
...the size of longest text string in train set is  171
...the size of longest text string in test set is  161


In [16]:
# the new max possible length will be (max_length - delta) , strings longer than new_max will be 
# decreased to new_max 
def cut(max_len, delta, x):
    new_max = max_len - delta
    length = len(x)
    if length <= new_max:
        return x 
    else:
        return x[:(new_max-length)]
    

delta = 25 
df_train.cleaned_text = df_train.cleaned_text.map(lambda x: cut(max_length, delta, x))
df_test.cleaned_text = df_test.cleaned_text.map(lambda x: cut(max_length, delta, x))

new_max_length_tr = df_train.cleaned_text.map(len).max()
new_max_length_te = df_test.cleaned_text.map(len).max()

print("After we cut tails of the longest tweets:")
print(f"...the size of longest text string in train set is  {new_max_length_tr}")
print(f"...the size of longest text string in test set is  {new_max_length_te}")

After we cut tails of the longest tweets:
...the size of longest text string in train set is  146
...the size of longest text string in test set is  146


# 3-Aproximación mediante BERT

BERT (Bidirectional Encoder Representations from Transformers) es un modelo de deep learning desarrollado por Google de código abierto. Es utilizado por muchos investigadores e industrias para para resolver muchas tareas de NLP. 

Ktrain (https://github.com/amaiya/ktrain) es un contenedor (wrapper) ligero para la biblioteca de deeplearning TensorFlow Keras (https://www.tensorflow.org/guide/keras/sequential_model) para ayudar a construir, entrenar e implementar ANN's y otros modelos de ML. Diseñado para hacer que el aprendizaje profundo (deep learning) y la IA sean más accesibles y fáciles de aplicar.

Ktrain proporciona soporte para la aplicación de muchas arquitecturas de aprendizaje profundo pre-entrenadas en el dominio de NLP; y BERT es una de ellas. Para resolver este problema, utilizaremos la implementación del BERT pre-entrenado proporcionado por ktrain y lo afinaremos/tunearemos para clasificar si los tweets del desastre son reales o no.

SOLO estamos interesados en la columna TEXTO y TARGET. Las cuales usaremos para clasificar nuestros Tweets.

## Importamos las librerias para leer el csv de entrenamiento (train.csv)

In [17]:
import tensorflow as tf
print(tf.__version__)

import ktrain
from ktrain import text
import pandas as pd
from sklearn.model_selection import train_test_split

2.2.0
using Keras version: 2.3.0-tf



## Obtenemos la variable predictora

In [18]:
#Nuestro train.csv está en el DF 'df_train'
random_seed = 12342
x_train, x_val, y_train, y_val = train_test_split(df_train['cleaned_text'], df_train['target'], shuffle=True, test_size = 0.2, random_state=random_seed, stratify=df_train['target'])

In [19]:
(x_train_bert,  y_train_bert), (x_val_bert, y_val_bert), preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                                                                         x_test = x_val, y_test=y_val,
                                                                                          class_names= ["0", "1"],
                                                                                          preprocess_mode='bert',
                                                                                          lang = 'en',
                                                                                          maxlen=65, 
                                                                                          max_features=35000)

preprocessing train...
language: en


preprocessing test...
language: en


In [20]:
model = text.text_classifier('bert', train_data=(x_train_bert, y_train_bert), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train_bert, y_train_bert), val_data=(x_val_bert, y_val_bert), batch_size=16)

Is Multi-Label? False
maxlen is 65
done.


In [21]:
#learner.lr_find()    #SImulamos un entrenamiento para encontrar el mejor LR.

In [22]:
#Para observar el plot del LR:
#learner.lr_plot()

In [23]:
learner.autofit(1e-5)

early_stopping automatically enabled at patience=5
reduce_on_plateau automatically enabled at patience=2


begin training using triangular learning rate policy with max lr of 1e-05...
Train on 6090 samples, validate on 1523 samples
Epoch 1/1024
Epoch 2/1024
Epoch 3/1024
Epoch 4/1024
Epoch 00004: Reducing Max LR on Plateau: new max lr will be 5e-06 (if not early_stopping).
Epoch 5/1024
Epoch 6/1024
Epoch 00006: Reducing Max LR on Plateau: new max lr will be 2.5e-06 (if not early_stopping).
Epoch 7/1024
Epoch 00007: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f9ac0367f10>

In [24]:
learner.validate(val_data=(x_val_bert, y_val_bert), class_names=['No Disaster', 'Disaster'])

              precision    recall  f1-score   support

 No Disaster       0.86      0.88      0.87       869
    Disaster       0.83      0.80      0.82       654

    accuracy                           0.85      1523
   macro avg       0.84      0.84      0.84      1523
weighted avg       0.84      0.85      0.84      1523



array([[762, 107],
       [129, 525]])

In [25]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [26]:
learner.model.save_weights("model-bert-more-cleaning.h5")
print("Saved model to disk")

Saved model to disk


## Predecimos en el CSV de TEST

In [None]:
df_test["target"] = predictor.predict(df_test["cleaned_text"].tolist())
df_test = df_test[["id", "target"]]
df_test.to_csv("submission_bert_more_cleaned.csv", index=False)