# 1-Introducción

Basado en: https://medium.com/analytics-vidhya/finetuning-bert-using-ktrain-for-disaster-tweets-classification-18f64a50910b

## Objetivos
Participar en la competencia de Kaggle 'Real or Not', donde se deben utilizar los datos de Tweets que nos brinda Kaggle en 2 archivos CSVs. Debemos clasificar los Tweets que hablan sobre desastres naturales contra los que NO hablan de estos (Y generalmente hablan de los mismos "metafóricamente"). 

Link set de datos y competencia: https://www.kaggle.com/c/nlp-getting-started

De esta manera, como dijimos previamente, debemos identificar y clasificar si los tweets corresponden o no a tweets que hablan sobre catástrofes. Tenemos un dataset 'train' con una columna 'target' donde "etiquetamos" cuales son verdaderos (1) o falsos (0). Identificar esto es una tarea compleja debido a la ambigüedad en la estructura lingüística de los tweets y, por lo tanto, no siempre está claro si las palabras de una persona realmente están anunciando un desastre o no. Por ejemplo, si una persona tuitea:
“On the plus side look at the sky last night, it was ablaze” (En español: 
"En el lado positivo, miré el cielo anoche, estaba en llamas"). 
La expresión 'ablaze' no significa que está en llamas realmente, sino que es una metáfora indicando que el cielo está anaranjado. Para nosotros es fácil entenderlo, pero para las máquinas no lo es. 





## Importamos Librerias

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#from wordcloud import WordCloud, STOPWORDS
%matplotlib inline

# 2-Preparación de los Datos

## Cargamos los datos csv locales descargados de Kaggle

In [2]:
df_train = pd.read_csv('Dataset/train.csv')
df_test = pd.read_csv('Dataset/test.csv')
df_Sample_Subm = pd.read_csv('Dataset/sample_submission.csv')

## Exploración mínima de los datos (la exploración completa la hicimos en el TP1)


Nuestras Columnas del dataset son:
 - id: Identificador único de cada tweet
 - keyword: Una palabra clave particular de cada tweet (puede ser NaN)
 - location - El lugar donde fue emitido el tweet (puede ser NaN)
 - text: texto del tweet
 - target: Si el tweet trata acerca de un desastre real, el valor es 1, sino 0  (solo en train.csv).

In [3]:
print (df_train.shape, df_test.shape, df_Sample_Subm.shape) 

(7613, 5) (3263, 4) (3263, 2)


In [4]:
df_train.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [5]:
df_test.head(10)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
5,12,,,We're shaking...It's an earthquake
6,21,,,They'd probably still show more life than Arse...
7,22,,,Hey! How are you?
8,27,,,What a nice hat?
9,29,,,Fuck off!


In [6]:
df_Sample_Subm.head(10)

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0
5,12,0
6,21,0
7,22,0
8,27,0
9,29,0


In [7]:
# Data cleaning

In [None]:
import re

In [None]:
def new_line(text):
    text = re.sub(r'\t', ' ', text) # remove tabs
    text = re.sub(r'\n', ' ', text) # remove line jump
    return text

In [None]:
def url(text):
# quite many tweets are truncated like "Experts in France 
# begin examining airplane debris found on Reunion Island: French air 
# accident experts o... http://t.co/YVVPznZmXg #news" , the explanation is above
    text = re.sub(r' \w{1,3}\.{3,3} http\S{0,}', ' ', text)
    text = re.sub(r' \w{1,3}Û_ http\S{0,}', ' ', text)
# some symbols and words one space before 'http' are eliminated, it is assumed the words have no a 
# semantical meaning and predictive power in the position. 
    text = re.sub(r"mp3 http\S{0,}", r" ", text)
    text = re.sub(r"rar http\S{0,}", r" ", text)
    pattern = re.compile(r'( pin\:\d+ | via )http\S{0,}')
    text = pattern.sub(r' ', text)
# the pattern in tweet context have no a big meaning and the elimination of the words 
# unify the strings structure 
    pattern = re.compile(r'Full read by|Full read b|Full read|Full rea|Full re|Full r')
    text = pattern.sub(r' ', text)
    pattern = re.compile(r'Full story at|Full story a|Full story|Full stor|Full sto|Full st|Full s')
    text = pattern.sub(r' ', text)
    
    return text

In [None]:
def clean(text):    
    text = new_line(text)
# eliminate the pattern
    text = re.sub(r'(&amp;|&gt;|&lt;)', " ", text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces
    text = url(text)
    
# the pattern is 'translated as 'USER'
# in https://www.kaggle.com/quentinsarrazin/tweets-preprocessing similar 'translation' is used
# in https://arxiv.org/ftp/arxiv/papers/1807/1807.07752.pdf similar pattern 
# is 'translated as 'USER_NAME'
    text = re.sub(r'@\S{0,}', ' USER ', text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces  
# shrink multiple USER USER USER ... to USER
    text = re.sub(r'\b(USER)( \1\b)+', r'\1', text)
    
# multiple  letters repeats like in 'Oooooohhh' are truncated to 2 letters, not possible to truncate 
# to 1 letter, because it may generated false meaning like  'good' to 'god'
    text = re.sub(r'([a-zA-Z])\1{1,}', r'\1\1', text)
    
#  URLs , if not yet eliminated by url function are eliminated 
    text = re.sub(r"htt\S{0,}", " ", text)
    
# remove all characters if not in the list [a-zA-Z\d\s]
    text = re.sub(r"[^a-zA-Z\d\s]", " ", text)
    
# the digit(s) pattern is 'translated' to 'NUMBER'
# in https://www.kaggle.com/quentinsarrazin/tweets-preprocessing similar 'translation' is used
    text = re.sub(r'^\d\S{0,}| \d\S{0,}| \d\S{0,}$', ' NUMBER ', text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces 
# shrink multiple NUMBER NUMBER  ... to NUMBER
    text = re.sub(r'\b(NUMBER)( \1\b)+', r'\1', text)
    
# remove digits if not eliminated above in 'NUMBER translation'
    text = re.sub(r"[0-9]", " ", text)
    
    text = text.strip() # remove spaces at the beginning and at the end of string    
# to reveal more equivalence classes the ' via USER' at the end of string is eliminated
    text = re.sub(r' via\s{1,}USER$', ' ', text)
    
    text = re.sub(r"\s+", " ", text) # remove extra spaces
    text = text.strip() # remove spaces at the beginning and at the end of string
    
    return text

# 3-Aproximación mediante BERT

BERT (Bidirectional Encoder Representations from Transformers) es un modelo de deep learning desarrollado por Google de código abierto. Es utilizado por muchos investigadores e industrias para para resolver muchas tareas de NLP. 

Ktrain (https://github.com/amaiya/ktrain) es un contenedor (wrapper) ligero para la biblioteca de deeplearning TensorFlow Keras (https://www.tensorflow.org/guide/keras/sequential_model) para ayudar a construir, entrenar e implementar ANN's y otros modelos de ML. Diseñado para hacer que el aprendizaje profundo (deep learning) y la IA sean más accesibles y fáciles de aplicar.

Ktrain proporciona soporte para la aplicación de muchas arquitecturas de aprendizaje profundo pre-entrenadas en el dominio de NLP; y BERT es una de ellas. Para resolver este problema, utilizaremos la implementación del BERT pre-entrenado proporcionado por ktrain y lo afinaremos/tunearemos para clasificar si los tweets del desastre son reales o no.

SOLO estamos interesados en la columna TEXTO y TARGET. Las cuales usaremos para clasificar nuestros Tweets.

## Importamos las librerias para leer el csv de entrenamiento (train.csv)

In [12]:
import tensorflow as tf
print(tf.__version__)

import ktrain
from ktrain import text
import pandas as pd
from sklearn.model_selection import train_test_split

2.2.0
using Keras version: 2.3.0-tf



## Obtenemos la variable predictora

In [22]:
#Nuestro train.csv está en el DF 'df_train'
random_seed = 12342
x_train, x_val, y_train, y_val = train_test_split(df_train['text'], df_train['target'], shuffle=True, test_size = 0.2, random_state=random_seed, stratify=df_train['target'])

In [23]:
(x_train_bert,  y_train_bert), (x_val_bert, y_val_bert), preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                                                                         x_test = x_val, y_test=y_val,
                                                                                          class_names= ["0", "1"],
                                                                                          preprocess_mode='bert',
                                                                                          lang = 'en',
                                                                                          maxlen=65, 
                                                                                          max_features=35000)

preprocessing train...
language: en


preprocessing test...
language: en


In [24]:
model = text.text_classifier('bert', train_data=(x_train_bert, y_train_bert), preproc=preproc)

Is Multi-Label? False
maxlen is 65
done.


In [25]:
model.load_weights("model-bert-best-score.h5")

In [27]:
predictor = ktrain.get_predictor(model, preproc)

In [29]:
test_df = pd.read_csv("Dataset/test.csv")
test_df["target"] = predictor.predict(test_df["text"].tolist())
test_df = test_df[["id", "target"]]
test_df.to_csv("submission_bert_load_model.csv", index=False)

La variable predictora es obtenida pasandole el modelo y el objeto 'preproc' al mètodo 'get_predictor'. Este 'predictor' puede ser usado para realizar predicciones en nuestra data de TEST directamente.

In [20]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [21]:
learner.print_layers()

0 (trainable=True) : <tensorflow.python.keras.engine.input_layer.InputLayer object at 0x7fc437160390>
1 (trainable=True) : <tensorflow.python.keras.engine.input_layer.InputLayer object at 0x7fc49403e090>
2 (trainable=True) : <keras_bert.layers.embedding.TokenEmbedding object at 0x7fc494153d10>
3 (trainable=True) : <tensorflow.python.keras.layers.embeddings.Embedding object at 0x7fc437160710>
4 (trainable=True) : <tensorflow.python.keras.layers.merge.Add object at 0x7fc4368384d0>
5 (trainable=True) : <keras_pos_embd.pos_embd.PositionEmbedding object at 0x7fc436838a50>
6 (trainable=True) : <tensorflow.python.keras.layers.core.Dropout object at 0x7fc436879cd0>
7 (trainable=True) : <keras_layer_normalization.layer_normalization.LayerNormalization object at 0x7fc4366d1fd0>
8 (trainable=True) : <keras_multi_head.multi_head_attention.MultiHeadAttention object at 0x7fc43676c9d0>
9 (trainable=True) : <tensorflow.python.keras.layers.core.Dropout object at 0x7fc4370d0890>
10 (trainable=True) : <t

In [27]:
learner.model.save_weights("model-bert.h5")
print("Saved model to disk")

Saved model to disk


## Predecimos en el CSV de TEST

In [22]:
test_df = pd.read_csv("Dataset/test.csv")
test_df["target"] = predictor.predict(test_df["text"].tolist())
test_df = test_df[["id", "target"]]
test_df.to_csv("submission_bert_cleaned.csv", index=False)

In [23]:
#Keywords

In [24]:
df_train = pd.read_csv("Dataset/train.csv")
train_df_copy = df_train
train_df_copy = train_df_copy.fillna('None')
ag = train_df_copy.groupby('keyword').agg({'text':np.size, 'target':np.mean}).rename(columns={'text':'Count', 'target':'Disaster Probability'})

ag.sort_values('Disaster Probability', ascending=False).head(20)

Unnamed: 0_level_0,Count,Disaster Probability
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1
wreckage,39,1.0
debris,37,1.0
derailment,39,1.0
outbreak,40,0.975
oil%20spill,38,0.973684
typhoon,38,0.973684
suicide%20bombing,33,0.969697
suicide%20bomber,31,0.967742
bombing,29,0.931034
suicide%20bomb,35,0.914286


In [25]:
count = 2
prob_disaster = 0.9
keyword_list_disaster = list(ag[(ag['Count']>count) & (ag['Disaster Probability']>=prob_disaster)].index)
#we print the list of keywords which will be used for prediction correction 
keyword_list_disaster

['bombing',
 'debris',
 'derailment',
 'nuclear%20disaster',
 'oil%20spill',
 'outbreak',
 'rescuers',
 'suicide%20bomb',
 'suicide%20bomber',
 'suicide%20bombing',
 'typhoon',
 'wreckage']

In [26]:
ids_disaster = test_df['id'][test_df.keyword.isin(keyword_list_disaster)].values
test_df['target'][test_df['id'].isin(ids_disaster)] = 1

AttributeError: 'DataFrame' object has no attribute 'keyword'

In [None]:
test_df = test_df[["id", "target"]]

In [None]:
test_df.head()

In [None]:
#test_df.to_csv("submission_bert_cleaned.csv", index=False)

## Subimos las predicciones de TEST a Kaggle

Como último paso subimos nuestras predicciones a Kaggle y chequeamos el SCORE obtenido.

VER.... Logramos una precisión del ......83.4% on the test set.

# 3- Conclusiones

Utilizamos las features de Ktrain para implementar de una manera sencilla el complejo modelo de BERT. AL final fuimos capaces de lograr una precisioǹ en TEST de .........

Uno de los mayores problema con BERT es que toma mucho tiempo entrenando. Para mejorar esto, podemos aplicar una versiòn màs ligera de BERT como distilBERT. Tambien, para reducir el tiempo de entrenamiento, los pesos de todas las capas pueden ser congeladas (frozen) a excepciòn de la capa final.