<a href="https://colab.research.google.com/github/DiegoBores/Clickbait-detector/blob/main/DetectorClickbait.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [2]:
!pip install tensorflow



In [3]:
!pip install -q tf-models-official==2.7.0

In [4]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import pandas as pd
from official.nlp import optimization #for  AdamW optimizer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score


# Dataset

In [5]:
#Obtenemos las URL y los path relativos correspondientes a los dataset de entrenamiento y de test
TRAIN_DATA_URL = "https://ml-coding-test.s3.eu-west-1.amazonaws.com/webis_train.csv"#URL del conjunto de entrenamiento
TEST_DATA_URL = "https://ml-coding-test.s3.eu-west-1.amazonaws.com/webis_test.csv"#URL del conjunto de test

#Descargamos los datasets
train_file_path = tf.keras.utils.get_file("webis_train.csv", TRAIN_DATA_URL,
                                          cache_dir='.', cache_subdir='datasets')
test_file_path = tf.keras.utils.get_file('webis_test.csv', TEST_DATA_URL,
                                         cache_dir='.', cache_subdir='datasets')

In [6]:
#Leemos los csv y los almacenamos en sendos dataframes de pandas
train_df = pd.read_csv(train_file_path) 
test_df = pd.read_csv(test_file_path)

In [7]:
#Podemos ver una muestra del dataset de entrenamiento
train_df.sample(5)

Unnamed: 0.1,Unnamed: 0,postMedia,postText,id,targetCaptions,targetParagraphs,targetTitle,postTimestamp,targetKeywords,targetDescription,truthJudgments,truthMean,truthClass,truthMedian,truthMode
4399,4399,[],Woman pulled alive from the River Thames besid...,844599361149571072,['The woman receives treatment after being pul...,['A woman has been pulled alive from the River...,Woman pulled alive from River Thames after Wes...,Wed Mar 22 17:19:14 +0000 2017,"Parliament shooting,Terrorism,Westminster,Stan...",A woman has been pulled alive from the River T...,"[1.0, 0.0, 0.0, 0.0, 0.0]",0.2,no-clickbait,0.0,0.0
5288,5288,['media/photo_820407651452338176.jpg'],Brain cancer survivor running 7 marathons on 7...,820407654275223552,[],"['January 14, 2017, 2:31 PM| Some people spend...",Brain cancer survivor running 7 marathons on 7...,Sat Jan 14 23:10:01 +0000 2017,Brain cancer survivor running 7 marathons on 7...,Some people spend months training for one mara...,"[1.0, 0.0, 0.0, 0.0, 0.0]",0.2,no-clickbait,0.0,0.0
18488,18488,[],#Bengaluru #police is being #accused of refusi...,832658468708323328,"['Flame', 'Police']",['In a shocking case of indifference and \xa0r...,Five-Year-Old Bengaluru Girl Burnt Alive After...,Fri Feb 17 18:30:23 +0000 2017,"Bengaluru Police, Child Rape, KSCPCR, Bannergh...",Five Year-Old Bengaluru Girl Burnt Alive After...,"[0.0, 0.0, 0.0, 0.0, 0.33333333330000003]",0.066667,no-clickbait,0.0,0.0
31,31,['media/photo_857416781736157184.jpg'],Congresswomen meet to discuss missing women of...,857416784953188352,['PHOTO: Exterior view of the U.S. Capitol bui...,"[""Following last month's spike in social media...",Congresswomen meet to discuss missing women of...,Thu Apr 27 02:11:06 +0000 2017,"congresswomen, D.C., capitol hill, missing gir...",Congresswomen and law enforcement representati...,"[1.0, 0.0, 0.0, 0.0, 0.0]",0.2,no-clickbait,0.0,0.0
8967,8967,['media/photo_827573620830507012.jpg'],Some patients are testing psychedelic drug the...,827573622982197248,['Some doctors and patients are testing psyche...,"['SAN FRANCISCO — In the 1950s and 60s, psyche...",Some patients try psychedelic drug therapy for...,Fri Feb 03 17:45:01 +0000 2017,"psilobycin, Dr. Charles Grob, Johns Hopkins, m...",Some doctors are testing the use of drugs like...,"[0.33333333330000003, 0.6666666666000001, 0.33...",0.266667,no-clickbait,0.333333,0.333333


Descripcion del dataset de entrenamiento (campos a utilizar)

Comprobamos que no haya valores null en los campos que nos interesan

In [8]:
train_df.isnull().any()

Unnamed: 0           False
postMedia            False
postText              True
id                   False
targetCaptions       False
targetParagraphs     False
targetTitle          False
postTimestamp        False
targetKeywords        True
targetDescription     True
truthJudgments       False
truthMean            False
truthClass           False
truthMedian          False
truthMode            False
dtype: bool

In [9]:
test_df.isnull().any()

Unnamed: 0           False
id                   False
postMedia            False
targetCaptions       False
postText              True
postTimestamp        False
targetTitle          False
targetDescription     True
targetKeywords        True
targetParagraphs     False
truthJudgments       False
truthClass           False
truthMedian          False
truthMode            False
truthMean            False
dtype: bool

Puesto que la columna de texto que nos interesa tiene valores Null, debemos eliminarlos para que no nos de problemas a la hora de entrenar el modelo



In [10]:
train_df = train_df.dropna(how='any',subset=['postText', 'truthClass'], axis=0)
test_df = test_df.dropna(how='any',subset=['postText', 'truthClass'], axis=0)

In [11]:
#Cambiamos los valores de 'truthClass' a valores 1 o 0 en funcion de si es un clickbait o no
#En el conjunto de entrenamiento
train_df['truthClass'] = train_df['truthClass'].apply(lambda x: 1 if x=='clickbait' else 0 )
train_df.rename(columns={'truthClass':'target'}, inplace=True)
#En el conjunto de test
test_df['truthClass'] = test_df['truthClass'].apply(lambda x: 1 if x=='clickbait' else 0 )
test_df.rename(columns={'truthClass':'target'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
#contamos  cuantos ejemplos del dataset de entrenamiento contienen un clickbait y cuantos no
train_df['target'].value_counts()

0    14768
1     4716
Name: target, dtype: int64

un 33% de los valores son clickbait. Implica un desbalanceo entre las clases que podría conducir a un alto sesgo durante el entrenamiento del modelo. Para evitar esto vamos a intentar balancear los datos, eliminando elementos que contengan 'no-clickbait' hasta que igualen a los de 'clikbait'. 
Además, por motivos de limitaciones computacionales y de tiempo, el fine-tunning será más rápido al disminuir el conjunto de datos.

In [13]:
df_clickbait = train_df[train_df['target']==1]
df_noclickbait = train_df[train_df['target']==0]

df_noclickbait_downsampled=df_noclickbait.sample(df_clickbait.shape[0])

train_df = pd.concat([df_noclickbait_downsampled, df_clickbait])
train_df=train_df.sample(frac=1)

In [14]:
train_df['target'].value_counts()

1    4716
0    4716
Name: target, dtype: int64

In [15]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,postMedia,postText,id,targetCaptions,targetParagraphs,targetTitle,postTimestamp,targetKeywords,targetDescription,truthJudgments,truthMean,target,truthMedian,truthMode
10935,10935,['media/photo_826028891798196228.jpg'],The world's 20 best beaches,826056122708680704,"['Seychelles', ""A broad stretch of sand is the...","['30 Jan 2017', 'A broad stretch of sand is th...",Beach holidays The world's 20 greatest beaches,Mon Jan 30 13:15:01 +0000 2017,"travel,beaches,Family holidays",,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.0,1,1.0,1.0
5526,5526,['media/photo_813617413665255424.jpg'],Biggest celebrity feuds of 2016,813617415561052160,['<p>Tempers were flying high on the set of th...,"['More', 'Tempers were flying high on the set ...",Biggest Celebrity Feuds of 2016,Tue Dec 27 05:28:02 +0000 2016,,With all the horrible things going on in the w...,"[0.6666666666000001, 0.6666666666000001, 1.0, ...",0.866667,1,1.0,1.0
8619,8619,['media/photo_836208042022596609.jpg'],#Tennis Australia 🎾 to set up help line for al...,836208098167529472,"['BXJ mother', 'Tennis organisation defends ho...",['Tennis Australia is hoping to set up a Crime...,Tennis Australia to set up help line for alleg...,Mon Feb 27 13:35:21 +0000 2017,"tennis australia, noel callaghan, sexual abuse...",Tennis Australia is hoping to set up a Crimest...,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.0,0,0.0,0.0
16085,16085,[],The Final Four is set.\n\nWho ya got?,846346154115825664,[],"[""The Final Four of the 2017 NCAA tournament i...",Vote: Four burning questions heading into Fina...,Mon Mar 27 13:00:22 +0000 2017,"gonzaga bulldogs, gonzaga, zags, final four, m...",The Final Four is set. Here are four important...,"[0.33333333330000003, 1.0, 0.6666666666000001,...",0.733333,1,0.666667,0.666667
7573,7573,['media/photo_811317767441879041.jpg'],"Bring Me The Horizon's @olobersyko created a ""...",811317770042408960,['Joseph Okpako/WireImage Oliver Sykes of Brin...,['Bring Me The Horizon frontman\xa0Oli Sykes i...,Bring Me The Horizon's Oli Sykes Created a 'St...,Tue Dec 20 21:10:04 +0000 2016,,Bring Me The Horizon frontman Oli Sykes releas...,"[0.0, 0.33333333330000003, 0.33333333330000003...",0.2,0,0.333333,0.333333


Dividimos el dataset en train y validation en una relacion 80/20

In [16]:
X_train, X_val, y_train, y_val = train_test_split(train_df['postText'],train_df['target'], train_size=0.8)

Vamos a extraer del conjunto de test las columnas que nos interesan, y en el mismo formato que los de test:

In [17]:
X_test = test_df['postText']
y_test = test_df['target']

# Modelos a utilizar:

Utilizaremos tres modelos:


*   Un modelo BERT sencillo, llamado Small BERT
*   ELECTRA base
*   Un baseline sencillo para comparar los resultados obtenidos




> # Small BERT

Se trata de un modelo basado en BERT original, pero con menos bloques Transformer y/o más pequeños.
Utilizaremos el modelo SmallBERT con la configuración




In [None]:
smallBERT_preprocess_handle = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
smallBERT_encoder_handle = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1', trainable=True)


> > > # Modelo de preprocesado


Las entradas de texto deben ser transformadas a tokens numericoas y agrupados en tensores antes de ser entradas de BERT. TensorFlow Hub facilita un modelo de preprocesamiento para cada modelo BERT discutido antes, por lo que no es necesario correr codigo de Python especifico fura del modelo de TensorFlow para preprocesar el texto.

El modelo de preoprocesamiento indicado en la documentacion del modelo BERT.

Del preprocesado hay tres salidas principales, que serán las utilizadas por el modelo BERT:
input_words_id, input_mask e input_type_ids


> > > # Modelo BERT

Los modelos BERT devuelven un map con 3 keys importantes: pooled_output, sequence_output y encoder_outputs:

pooled_output representa cada frase de entrada como un conjunto. Las dimensiones son [batch_size, H].
sequence_output representa cada token de entrada en el contexto. Las dimensiones son [batch_size, seq_length, H].
encoder_outputs son las activaciones intermedias de los L bloques Transformer. outputs["encoder_outputs"][i] es un tensor con dimensiones[batch_size, seq_length, 1024] con las salidas del i-esimo bloque Transformer, para 0<=i<L. El ultimo valor de esta lista es igual a sequence_output.
Para el fine-tunning se utiliza el array pooled_output

> > > # Definicion del modelo

El modelo utilizado es uno sencillo, con el preprocesado, el modelo Small BERT, una capa Dropout y una capa Dense.

La capa Dropout se utiliza para evitar que el modelo se sebreajuste. La entrada de esta capa será la salida pooled_outputs del modelo BERT.
La capa Dense tiene solo una neurona. Utilizamos una funcion de activacion sigmoid, dado que los valores de salida están entre 0 y 1. 

In [None]:
#Crearemos una función que construya el modelo a apartir de las url del modulo de preprocesado y del modulo BERT empleado
def build_classifier_model(preprocess_handle, encoder_handle):
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(preprocess_handle, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(encoder_handle, trainable = True, name='encoder')
  outputs = encoder(encoder_inputs)
  net=outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation='sigmoid', name='classifier')(net)
  return tf.keras.Model(text_input, net)


Vamos a echar un vistazo al esquema del modelo



In [None]:
smallBERT_classifier_model = build_classifier_model(smallBERT_preprocess_handle , smallBERT_encoder_handle)

In [None]:
tf.keras.utils.plot_model(smallBERT_classifier_model)

In [None]:
smallBERT_classifier_model.summary()

> > > # Función de pérdidas, optimizador y metricas

Dado que se trata de un problema de clasificacion binaria, y la salida del modelo es una probabilidad (es una capa con una única unidad),utilizaremos la funcion de costes losses.BinaryCrossentropy:



In [None]:
loss = tf.keras.losses.BinaryCrossentropy()

Metrics will be used to check the model performance so that we can know how we trained our model. We set the BinaryAccuracy(name='accuracy') which will be used to calculate the accuracy score of the model.

In [None]:
METRICS = [
           tf.keras.metrics.BinaryAccuracy(name='accuracy'),
           tf.keras.metrics.Precision(name='precision'),
           tf.keras.metrics.Recall(name='recall')
]

Para el fine-tunning utilizaremos el mismo con el que se entrenó originalmente BERT: "Adaptative Moments" (Adam). Este optimizador minimiza la perdida de la prediccion y emplea regularización mediante disminucion de pesos (no emplea momentos), que se conoce como AdamW.

Para la tasa de aprendizaje (init_lr) emplearemos el mismo esquema que en el pre-entrenamiento de BERT: disminucion linear de una tasa inicial, prefijada con una fase de calentamiento linear sobre el 10% de los pasos de entrenamiento(num_warmup_steps). De acuerdo con el paper de BERT, la tasa de aprendizaje inicial debe ser más pequeña para el fine-tunning( mejor de 5e-5, 3e-5, 2e-5)

In [None]:
# epochs = 1 #numero de epochs inicialmente 1
# steps_per_epoch = tf.data.experimental.cardinality((X_train, y_train)).numpy()
# num_train_steps = steps_per_epoch * epochs #Numero de pasos de entrenamiento
# num_warmup_steps = int(0.1*num_train_steps) #Numero de pasos de calentamiento

# #Tasa de aprendizaje inicial
# init_lr = 3e-5
# optimizer = optimization.create_optimizer(init_lr = init_lr,
#                                           num_train_steps=num_train_steps,
#                                           num_warmup_steps=num_warmup_steps,
#                                           optimizer_type='adamw')


> > > # Compilando el modelo y fine-tunning

A continuación, compilamos el modelo definido anteriormente. 



In [None]:
smallBERT_classifier_model.compile(optimizer= 'adam',
                                   loss=loss,
                                   metrics = METRICS)

Y realizamos el fine-tunning. Por limitaciones computacionales y de timepo, reduciremos el numero de epochs a 4, si bien este numero debería ser bastante mayor.

In [None]:
smallBERT_history = smallBERT_classifier_model.fit(X_train,y_train, 
                                                   validation_data=(X_val, y_val), 
                                                   epochs=4)

Analisis de accuracy en train y en val. Relacion con overfitting, etc. 

> > > # Evaluación del modelo

Vamos a ver el rendimiento del modelo. Se devuelven dos valores: Loss(un numero que representa el error. Menores valores son mejores) y la precision



In [24]:
smallBERT_metrics= smallBERT_classifier_model.evaluate(X_test[0:50], y_test[0:50])

y_predicted = smallBERT_classifier_model.predict(X_test[0:10])



In [25]:
print(smallBERT_metrics)

[0.6501409411430359, 0.6800000071525574, 0.0, 0.0]


In [None]:
print(y_predicted)

[[0.1287171 ]
 [0.0844087 ]
 [0.63812006]
 [0.10491264]
 [0.4223712 ]
 [0.54527897]
 [0.18271327]
 [0.6933144 ]
 [0.17541182]
 [0.10552812]]


In [None]:
print(y_test[0:10])

0    0
1    0
2    1
3    0
4    0
5    0
6    0
7    1
8    0
9    0
Name: target, dtype: int64


Podemos realizar una representacion de la evolucion de las perdidas y de la precision con el tiempo

In [None]:
history_dict = history.history
print(history_dict.keys())

smallBERT_acc = history_dict['binary_accuracy']
smallBERT_val_acc = history_dict['val_binary_accuracy']
smallBERT_loss = history_dict['loss']
smallBERT_val_loss = history_dict['val_loss']

epochs = range(1, len(acc) +1)
fig = plt.figure(figsize = (10,6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# r is for "solid red line"
plt.plot(epochs, smallBERT_loss, 'r', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, smallBERT_val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, smallBERT_acc, 'r', label='Training acc')
plt.plot(epochs, smallBERT_val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

Por último vamos a exportar el modelo obtenido para utilizarlo mas adelante:

In [None]:
dataset_name = 'webis'
saved_model_path = './{}_smallBERT'.format(dataset_name.replace('/', '_'))
smallBERT_classifier_model.save(saved_model_path, include_optimizer = False)




> # ELECTRA

Se trata de un modelo basado en BER, pero que ha sido pre-entrenado como un discriminador en una GAN(Generative Adversarial Network). Utilizamos el tamaño intermedio, base. Se encuentra disponible en TensorFlow Hub, de modo que podemos compararlo fácilmente con el resultado de Small BERT.

> > > # Definicion del modelo

Utilizamos el mismo modelo que en el caso de smallBERT, cambiando únicamente el modelo de codificador, que ahora será electra_base, y su preprocesador correspondiente. 


In [17]:
electraBase_preprocess_url = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
electraBase_encoder_url = hub.KerasLayer('https://tfhub.dev/google/electra_base/2', trainable=True)

In [20]:
electraBase_classifier_model = build_classifier_model(electraBase_preprocess_url , electraBase_encoder_url)

Vamos a echar un vistazo al esquema del modelo



In [None]:
tf.keras.utils.plot_model(electraBase_classifier_model)

In [73]:
electraBase_classifier_model.summary()

Model: "model_12"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_type_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                   

La función de pérdidas, las métricas de rendimiento y el optimizador serán los utilzados en el caso anterior. Por lo tanto, sólo nos queda compilar el modelo y realizar el fine-tunning:

In [21]:
electraBase_classifier_model.compile(optimizer= 'adam',
                                   loss=loss,
                                   metrics = METRICS)

In [22]:
electra_history = electraBase_classifier_model.fit(X_train,y_train, validation_data=(X_val, y_val), epochs=1)

 13/488 [..............................] - ETA: 7:27:41 - loss: 1.0447 - accuracy: 0.6851 - precision: 0.1818 - recall: 0.1348

KeyboardInterrupt: ignored

> > > # Evaluación del modelo

Vamos a ver el rendimiento del modelode manera idéntica al caso anterior.



In [None]:
electra_loss, electra_accuracy = electraBase_classifier_model.evaluate(X_test, y_test)

Y realizamos laas gráficas oportunas

In [None]:
electra_history_dict = electra_history.history
print(electra_history_dict.keys())

electra_acc = electra_history_dict['binary_accuracy']
electra_val_acc = electra_history_dict['val_binary_accuracy']
electra_loss = electra_history_dict['loss']
electra_val_loss = electra_history_dict['val_loss']

epochs = range(1, len(acc) +1)
fig = plt.figure(figsize = (10,6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# r is for "solid red line"
plt.plot(epochs, electra_loss, 'r', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, electra_val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, electra_acc, 'r', label='Training acc')
plt.plot(epochs, electra_val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

Por último vamos a exportar el modelo obtenido para utilizarlo mas adelante:

In [None]:
dataset_name = 'webis'
saved_model_path = './{}_electraBase'.format(dataset_name.replace('/', '_'))
electraBase_classifier_model.save(saved_model_path, include_optimizer = False)




> # Clasificador sencillo



In [None]:
clf = DummyClassifier()
scores = cross_val_score(clf, X_train, y_train)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std()*2))

Dummy classifier score: 0.758 (+/- 0.00)
