# Aplicando BERT para Detecção de Bots do Twitter

Aluno: Gustavo Monteiro

In [1]:
pip install tensorflow transformers pandas scikit-learn



In [2]:
pip install tensorflow transformers scikit-learn pandas gdown



## Download do Dataset

O dataset foi obtido via Google Drive usando o módulo gdown. O arquivo foi salvo localmente como twitter_bot_detection.csv e carregado utilizando o pandas.

In [3]:
import gdown
url = "https://drive.google.com/uc?id=1et-4zd_KETINZmIzjTET3YNfru3LhDcB"

output = "twitter_bot_detection.csv"
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1et-4zd_KETINZmIzjTET3YNfru3LhDcB
To: /content/twitter_bot_detection.csv
100%|██████████| 7.46M/7.46M [00:00<00:00, 123MB/s]


'twitter_bot_detection.csv'

In [4]:
import pandas as pd
df = pd.read_csv("twitter_bot_detection.csv")

print(df.head())

   User ID        Username                                              Tweet  \
0   132131           flong  Station activity person against natural majori...   
1   289683  hinesstephanie  Authority research natural life material staff...   
2   779715      roberttran  Manage whose quickly especially foot none to g...   
3   696168          pmason  Just cover eight opportunity strong policy which.   
4   704441          noah87                      Animal sign six data good or.   

   Retweet Count  Mention Count  Follower Count  Verified  Bot Label  \
0             85              1            2353     False          1   
1             55              5            9617      True          0   
2              6              2            4363      True          0   
3             54              5            2242      True          1   
4             26              3            8438     False          1   

       Location           Created At            Hashtags  
0     Adkinston  2020

## Pré-processamento

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer

df = pd.read_csv("twitter_bot_detection.csv")

print(df.head())

df = df.dropna().drop_duplicates()

print(df.columns)

texts = df['Tweet'].values
labels = df['Bot Label'].values

label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(texts, encoded_labels, test_size=0.2, random_state=42)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_texts(texts, tokenizer, max_len=128):
    encodings = tokenizer(
        list(texts),
        padding='max_length',
        truncation=True,
        max_length=max_len,
        return_tensors='tf'
    )
    return encodings

train_encodings = tokenize_texts(X_train, tokenizer)
test_encodings = tokenize_texts(X_test, tokenizer)

   User ID        Username                                              Tweet  \
0   132131           flong  Station activity person against natural majori...   
1   289683  hinesstephanie  Authority research natural life material staff...   
2   779715      roberttran  Manage whose quickly especially foot none to g...   
3   696168          pmason  Just cover eight opportunity strong policy which.   
4   704441          noah87                      Animal sign six data good or.   

   Retweet Count  Mention Count  Follower Count  Verified  Bot Label  \
0             85              1            2353     False          1   
1             55              5            9617      True          0   
2              6              2            4363      True          0   
3             54              5            2242      True          1   
4             26              3            8438     False          1   

       Location           Created At            Hashtags  
0     Adkinston  2020

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Remoção de valores nulos e duplicados.
Codificação dos textos dos tweets usando o BERT Tokenizer, que prepara os dados de entrada para serem compatíveis com o modelo BERT.
Divisão dos dados em treino e teste, com uma proporção de 80% para treino e 20% para teste, usando train_test_split.

## Implementação do Modelo

In [6]:
import tensorflow as tf
from transformers import TFBertModel

bert_model = TFBertModel.from_pretrained('bert-base-uncased')

input_ids = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="attention_mask")

def bert_lambda(inputs):
    return bert_model(input_ids=inputs[0], attention_mask=inputs[1]).last_hidden_state

bert_outputs = tf.keras.layers.Lambda(bert_lambda, output_shape=(128, 768))([input_ids, attention_mask])

cls_token = bert_outputs[:, 0, :]

output = tf.keras.layers.Dense(1, activation='sigmoid')(cls_token)

model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, 128)]                0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, 128)]                0         []                            
 )                                                                                                
                                                                                                  
 lambda (Lambda)             (None, 128, 768)             0         ['input_ids[0][0]',           
                                                                     'attention_mask[0][0]']      
                                                                                              

A saída do modelo foi conectada a uma camada densa com ativação sigmoid, que retorna a probabilidade de uma amostra ser bot ou humano.

## Treinamento do Modelo

In [7]:
import tensorflow as tf
from transformers import TFBertModel
from tensorflow.keras.callbacks import EarlyStopping

bert_model = TFBertModel.from_pretrained('bert-base-uncased')

input_ids = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="attention_mask")

def bert_lambda(inputs):
    return bert_model(input_ids=inputs[0], attention_mask=inputs[1]).last_hidden_state

bert_outputs = tf.keras.layers.Lambda(bert_lambda, output_shape=(128, 768))([input_ids, attention_mask])

cls_token = bert_outputs[:, 0, :]

output = tf.keras.layers.Dense(1, activation='sigmoid')(cls_token)

model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

batch_size = 64

history = model.fit(
    x=[train_encodings['input_ids'], train_encodings['attention_mask']],
    y=y_train,
    validation_split=0.2,
    batch_size=batch_size,
    epochs=2,
    callbacks=[early_stopping]
)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Epoch 1/2
[1m417/417[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30s/step - accuracy: 0.4919 - loss: 0.7929 

KeyboardInterrupt: 

Otimizador Adam com uma taxa de aprendizado de 2e-5 e função de perda binary_crossentropy

EarlyStopping para interromper o treinamento caso não houvesse melhorias no erro de validação

batch_size para otimizar o tempo de treinamento, visto que o modelo estava demorando em cada época devido à complexidade e tamanho dos dados

In [8]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

y_pred = model.predict([test_encodings['input_ids'], test_encodings['attention_mask']])
y_pred_labels = (y_pred > 0.5).astype(int)

print(confusion_matrix(y_test, y_pred_labels))
print(classification_report(y_test, y_pred_labels))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred)}")


[1m261/261[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3889s[0m 15s/step
[[2786 1419]
 [2712 1415]]
              precision    recall  f1-score   support

           0       0.51      0.66      0.57      4205
           1       0.50      0.34      0.41      4127

    accuracy                           0.50      8332
   macro avg       0.50      0.50      0.49      8332
weighted avg       0.50      0.50      0.49      8332

AUC-ROC: 0.5067604738609781


## Objetivo


O objetivo deste projeto foi treinar uma rede neural BERT no Keras para detectar bots no Twitter com o dataset "Twitter-Bot Detection", e o modelo pré-treinado BERT (Bidirectional Encoder Representations from Transformers) para realizar a tarefa de classificação binária (humano ou bot).

## Desafios Enfrentados

Erro de Formato nos Inputs: Durante o processo de implementação, encontramos um erro de formato no modelo relacionado à compatibilidade dos tensores de entrada com o esperado pelo modelo BERT

Tempo de Treinamento Prolongado: O treinamento inicial apresentou uma lentidão significativa, com algumas épocas durando horas. Para mitigar esse problema, o ajuste do batch_size para 64, além de adicionar a técnica de EarlyStopping para evitar o desperdício de tempo em épocas sem melhorias

## Lições Aprendidas


O sucesso do modelo BERT depende fortemente da qualidade e do formato dos dados de entrada. Tokenizar corretamente os textos e garantir que os tensores de entrada estejam no formato correto.

O treinamento de modelos complexos como o BERT pode ser muito demorado. Aprendemos que ajustar o batch_size e aplicar técnicas de interrupção antecipada, como o EarlyStopping para otimizar o uso de recursos e o tempo.

## **Próximos Passos**

Análise de Resultados: Realizar uma análise detalhada dos resultados, incluindo métricas de avaliação como precisão, recall e F1-Score. A curva ROC e a AUC também podem ser úteis para entender a performance do modelo.

Otimização de Hiperparâmetros: Explorar diferentes valores de hiperparâmetros, como taxa de aprendizado, tamanho do batch e número de épocas, para encontrar a configuração que ofereça o melhor desempenho.

Integração com Outros Modelos: Experimentar a combinação do BERT com outros modelos baseados em redes neurais convolucionais (CNNs) ou recorrentes (RNNs) para lidar melhor com diferentes tipos de textos.