# Aplicando BERT para Detec√ß√£o de Bots do Twitter

## Inteli - Sistemas de Informa√ß√£o - Programa√ß√£o
- **Professor**üë®‚Äçüè´: Jefferson de Oliveira Silva
- **Aluno**üë®‚Äçüéì: Pedro de Carvalho Rezende

### Objetivoüö®
Treinar uma rede neural BERT no Keras para detectar bots no Twitter utilizando o dataset Twitter-Bot Detection, dispon√≠vel no Kaggle.

### Entreg√°veis üìÉ
- Relat√≥rio do Projeto: Documenta√ß√£o detalhando a an√°lise, implementa√ß√£o e resultados.
- Modelo treinado e persistido

# Conjunto de Dados de Detec√ß√£o de Bots ü§ñüîç

Bem-vindo ao Conjunto de Dados de Detec√ß√£o de Bots! Este conjunto de dados foi criado para **facilitar a an√°lise e detec√ß√£o de contas automatizadas (bots) no Twitter**. Ele cont√©m uma cole√ß√£o de perfis de usu√°rios e dados de tweets associados, junto com um r√≥tulo bin√°rio indicando se cada usu√°rio √© um bot ou n√£o.



## Informa√ß√µes do Conjunto de Dados üìä

O conjunto de dados √© fornecido no formato de arquivo CSV, nomeado 'bot_detection_dataset.csv'. Ele inclui as seguintes colunas:

- **ID do Usu√°rio**: Identificador √∫nico para cada usu√°rio no conjunto de dados.
- **Nome de Usu√°rio**: O nome de usu√°rio associado ao usu√°rio.
- **Tweet**: O conte√∫do textual do tweet.
- **Contagem de Retweets**: O n√∫mero de vezes que o tweet foi retweetado.
- **Contagem de Men√ß√µes**: O n√∫mero de men√ß√µes no tweet.
- **Contagem de Seguidores**: O n√∫mero de seguidores que o usu√°rio tem.
- **Verificado**: Um valor booleano indicando se o usu√°rio √© verificado ou n√£o.
- **R√≥tulo de Bot**: Um r√≥tulo indicando se o usu√°rio √© um bot (1) ou n√£o (0).
- **Localiza√ß√£o**: A localiza√ß√£o associada ao usu√°rio.
- **Criado em**: A data e hora em que a conta foi criada.

## Como Usar üìù - REFAZER

1. Carregue o conjunto de dados: Leia o arquivo 'bot_detection_dataset.csv' na sua ferramenta ou biblioteca de an√°lise de dados ou machine learning preferida.
2. Pr√©-processamento dos dados: Realize a limpeza necess√°ria dos dados, trate os valores ausentes e fa√ßa a engenharia de features.
3. Divida os dados: Separe o conjunto de dados em conjuntos de treinamento e teste.
4. Escolha um Algoritmo de Machine Learning: Selecione um ou mais algoritmos adequados para classifica√ß√£o bin√°ria, como Regress√£o Log√≠stica, Random Forest, Gradient Boosting, M√°quinas de Vetores de Suporte (SVM) ou Redes Neurais.
5. Treine o modelo: Treine o(s) algoritmo(s) escolhido(s) nos dados de treinamento.
6. Avalie o modelo: Avalie o desempenho do modelo usando as m√©tricas de avalia√ß√£o adequadas.
7. Predizer Bot ou N√£o: Aplique o modelo treinado em novos dados para prever se um usu√°rio √© ou n√£o um bot.

## Algoritmos de Machine Learning para Detec√ß√£o de Bots üß†üí°

V√°rios algoritmos de machine learning podem ser aplicados para prever contas de bots usando este conjunto de dados. Alguns algoritmos comumente utilizados incluem:

- Regress√£o Log√≠stica
- Random Forest
- Gradient Boosting (XGBoost, LightGBM)
- M√°quinas de Vetores de Suporte (SVM)
- Redes Neurais (MLPs, CNNs)

No nosso caso aqui, farei o treinamento da rede neural BERT no Keras. Por√©m, sugiro experimentar diferentes algoritmos e considere realizar ajuste de hiperpar√¢metros para otimizar o desempenho do modelo.

Lembre-se de reconhecer a fonte do conjunto de dados e fornecer as devidas cita√ß√µes se usar este conjunto de dados para pesquisas ou an√°lises.

Aproveite a explora√ß√£o do Conjunto de Dados de Detec√ß√£o de Bots e descubra insights sobre contas de bots no Twitter! üöÄüîç

# **IMPORTANTE**:
- Este notebook est√° sendo trabalhado com GPUs.
- Por isso √© aplicado cuDF.
- Verifique se voc√™ est√° rodando em um tempo de execu√ß√£o com GPU, pois se n√£o, ser√° necess√°rio pequenas mudan√ßas no c√≥digo. Al√©m de que ir√° demorar um pouco mais.

---

# Implementando cuDF

- O principal motivo de estarmos utilizando a GPU √© para acelerar o processamento dos dados, visto que o cuDF √© uma biblioteca que permite a manipula√ß√£o de dados em GPU, o que torna o processamento mais r√°pido.
- Um exemplo claro disso √© percebido no momento de rodar qualquer pre processo dos dados

In [1]:
!nvidia-smi

Wed Oct  9 12:10:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
%load_ext cudf.pandas

# Importa√ß√µes e instala√ß√µes de bibliotecas

In [3]:
import gdown
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import seaborn as sns
import string
import re

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
nltk.download('vader_lexicon')
nltk.download('punkt')

from collections import Counter
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

import tensorflow as tf
import torch
from transformers import TFBertModel, BertTokenizer
from tensorflow.keras.layers import Input, Dense, Concatenate, Embedding, Flatten, Dropout, BatchNormalization, Activation, Layer, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Baixando os dados

In [4]:
!gdown 1zASdz9hUY6NC6p7h9ef5trbv4B3uUpJr

Downloading...
From: https://drive.google.com/uc?id=1zASdz9hUY6NC6p7h9ef5trbv4B3uUpJr
To: /content/bot_detection_data.csv
  0% 0.00/7.46M [00:00<?, ?B/s]100% 7.46M/7.46M [00:00<00:00, 154MB/s]


In [5]:
df = pd.read_csv('/content/bot_detection_data.csv')
df

Unnamed: 0,User ID,Username,Tweet,Retweet Count,Mention Count,Follower Count,Verified,Bot Label,Location,Created At,Hashtags
0,132131,flong,Station activity person against natural majori...,85,1,2353,False,1,Adkinston,2020-05-11 15:29:50,
1,289683,hinesstephanie,Authority research natural life material staff...,55,5,9617,True,0,Sanderston,2022-11-26 05:18:10,both live
2,779715,roberttran,Manage whose quickly especially foot none to g...,6,2,4363,True,0,Harrisonfurt,2022-08-08 03:16:54,phone ahead
3,696168,pmason,Just cover eight opportunity strong policy which.,54,5,2242,True,1,Martinezberg,2021-08-14 22:27:05,ever quickly new I
4,704441,noah87,Animal sign six data good or.,26,3,8438,False,1,Camachoville,2020-04-13 21:24:21,foreign mention
...,...,...,...,...,...,...,...,...,...,...,...
49995,491196,uberg,Want but put card direction know miss former h...,64,0,9911,True,1,Lake Kimberlyburgh,2023-04-20 11:06:26,teach quality ten education any
49996,739297,jessicamunoz,Provide whole maybe agree church respond most ...,18,5,9900,False,1,Greenbury,2022-10-18 03:57:35,add walk among believe
49997,674475,lynncunningham,Bring different everyone international capital...,43,3,6313,True,1,Deborahfort,2020-07-08 03:54:08,onto admit artist first
49998,167081,richardthompson,Than about single generation itself seek sell ...,45,1,6343,False,0,Stephenside,2022-03-22 12:13:44,star


# An√°lise e prepara√ß√£o dos dados

## Pequena explorat√≥ria

In [6]:
df

Unnamed: 0,User ID,Username,Tweet,Retweet Count,Mention Count,Follower Count,Verified,Bot Label,Location,Created At,Hashtags
0,132131,flong,Station activity person against natural majori...,85,1,2353,False,1,Adkinston,2020-05-11 15:29:50,
1,289683,hinesstephanie,Authority research natural life material staff...,55,5,9617,True,0,Sanderston,2022-11-26 05:18:10,both live
2,779715,roberttran,Manage whose quickly especially foot none to g...,6,2,4363,True,0,Harrisonfurt,2022-08-08 03:16:54,phone ahead
3,696168,pmason,Just cover eight opportunity strong policy which.,54,5,2242,True,1,Martinezberg,2021-08-14 22:27:05,ever quickly new I
4,704441,noah87,Animal sign six data good or.,26,3,8438,False,1,Camachoville,2020-04-13 21:24:21,foreign mention
...,...,...,...,...,...,...,...,...,...,...,...
49995,491196,uberg,Want but put card direction know miss former h...,64,0,9911,True,1,Lake Kimberlyburgh,2023-04-20 11:06:26,teach quality ten education any
49996,739297,jessicamunoz,Provide whole maybe agree church respond most ...,18,5,9900,False,1,Greenbury,2022-10-18 03:57:35,add walk among believe
49997,674475,lynncunningham,Bring different everyone international capital...,43,3,6313,True,1,Deborahfort,2020-07-08 03:54:08,onto admit artist first
49998,167081,richardthompson,Than about single generation itself seek sell ...,45,1,6343,False,0,Stephenside,2022-03-22 12:13:44,star


In [7]:
df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   User ID         50000 non-null  int64
 1   Username        50000 non-null  object
 2   Tweet           50000 non-null  object
 3   Retweet Count   50000 non-null  int64
 4   Mention Count   50000 non-null  int64
 5   Follower Count  50000 non-null  int64
 6   Verified        50000 non-null  bool
 7   Bot Label       50000 non-null  int64
 8   Location        50000 non-null  object
 9   Created At      50000 non-null  object
 10  Hashtags        41659 non-null  object
dtypes: bool(1), int64(5), object(5)
memory usage: 8.6+ MB


- Retweet Count: A m√©dia de retweets √© 50, com um desvio padr√£o de 29. A contagem m√≠nima de retweets √© 0 e a m√°xima √© 100.
- Mention Count: A m√©dia de men√ß√µes por tweet √© 2,5, com um desvio padr√£o de 1,7. O m√°ximo de men√ß√µes √© 5.
- Follower Count: A m√©dia de seguidores por usu√°rio √© de aproximadamente 4.988, com um m√°ximo de 10.000 seguidores.

In [8]:
df.describe()

Unnamed: 0,User ID,Retweet Count,Mention Count,Follower Count,Bot Label
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,548890.68054,50.0056,2.51376,4988.60238,0.50036
std,259756.681425,29.18116,1.708563,2878.742898,0.500005
min,100025.0,0.0,0.0,0.0,0.0
25%,323524.25,25.0,1.0,2487.75,0.0
50%,548147.0,50.0,3.0,4991.5,1.0
75%,772983.0,75.0,4.0,7471.0,1.0
max,999995.0,100.0,5.0,10000.0,1.0


- Aqui conseguimos ver um bom equil√≠brio de usu√°rio bots e de usu√°rios n√£o bots

In [9]:
df['Bot Label'].value_counts()

Unnamed: 0_level_0,count
Bot Label,Unnamed: 1_level_1
1,25018
0,24982


In [10]:
teste = df.isnull().sum()
teste

Unnamed: 0,0
User ID,0
Username,0
Tweet,0
Retweet Count,0
Mention Count,0
Follower Count,0
Verified,0
Bot Label,0
Location,0
Created At,0


In [11]:
# Changing every title into better version (lowwercase + "_" instead of spaces)
df.columns = [col.lower().replace(" ", "_") for col in df.columns]
df

Unnamed: 0,user_id,username,tweet,retweet_count,mention_count,follower_count,verified,bot_label,location,created_at,hashtags
0,132131,flong,Station activity person against natural majori...,85,1,2353,False,1,Adkinston,2020-05-11 15:29:50,
1,289683,hinesstephanie,Authority research natural life material staff...,55,5,9617,True,0,Sanderston,2022-11-26 05:18:10,both live
2,779715,roberttran,Manage whose quickly especially foot none to g...,6,2,4363,True,0,Harrisonfurt,2022-08-08 03:16:54,phone ahead
3,696168,pmason,Just cover eight opportunity strong policy which.,54,5,2242,True,1,Martinezberg,2021-08-14 22:27:05,ever quickly new I
4,704441,noah87,Animal sign six data good or.,26,3,8438,False,1,Camachoville,2020-04-13 21:24:21,foreign mention
...,...,...,...,...,...,...,...,...,...,...,...
49995,491196,uberg,Want but put card direction know miss former h...,64,0,9911,True,1,Lake Kimberlyburgh,2023-04-20 11:06:26,teach quality ten education any
49996,739297,jessicamunoz,Provide whole maybe agree church respond most ...,18,5,9900,False,1,Greenbury,2022-10-18 03:57:35,add walk among believe
49997,674475,lynncunningham,Bring different everyone international capital...,43,3,6313,True,1,Deborahfort,2020-07-08 03:54:08,onto admit artist first
49998,167081,richardthompson,Than about single generation itself seek sell ...,45,1,6343,False,0,Stephenside,2022-03-22 12:13:44,star


In [12]:
username_tweet_counts = df.groupby('username')['tweet'].count()
users_with_multiple_tweets = username_tweet_counts[username_tweet_counts > 10]
users_with_multiple_tweets

Unnamed: 0_level_0,tweet
username,Unnamed: 1_level_1
bjohnson,11
bjones,11
bwilliams,11
djohnson,12
dsmith,13
ejohnson,12
fsmith,12
fwilliams,12
ismith,13
ksmith,21


In [13]:
ksmith_tweets = df[df['username'] == 'bjohnson']
ksmith_bot_counts = ksmith_tweets.groupby('bot_label')['tweet'].count()
print(ksmith_bot_counts)

bot_label
0    7
1    4
Name: tweet, dtype: int64


In [14]:
teste = df[[col for col in df.columns if col not in ['user_id', 'username', 'tweet', 'location', 'created_at', 'hashtags']]]

In [15]:
teste.corr()['bot_label'].sort_values(ascending=False)

Unnamed: 0,bot_label
bot_label,1.0
retweet_count,0.00125
follower_count,0.001162
verified,-0.00264
mention_count,-0.006912


In [16]:
import plotly.express as px
import plotly.figure_factory as ff

correlation_matrix = df[[col for col in df.columns if col not in ['user_id', 'username', 'tweet', 'location', 'created_at', 'hashtags']]].corr()

fig = ff.create_annotated_heatmap(
    z=correlation_matrix.values,
    x=list(correlation_matrix.columns),
    y=list(correlation_matrix.index),
    annotation_text=np.around(correlation_matrix.values, decimals=2),
    colorscale='Viridis'
)

fig.update_layout(title='Matriz de Correla√ß√£o', xaxis_title='Vari√°veis', yaxis_title='Vari√°veis')
fig.show()

- √â poss√≠vel perceber que as vari√°veis num√©ricas n√£o s√£o nem um pouco efetivas para a predi√ß√£o dessa classifica√ß√£o, por isso vou dropar.

## Tratamento de colunas

In [17]:
df

Unnamed: 0,user_id,username,tweet,retweet_count,mention_count,follower_count,verified,bot_label,location,created_at,hashtags
0,132131,flong,Station activity person against natural majori...,85,1,2353,False,1,Adkinston,2020-05-11 15:29:50,
1,289683,hinesstephanie,Authority research natural life material staff...,55,5,9617,True,0,Sanderston,2022-11-26 05:18:10,both live
2,779715,roberttran,Manage whose quickly especially foot none to g...,6,2,4363,True,0,Harrisonfurt,2022-08-08 03:16:54,phone ahead
3,696168,pmason,Just cover eight opportunity strong policy which.,54,5,2242,True,1,Martinezberg,2021-08-14 22:27:05,ever quickly new I
4,704441,noah87,Animal sign six data good or.,26,3,8438,False,1,Camachoville,2020-04-13 21:24:21,foreign mention
...,...,...,...,...,...,...,...,...,...,...,...
49995,491196,uberg,Want but put card direction know miss former h...,64,0,9911,True,1,Lake Kimberlyburgh,2023-04-20 11:06:26,teach quality ten education any
49996,739297,jessicamunoz,Provide whole maybe agree church respond most ...,18,5,9900,False,1,Greenbury,2022-10-18 03:57:35,add walk among believe
49997,674475,lynncunningham,Bring different everyone international capital...,43,3,6313,True,1,Deborahfort,2020-07-08 03:54:08,onto admit artist first
49998,167081,richardthompson,Than about single generation itself seek sell ...,45,1,6343,False,0,Stephenside,2022-03-22 12:13:44,star


- A partir dessa an√°lise de correla√ß√£o, n√£o vamos fazer a utiliza√ß√£o da vari√°vel verified, j√° que ela n√£o causa certa influ√™ncia na verifica√ß√£o de bots e n√£o bots

In [18]:
crosstab_verified_botlabel = pd.crosstab(df['verified'], df['bot_label'])
print(crosstab_verified_botlabel)

bot_label      0      1
verified               
False      12456  12540
True       12526  12478


- S√≥ estarei usando o mais importante, o target e o tweet mesmo

In [19]:
# Tirando location, pois s√£o valores artificiais
df_clean = df[['tweet', 'bot_label']]
df_clean

Unnamed: 0,tweet,bot_label
0,Station activity person against natural majori...,1
1,Authority research natural life material staff...,0
2,Manage whose quickly especially foot none to g...,0
3,Just cover eight opportunity strong policy which.,1
4,Animal sign six data good or.,1
...,...,...
49995,Want but put card direction know miss former h...,1
49996,Provide whole maybe agree church respond most ...,1
49997,Bring different everyone international capital...,1
49998,Than about single generation itself seek sell ...,0


### Tratamento de caracteres

- Aqui, estou fazendo uma limpeza b√°sica que remove qualquer tipo de pontua√ß√£o, caracteres especiais (como !, @, #, etc.), deixando apenas letras, n√∫meros e espa√ßos.


In [20]:
df_clean['tweet'] = df_clean['tweet'].apply(lambda x: re.sub(r'[^\w\s]', '', str(x)))

In [21]:
df_clean.head()

Unnamed: 0,tweet,bot_label
0,Station activity person against natural majori...,1
1,Authority research natural life material staff...,0
2,Manage whose quickly especially foot none to g...,0
3,Just cover eight opportunity strong policy which,1
4,Animal sign six data good or,1


In [22]:
df_clean.shape

(50000, 2)

In [23]:
df_clean.columns

Index(['tweet', 'bot_label'], dtype='object')

## Estrutura√ß√£o dos dados

In [24]:
df_clean

Unnamed: 0,tweet,bot_label
0,Station activity person against natural majori...,1
1,Authority research natural life material staff...,0
2,Manage whose quickly especially foot none to g...,0
3,Just cover eight opportunity strong policy which,1
4,Animal sign six data good or,1
...,...,...
49995,Want but put card direction know miss former half,1
49996,Provide whole maybe agree church respond most ...,1
49997,Bring different everyone international capital...,1
49998,Than about single generation itself seek sell ...,0


In [25]:
# Inicializando o tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # aqui √© onde instanciamos o BERT tokenizer para fazer a tokeniza√ß√£o dos conte√∫dos textuais



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]


`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884



In [26]:
# Fazendo uma fun√ß√£ozinha para tokenizar os tweets
def tokenize_data(texts, max_length=128):
    tokens = tokenizer(
        texts.to_list(),
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors='tf'
    )
    return tokens['input_ids'], tokens['attention_mask']

In [27]:
feature = df['tweet']
labels = df['bot_label']

In [28]:
input_ids, attention_masks = tokenize_data(feature)

In [29]:
input_ids_np = input_ids.numpy()
attention_masks_np = attention_masks.numpy()

- Aqui vou estar fazendo a separa√ß√£o entre treino e teste dos tokens e label que demonstrei anteriormente

In [30]:
X_train_ids, X_test_ids, X_train_masks, X_test_masks, y_train, y_test = train_test_split(input_ids_np, attention_masks_np, labels, test_size=0.2, random_state=42)

In [31]:
X_train_ids = tf.convert_to_tensor(X_train_ids)
X_test_ids = tf.convert_to_tensor(X_test_ids)
X_train_masks = tf.convert_to_tensor(X_train_masks)
X_test_masks = tf.convert_to_tensor(X_test_masks)
y_train = tf.convert_to_tensor(y_train)
y_test = tf.convert_to_tensor(y_test)

# Treinamento do modelo

In [32]:
# Definindo o modelo para classifica√ß√£o bin√°ria
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Otimizador Adam com taxa de aprendizado ajustada
optimizer = Adam(learning_rate=2e-5)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [33]:
# Camadas de entrada
input_ids_layer = Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask_layer = Input(shape=(128,), dtype=tf.int32, name="attention_mask")

In [34]:
# Fun√ß√£o para pegar o output do BERT
def bert_layer(inputs):
    input_ids, attention_mask = inputs
    # Pegando o √∫ltimo estado escondido do BERT
    output = bert_model(input_ids, attention_mask=attention_mask)
    return output.last_hidden_state

# Aplicando a camada Lambda para BERT
bert_output = Lambda(bert_layer, output_shape=(128, 768))([input_ids_layer, attention_mask_layer])

In [35]:
# Pegando o √∫ltimo token da sequ√™ncia (correspondente ao [CLS])
last_token = bert_output[:, 0, :]

# Dropout para evitar overfitting
dropout = Dropout(0.3)(last_token)

# Camada densa final para classifica√ß√£o
output = Dense(1, activation='sigmoid')(dropout)

In [36]:
# Constru√ß√£o do modelo
model = Model(inputs=[input_ids_layer, attention_mask_layer], outputs=output)

# Compila√ß√£o do modelo com otimizador AdamW
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

In [39]:
# Treinamento do modelo
history = model.fit(
    [X_train_ids, X_train_masks], y_train,
    validation_data=([X_test_ids, X_test_masks], y_test),
    epochs=2,
    batch_size=16
)

Epoch 1/2
[1m2500/2500[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m357s[0m 143ms/step - accuracy: 0.5053 - loss: 0.7227 - val_accuracy: 0.4859 - val_loss: 0.7020
Epoch 2/2
[1m2500/2500[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m357s[0m 143ms/step - accuracy: 0.4994 - loss: 0.7206 - val_accuracy: 0.4859 - val_loss: 0.7011


In [40]:
# Avalia√ß√£o do modelo
loss, accuracy = model.evaluate([X_test_ids, X_test_masks], y_test)
print(f"Loss: {loss}, Accuracy: {accuracy}")

[1m313/313[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m85s[0m 242ms/step - accuracy: 0.4831 - loss: 0.7016
Loss: 0.7011010050773621, Accuracy: 0.48590001463890076


# Resultados

In [41]:
y_pred_probs = model.predict([X_test_ids, X_test_masks])
y_pred = (y_pred_probs > 0.5).astype(int)
cm = confusion_matrix(y_test, y_pred)

fig = ff.create_annotated_heatmap(
    z=cm,
    x=['Predito Negativo', 'Predito Positivo'],
    y=['Real Negativo', 'Real Positivo'],
    annotation_text=cm,
    colorscale='Viridis'
)

fig.update_layout(title='Matriz de Confus√£o', xaxis_title='Predito', yaxis_title='Real')
fig.show()

[1m313/313[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m85s[0m 254ms/step


Essa matriz de confus√£o mostra a compara√ß√£o entre os valores reais e os valores previstos de um modelo de classifica√ß√£o bin√°ria. Ela est√° organizada da seguinte forma:

- No eixo Y ("Real"), temos as classes reais: "Real Positivo" (cima) e "Real Negativo" (baixo).
- No eixo X ("Predito"), temos as classes previstas: "Predito Negativo" (esquerda) e "Predito Positivo" (direita).

Os n√∫meros no centro das caixas representam a quantidade de ocorr√™ncias em cada combina√ß√£o:

1. **2319**: N√∫mero de casos onde o modelo previu **negativo** corretamente (verdadeiro negativo).
2. **2713**: N√∫mero de casos onde o modelo previu **positivo**, mas na verdade era **negativo** (falso positivo).
3. **2146**: N√∫mero de casos onde o modelo previu **negativo**, mas na verdade era **positivo** (falso negativo).
4. **2822**: N√∫mero de casos onde o modelo previu **positivo** corretamente (verdadeiro positivo).

### An√°lise:
- O modelo est√° cometendo **2146 falsos negativos** e **2713 falsos positivos**, o que sugere que h√° uma quantidade consider√°vel de erros em ambas as dire√ß√µes.
- A quantidade de **verdadeiros positivos (2822)** √© alta, assim como a de **verdadeiros negativos (2319)**, indicando que o modelo est√° acertando uma boa parte das predi√ß√µes, mas ainda tem espa√ßo para melhorar a precis√£o.