# Introdução

O presente notebook tem o fito de suprir a demanda do Governo do Distrito Federal, a qual foi exposta por intermédio de um Ideathon em 2026.O arquivo contempla o código python utilizado para tal efeito.

## 1. Configurações Iniciais

### 1.1 Instalação de Dependências


In [1]:
!pip install -q transformers accelerate torch scikit-learn pandas openpyxl


### 1.2 Importação das Bibliotecas Utilizadas

In [2]:
import pandas as pd
import numpy as np
import torch

from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import AdamW

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import recall_score, precision_score
from sklearn.utils.class_weight import compute_class_weight


### 1.3 Definição do Modelo e Ajuste de Hiperparâmetros

In [3]:
MODEL_NAME = "neuralmind/bert-base-portuguese-cased"
MAX_LEN = 256
BATCH_SIZE = 8
EPOCHS = 3
LR = 2e-5
THRESHOLD = 0.45

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


Device: cpu


### 1.4 Carregamento dos Dados

In [6]:
df = pd.read_csv("/content/AMOSTRA_e-SIC.csv")

df = df.rename(columns={
    "Texto Mascarado": "text",
    "label": "label"
})

df = df[["text", "label"]]
df.dropna(inplace=True)

df.head()

Unnamed: 0,text,label
0,Solicito cópia do cadastro que preenchi virtua...,1
1,Gostaria de saber da defensoria se q irão impl...,1
2,Oi estou chateada o meu companheiro está estra...,1
3,"Prezados senhores, boa tarde!\n\nSolicito aces...",1
4,Solicito acesso a um laudo de adicional de per...,1


### 1.5 Customização do Dataset

A customização do dataset torna-se imprescindível para que o BERT funcione corretamente.

In [7]:
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer):
        self.texts = texts.tolist()
        self.labels = labels.tolist()
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            padding="max_length",
            truncation=True,
            max_length=MAX_LEN,
            return_tensors="pt"
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "labels": torch.tensor(self.labels[idx], dtype=torch.long)
        }


## 2. Desenvolvimento do Modelo Preditivo

### 2.1 Criação da Função de Treino e de Avaliação do Modelo

In [8]:
def train_and_evaluate(train_texts, train_labels, val_texts, val_labels):

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    train_dataset = TextDataset(train_texts, train_labels, tokenizer)
    val_dataset = TextDataset(val_texts, val_labels, tokenizer)

    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

    class_weights = compute_class_weight(
        class_weight="balanced",
        classes=np.array([0, 1]),
        y=train_labels
    )

    class_weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(device)

    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=2
    ).to(device)

    optimizer = AdamW(model.parameters(), lr=LR)
    loss_fn = torch.nn.CrossEntropyLoss(weight=class_weights_tensor)

    # -------- TREINO --------
    for epoch in range(EPOCHS):
        model.train()
        total_loss = 0

        for batch in train_loader:
            optimizer.zero_grad()

            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            loss = loss_fn(outputs.logits, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1} | Loss: {total_loss/len(train_loader):.4f}")

    # -------- AVALIAÇÃO --------
    model.eval()
    probs = []
    y_true = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            prob = torch.softmax(outputs.logits, dim=1)[:, 1]
            probs.extend(prob.cpu().numpy())
            y_true.extend(batch["labels"].numpy())

    return np.array(probs), np.array(y_true)


### 2.2 Validação Cruzada

A validação cruzada configura-se como uma técnica para garantir que o modelo não decorou os dados e consegue manter a performance mesmo frente a outros dados.

In [9]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

recalls = []
precisions = []

fold = 1

for train_idx, val_idx in skf.split(df["text"], df["label"]):

    print(f"\n======================")
    print(f"🔁 Fold {fold}")
    print("======================")

    train_texts = df.iloc[train_idx]["text"]
    train_labels = df.iloc[train_idx]["label"]

    val_texts = df.iloc[val_idx]["text"]
    val_labels = df.iloc[val_idx]["label"]

    probs, y_true = train_and_evaluate(
        train_texts, train_labels,
        val_texts, val_labels
    )

    y_pred = (probs >= THRESHOLD).astype(int)

    recall = recall_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)

    print(f"Recall: {recall:.3f}")
    print(f"Precision: {precision:.3f}")

    recalls.append(recall)
    precisions.append(precision)

    fold += 1



🔁 Fold 1


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Epoch 1 | Loss: 0.6977
Epoch 2 | Loss: 0.5622
Epoch 3 | Loss: 0.3993
Recall: 0.938
Precision: 1.000

🔁 Fold 2


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 | Loss: 0.6645
Epoch 2 | Loss: 0.5058
Epoch 3 | Loss: 0.3676
Recall: 0.800
Precision: 0.750

🔁 Fold 3


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 | Loss: 0.6935
Epoch 2 | Loss: 0.5547
Epoch 3 | Loss: 0.4204
Recall: 0.867
Precision: 0.867

🔁 Fold 4


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 | Loss: 0.6669
Epoch 2 | Loss: 0.5439
Epoch 3 | Loss: 0.3872
Recall: 0.867
Precision: 0.929

🔁 Fold 5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 | Loss: 0.7623
Epoch 2 | Loss: 0.5040
Epoch 3 | Loss: 0.4027
Recall: 1.000
Precision: 1.000


## 3. Análise das Métricas Finais

Análise final das métricas de Recall e Precision.

In [11]:
print("\n📊 Resultado Final, após validação cruzada:")

print(f"Recall médio: {np.mean(recalls):.3f}")
print(f"Desvio recall: {np.std(recalls):.3f}")

print(f"Precision média: {np.mean(precisions):.3f}")



📊 Resultado Final, após validação cruzada:
Recall médio: 0.894
Desvio recall: 0.068
Precision média: 0.909
