# üí¨ TMCD ‚Äì Trabalho Final
## An√°lise de Sentimentos em Reviews de Filmes

### üë• Grupo Trab-grupo-30
- **Rafael Alexandre Dias Andorinha**, n¬∫ 131000  
- **Pedro Fonte Santa**, n¬∫ 105306  

---

üìÖ **Data de entrega:** 26 de abril  

üìä **Objetivo deste script:**
Este Notebook corresponde √† Tarefa 2.5 do trabalho.

O objetivo desta etapa √© aplicar um modelo pr√©-treinado baseado em transformadores ao problema de an√°lise de sentimentos no dataset IMDB. A tarefa √© dividida em duas fases:

- **Etapa a)**: aplicar modelos pr√©-treinados diretamente, usando pipelines da biblioteca Hugging Face.
- **Etapa b)**: realizar fine-tuning de um modelo pr√©-treinado, ajustando-o aos dados espec√≠ficos do projeto.

O modelo escolhido foi o `distilbert-base-uncased-finetuned-sst-2-english` para a primeira etapa, e `distilbert-base-uncased` para o fine-tuning. A implementa√ß√£o foi baseada no notebook da aula "BERT-based classification", adaptado ao contexto do trabalho.

---

# üóÇÔ∏è Dataset: IMDB Reviews

### üìò Etapa a) ‚Äî Pipeline direto com Hugging Face

In [9]:
!pip install transformers datasets evaluate --quiet
!pip install torch



In [7]:
from transformers import pipeline
import pandas as pd
from sklearn.metrics import classification_report
from tqdm import tqdm

In [2]:
df_test = pd.read_csv('../dataset/imdb_reviews_test.csv')

In [3]:
clf = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    framework="pt"  # <-- for√ßar a usar PyTorch
)

Device set to use cpu


In [None]:
texts = df_test['text'].tolist()
batch_size = 32
preds = []

# Processar por batches com barra de progresso
for i in tqdm(range(0, len(texts), batch_size)):
    batch = texts[i:i+batch_size]
    preds.extend(clf(batch, truncation=True))

df_test['pred_pipeline'] = [p['label'].lower() for p in preds]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 688/688 [2:23:02<00:00, 12.47s/it]  


In [18]:
# Converter as predi√ß√µes do pipeline para o mesmo formato dos r√≥tulos
df_test['pred_pipeline'] = df_test['pred_pipeline'].map({'positive': 'pos', 'negative': 'neg'})

In [17]:
print(df_test['label'].unique())
print(df_test['pred_pipeline'].unique())

['pos' 'neg']
['positive' 'negative']


In [None]:
print(classification_report(
    df_test['label'],
    df_test['pred_pipeline'],
    target_names=['neg', 'pos']
))

              precision    recall  f1-score   support

         neg       0.89      0.92      0.90     11050
         pos       0.91      0.89      0.90     10946

    accuracy                           0.90     21996
   macro avg       0.90      0.90      0.90     21996
weighted avg       0.90      0.90      0.90     21996



### ‚öôÔ∏è Etapa b) ‚Äì Fine-tuning com Hugging Face Trainer

In [27]:
!pip install transformers datasets evaluate --quiet
!pip install --upgrade transformers




In [3]:
import transformers
print(transformers.__version__)

# Idealmente deve aparecer algo como: 4.38.2 ou superior caso contario √© necessario atualizar

4.51.3


In [2]:
import pandas as pd
import random
import numpy as np
import evaluate

from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding

2025-04-24 16:16:14.264653: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# Carregar os ficheiros CSV
df_train = pd.read_csv('../dataset/imdb_reviews_train.csv')
df_test = pd.read_csv('../dataset/imdb_reviews_test.csv')

# Mapear labels para 0 e 1 se necess√°rio (s√≥ se n√£o estiverem j√° assim)
label_map = {'neg': 0, 'pos': 1}
df_train['label'] = df_train['label'].map(label_map)
df_test['label'] = df_test['label'].map(label_map)

# Converter para Hugging Face datasets
train_ds = Dataset.from_pandas(df_train[['text', 'label']])
test_ds = Dataset.from_pandas(df_test[['text', 'label']])

dataset = DatasetDict({'train': train_ds, 'test': test_ds})

In [5]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = dataset["train"].map(preprocess_function, batched=True)
tokenized_test = dataset["test"].map(preprocess_function, batched=True)

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 21754/21754 [00:23<00:00, 920.17 examples/s] 
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 21996/21996 [00:21<00:00, 1042.03 examples/s]


In [6]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=predictions, references=labels)["accuracy"],
        "precision": precision.compute(predictions=predictions, references=labels, average='binary')["precision"],
        "recall": recall.compute(predictions=predictions, references=labels, average='binary')["recall"],
        "f1": f1.compute(predictions=predictions, references=labels, average='binary')["f1"]
    }

In [8]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

### üöÄ Treinar o modelo

In [None]:
trainer.train()


In [None]:
results_finetuned = trainer.evaluate()
print(results_finetuned)