# Extraindo Gênero de Overview

## 1 - Bibliotecas e Dados

In [12]:
pip install transformers datasets torch scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
     -------------------------------------- 494.8/494.8 kB 3.1 MB/s eta 0:00:00
Collecting multiprocess<0.70.17
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
     ---------------------------------------- 134.8/134.8 kB ? eta 0:00:00
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.34.4-py3-none-any.whl (561 kB)
     -------------------------------------- 561.5/561.5 kB 8.9 MB/s eta 0:00:00
Collecting pyarrow>=15.0.0
  Downloading pyarrow-21.0.0-cp310-cp310-win_amd64.whl (26.2 MB)
     --------------------------------------- 26.2/26.2 MB 38.4 MB/s eta 0:00:00
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ---------------------------------------- 64.7/64.7 kB 3.4 MB/s eta 0:00:00
Collecting fsspec[http]<=2025.3.0,>=2023.1.0
  Downloading fsspec-2025.3.0-py3-none-any

ERROR: Could not install packages due to an OSError: [WinError 5] Acesso negado: 'C:\\Users\\User\\AppData\\Roaming\\Python\\Python310\\site-packages\\~yarrow\\arrow.dll'
Check the permissions.



In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
df = pd.read_csv('data/df_split.csv')

## 2 - 

In [19]:
df['Main_Genre'] = df['Genre'].str.split(',').str[0]

# Filtrar gêneros raros (menos de 5 filmes)
genre_counts = df['Main_Genre'].value_counts()
genres_to_keep = genre_counts[genre_counts >= 5].index
df_filtered = df[df['Main_Genre'].isin(genres_to_keep)].copy()

In [20]:
le = LabelEncoder()
df_filtered['genre_label'] = le.fit_transform(df_filtered['Main_Genre'])

# Divisão treino/teste
X_train, X_test, y_train, y_test = train_test_split(
    df_filtered['Overview'],
    df_filtered['genre_label'],
    test_size=0.2,
    random_state=42,
    stratify=df_filtered['genre_label']
)

In [21]:
# Tokenizador DistilBERT
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(list(X_train), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True, max_length=128)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [22]:
class MovieDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

train_dataset = MovieDataset(train_encodings, list(y_train))
test_dataset = MovieDataset(test_encodings, list(y_test))

In [23]:
num_labels = len(df_filtered['genre_label'].unique())
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', 
    num_labels=num_labels
)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

In [25]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,           # menos epochs
    per_device_train_batch_size=8, # batch menor
    per_device_eval_batch_size=8,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_dir='./logs',
    logging_steps=50,
    learning_rate=5e-5,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [26]:
preds = trainer.predict(test_dataset)
y_pred = preds.predictions.argmax(-1)

print("Acurácia:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=le.classes_))

***** Running Prediction *****
  Num examples = 198
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

Acurácia: 0.29292929292929293
              precision    recall  f1-score   support

      Action       0.00      0.00      0.00        35
   Adventure       0.00      0.00      0.00        14
   Animation       0.00      0.00      0.00        17
   Biography       0.00      0.00      0.00        18
      Comedy       0.00      0.00      0.00        31
       Crime       0.00      0.00      0.00        21
       Drama       0.29      1.00      0.45        58
      Horror       0.00      0.00      0.00         2
     Mystery       0.00      0.00      0.00         2

    accuracy                           0.29       198
   macro avg       0.03      0.11      0.05       198
weighted avg       0.09      0.29      0.13       198



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
