# Objašnjenje Koda za Email Spam Klasifikaciju

## Uvod

Ovaj kod implementira sistem za detekciju spam email-ova koristeći tri popularna transformer modela: BERT, DistilBERT i RoBERTa. Sistem trenira modele na Enron datasetu i testira ih na različitim datasetima.

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import os
from datasets import Dataset
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    RobertaTokenizer,
    RobertaForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, f1_score, \
    recall_score, confusion_matrix

  from .autonotebook import tqdm as notebook_tqdm


Koristimo **PyTorch** kao backend framework, **Transformers** biblioteku od Hugging Face za pristup pre-treniranim modelima, i **datasets** biblioteku za efikasno rukovanje podacima.


In [None]:
MODEL_CONFIGS = {
    'bert': {
        'name': 'bert-base-uncased',
        'tokenizer': BertTokenizer,
        'model': BertForSequenceClassification
    },
    'distilbert': {
        'name': 'distilbert-base-uncased',
        'tokenizer': DistilBertTokenizer,
        'model': DistilBertForSequenceClassification
    },
    'roberta': {
        'name': 'roberta-base',
        'tokenizer': RobertaTokenizer,
        'model': RobertaForSequenceClassification
    }
}

Definišemo tri modela:
- **BERT** - originalni transformer model
- **DistilBERT** - lakša verzija BERT-a (brža, manje resursa)
- **RoBERTa** - optimizovana verzija BERT-a

### Funkciju za pripremu dataset-ova

In [None]:
def prepare_datasets(X_train, y_train, X_val, y_val, X_test, y_test):


Pretvara numpy nizove u Hugging Face `Dataset` objekte koji su optimizovani za rad sa transformerima.

### Tokenizacija

In [None]:
def preprocess_function(examples, tokenizer, max_length=128):


**Tokenizacija** pretvara tekst u numeričke tokene koje model može da razume. Postavljamo:
- `max_length=128` - maksimalna dužina sekvence
- `padding='max_length'` - dopunjavanje kraćih tekstova
- `truncation=True` - skraćivanje dužih tekstova

### Metrike evaluacije modela

In [None]:
def compute_metrics(eval_pred):


Računamo ključne metrike:
- **Accuracy** - ukupna tačnost
- **F1 Score** - harmonijska sredina preciznosti i odziva
- **Precision** - koliko je detektovanih spam-ova zaista spam
- **Recall** - koliko smo stvarnih spam-ova detektovali

### Treniranje Modela
Ova funkcija:
- Deli skup podataka na train/val/test
- Tokenizuje podatke
- Inicijalizuje model i trenira ga pomoću Hugging Face `Trainer` klase
- Evaluira performanse i vraća ključne rezultate

In [3]:
def train_with_trainer(X_train, y_train, X_test, y_test, model_type='bert', use_cross_dataset=True):
    config = MODEL_CONFIGS[model_type]
    MODEL_NAME = config['name']
    TokenizerClass = config['tokenizer']
    ModelClass = config['model']

    MAX_LENGTH = 128
    BATCH_SIZE = 16
    EPOCHS = 3
    LEARNING_RATE = 2e-5
    WEIGHT_DECAY = 0.01

    if use_cross_dataset:
        X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
            X_train, y_train, test_size=0.15, stratify=y_train, random_state=42
        )
        X_test_final, y_test_final = X_test, y_test
    else:
        X_temp, X_test_final, y_temp, y_test_final = train_test_split(
            X_train, y_train, test_size=0.15, stratify=y_train, random_state=42
        )
        X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
            X_temp, y_temp, test_size=0.175, stratify=y_temp, random_state=42
        )

    tokenizer = TokenizerClass.from_pretrained(MODEL_NAME)

    train_dataset, val_dataset, test_dataset = prepare_datasets(
        X_train_split, y_train_split,
        X_val_split, y_val_split,
        X_test_final, y_test_final
    )

    train_dataset = train_dataset.map(
        lambda x: preprocess_function(x, tokenizer, MAX_LENGTH),
        batched=True
    )
    val_dataset = val_dataset.map(
        lambda x: preprocess_function(x, tokenizer, MAX_LENGTH),
        batched=True
    )
    test_dataset = test_dataset.map(
        lambda x: preprocess_function(x, tokenizer, MAX_LENGTH),
        batched=True
    )

    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])


### Rezultat tokenizacije:
- input_ids - Niz brojevva koji predstavljaju tokene
- attention_mask - Niz koji označava koje tokene model treba da obradi (1) a koje da ignoriše (0)
- labels - Prave klase (0 za ham, 1 za spam)

### Trening
U ovom delu inicijalizujemo model, definišemo parametre treninga i pokrećemo proces treniranja. Nakon treniranja model se evaluira na validacionom i test skupu i čuva u folderu.


In [None]:
    model = ModelClass.from_pretrained(MODEL_NAME, num_labels=2)

    training_args = TrainingArguments(
        output_dir='./results',
        eval_strategy='epoch',
        save_strategy='epoch',
        learning_rate=LEARNING_RATE,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=EPOCHS,
        weight_decay=WEIGHT_DECAY,
        load_best_model_at_end=True,
        metric_for_best_model='f1',
        greater_is_better=True,
        save_total_limit=2,
        warmup_steps=500,
        fp16=torch.cuda.is_available(),
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
    )

    trainer.train()

    val_results = trainer.evaluate(val_dataset)
    test_predictions = trainer.predict(test_dataset)
    predicted_labels = np.argmax(test_predictions.predictions, axis=1)
    true_labels = test_dataset['label']

    report = classification_report(
        true_labels, predicted_labels,
        target_names=['ham', 'spam'], zero_division=0
    )
    acc = accuracy_score(true_labels, predicted_labels)
    prec, rec, f1, _ = precision_recall_fscore_support(
        true_labels, predicted_labels, average='binary', pos_label=1, zero_division=0
    )
    cm = confusion_matrix(true_labels, predicted_labels)

    model_save_path = f'./fine_tuned_{model_type}_spam_classifier'
    model.save_pretrained(model_save_path)
    tokenizer.save_pretrained(model_save_path)

    return {
        "model": model,
        "tokenizer": tokenizer,
        "trainer": trainer,
        "val_results": val_results,
        "test_report": report,
        "test_metrics": {
            "accuracy": acc,
            "precision": prec,
            "recall": rec,
            "f1": f1,
            "confusion_matrix": cm
        }
    }

### Pokretanje treninga i prikaz rezultata
U ovoj ćeliji treniramo model i prikazujemo ključne metrike (accuracy, precision, recall, F1, confusion matrix).


In [None]:
def main():
    df_enron = pd.read_csv("/content/drive/MyDrive/EmailDatasets/enron_mails.csv").dropna(subset=['Message'])
    X_enron = df_enron['Message'].values
    y_enron = df_enron['Spam/Ham'].map({'ham': 0, 'spam': 1}).values

    df_venky = pd.read_csv("/content/drive/MyDrive/EmailDatasets/venky_spam_ham_dataset.csv").drop(subset=['text'])
    X_venky = df_venky['text'].values
    y_venky = df_venky['label'].map({'ham': 0, 'spam': 1}).values

    SELECTED_MODEL = 'distilbert'
    USE_CROSS_DATASET = True

    model, tokenizer, trainer = train_with_trainer(
        X_enron, y_enron,
        X_venky, y_venky,
        model_type=SELECTED_MODEL,
        use_cross_dataset=USE_CROSS_DATASET
    )
    return model, tokenizer, trainer