# LLM - Klasyczne metody klasyfikacji tekstu - LAB

# Zadanie

Zaadaptuj kod z notatnika *LLM - Klasyczne metody klasyfikacji tekstu - Omówienie* do problemu klasyfikacji liczby gwiazdek dla opinii z serwisu Yelp.
Możesz przygotować pętlę treningową albo w czystym PyTorchu, albo z wykorzystaniem biblioteki PyTorch Lightning.

* Wykorzystaj zbiór `Yelp/yelp_review_full` ([link](https://huggingface.co/datasets/Yelp/yelp_review_full)) zawierający opinie z serwisu Yelp (kolumna: `text`) i etykietę (kolumna: `label`) o wartościach $0,1,2,3,4$ określającą liczbę gwiazdek przyznaną przez użytkownika (a ściślej, liczbę gwiazdek minus jeden). Ponieważ mamy pięć klas, ostatnia warstwa liniowa w sieci neuronowej musi zwracać pięć wartości.
    * Zgodnie z dobrą praktyką z części treningowej wydziel dodatkową część walidacyjną.
    * (opcjonalnie) Ogranicz rozmiar każdej części zbioru danych (treningowej, walidacyjnej i testowej). Część treningowa nie powinna zawierać więcej niż 100k elementów.
* Do ekstrakcji cech z tekstu wykorzystaj **metodę TF-IDF** (*term frequency-inverse document frequency*) opartą o podejście typu worek słów (*bag-of-words*). Zastosuj funkcję `TfidfVectorizer` z biblioteki `scikit-learn`.


## Punkty do wykonania

1.   Napisz funkcję znajdującą i wyświetlającą $k$ elementów zbioru testowego dla których model najbardziej się myli, czyli predykuje najmniejsze prawdopodobieństwa prawdziwej klasy. Softmax jest funkcją ściśle rosnącą, więc wystarczy znaleźć elementy z najmniejszą wartością nieznormalizowanego wyjścia z sieci (logita) dla prawdziwej klasy.
2.   Zbadaj wpływ wybranych parametrów funkcji ekstrakcji cech z tekstu `TfidfVectorizer` na skuteczność wytrenowanego modelu. Uruchom kilka eksperymentów z różnymi wartościami parametrów i porównaj dokładność wytrenowanego modelu na zbiorze walidacyjnym.
3.   Zbadaj wpływ wybranych hiperparametrów modelu (np. liczba warstw liniowych modelu, rozmiary warstw) i procesu uczenia (np. początkowa wartość stopy uczenia, liczba epok, typ i parametry planisty stopy uczenia, typ i parametry optymalizatora) na skuteczność wytrenowanego modelu. Uruchom kilka eksperymentów z różnymi wartościami hiperparametrów i porównaj dokładność wytrenowanego modelu na zbiorze walidacyjnym. Następnie wykonaj finalną ewaluację najlepszego modelu na zbiorze testowym.


Import bibliotek.

In [1]:
import torch
import torch.nn as nn
import numpy as np
import heapq
import pandas as pd
import matplotlib.pyplot as plt
from itertools import product
from torch.utils.data import DataLoader, TensorDataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from datasets import load_dataset
from sklearn.metrics import (
    f1_score,
    recall_score,
    accuracy_score,
    precision_score,
    confusion_matrix,
)

print(f"Wersja biblioteki PyTorch: {torch.__version__}")

Wersja biblioteki PyTorch: 2.9.0+cu128


  from .autonotebook import tqdm as notebook_tqdm


Sprawdzenie dostępności GPU.

In [2]:
print(f"Dostępność GPU: {torch.cuda.is_available()}")
print(f"Typ GPU: {torch.cuda.get_device_name(0)}")

Dostępność GPU: True
Typ GPU: NVIDIA GeForce RTX 5070 Ti


In [3]:
import wandb

# Logowanie do serwisu Weights&Biases monitorującego przebieg eksperymentów
wandb.login(key="b18357d829db3e608dce0a0b0637312f25532350")

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/atarsander/.netrc
[34m[1mwandb[0m: Currently logged in as: [33matarsander[0m ([33matarsander-warsaw-university-of-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Rozwiązanie

In [4]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 128
NUM_WORKERS = 8

In [5]:
dataset_train = load_dataset("Yelp/yelp_review_full", split="train[:100000]")
dataset_test = load_dataset("Yelp/yelp_review_full", split="test[:20000]")

In [6]:
ds = dataset_train.train_test_split(test_size=0.15)
dataset_train, dataset_val = ds["train"], ds["test"]

In [7]:
print(f"Train dataset size: {len(dataset_train)}")
print(f"Validation dataset size: {len(dataset_val)}")
print(f"Test dataset size: {len(dataset_test)}")

Train dataset size: 85000
Validation dataset size: 15000
Test dataset size: 20000


In [8]:
class MLP(nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = nn.ModuleList()
        for layer in layers:
            if "dropout" in layer:
                self.layers.append(nn.Dropout(layer["dropout"]))
            if "linear" in layer:
                self.layers.append(nn.Linear(*layer["linear"]))
            if "batch_norm" in layer:
                self.layers.append(nn.BatchNorm1d(layer["batch_norm"]))
            if "relu" in layer:
                self.layers.append(nn.ReLU())

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

Z powodu rzadkości oraz bardzo dużego wymiaru macierzy TF-IDF, zastosuję SVD, aby zredukować liczbę cech i wydobyć najważniejsze ukryte zależności semantyczne między słowami. Pozwoli to uzyskać bardziej informacyjne reprezentacje tekstu dla downstreamowych modeli MLP.

In [9]:
def make_loaders_from_text(vectorizer_params, svd_components, batch_size, num_workers=2, pin_memory=True):
    vectorizer = TfidfVectorizer(**vectorizer_params)
    X_train_tf = vectorizer.fit_transform(dataset_train["text"])
    X_val_tf   = vectorizer.transform(dataset_val["text"])
    X_test_tf  = vectorizer.transform(dataset_test["text"])

    svd = TruncatedSVD(n_components=svd_components)
    X_train = svd.fit_transform(X_train_tf)
    X_val   = svd.transform(X_val_tf)
    X_test  = svd.transform(X_test_tf)

    y_train = np.asarray(dataset_train["label"])
    y_val   = np.asarray(dataset_val["label"])
    y_test  = np.asarray(dataset_test["label"])

    train_ds = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
    val_ds   = TensorDataset(torch.tensor(X_val),   torch.tensor(y_val))
    test_ds  = TensorDataset(torch.tensor(X_test),  torch.tensor(y_test))

    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True,  num_workers=num_workers, pin_memory=pin_memory)
    val_loader   = DataLoader(val_ds,   batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=pin_memory)
    test_loader  = DataLoader(test_ds,  batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=pin_memory)
    return vectorizer, svd, (train_loader, val_loader, test_loader)

In [10]:
INPUT_SIZE = 300
vectorizer, svd, (train_loader, val_loader, test_loader) = make_loaders_from_text(
    vectorizer_params={
        "max_features": 30000,
        "min_df": 2,
        "max_df": 0.95,
    },
    svd_components=INPUT_SIZE,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    pin_memory=True,
)

In [11]:
values, counts = np.unique(dataset_train["label"], return_counts=True)
label_counts = dict(zip(values, counts))
label_counts

{0: 19457, 1: 17390, 2: 16811, 3: 16355, 4: 14987}

Klasy nie są bardzo niezbalansowane, ale żeby wziąć pod uwagę różnice w ich liczności przy ewaluacji modelu będę się kierować metryką **f1_score**.

In [12]:
def setup_wandb(project, group, run_name, config):
    run = wandb.init(
        project=project,
        group=group,
        name=run_name,
        config=config,
        job_type="train",
        reinit="finish_previous",
    )
    wandb.define_metric("epoch")
    wandb.define_metric("train/*", step_metric="epoch")
    wandb.define_metric("val/*", step_metric="epoch")
    return run

In [13]:
def evaluate(model, loader, device):
    model.eval()
    y_pred = []
    y_true = []
    with torch.inference_mode():
        for X, y in loader:
            X, y = X.to(dtype=torch.float32, device=device), y.to(device)
            logits = model(X)
            y_pred.extend(logits.argmax(dim=1).detach().cpu().numpy().tolist())
            y_true.extend(y.detach().cpu().numpy().tolist())
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    return {
        "accuracy": float(accuracy_score(y_true, y_pred)),
        "precision_macro": float(precision_score(y_true, y_pred, average="macro")),
        "recall_macro": float(recall_score(y_true, y_pred, average="macro")),
        "confusion_matrix": confusion_matrix(y_true, y_pred),
        "f1_score": float(f1_score(y_true, y_pred, average="macro")),
        "y_true": y_true,
        "y_pred": y_pred,
    }

In [14]:
def train(
    train_loader,
    val_loader,
    model,
    epochs,
    optim_type,
    optim_params,
    criterion,
    device,
    project,
    group,
    run_name,
    config,
):
    run = setup_wandb(project, group, run_name, config)
    model = model.to(device)
    optimizer = optim_type(model.parameters(), *optim_params)
    losses = []
    wandb.watch(model, log="gradients", log_freq=200)
    for epoch in range(epochs):
        model.train()
        avg_loss = 0
        total_count = 0
        for X, y in train_loader:
            X, y = X.to(dtype=torch.float32, device=device), y.to(device)
            optimizer.zero_grad()
            logits = model(X)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()
            avg_loss += y.size(0) * loss.item()
            total_count += y.size(0)
        losses.append(avg_loss / total_count)

        val_metrics = evaluate(model, val_loader, device)
        wandb.log(
            {
                "epoch": epoch,
                "train/loss": avg_loss / total_count,
                "val/accuracy": val_metrics["accuracy"],
                "val/precision": val_metrics["precision_macro"],
                "val/recall": val_metrics["recall_macro"],
                "val/f1_score": val_metrics["f1_score"],
            }
        )
        try:
            wandb.log(
                {
                    "val/confusion_matrix": wandb.plot.confusion_matrix(
                        preds=val_metrics["y_pred"].astype(int).tolist(),
                        y_true=val_metrics["y_true"].astype(int).tolist(),
                    )
                }
            )
        except Exception:
            ...
        run.summary["train/losses"] = losses
        run.summary["val/accuracy"] = val_metrics["accuracy"]
        run.summary["val/f1_score"] = val_metrics["f1_score"]

    return losses, run

In [15]:
def count_params(model):
    return sum(p.numel() for p in model.parameters())

In [17]:
def tune_architecture(
    model_class,
    train_loader,
    val_loader,
    architectures,
    training_setup,
    device,
    project_name="LLM_lab1",
    group_name="exp_01",
    best_metric="f1_score",
):
    results = []
    best = {}
    best_metric_value = -1
    for i, architecture in enumerate(architectures):
        model = model_class(architecture).to(device)
        run_name = f"Architecture class: {getattr(model_class, '__name__', None)} | setup: {str(i)}"
        print(f"[W&B] Training model {i}")
        config = dict(
            epochs=training_setup["epochs"],
            optim_type=training_setup["optim_type"],
            optim_params=training_setup["optim_params"],
            criterion=training_setup["criterion"],
            batch_size=getattr(train_loader, "batch_size", None),
            device=str(device),
        )
        loss, run = train(
            train_loader,
            val_loader,
            model,
            training_setup["epochs"],
            training_setup["optim_type"],
            training_setup["optim_params"],
            training_setup["criterion"],
            device=device,
            project=project_name,
            group=group_name,
            run_name=run_name,
            config=config,
        )
        print("[W&B] Training done")

        train_metrics = evaluate(model, train_loader, device)
        val_metrics = evaluate(model, val_loader, device)

        wandb.log(
            {
                "train/accuracy": train_metrics["accuracy"],
                "train/precision_macro": train_metrics["precision_macro"],
                "train/recall_macro": train_metrics["recall_macro"],
                "train/f1_score": train_metrics["f1_score"],
            }
        )
        run.summary["train/accuracy"] = train_metrics["accuracy"]
        run.summary["train/f1_score"] = train_metrics["f1_score"]

        record = {
            "train/loss": loss,
            "train": train_metrics,
            "val": val_metrics,
        }

        results.append(record)

        if val_metrics[best_metric] > best_metric_value:
            best = {run_name: val_metrics}
            best_metric_value = val_metrics[best_metric]
        wandb.finish()
    return results, best

## 3. Porównanie różnych architektur

In [18]:
architectures = [
    [
        {"linear": (INPUT_SIZE, 256), "relu": True},
        {"linear": (256, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 128), "relu": True},
        {"linear": (128, 64), "relu": True},
        {"linear": (64, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 64), "relu": True},
        {"linear": (64, 32), "relu": True},
        {"linear": (32, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 256), "relu": True},
        {"linear": (256, 128), "relu": True},
        {"linear": (128, 64), "relu": True},
        {"linear": (64, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 128), "relu": True},
        {"dropout": 0.3, "linear": (128, 64), "relu": True},
        {"dropout": 0.3, "linear": (64, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 64), "relu": True},
        {"dropout": 0.3, "linear": (64, 32), "relu": True},
        {"dropout": 0.3, "linear": (32, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 768), "batch_norm": 768, "relu": True},
        {"dropout": 0.2, "linear": (768, 384), "batch_norm": 384, "relu": True},
        {"dropout": 0.2, "linear": (384, 128), "batch_norm": 128, "relu": True},
        {"dropout": 0.3, "linear": (128, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 512), "batch_norm": 512, "relu": True},
        {"dropout": 0.2, "linear": (512, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 1024), "relu": True},
        {"dropout": 0.1, "linear": (1024, 5)},
    ],
    [
        {"linear": (INPUT_SIZE, 512), "batch_norm": 512, "relu": True},
        {"dropout": 0.5, "linear": (512, 256), "batch_norm": 256, "relu": True},
        {"dropout": 0.5, "linear": (256, 5)},
    ],
]

In [19]:
training_setup = {
    "epochs": 25,
    "optim_type": torch.optim.Adam,
    "optim_params": [3e-4],
    "criterion": nn.CrossEntropyLoss(),
}

In [20]:
results, best = tune_architecture(
    MLP, train_loader, val_loader, architectures, training_setup, device=DEVICE
)

[W&B] Training model 0


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/accuracy,▁
train/f1_score,▁
train/loss,█▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▁▄▆▆▇█▇█▇▆▇▇▇▇█▅▆▆▆▇▇▆▇▇█
val/f1_score,▁▆▇▇▇█▇▇▆▆▇▆▆▇▇▅▆▅▆▇▇▆▇▇▇
val/precision,▁▅▇▇▇█▇▇▆▆▇▆▆▇█▄▆▄▆▆▆▆▇█▇
val/recall,▁▅▆▆▇█▇▇▇▆▇▇▇▇▇▆▆▅▆▇▇▆▇▇▇

0,1
epoch,24.0
train/accuracy,0.59387
train/f1_score,0.58657
train/loss,0.95532
train/precision_macro,0.58624
train/recall_macro,0.58945
val/accuracy,0.569
val/f1_score,0.55981
val/precision,0.55945
val/recall,0.56308


[W&B] Training model 1


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/accuracy,▁
train/f1_score,▁
train/loss,█▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▁▄▄▄▅▅▆▆▆▆▆▆▇▆▆▆▇▇█▇█████
val/f1_score,▁▄▄▅▄▆▇▅▅▅▇▆▇▆▅▅▆▇█▇▇████
val/precision,▁▄▄▅▅▆▇▅▅▅▇▅█▆▅▅▆▇█▇▇█▇██
val/recall,▁▄▄▅▅▅▆▅▆▆▆▆▆▆▆▅▆▇█▇▇████

0,1
epoch,24.0
train/accuracy,0.60581
train/f1_score,0.59971
train/loss,0.92315
train/precision_macro,0.5989
train/recall_macro,0.6014
val/accuracy,0.57487
val/f1_score,0.56791
val/precision,0.56727
val/recall,0.56927


[W&B] Training model 2


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/accuracy,▁
train/f1_score,▁
train/loss,█▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▁▅▆▆▆▇▇▆▇▇▇▇▇█▇███▇██████
val/f1_score,▁▆▆▆▆▆▇▆▇▇▇▇▇▇▇▇██▇█▇█▇▇▇
val/precision,▁▆▆▅▅▆▇▅▇▇▆▆▆▇▇▇██▆▇▇█▇▆▇
val/recall,▁▆▆▆▆▇▇▇▇▇▇▇▇█▇███▇███▇▇▇

0,1
epoch,24.0
train/accuracy,0.57975
train/f1_score,0.57174
train/loss,0.97641
train/precision_macro,0.57133
train/recall_macro,0.57445
val/accuracy,0.56747
val/f1_score,0.55825
val/precision,0.55778
val/recall,0.56077


[W&B] Training model 3


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/accuracy,▁
train/f1_score,▁
train/loss,█▅▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▆▇▆▇▇███████▇▇▇▇▆▆▅▅▃▄▄▃▁
val/f1_score,▅▆▆▆▇█▇██▆▇▇▇▇▆▇▅▅▅▄▃▃▂▂▁
val/precision,▄▄▆▅▆██▇█▆▇▆▇▆▆▆▅▄▃▃▄▁▁▂▁
val/recall,▆▇▆▇▇█▇██▇█▇▇▇▇▇▅▆▆▅▃▄▃▃▁

0,1
epoch,24.0
train/accuracy,0.69755
train/f1_score,0.69753
train/loss,0.73832
train/precision_macro,0.70566
train/recall_macro,0.69422
val/accuracy,0.5288
val/f1_score,0.52792
val/precision,0.53624
val/recall,0.5234


[W&B] Training model 4


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/accuracy,▁
train/f1_score,▁
train/loss,█▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▁▃▄▅▅▇▇▆▇▆▇█▇▇██▇██████▇█
val/f1_score,▁▃▄▅▅▆▆▆▇▆▆▇▇▇▇▇▇▇▇▇█▇▇▇▇
val/precision,▁▃▄▅▄▆▅▅▆▆▆▇▆▆▇▆▆▆▇▇█▇▇▇▇
val/recall,▁▄▄▅▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇██▇▇█

0,1
epoch,24.0
train/accuracy,0.61916
train/f1_score,0.61402
train/loss,0.93767
train/precision_macro,0.614
train/recall_macro,0.61464
val/accuracy,0.57673
val/f1_score,0.57044
val/precision,0.57044
val/recall,0.57102


[W&B] Training model 5


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/accuracy,▁
train/f1_score,▁
train/loss,█▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▁▆▆▇▇▇▇▇▇▇▇▇█▇███████████
val/f1_score,▁▆▇▇▇▇▇▇▇▇█▇████▇████████
val/precision,▁▆▇▆▆▇▇▇▇▇█▇▇▇██▇████████
val/recall,▁▆▇▇▇▇▇▇▇▇▇▇████▇████████

0,1
epoch,24.0
train/accuracy,0.58868
train/f1_score,0.5804
train/loss,0.99453
train/precision_macro,0.57848
train/recall_macro,0.58458
val/accuracy,0.57213
val/f1_score,0.56331
val/precision,0.56153
val/recall,0.56689


[W&B] Training model 6


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/accuracy,▁
train/f1_score,▁
train/loss,█▇▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▆▇██▇▇▇▅▅▅▅▃▄▄▃▂▃▃▁▂▂▁▁▁▂
val/f1_score,▅▇▇█▇▇▇▅▆▄▄▃▃▄▃▃▂▂▁▂▁▁▂▁▂
val/precision,▅▇██▇██▆▆▄▄▃▄▅▄▄▁▂▁▃▁▂▃▃▂
val/recall,▆▇██▇▇▇▆▆▄▅▃▄▄▄▃▃▃▂▂▂▁▂▁▂

0,1
epoch,24.0
train/accuracy,0.93526
train/f1_score,0.93468
train/loss,0.49906
train/precision_macro,0.93548
train/recall_macro,0.93448
val/accuracy,0.54553
val/f1_score,0.53784
val/precision,0.53742
val/recall,0.5392


[W&B] Training model 7


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/accuracy,▁
train/f1_score,▁
train/loss,█▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▅▆▆█▇▇▆▇▅▅▄▄▄▄▃▄▄▃▃▃▃▂▂▂▁
val/f1_score,▅▆▅█▇▆▆▇▅▅▄▄▄▄▃▃▃▃▂▂▃▂▂▁▂
val/precision,▅▇▅█▇▆▅▇▅▄▃▃▃▃▃▃▃▃▂▂▂▁▁▁▂
val/recall,▅▆▆█▆▇▆▇▅▅▄▄▄▄▃▄▃▃▂▂▃▂▂▁▁

0,1
epoch,24.0
train/accuracy,0.85264
train/f1_score,0.85192
train/loss,0.63015
train/precision_macro,0.85179
train/recall_macro,0.85236
val/accuracy,0.54193
val/f1_score,0.53779
val/precision,0.53873
val/recall,0.53733


[W&B] Training model 8


[W&B] Training done


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇███
train/accuracy,▁
train/f1_score,▁
train/loss,█▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▁▅▅▃▅▄▅▆▆▆▆▇▇▅▇▆▆▇█▇█▆▇▇▆
val/f1_score,▂▃▇▁▆▃▆▆▄▅▆▆▆▄▄▅▄▄▇▇█▆▅▆▇
val/precision,▄▃▇▁▆▂▆▆▄▅▇▇▅▄▄▄▄▄▇▇█▅▅▅█
val/recall,▁▅▆▄▆▅▆▆▆▆▆█▇▆▆▆▆▇█▇█▇▇▇▆

0,1
epoch,24.0
train/accuracy,0.68114
train/f1_score,0.67742
train/loss,0.83516
train/precision_macro,0.68015
train/recall_macro,0.67702
val/accuracy,0.57093
val/f1_score,0.56401
val/precision,0.56655
val/recall,0.56416


[W&B] Training model 9


[W&B] Training done


0,1
epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/accuracy,▁
train/f1_score,▁
train/loss,█▅▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁
train/precision_macro,▁
train/recall_macro,▁
val/accuracy,▁▄▂▅▅▆▅▇▆▇█▇▆▇▇█▇▆▇▇▆▆▆▄▆
val/f1_score,▂▃▁▅▆▆▅▆▇▇█▆▆▆▇▇▇▆▆▆▆▅▆▅▆
val/precision,▁▃▁▄▆▅▄▅▇▇█▆▅▆▆▇▇▆▆▆▆▅▅▅▆
val/recall,▁▄▂▅▅▆▆▇▇▇█▆▆▆▇▇▇▆▇▇▆▆▆▄▆

0,1
epoch,24.0
train/accuracy,0.71118
train/f1_score,0.70427
train/loss,0.84655
train/precision_macro,0.70448
train/recall_macro,0.7085
val/accuracy,0.57267
val/f1_score,0.56229
val/precision,0.56088
val/recall,0.56711


Najgorsze wyniki osiągnęły architektury głębokie (z większą liczbą warstw), używające batch normu, oraz posiadające zbyt duży rozmiar warstwy wejściowej.

In [21]:
df = pd.DataFrame(best)
df

Unnamed: 0,Architecture class: MLP | setup: 4
accuracy,0.576733
precision_macro,0.570442
recall_macro,0.571022
confusion_matrix,"[[2674, 636, 94, 23, 60], [746, 1469, 581, 118..."
f1_score,0.570444
y_true,"[0, 4, 1, 4, 4, 0, 1, 2, 4, 2, 1, 1, 1, 2, 4, ..."
y_pred,"[4, 4, 2, 0, 4, 0, 1, 1, 4, 3, 1, 1, 2, 3, 3, ..."


Najlepszy wynik f1_score osiągnął model z setupu 4: 
```[
        {"linear": (INPUT_SIZE, 128), "relu": True},
        {"dropout": 0.3, "linear": (128, 64), "relu": True},
        {"dropout": 0.3, "linear": (64, 5)}

Co warto zauważyć drugi najlepszy wynik został osiągnięty przez identyczną architekturę jedynie bez dropoutu. 

Kontynuujemy eksperymenty wykorzystując tą architekturę.

## 2. Porównanie różnych parametrów TF-IDF oraz SVD

In [22]:
def adjust_input_dim(base_arch, input_dim):
    _, out_dim = base_arch[0]["linear"]
    base_arch[0]["linear"] = (input_dim, out_dim)
    return base_arch

In [23]:
def tune_text_pipeline_for_architecture(
    base_arch,
    tfidf_grid,
    svd_components_grid,
    training_setup,
    batch_size,
    num_workers,
    device,
    best_metric,
    project_name="LLM_lab1",
    group_name="tfidf_svd_search",
):
    results = []
    best = None
    best_metric_value = -1

    for tfidf_params, n_comp in product(tfidf_grid, svd_components_grid):
        vectorizer, svd, loaders = make_loaders_from_text(
            vectorizer_params=tfidf_params,
            svd_components=n_comp,
            batch_size=batch_size,
            num_workers=num_workers,
            pin_memory=True,
        )
        tr_loader, va_loader, _ = loaders

        arch = adjust_input_dim(base_arch, input_dim=n_comp)
        model = MLP(arch).to(device)

        config = dict(
            epochs=training_setup["epochs"],
            optim_type=training_setup["optim_type"],
            optim_params=training_setup["optim_params"],
            criterion=str(training_setup["criterion"]),
            batch_size=getattr(tr_loader, "batch_size", None),
            device=str(device),
            tfidf_params=tfidf_params,
            svd_components=n_comp,
        )

        run_name = f"TFIDF:{tfidf_params}|SVD:{n_comp}"
        print(f"[Search] {run_name}")
        _, run = train(
            tr_loader,
            va_loader,
            model,
            training_setup["epochs"],
            training_setup["optim_type"],
            training_setup["optim_params"],
            training_setup["criterion"],
            device,
            project_name,
            group_name,
            run_name,
            config,
        )

        val_metrics = evaluate(model, va_loader, device)
        train_metrics = evaluate(model, tr_loader, device)

        record = {
            "tfidf": tfidf_params,
            "svd_components": n_comp,
            "train": train_metrics,
            "val": val_metrics,
            "model": model,
            "vectorizer": vectorizer,
            "svd": svd,
            "loaders": loaders,
        }
        results.append(record)

        if val_metrics[best_metric] > best_metric_value:
            best = {run_name: val_metrics}

        wandb.finish()

    return results, best

In [24]:
tfidf_grid = [
    {"max_features": 30000, "min_df": 2, "max_df": 0.95, "ngram_range": (1,1)},
    {"max_features": 60000, "min_df": 2, "max_df": 0.95, "ngram_range": (1,1)},
    {"max_features": None,  "min_df": 5, "max_df": 0.90, "ngram_range": (1,2)},
]
svd_components_grid = [128, 300, 512]

In [26]:
best_arch = [
    {"linear": (INPUT_SIZE, 128), "relu": True},
    {"dropout": 0.3, "linear": (128, 64), "relu": True},
    {"dropout": 0.3, "linear": (64, 5)},
]
results, best = tune_text_pipeline_for_architecture(
    base_arch=best_arch,
    tfidf_grid=tfidf_grid,
    svd_components_grid=svd_components_grid,
    training_setup=training_setup,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    device=DEVICE,
    best_metric="f1_score",
)

[Search] TFIDF:{'max_features': 30000, 'min_df': 2, 'max_df': 0.95, 'ngram_range': (1, 1)}|SVD:128


0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/loss,█▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
val/accuracy,▁▃▄▄▄▄▅▅▅▆▆▆▆▇▇▇▇▇▇██████
val/f1_score,▁▃▄▄▅▅▆▅▆▆▇▇▇▇▇▇▇▇██████▇
val/precision,▁▃▄▄▄▅▅▅▅▆▇▇▇▇▇▇▇▇██████▇
val/recall,▁▃▄▄▄▅▅▅▅▆▇▇▆▇▇▇▇▇▇██████

0,1
epoch,24.0
train/loss,1.0213
val/accuracy,0.5586
val/f1_score,0.54866
val/precision,0.54791
val/recall,0.5518


[Search] TFIDF:{'max_features': 30000, 'min_df': 2, 'max_df': 0.95, 'ngram_range': (1, 1)}|SVD:300


0,1
epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/loss,█▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁
val/accuracy,▁▂▃▄▄▅▅▅▅▆▆▆▆▆▆▇▆▇█▇▇██▇█
val/f1_score,▁▂▃▄▃▅▅▅▅▆▅▆▆▆▇▆▇▇█▇▇██▇█
val/precision,▁▂▃▄▃▅▅▅▆▆▅▆▆▆▇▆▇███▇██▆█
val/recall,▁▂▃▃▄▅▅▅▅▆▅▅▆▆▆▆▆▇█▇▇█▇▇█

0,1
epoch,24.0
train/loss,0.942
val/accuracy,0.57793
val/f1_score,0.56927
val/precision,0.56834
val/recall,0.5718


[Search] TFIDF:{'max_features': 30000, 'min_df': 2, 'max_df': 0.95, 'ngram_range': (1, 1)}|SVD:512


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██
train/loss,█▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
val/accuracy,▁▃▃▄▄▅▅▅▅▆▅▆▆▇▇▆▇▇▇▇▇▇▇██
val/f1_score,▁▃▃▄▅▅▅▆▆▆▆▆▇▇▇▆▇▇▇▇▇███▇
val/precision,▁▃▃▄▅▅▅▆▅▆▅▆▆▆▇▆▇▇▇▇▇█▇▇▇
val/recall,▁▃▃▄▄▅▅▆▅▆▅▆▆▆▇▆▇▇▇▇█▇▇██

0,1
epoch,24.0
train/loss,0.88877
val/accuracy,0.58327
val/f1_score,0.57464
val/precision,0.57378
val/recall,0.57721


[Search] TFIDF:{'max_features': 60000, 'min_df': 2, 'max_df': 0.95, 'ngram_range': (1, 1)}|SVD:128


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇██
train/loss,█▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
val/accuracy,▁▃▄▃▄▄▅▅▅▆▆▆▆▇▇▇▇▇█▇██▇▇▇
val/f1_score,▁▂▃▃▄▅▅▄▅▆▇▆▇▇█▇███▇██▇██
val/precision,▁▁▂▃▃▄▄▄▄▅▇▆▇▇█▇███▇█▇▇█▇
val/recall,▁▃▄▃▄▄▅▅▅▆▆▆▇▇▇▇▇▇█▇██▇▇▇

0,1
epoch,24.0
train/loss,1.01697
val/accuracy,0.55667
val/f1_score,0.54895
val/precision,0.54761
val/recall,0.55119


[Search] TFIDF:{'max_features': 60000, 'min_df': 2, 'max_df': 0.95, 'ngram_range': (1, 1)}|SVD:300


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/loss,█▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁
val/accuracy,▁▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███████
val/f1_score,▁▄▄▅▆▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇█▇██▇
val/precision,▁▄▄▅▅▅▆▆▇▆▇▆▆▇▇▇▇▆▇▇█▇██▇
val/recall,▁▄▄▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇███████

0,1
epoch,24.0
train/loss,0.94016
val/accuracy,0.57833
val/f1_score,0.56908
val/precision,0.56811
val/recall,0.57275


[Search] TFIDF:{'max_features': 60000, 'min_df': 2, 'max_df': 0.95, 'ngram_range': (1, 1)}|SVD:512


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇███
train/loss,█▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
val/accuracy,▁▃▄▅▅▅▆▅▅▆▆▆▆▇▆▇▇▇██▇█▇█▇
val/f1_score,▁▅▅▅▅▆▆▆▆▆▆▆▅▇▆▇▇▇██▇█▇▇▇
val/precision,▁▅▅▅▅▆▆▆▅▆▆▆▅▇▆▇▇▇█▇▇█▆▇▇
val/recall,▁▃▄▅▅▅▅▆▆▆▆▆▆▇▆▇▇▇██▇█▇█▇

0,1
epoch,24.0
train/loss,0.87956
val/accuracy,0.58227
val/f1_score,0.57396
val/precision,0.57247
val/recall,0.57688


[Search] TFIDF:{'max_features': None, 'min_df': 5, 'max_df': 0.9, 'ngram_range': (1, 2)}|SVD:128


0,1
epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/loss,█▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val/accuracy,▁▅▅▆▆▆▆▇▇▆▇▇▇▇▇▇▇▇▇▇▇█▇██
val/f1_score,▁▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█▇████
val/precision,▁▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█▇████
val/recall,▁▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█▇████

0,1
epoch,24.0
train/loss,1.00175
val/accuracy,0.56713
val/f1_score,0.55845
val/precision,0.55745
val/recall,0.56101


[Search] TFIDF:{'max_features': None, 'min_df': 5, 'max_df': 0.9, 'ngram_range': (1, 2)}|SVD:300


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
train/loss,█▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
val/accuracy,▁▆▇▇▇▇▇▇▇▇███████████████
val/f1_score,▁▆▆▇▇▇▇▇▇▇███████████████
val/precision,▁▆▆▇▇▇▇▇▇▇█▇▇▇█▇██▇██████
val/recall,▁▆▇▇▇▇▇▇▇▇███████████████

0,1
epoch,24.0
train/loss,0.92253
val/accuracy,0.59013
val/f1_score,0.58314
val/precision,0.58297
val/recall,0.58417


[Search] TFIDF:{'max_features': None, 'min_df': 5, 'max_df': 0.9, 'ngram_range': (1, 2)}|SVD:512


0,1
epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██
train/loss,█▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁
val/accuracy,▁▆▆▇▇▇▇▇█▇██▇█▇██████████
val/f1_score,▁▆▆▇▇▇█▇▇██▇██▇██████████
val/precision,▁▆▆▆▇▇██▇▇▇▇▇█▇███████▇██
val/recall,▁▆▆▇▇▇▇▇█▇████▇██████████

0,1
epoch,24.0
train/loss,0.86732
val/accuracy,0.59627
val/f1_score,0.58897
val/precision,0.58788
val/recall,0.59123


In [29]:
df = pd.DataFrame(best)
df

Unnamed: 0,"TFIDF:{'max_features': None, 'min_df': 5, 'max_df': 0.9, 'ngram_range': (1, 2)}|SVD:512"
accuracy,0.596267
precision_macro,0.587884
recall_macro,0.591234
confusion_matrix,"[[2731, 629, 63, 17, 47], [749, 1557, 526, 95,..."
f1_score,0.588969
y_true,"[0, 4, 1, 4, 4, 0, 1, 2, 4, 2, 1, 1, 1, 2, 4, ..."
y_pred,"[1, 4, 2, 0, 4, 0, 2, 1, 4, 2, 1, 1, 1, 4, 3, ..."


Najlepszy f1_score został osiągnięty dla TF-IDF z parametrami:
- max_features = None
- min_df = 5
- max_df = 0.9
- ngram_range = (1,2)
- SVD n_comp = 512

In [30]:
def train_best_model(
    best_arch,
    tfidf_params,
    n_comp,
    training_setup,
    batch_size,
    num_workers,
    device,
    project_name="LLM_lab1",
    group_name="final_model",
):
    vectorizer, svd, loaders = make_loaders_from_text(
        vectorizer_params=tfidf_params,
        svd_components=n_comp,
        batch_size=batch_size,
        num_workers=num_workers,
        pin_memory=True,
    )
    tr_loader, va_loader, te_loader = loaders
    model = MLP(best_arch).to(device)
    config = dict(
        epochs=training_setup["epochs"],
        optim_type=training_setup["optim_type"],
        optim_params=training_setup["optim_params"],
        criterion=str(training_setup["criterion"]),
        batch_size=getattr(tr_loader, "batch_size", None),
        device=str(device),
        tfidf_params=tfidf_params,
        svd_components=n_comp,
    )
    run_name = f"Final model run"
    _, run = train(
        tr_loader,
        va_loader,
        model,
        training_setup["epochs"],
        training_setup["optim_type"],
        training_setup["optim_params"],
        training_setup["criterion"],
        device,
        project_name,
        group_name,
        run_name,
        config,
    )

    test_metrics = evaluate(model, te_loader, device)

    result = {
            "Best_model" : test_metrics
           
        }

    return result, model

In [31]:
best_arch = [
        {"linear": (512, 128), "relu": True},
        {"linear": (128, 64), "relu": True},
        {"linear": (64, 5)},
    ]
result, model = train_best_model(
    best_arch,
    tfidf_params={"max_features": None, "min_df": 5, "max_df": 0.9, "ngram_range": (1,2)},
    n_comp=512,
    training_setup=training_setup,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    device=DEVICE,
    project_name="LLM_lab1",
    group_name="final_model",
)

In [47]:
wandb.finish()

0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/loss,█▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁
val/accuracy,▁▅▅▅▅▅▆█▆█▇███▇█▇▇█▇▇▇▆▆▆
val/f1_score,▁▅▃▆▄▅▆▆▃▇█▆█▆▆█▅▇▇▇▇▇▇▇▆
val/precision,▁▄▃▅▄▅▆▆▃▇▇▅▇▅▆▇▆█▆▆▆▇▆▇▇
val/recall,▁▆▅▅▅▅▆▇▅▇▇▇██▆█▅▆▇██▇▇▇▆

0,1
epoch,24.0
train/loss,0.86545
val/accuracy,0.59107
val/f1_score,0.58505
val/precision,0.58642
val/recall,0.58481


In [32]:
df = pd.DataFrame(result)
df

Unnamed: 0,Best_model
accuracy,0.57205
precision_macro,0.574797
recall_macro,0.573012
confusion_matrix,"[[3092, 796, 98, 35, 64], [1052, 2089, 816, 14..."
f1_score,0.572952
y_true,"[0, 0, 0, 0, 0, 2, 1, 3, 3, 2, 1, 3, 0, 1, 3, ..."
y_pred,"[0, 0, 0, 0, 0, 2, 0, 3, 3, 3, 2, 3, 0, 2, 3, ..."


Najlepszy model osiągnął na zbiorze testowym:
- accuracy = 57.2%
- f1_score = 57.3%

## 1. Znalezienie najgorzej sklasyfikowanych przykładów

Transformujemy zbiór testowy zgodnie z najlepszymi znalezionymi parametrami.

In [39]:
vectorizer = TfidfVectorizer(**{"max_features": None, "min_df": 5, "max_df": 0.9, "ngram_range": (1,2)})
X_train_tf = vectorizer.fit_transform(dataset_train["text"])
X_test_tf  = vectorizer.transform(dataset_test["text"])

svd = TruncatedSVD(n_components=512)
X_train = svd.fit(X_train_tf)
X_test  = svd.transform(X_test_tf)

y_test  = np.asarray(dataset_test["label"])

test_ds  = TensorDataset(torch.tensor(X_test),  torch.tensor(y_test))
test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True)

Następnie szukamy najmniejszych logitów dla prawdziwe klasy każdego z przykładów.

In [40]:
def show_k_most_wrong(model, test_loader, true_texts, k, device):
    model.eval()
    all_logits, all_y = [], []
    for X, y in test_loader:
        X = X.to(dtype=torch.float32, device=device)
        logits = model(X).detach().cpu()
        all_logits.append(logits)
        all_y.append(y)
        
    logits, y_true = torch.cat(all_logits, dim=0), torch.cat(all_y, dim=0)
    true_logits = logits[torch.arange(logits.size(0)), y_true]

    p_true = torch.softmax(logits, dim=1)[torch.arange(logits.size(0)), y_true]
    y_pred = torch.argmax(logits, dim=1)

    idx_sorted = torch.argsort(true_logits, dim=0, descending=False)[:k].cpu().numpy().tolist()

    rows = []
    for idx in idx_sorted:
        rows.append({
            "idx": idx,
            "true_label": int(y_true[idx].item()),
            "pred_label": int(y_pred[idx].item()),
            "true_logit": float(true_logits[idx].item()),
            "pred_logit": float(logits[idx, y_pred[idx]].item()),
            "p_true": float(p_true[idx].item()),
            "text": str(true_texts[idx]["text"])[:1200]  
        })
    return rows

In [41]:
rows = show_k_most_wrong(model, test_loader, dataset_test, k=20, device=DEVICE)
df = pd.DataFrame(rows)
df

Unnamed: 0,idx,true_label,pred_label,true_logit,pred_logit,p_true,text
0,2806,3,0,-6.309756,5.898983,5e-06,This is the worst food I have ever had. Smelly...
1,8214,4,1,-5.47843,2.136673,0.000361,Started off with the skillet cornbread and the...
2,4578,4,1,-5.412669,1.787483,0.000378,I wanted to take off a star for the weird bend...
3,10410,3,0,-5.237564,3.326813,9.8e-05,Yummmmmmmm eeeeeeeee! Taste eeeeeeee!
4,11362,4,1,-5.193672,2.089184,0.000488,Came here a few nights ago for some brain food...
5,12786,4,1,-5.157664,1.864151,0.000599,"The night I went to DDS there was this awful, ..."
6,12285,4,0,-5.104719,3.716515,0.000139,We at first had a ok experience than it went d...
7,4640,4,0,-4.917395,3.192785,0.000261,Nothing Short of a Miracle!! \nI had a beautif...
8,15171,4,1,-4.785367,0.863193,0.001542,"C-Fu, Great Wall, Golden Buddha ... blah blah ..."
9,11305,0,4,-4.709807,5.638265,3.1e-05,Love love love this place!!! Everyone is amazi...


In [42]:
df["true_label"].value_counts()

true_label
4    12
3     4
0     4
Name: count, dtype: int64

In [43]:
df["pred_label"].value_counts()

pred_label
1    9
0    7
2    2
4    1
3    1
Name: count, dtype: int64

In [44]:
print(dataset_test["text"][2806])

This is the worst food I have ever had. Smelly Salmon, mushy broccoli, saw dust dry burger, rock hard ribs. Even the mashed potatoes were thick and dry. I understand this place's forte' is rock memorabilia, they should at least provide palatable food, not saw dust. They would be better off serving TV dinners. Stay away from this place. if you must go, then just have some appetizers and leave.


In [45]:
print(dataset_test["text"][8214])

Started off with the skillet cornbread and the grilled artichokes for appetizers. Both were unbelievably good. The grilled artichokes were just short of a spiritual experience! I could almost hear an angelic choir hit the high notes after my first bite. \n\nFor the main courses, we ordered the Kobe beef tips and the grilled salmon. The beef tips were the clear winner, tender, cooked to perfection, and with a nice wood-grilled flavor. The salmon was good, but the \"green rice\" that came along side of it was a little on the tangy side for me. The salmon had a sweet glaze over the top which was good, but I guess I was not in the mood for that flavor. \n\nThe bottom line is, the Yelpers were right again. This place is a winner. You won't be disappointed.


In [46]:
print(dataset_test["text"][4578])

I wanted to take off a star for the weird bendy line that I always have to wait in when I go through the drive thru. However, the truth is that if I simply parked my car and walked my lazy a** inside, I would not have to wait.\n\nSpeaking of their drive thru, wow. You could be in a line all the way out to Eastern and you would be through that line in 5 minutes. They really have an operation going on there. This is what happens when you have the budget to hire sufficient amounts of employees.\n\nOne time this one girl was kinda mean to me...but I can't blame McDonalds.  It felt like an anger caused by something such as a breakup or a stressful issue with not paying rent, so I let it go.\n\nHAVE YOU HAD THEIR FRIES?!?!!?  I don't care what people say. I think are putting crack in those. Salt and crack. If one day down the line we find out that McDonalds fries were so good because they were laced with something, I won't be shocked.


Powyżej znajdują się 3 najgorzej sklasyfikowane recenzje. Pomyłki na nich są zrozumiałe, gdyż pierwsza z opinii jest niesamowicie negatywna, przez co model daje jej 0, ale prawdziwa ocena to 3. 
W drugim przykładzie pomyłka jest bardziej ewidentna, ale możliwe, że np. słowo "dissappointed" zmyliło model.
Trzeci przykład to bardzo specyficzna wypowiedź, z której ciężko wywnioskować ocenę na podstawie pojedynczych słów.