# Задание 4 - 10 баллов

## Задание

Исходный набор данных - [Fake and real news dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)

- Реализовать классификацию двумя моделями: CNN, LSTM - **6 баллов = 3 + 3**
- Сравнить качество обученных моделей **1 балл**


- Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - **2 балла**

- Соблюден code style на уровне pep8 и [On writing clean Jupyter notebooks](https://ploomber.io/blog/clean-nbs/)  - **1 балл**
 
Примеры: [Using Convolution Neural Networks to Classify Text in PyTorch](https://tzuruey.medium.com/using-convolution-neural-networks-to-classify-text-in-pytorch-3b626a42c3ca), [LSTM in Pytorch](https://wandb.ai/sauravmaheshkar/LSTM-PyTorch/reports/Using-LSTM-in-PyTorch-A-Tutorial-With-Examples--VmlldzoxMDA2NTA5)



In [1]:
import numpy as np
import pandas as pd

import warnings
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

from nltk.probability import FreqDist

import spacy

from gensim.models import KeyedVectors

import torch
import torch.nn as nn
import torch.optim as optim

warnings.simplefilter(action='ignore')

SEED = 566
# data dir
DATA_DIR = "../data"

DATA = "../data/spam_or_not_spam.csv"


def classification_report_pd(y_test, y_pred):
    report = pd.DataFrame(classification_report(y_true=y_test, y_pred=y_pred, output_dict=True)).transpose()
    report.support = report.support.astype(int)
    report.loc['accuracy', 'support'] = report.loc['macro avg', 'support']
    report.loc['accuracy', 'precision'] = np.nan
    report.loc['accuracy', 'recall'] = np.nan
    return report

# Установка torch

Попробую на своем компуктере запустить на CPU, если все будет плохо -- перейду в коллаб

In [3]:
# ! pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Подготовка данных

Шаг 0:
надо скачать и положить в директорию `../data/` эмбеддинги word2vec: [https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g](ссылка).

И распаковать (чтобы был файл `.bin`).

Шаг 1: качаю спам/нотспам датасет

In [2]:
# !kaggle datasets download -d ozlerhakan/spam-or-not-spam-dataset
# !mv ./spam-or-not-spam-dataset.zip ../data/
# !unzip ../data/spam-or-not-spam-dataset.zip
# !mv ./spam_or_not_spam.csv ../data/
# !python -m spacy download en_core_web_sm

##  Чтение данных
Сделаем все как в домашках ранее.

In [3]:
NLP = spacy.load("en_core_web_sm")


def check_token(token):
    return not (token.is_stop
                or token.is_punct
                or token.is_digit
                or token.like_email
                or token.like_num)  # like num в данном случае делать не обязательно (тут уже заменены все NUMBER), но сделал это для общности -- на будущее


def tokenize_clean(text):
    return [token.lemma_.lower() for token in NLP(text) if check_token(token)]

In [4]:
data = pd.read_csv(DATA)
data = data[(~data['email'].isna()) & (data['label'].isin([0, 1]))]
print('Shape:', data.shape)
print('Classes:', data.label.value_counts())
data.head()

Shape: (2999, 2)
Classes: label
0    2500
1     499
Name: count, dtype: int64


Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0


In [8]:
data.tail(80)

Unnamed: 0,email,label
2919,register domains for just NUMBER NUMBER the ne...,1
2920,generic v agra NUMBER per NUMBERmg generic v ...,1
2921,hello you may have seen this business before a...,1
2922,hello you may have seen this business before a...,1
2923,the best mortage rates simple easy and free h...,1
...,...,...
2995,abc s good morning america ranks it the NUMBE...,1
2996,hyperlink hyperlink hyperlink let mortgage le...,1
2997,thank you for shopping with us gifts for all ...,1
2998,the famous ebay marketing e course learn to s...,1


In [7]:
%%time
data['email_tokenized'] = data['email'].apply(tokenize_clean)
data.head()

CPU times: user 1min 7s, sys: 469 ms, total: 1min 8s
Wall time: 1min 8s


Unnamed: 0,email,label,email_tokenized
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0,"[ , date, d, number, aug, number, number, numb..."
1,martin a posted tassos papadopoulos the greek ...,0,"[martin, post, tassos, papadopoulo, greek, scu..."
2,man threatens explosion in moscow thursday aug...,0,"[man, threaten, explosion, moscow, thursday, a..."
3,klez the virus that won t die already the most...,0,"[klez, virus, win, t, die, prolific, virus, kl..."
4,in adding cream to spaghetti carbonara which ...,0,"[ , add, cream, spaghetti, carbonara, effect, ..."


## Подгрузим эмбеддинги

In [49]:
w2v_model = KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin', binary=True)
vector = w2v_model['computer']
vector[:10]

array([ 0.10742188, -0.20117188,  0.12304688,  0.21191406, -0.09130859,
        0.21679688, -0.13183594,  0.08300781,  0.20214844,  0.04785156],
      dtype=float32)

In [9]:
def text_to_embedding(text):
    vectors = []

    for word in text:
        if word in w2v_model:
            vectors.append(w2v_model[word])

    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(w2v_model.vector_size)

In [10]:
%%time
data['email_embedded'] = data['email_tokenized'].apply(text_to_embedding)
data.head()

CPU times: user 464 ms, sys: 3.99 ms, total: 468 ms
Wall time: 468 ms


Unnamed: 0,email,label,email_tokenized,email_embedded
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0,"[ , date, d, number, aug, number, number, numb...","[0.033716675, 0.039779373, 0.018512992, 0.0509..."
1,martin a posted tassos papadopoulos the greek ...,0,"[martin, post, tassos, papadopoulo, greek, scu...","[-0.012349759, 0.035399687, -0.036836125, 0.12..."
2,man threatens explosion in moscow thursday aug...,0,"[man, threaten, explosion, moscow, thursday, a...","[0.018030634, 0.0055707716, 0.026558831, 0.046..."
3,klez the virus that won t die already the most...,0,"[klez, virus, win, t, die, prolific, virus, kl...","[0.051892176, 0.024502225, -0.029774984, 0.115..."
4,in adding cream to spaghetti carbonara which ...,0,"[ , add, cream, spaghetti, carbonara, effect, ...","[-0.07272888, 0.0076953126, 0.011092937, 0.189..."


In [11]:
X, y = data[['email_tokenized', 'email_embedded']], data.label

X_train, X_tv, y_train, y_tv = train_test_split(X,
                                                y,
                                                test_size=0.3,
                                                random_state=SEED,
                                                stratify=y)

X_val, X_test, y_val, y_test = train_test_split(X_tv,
                                                y_tv,
                                                test_size=0.5,
                                                random_state=SEED,
                                                stratify=y_tv)

display(y_train.value_counts())
display(y_val.value_counts())
display(y_test.value_counts())
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()
y_val = y_val.to_numpy()

label
0    1750
1     349
Name: count, dtype: int64

label
0    375
1     75
Name: count, dtype: int64

label
0    375
1     75
Name: count, dtype: int64

## Text 2 sequence

Кроме эмбеддингов превратим еще тексты в последовательности с помощью словаря токенов (1 слово -- 1 токен). Именно этот метод будет использоваться для обучения нейронок в этом дз, но эмбеддинги выше решил оставить -- чтобы не потерялось, когда буду делать дальнейшие ДЗ.

In [12]:
def text_to_sequence(text, maxlen):
    result = []
    for word in text:
        if word in vocabulary:
            result.append(vocabulary[word])
    padding = [0] * (maxlen - len(result))
    return padding + result[-maxlen:]


tokens = []
for text in X_train:
    tokens.extend(text)

max_words = 2000
dist = FreqDist(tokens)
tokens_filtered_top = [pair[0] for pair in dist.most_common(max_words - 1)]
tokens_filtered_top[:10]

vocabulary = {v: k for k, v in dict(enumerate(tokens_filtered_top, 1)).items()}
len(vocabulary)

13

In [13]:
max_len = int(np.quantile(data.email_tokenized.apply(lambda x: len(x)), 0.85))
print(f"Максимальная длина предложения: {max_len} ")
Xp_train = np.array([text_to_sequence(text, max_len) for text in X_train['email_tokenized']], dtype=np.int32)
Xp_test = np.array([text_to_sequence(text, max_len) for text in X_test['email_tokenized']], dtype=np.int32)
Xp_val = np.array([text_to_sequence(text, max_len) for text in X_val['email_tokenized']], dtype=np.int32)

Максимальная длина предложения: 184 


# Torch

Способ выше -- превратили в числа, и приделали паддинги -- как на уроке, 

Еще научимся эмбеденные слова превращать в тензоры (это при построении нейронок не использовалось, но чтобы не потерять, оставлю это тут на будущее -- надеюсь не критично). 

In [15]:
y_train

array([0, 1, 0, ..., 0, 0, 0])

In [16]:
Xt_train, yt_train = torch.tensor([x for x in X_train['email_embedded']]), torch.tensor(y_train)
Xt_val, yt_val = torch.tensor([x for x in X_val['email_embedded']]), torch.tensor(y_val)
Xt_test, yt_test = torch.tensor([x for x in X_test['email_embedded']]), torch.tensor(y_test)

In [17]:
from torch.utils.data import DataLoader, Dataset
from copy import deepcopy


class TextDataWrapper(Dataset):
    def __init__(self, data, target=None, transform=None):
        self.data = torch.from_numpy(data).long()
        if target is not None:
            self.target = torch.from_numpy(target).long()
        else:
            self.target = None
        self.transform = transform

    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index] if self.target is not None else -1

        if self.transform:
            x = self.transform(x)
        return x, y

    def __len__(self):
        return len(self.data)

## Models 

In [18]:
class CNNClassifier(nn.Module):
    def __init__(self, vocab_size=2000, embedding_dim=128, out_channel=128, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv = nn.Conv1d(embedding_dim, out_channel, kernel_size=3)
        self.relu = nn.ReLU()
        self.linear = nn.Linear(out_channel, num_classes)

    def forward(self, x):
        output = self.embedding(x)
        output = output.permute(0, 2, 1)  # bs, emb_dim, len
        output = self.conv(output)
        output = self.relu(output)
        output = torch.max(output, axis=2).values
        output = self.linear(output)
        return output
    
    
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size=2000, embedding_dim=128, hidden_dim=64, num_layers=1, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True, bidirectional=False)
        self.linear = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        output = self.embedding(x)
        lstm_out, _ = self.lstm(output)
        final_hidden_state = lstm_out[:, -1, :]
        output = self.linear(final_hidden_state)
        return output


model_cnn = CNNClassifier()
model_lstm = LSTMClassifier()

### Prepare and train CNN

In [19]:
batch_size = 256
num_epochs = 100
print(model_cnn)
print("Parameters:", sum([param.nelement() for param in model_cnn.parameters()]))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model_cnn.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

train_dataset = TextDataWrapper(Xp_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

val_dataset = TextDataWrapper(Xp_val, y_val)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)


CNNClassifier(
  (embedding): Embedding(2000, 128)
  (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
  (relu): ReLU()
  (linear): Linear(in_features=128, out_features=2, bias=True)
)
Parameters: 305538


In [20]:
%%time
train_accuracies = []
val_accuracies = []
train_losses = []

for epoch in range(num_epochs):
    model.train()
    temp_train_losses = []
    for i, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        temp_train_losses.append(loss.float().item())
    train_losses.append(np.mean(temp_train_losses))

    if epoch % 10 == 9:
        # Validation accuracy
        model.eval()
        with torch.no_grad():
            temp_train_acc = []
            temp_val_acc = []
            for i, (data, target) in enumerate(train_loader):
                train_outputs = model(data).squeeze()
                temp_train_acc.append(np.array((model(data).argmax(1) == target).int()).mean())
            for i, (data, target) in enumerate(val_loader):
                val_outputs = model(data).squeeze()
                temp_val_acc.append(np.array((model(data).argmax(1) == target).int()).mean())
            val_accuracies.append(np.mean(temp_val_acc))
            train_accuracies.append(np.mean(temp_train_acc))

        print(
            f'Epoch {epoch + 1}/{num_epochs}, Loss: {train_losses[-1]}, Train Accuracy: {train_accuracies[-1] * 100:.2f}%, Val Accuracy: {val_accuracies[-1] * 100:.2f}%')


Epoch 10/100, Loss: 0.3727441496319241, Train Accuracy: 86.15%, Val Accuracy: 84.85%
Epoch 20/100, Loss: 0.35570215847757125, Train Accuracy: 87.02%, Val Accuracy: 85.11%
Epoch 30/100, Loss: 0.3505771789285872, Train Accuracy: 87.10%, Val Accuracy: 85.49%
Epoch 40/100, Loss: 0.32878520091374713, Train Accuracy: 87.41%, Val Accuracy: 85.17%
Epoch 50/100, Loss: 0.33881500032212997, Train Accuracy: 87.54%, Val Accuracy: 84.92%
Epoch 60/100, Loss: 0.34424579805798, Train Accuracy: 87.19%, Val Accuracy: 85.24%
Epoch 70/100, Loss: 0.3126346833176083, Train Accuracy: 86.97%, Val Accuracy: 84.53%
Epoch 80/100, Loss: 0.33494001626968384, Train Accuracy: 87.02%, Val Accuracy: 85.48%
Epoch 90/100, Loss: 0.3336748215887282, Train Accuracy: 86.93%, Val Accuracy: 85.05%
Epoch 100/100, Loss: 0.3204502794477675, Train Accuracy: 87.71%, Val Accuracy: 85.42%
CPU times: user 7min 12s, sys: 34.3 s, total: 7min 46s
Wall time: 60 s


In [21]:
CNN_MODEL_TRAINED = deepcopy(model)

### Prepare and train LSTM

In [22]:
batch_size = 256
num_epochs = 100
print(model_lstm)
print("Parameters:", sum([param.nelement() for param in model_cnn.parameters()]))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model_lstm.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

train_dataset = TextDataWrapper(Xp_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

val_dataset = TextDataWrapper(Xp_val, y_val)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)


LSTMClassifier(
  (embedding): Embedding(2000, 128)
  (lstm): LSTM(128, 64, batch_first=True)
  (linear): Linear(in_features=64, out_features=2, bias=True)
)
Parameters: 305538


In [23]:
%%time
train_accuracies = []
val_accuracies = []
train_losses = []

for epoch in range(num_epochs):
    model.train()
    temp_train_losses = []
    for i, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        temp_train_losses.append(loss.float().item())
    train_losses.append(np.mean(temp_train_losses))

    if epoch % 10 == 9:
        # Validation accuracy
        model.eval()
        with torch.no_grad():
            temp_train_acc = []
            temp_val_acc = []
            for i, (data, target) in enumerate(train_loader):
                train_outputs = model(data).squeeze()
                temp_train_acc.append(np.array((model(data).argmax(1) == target).int()).mean())
            for i, (data, target) in enumerate(val_loader):
                val_outputs = model(data).squeeze()
                temp_val_acc.append(np.array((model(data).argmax(1) == target).int()).mean())
            val_accuracies.append(np.mean(temp_val_acc))
            train_accuracies.append(np.mean(temp_train_acc))

        print(
            f'Epoch {epoch + 1}/{num_epochs}, Loss: {train_losses[-1]}, Train Accuracy: {train_accuracies[-1] * 100:.2f}%, Val Accuracy: {val_accuracies[-1] * 100:.2f}%')


Epoch 10/100, Loss: 0.4076301058133443, Train Accuracy: 83.97%, Val Accuracy: 85.56%
Epoch 20/100, Loss: 0.37447400556670296, Train Accuracy: 85.76%, Val Accuracy: 86.51%
Epoch 30/100, Loss: 0.35908017887009513, Train Accuracy: 86.54%, Val Accuracy: 86.19%
Epoch 40/100, Loss: 0.35851213998264736, Train Accuracy: 86.54%, Val Accuracy: 85.55%
Epoch 50/100, Loss: 0.3640291723940108, Train Accuracy: 87.10%, Val Accuracy: 86.21%
Epoch 60/100, Loss: 0.3537207245826721, Train Accuracy: 87.32%, Val Accuracy: 86.14%
Epoch 70/100, Loss: 0.33135310146543717, Train Accuracy: 87.37%, Val Accuracy: 85.57%
Epoch 80/100, Loss: 0.3299876418378618, Train Accuracy: 87.58%, Val Accuracy: 85.57%
Epoch 90/100, Loss: 0.3293519483672248, Train Accuracy: 88.11%, Val Accuracy: 85.17%
Epoch 100/100, Loss: 0.3050866706503762, Train Accuracy: 86.54%, Val Accuracy: 85.25%
CPU times: user 8min 29s, sys: 1min 48s, total: 10min 17s
Wall time: 1min 22s


In [27]:
LSTM_MODEL_TRAINED = deepcopy(model)

## Сравним две модели

In [48]:
for name, model in zip(['LSTM', 'CNN'], [LSTM_MODEL_TRAINED, CNN_MODEL_TRAINED]):
    print(name)    
    model.eval()
    with torch.no_grad():
        y_pred = model(data).squeeze().argmax(1).int().numpy()
        display(classification_report_pd(y_test, y_pred))
        display(confusion_matrix(y_test, y_pred))

LSTM


Unnamed: 0,precision,recall,f1-score,support
0,0.860577,0.954667,0.905183,375
1,0.5,0.226667,0.311927,75
accuracy,,,0.833333,450
macro avg,0.680288,0.590667,0.608555,450
weighted avg,0.800481,0.833333,0.806307,450


array([[358,  17],
       [ 58,  17]])

CNN


Unnamed: 0,precision,recall,f1-score,support
0,0.856459,0.954667,0.9029,375
1,0.46875,0.2,0.280374,75
accuracy,,,0.828889,450
macro avg,0.662605,0.577333,0.591637,450
weighted avg,0.791841,0.828889,0.799146,450


array([[358,  17],
       [ 60,  15]])

Из выведенных метрик видно, что LST лучше справилась с задачей, чем CNN. Возможно, если начать подбор гиперпараметров (количество слоев, нейронов в каждом слое), то модели выучатся лучше, и метрики будут поближе к единице. 

Кроме того, данных определенно не хватает. Если взять другой (побольше) датасет -- определенно получилось бы лучше. 