# Глубокое обучение и обработка естественного языка

## Домашняя работа №4

Исходный набор данных - [Fake and real news dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)
2. Реализовать классификацию двумя моделями: CNN, LSTM - 6 баллов = 3 + 3
3. Сравнить качество обученных моделей 1 балл
4. Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - 2 балла
5. Соблюден code style на уровне pep8 и [On writing clean Jupyter notebooks](https://ploomber.io/blog/clean-nbs/) - 1 балл

Примеры: [Using Convolution Neural Networks to Classify Text in PyTorch](https://tzuruey.medium.com/using-convolution-neural-networks-to-classify-text-in-pytorch-3b626a42c3ca), [LSTM in Pytorch](https://wandb.ai/sauravmaheshkar/LSTM-PyTorch/reports/Using-LSTM-in-PyTorch-A-Tutorial-With-Examples--VmlldzoxMDA2NTA5)

In [79]:
# установка torchmetrics
!pip install torchmetrics



In [80]:
# подключение библиотек
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torchmetrics import F1Score

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
nltk.download("punkt")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [81]:
seed = 2023

np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x7f39bdb82df0>

In [82]:
# функция, переводит текст в список id-слов
def text_to_sequence(text, maxlen, vocabulary):
    result = []
    tokens = word_tokenize(text.lower())
    tokens_filtered = [word for word in tokens if word.isalnum()]

    for word in tokens_filtered:
        if word in vocabulary:
            result.append(vocabulary[word])
    padding = [0]*(maxlen-len(result))

    return padding + result[-maxlen:]

In [83]:
# функция обучения модели
def train(model, train_loader, epochs=10):
  model.train()

  f1 = F1Score(task="binary")
  optimizer = torch.optim.Adam(model.parameters(), lr=10e-3)
  criterion = nn.CrossEntropyLoss()

  for epoch in range(1, epochs + 1):
      print(f"Train epoch {epoch}/{epochs}")
      temp_loss = []
      temp_metrics = []
      for i, (data, target) in enumerate(train_loader):
          optimizer.zero_grad()
          output = model(data)

          loss = criterion(output, target)
          loss.backward()

          optimizer.step()
          temp_loss.append(loss.float().item())
          temp_metrics.append(f1(output.argmax(1), target).item())

      epoch_loss = np.array(temp_loss).mean()
      epoch_f1 = np.array(temp_metrics).mean()
      print(f'Loss: {epoch_loss}, f1 score: {epoch_f1}')

In [84]:
# функция оценки модели
def eval(model, val_loader):
  f1 = F1Score(task="binary")
  temp_metrics = []

  for i, (data, target) in enumerate(val_loader):
    output = model(data)
    temp_metrics.append(f1(output.argmax(1), target).item())

  f1_mean = np.array(temp_metrics).mean()
  print(f'F1 score: {f1_mean}')

In [85]:
# класс, обертка над данными
class TextDataWrapper(Dataset):
    def __init__(self, data, target=None, transform=None):
        self.data = torch.from_numpy(data).long()
        if target is not None:
            self.target = torch.from_numpy(target).long()
        else:
          self.target = None
        self.transform = transform

    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index] if self.target is not None else -1

        if self.transform:
            x = self.transform(x)
        return x, y

    def __len__(self):
        return len(self.data)

In [86]:
# класс, CNN
class ConvTextClassifier(nn.Module):
    def __init__(self, vocab_size=2000, embedding_dim=128, out_channel=128, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv = nn.Conv1d(embedding_dim, out_channel, kernel_size=3)
        self.relu = nn.ReLU()
        self.linear = nn.Linear(out_channel, num_classes)

    def forward(self, x):
        output = self.embedding(x)
        output = output.permute(0, 2, 1) # bs, emb_dim, len
        output = self.conv(output)
        output = self.relu(output)
        output = torch.max(output, axis=2).values
        output = self.linear(output)
        return output

In [87]:
# класс, LSTM
class LSTMTextClassifier(nn.Module):
    def __init__(self, vocab_size=2000, embedding_dim=128, out_channel=128, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, out_channel, batch_first=True)
        self.relu = nn.ReLU()
        self.linear = nn.Linear(out_channel, num_classes)

    def forward(self, x):
        output = self.embedding(x)
        output, (hn, cn) = self.lstm(output)
        hn = hn.squeeze()
        output = self.relu(hn)
        output = self.linear(output)
        return output

## Загрузка данных


In [88]:
df_fake = pd.read_csv('Fake.csv')
df_true = pd.read_csv('True.csv')

In [89]:
df_fake['class'] = 0
df_true['class'] = 1

In [90]:
df = pd.concat([df_fake, df_true], axis=0)

In [91]:
df

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
...,...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1


Работа с полем `text`

In [92]:
df.drop(columns=['title', 'subject', 'date'], axis=0, inplace=True)

## Предобработка

Создание корпуса

In [93]:
train_corpus = list(df['text'])
tokens = []

for text in tqdm(train_corpus):
  tokens.extend(word_tokenize(text.lower()))
tokens_filtered = [word for word in tokens if word.isalnum()]

  0%|          | 0/44898 [00:00<?, ?it/s]

In [94]:
max_words = 2000
dist = FreqDist(tokens_filtered)
tokens_filtered_top = [pair[0] for pair in dist.most_common(max_words-1)]

In [95]:
vocabulary = {v: k for k, v in dict(enumerate(tokens_filtered_top, 1)).items()}

Создание `train` и `test`

In [96]:
batch_size = 256
max_len = 40

In [97]:
df_train, df_test = train_test_split(df, test_size=0.2)

x_train = np.array([text_to_sequence(text, max_len, vocabulary) for text in tqdm(df_train["text"])], dtype=np.int32)
x_test = np.array([text_to_sequence(text, max_len, vocabulary) for text in tqdm(df_test["text"])], dtype=np.int32)
y_train = np.array(df_train["class"])
y_test = np.array(df_test["class"])

  0%|          | 0/35918 [00:00<?, ?it/s]

  0%|          | 0/8980 [00:00<?, ?it/s]

In [98]:
train_dataset = TextDataWrapper(x_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = TextDataWrapper(x_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

## Классификация. CNN и LSTM

In [99]:
epochs = 10

### CNN

In [100]:
cnn = ConvTextClassifier()
print(cnn)
print("Parameters:", sum([param.nelement() for param in cnn.parameters()]))

ConvTextClassifier(
  (embedding): Embedding(2000, 128)
  (conv): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
  (relu): ReLU()
  (linear): Linear(in_features=128, out_features=2, bias=True)
)
Parameters: 305538


In [101]:
train(cnn, train_loader, epochs=10)

Train epoch 1/10
Loss: 0.25232159378046687, f1 score: 0.8849495133609636
Train epoch 2/10
Loss: 0.10074995880854046, f1 score: 0.9595868240856955
Train epoch 3/10
Loss: 0.06951293296425055, f1 score: 0.9727606464785041
Train epoch 4/10
Loss: 0.08964658450460762, f1 score: 0.9717014178316644
Train epoch 5/10
Loss: 0.022595512904457958, f1 score: 0.9918189395404031
Train epoch 6/10
Loss: 0.008140199879433709, f1 score: 0.997181256612142
Train epoch 7/10
Loss: 0.003930475542324691, f1 score: 0.9988840382995335
Train epoch 8/10
Loss: 0.0011929141159180935, f1 score: 0.9998496770858765
Train epoch 9/10
Loss: 0.0005957238341283081, f1 score: 0.9999703253414614
Train epoch 10/10
Loss: 0.0005124539895198768, f1 score: 0.9999698206042567


In [102]:
eval(cnn, test_loader)

F1 score: 0.9685539801915487


### LSTM

In [103]:
lstm = LSTMTextClassifier()
print(lstm)
print("Parameters:", sum([param.nelement() for param in lstm.parameters()]))

LSTMTextClassifier(
  (embedding): Embedding(2000, 128)
  (lstm): LSTM(128, 128, batch_first=True)
  (relu): ReLU()
  (linear): Linear(in_features=128, out_features=2, bias=True)
)
Parameters: 388354


In [104]:
train(lstm, train_loader, epochs=10)

Train epoch 1/10
Loss: 0.21653707108810438, f1 score: 0.9117162485494681
Train epoch 2/10
Loss: 0.11361222595293471, f1 score: 0.9568147185846423
Train epoch 3/10
Loss: 0.07518546293813286, f1 score: 0.9719462914669768
Train epoch 4/10
Loss: 0.052561427314653464, f1 score: 0.9805174055674397
Train epoch 5/10
Loss: 0.04264911350047757, f1 score: 0.9850408456004258
Train epoch 6/10
Loss: 0.03007368272144004, f1 score: 0.989530103849181
Train epoch 7/10
Loss: 0.02314805885750457, f1 score: 0.9916127514331898
Train epoch 8/10
Loss: 0.02889553175129472, f1 score: 0.9891546520781009
Train epoch 9/10
Loss: 0.029155359719054284, f1 score: 0.9891024413683736
Train epoch 10/10
Loss: 0.022231496097718146, f1 score: 0.9920127459451662


In [105]:
eval(lstm, test_loader)

F1 score: 0.9652212378051546


In [106]:
!pip freeze > requirements.txt