Используя ноутбук занятия (также размещен в папке Materials) и данные fakenews, 3 раза разными способами получить на задаче классификации значение f1 выше 0.91 для методов на sklearn и выше 0.52 для методов на pytorch.

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

In [1]:
!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2024-08-17 16:09:42--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1,2M) [text/plain]
Saving to: 'Constraint_Train.csv.3'

     0K .......... .......... .......... .......... ..........  4% 1,00M 1s
    50K .......... .......... .......... .......... ..........  8%  727K 1s
   100K .......... .......... .......... .......... .......... 12%  938K 1s
   150K .......... .......... .......... .......... .......... 16%  954K 1s
   200K .......... .......... .......... .......... .......... 20%  979K 1s
   250K .......... .......... .......... .......... .......... 24%  813K 1s
   300K .......... .......... .......... .......... .......... 28%  973K 1s
   350K

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('Constraint_Train.csv')

In [4]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [5]:
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yuril\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]

100%|████████████████████████████████████████████████████████████████████████████| 6420/6420 [00:01<00:00, 4340.75it/s]


In [8]:
from gensim.models.word2vec import Word2Vec
%time model_tweets = Word2Vec(sentences, workers=6, vector_size=300, min_count=3, window=5, epochs=30)

CPU times: total: 5.56 s
Wall time: 3.63 s


In [9]:
model_tweets.wv.most_similar('vaccine')

[('drug', 0.5828571915626526),
 ('vaccines', 0.5670225024223328),
 ('developed', 0.5512670874595642),
 ('trial', 0.5361182689666748),
 ('cure', 0.5307860970497131),
 ('manufacturing', 0.5240650773048401),
 ('remedy', 0.49537548422813416),
 ('company', 0.4888715147972107),
 ('trials', 0.48215413093566895),
 ('therapeutics', 0.4667731523513794)]

In [10]:
model_tweets.wv.fill_norms()
#model_tweets.wv.init_sims()

In [11]:
import numpy as np

In [12]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.average(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [13]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|████████████████████████████████████████████████████████████████████████████| 6420/6420 [00:02<00:00, 3048.64it/s]


In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.3,  random_state=42)

In [16]:
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

In [17]:
from sklearn.metrics import classification_report

In [18]:
predicted = model.predict(X_test)

In [19]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.93      0.88      0.90       920
        real       0.90      0.94      0.92      1006

    accuracy                           0.91      1926
   macro avg       0.91      0.91      0.91      1926
weighted avg       0.91      0.91      0.91      1926



In [20]:
# f1 выше 0.91 для методов на sklearn

###  Что будет, если использовать самый наивный метод?

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
vec = CountVectorizer()

In [23]:
bow = vec.fit_transform(df.tweet)

In [24]:
X_train, X_test, y_train, y_test = train_test_split(bow, df.label, test_size=0.3, random_state=42)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

In [25]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.91      0.93      0.92       920
        real       0.93      0.92      0.92      1006

    accuracy                           0.92      1926
   macro avg       0.92      0.92      0.92      1926
weighted avg       0.92      0.92      0.92      1926



In [26]:
# f1 выше 0.91 для методов на sklearn

In [27]:
#попробуем векторизатор Tf-Idf с параметрами по умолчанию
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [28]:
bow = tfidf.fit_transform(df.tweet)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(bow, df.label, test_size=0.3, random_state=42)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

In [30]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.90      0.93      0.92       920
        real       0.94      0.91      0.92      1006

    accuracy                           0.92      1926
   macro avg       0.92      0.92      0.92      1926
weighted avg       0.92      0.92      0.92      1926



In [31]:
# метрика f1-score осталась прежней 0.92

### PyTorch + LSTM

In [32]:
labels = (df.label == 'real').astype(int).to_list()

Нужно заранее задать размер для макксимальной длины предложений.

In [33]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]
max_len = len(max(token_lists, key=len))

In [34]:
max_len

1592

Это слишком много. Но какая длина обычно?

In [35]:
from collections import Counter
fd = Counter([len(tokens) for tokens in token_lists])

In [36]:
fd.most_common(10)

[(20, 178),
 (25, 174),
 (22, 170),
 (18, 170),
 (19, 168),
 (21, 168),
 (16, 163),
 (17, 162),
 (15, 160),
 (23, 156)]

Зададим максимум 300.

Возьмём те же w2v эмбеддинги.

In [37]:
def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result

In [38]:
features = [get_word_embedding(text, 300) for text in tqdm(token_lists)]

100%|████████████████████████████████████████████████████████████████████████████| 6420/6420 [00:02<00:00, 2183.04it/s]


In [39]:
import torch
import torch.nn as nn
import torch.optim as optim

In [40]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction


net = Net()
#net.cuda()
print(net)

Net(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [41]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state = 42)

In [42]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

  in_data = torch.tensor(X_train).float()


In [43]:
in_data.shape

torch.Size([4815, 300, 300])

In [44]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCELoss()

In [45]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
#        batch_x = in_data[i:i + batch_size].cuda()
#        batch_y = targets[i:i + batch_size].cuda()
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [46]:
train_one_epoch(in_data, targets)

100%|████████████████████████████████████████████████████████████████████████████████| 301/301 [00:10<00:00, 27.80it/s]

tensor(0.6880, grad_fn=<BinaryCrossEntropyBackward0>)





Что получилось?

In [47]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [48]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [49]:
max(torch.nn.functional.softmax(output))

  max(torch.nn.functional.softmax(output))


tensor(0.0007)

In [50]:
targets_test

tensor([0., 1., 1.,  ..., 0., 1., 0.])

In [51]:
result = (output.cpu() > 0.5) == targets_test

In [52]:
result.sum().item() / len(result)

0.5214953271028038

In [53]:
# f1-score  0.5214

In [54]:
optimizer = optim.Adam(net.parameters(), lr=0.001)
criterion = nn.MSELoss()

In [55]:
train_one_epoch(in_data, targets)

100%|████████████████████████████████████████████████████████████████████████████████| 301/301 [00:11<00:00, 26.09it/s]

tensor(0.2496, grad_fn=<MseLossBackward0>)





In [56]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [57]:
result = (output > 0.5) == targets_test

In [58]:
result.sum().item() / len(result)

0.5221183800623053

In [59]:
# f1-score 0.5221