Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

In [None]:
#!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2021-10-22 18:12:21--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1.2M) [text/plain]
Saving to: ‘Constraint_Train.csv’


2021-10-22 18:12:21 (21.7 MB/s) - ‘Constraint_Train.csv’ saved [1253562/1253562]



In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier



In [2]:
df = pd.read_csv('Constraint_Train.csv')

In [3]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [4]:
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\BEU_RU1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
from nltk.corpus import stopwords

In [8]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:05<00:00, 1106.89it/s]


In [9]:
# увеличим количество эпох

In [23]:
from gensim.models.word2vec import Word2Vec
%time model_tweets = Word2Vec(sentences, workers=4, vector_size=300, min_count=3, window=5, epochs=25)

Wall time: 4.79 s


In [24]:
model_tweets.wv.most_similar('france')

[('2015', 0.7670807242393494),
 ('victims', 0.7295958399772644),
 ('floor', 0.7201323509216309),
 ('arrest', 0.719237208366394),
 ('originated', 0.705829381942749),
 ('bags', 0.7043509483337402),
 ('corpses', 0.7031402587890625),
 ('spain', 0.7030721306800842),
 ('streets', 0.7030181288719177),
 ('lying', 0.6849498748779297)]

In [25]:
model_tweets.init_sims()

In [26]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.sum(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [27]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:03<00:00, 1904.34it/s]


In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.33)

In [29]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [20]:
from sklearn.metrics import classification_report

In [30]:
predicted = model.predict(X_test)

In [31]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.91      0.93      0.92      1034
        real       0.93      0.91      0.92      1085

    accuracy                           0.92      2119
   macro avg       0.92      0.92      0.92      2119
weighted avg       0.92      0.92      0.92      2119



###  Что будет, если использовать самый наивный метод?

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
vec = CountVectorizer()

In [34]:
bow = vec.fit_transform(df.tweet)

In [35]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.33)
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [36]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.93      0.94      0.93      1001
        real       0.94      0.93      0.94      1118

    accuracy                           0.93      2119
   macro avg       0.93      0.93      0.93      2119
weighted avg       0.93      0.93      0.93      2119



Конечно, мы всегда можем поиграться с предобработкой.

In [37]:
vec = CountVectorizer(ngram_range=(1, 1), stop_words=stopwords.words('english'))
bow = vec.fit_transform(df.tweet)

In [38]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(bow, df.label, test_size=0.33)
lr2 = LogisticRegression()
lr2.fit(X_train_2, y_train_2)

LogisticRegression()

In [39]:
predicted_2 = lr2.predict(X_test_2)
print(classification_report(y_test_2, predicted_2))

              precision    recall  f1-score   support

        fake       0.93      0.93      0.93      1001
        real       0.94      0.93      0.94      1118

    accuracy                           0.93      2119
   macro avg       0.93      0.93      0.93      2119
weighted avg       0.93      0.93      0.93      2119



### PyTorch + LSTM

In [40]:
labels = (df.label == 'real').astype(int).to_list()

Нужно заранее задать размер для макксимальной длины предложений.

In [41]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]
max_len = len(max(token_lists, key=len))

In [42]:
max_len

1592

Это слишком много. Но какая длина обычно?

In [43]:
from collections import Counter
fd = Counter([len(tokens) for tokens in token_lists])

In [44]:
fd.most_common(10)

[(20, 178),
 (25, 174),
 (22, 170),
 (18, 170),
 (19, 168),
 (21, 168),
 (16, 163),
 (17, 162),
 (15, 160),
 (23, 156)]

Зададим максимум 200.

Возьмём те же w2v эмбеддинги.

In [127]:
def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result

In [128]:
features = [get_word_embedding(text, 200) for text in tqdm(token_lists)]

100%|██████████| 6420/6420 [00:05<00:00, 1140.71it/s]


In [129]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)

In [50]:
import torch
import torch.nn as nn
import torch.optim as optim

In [130]:
len(features[0][0])

300

In [131]:
len(X_train)

4301

In [132]:
len(X_train[0])

200

In [133]:
len(X_train[0][0])

300

In [134]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction


net = Net()
print(net)

Net(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [135]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

In [136]:
in_data.shape

torch.Size([4301, 200, 300])

In [137]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCELoss()

In [138]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [139]:
train_one_epoch(in_data, targets)

100%|██████████| 269/269 [02:05<00:00,  2.15it/s]

tensor(0.6909, grad_fn=<BinaryCrossEntropyBackward0>)





Что получилось?

In [140]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [141]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [142]:
result = (output > 0.5) == targets_test

In [143]:
result.sum().item() / len(result)

0.512977819726286

Но такую модель надо учить дольше(

In [167]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100, dropout=0.5)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.nn.functional.sigmoid(self.out(longterm))
        return prediction


net = Net()
print(net)

Net(
  (lstm): LSTM(300, 100, dropout=0.5)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [168]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

In [169]:
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
criterion = nn.BCELoss()

In [170]:
def train_one_epoch(in_data, targets, batch_size=16):
    for b in tqdm(range(0, in_data.shape[0])):
        for i in range(0, 200, batch_size): #tqdm(range(0, in_data.shape[0], batch_size)):
            batch_x = in_data[b:b+1, i:i + batch_size]
            batch_y = targets[b:b+1]
            optimizer.zero_grad()
            output = net(batch_x)
            loss = criterion(output.reshape(-1), batch_y)
            loss.backward()
            optimizer.step()
    print(loss)

In [171]:
train_one_epoch(in_data, targets)

100%|██████████| 4301/4301 [06:23<00:00, 11.21it/s]

tensor(0.3721, grad_fn=<BinaryCrossEntropyBackward0>)





In [172]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [173]:
in_data_test.shape, targets_test.shape

(torch.Size([2119, 200, 300]), torch.Size([2119]))

In [174]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [175]:
result = (output > 0.5) == targets_test

In [176]:
result.sum().item() / len(result)

0.5884851344974045