Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

In [None]:
!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2021-10-22 18:12:21--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1.2M) [text/plain]
Saving to: ‘Constraint_Train.csv’


2021-10-22 18:12:21 (21.7 MB/s) - ‘Constraint_Train.csv’ saved [1253562/1253562]



In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('Constraint_Train.csv')

In [5]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [6]:
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [7]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Pavel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:01<00:00, 3479.27it/s]


In [9]:
sentences[0]

['the',
 'cdc',
 'currently',
 'reports',
 '99031',
 'deaths',
 '.',
 'in',
 'general',
 'the',
 'discrepancies',
 'in',
 'death',
 'counts',
 'between',
 'different',
 'sources',
 'are',
 'small',
 'and',
 'explicable',
 '.',
 'the',
 'death',
 'toll',
 'stands',
 'at',
 'roughly',
 '100000',
 'people',
 'today',
 '.']

In [11]:
from gensim.models.word2vec import Word2Vec

model_tweets = Word2Vec(sentences, workers=6, vector_size=300, min_count=3, window=5, epochs=15)

In [12]:
model_tweets.wv.most_similar('bad')

[('similar', 0.8360995650291443),
 ('never', 0.8355663418769836),
 ('autopsies', 0.8266803622245789),
 ('maybe', 0.8262165188789368),
 ('coronil', 0.8224976658821106),
 ('lysol', 0.8209189176559448),
 ('ask', 0.8208632469177246),
 ('weed', 0.8206198215484619),
 ('natural', 0.8197526335716248),
 ('say', 0.8176798820495605)]

In [8]:
import numpy as np

In [53]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.sum(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [160]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:02<00:00, 2618.74it/s]


In [85]:
def minmaxnorm(array):
    min_, max_ = min(array), max(array)
    for i in range(len(array)):
        if min_ != max_:
            array[i] = (array[i] - min_) / (max_ - min_)
    return array

In [158]:
features = [minmaxnorm(feat) for feat in features]

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [149]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.25, random_state=42)

In [150]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [151]:
from sklearn.metrics import classification_report

In [152]:
predicted = model.predict(X_test)

In [153]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.92      0.92      0.92       768
           1       0.92      0.92      0.92       837

    accuracy                           0.92      1605
   macro avg       0.92      0.92      0.92      1605
weighted avg       0.92      0.92      0.92      1605



###  Сделаем простой мешок слов на основе one hot векторов

In [107]:
from sklearn.feature_extraction.text import CountVectorizer

In [108]:
vec = CountVectorizer()

In [154]:
bow = vec.fit_transform(df.tweet)

In [155]:
X_train, X_test, y_train, y_test = train_test_split(bow, df.label, test_size=0.25, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

In [157]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.91      0.92      0.92       768
           1       0.93      0.92      0.92       837

    accuracy                           0.92      1605
   macro avg       0.92      0.92      0.92      1605
weighted avg       0.92      0.92      0.92      1605



### Были попытки обработать данные с помощью min max нормализации для модели word2vec, но результаты немного упали, поэтому дополнительных предобработок не было

### PyTorch + LSTM

In [14]:
labels = (df.label == 'real').astype(int).to_list()

Нужно заранее задать размер для макксимальной длины предложений.

In [15]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]
max_len = len(max(token_lists, key=len))

In [16]:
max_len

1592

Это слишком много. Но какая длина обычно?

In [17]:
from collections import Counter
fd = Counter([len(tokens) for tokens in token_lists])

In [18]:
fd.most_common(10)

[(20, 178),
 (25, 174),
 (22, 170),
 (18, 170),
 (19, 168),
 (21, 168),
 (16, 163),
 (17, 162),
 (15, 160),
 (23, 156)]

Зададим максимум 200.

Возьмём те же w2v эмбеддинги.

In [19]:
'''def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result'''

In [51]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.average(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [52]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:02<00:00, 2362.69it/s]


In [53]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25)

In [54]:
import torch
import torch.nn as nn
import torch.optim as optim

In [55]:
len(X_train)

4815

In [56]:
len(X_train[0])

300

In [128]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.out = nn.Linear(300, 1)

    def forward(self, x):
        return torch.sigmoid(self.out(x))

net = Net()
print(net)

Net(
  (out): Linear(in_features=300, out_features=1, bias=True)
)


In [129]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

In [130]:
in_data.shape

torch.Size([4815, 300])

In [131]:
optimizer = optim.Adam(net.parameters(), lr=1e-3)
criterion = nn.BCELoss()

In [144]:
def train_one_epoch(in_data, targets, batch_size=64):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [145]:
for _ in range(50):
    train_one_epoch(in_data, targets)

100%|██████████| 76/76 [00:00<00:00, 1407.04it/s]


tensor(0.2647, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1651.81it/s]


tensor(0.2643, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1489.86it/s]


tensor(0.2639, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1151.25it/s]


tensor(0.2635, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 863.45it/s]


tensor(0.2631, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 926.62it/s]


tensor(0.2628, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1381.57it/s]


tensor(0.2624, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1726.85it/s]


tensor(0.2621, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1651.77it/s]


tensor(0.2618, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1616.65it/s]


tensor(0.2615, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1651.76it/s]


tensor(0.2612, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1616.60it/s]


tensor(0.2609, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1583.01it/s]


tensor(0.2606, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1651.77it/s]


tensor(0.2603, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1187.16it/s]


tensor(0.2600, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1026.85it/s]


tensor(0.2598, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1206.07it/s]


tensor(0.2595, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1616.62it/s]


tensor(0.2593, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1519.70it/s]


tensor(0.2590, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1651.78it/s]


tensor(0.2588, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1899.61it/s]


tensor(0.2585, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1767.10it/s]


tensor(0.2583, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1853.24it/s]


tensor(0.2580, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1767.03it/s]


tensor(0.2578, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1809.11it/s]


tensor(0.2576, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1582.94it/s]


tensor(0.2574, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1616.67it/s]


tensor(0.2571, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 938.06it/s]


tensor(0.2569, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1333.07it/s]


tensor(0.2567, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1809.08it/s]


tensor(0.2565, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1809.15it/s]


tensor(0.2563, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1809.14it/s]


tensor(0.2561, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1767.08it/s]


tensor(0.2559, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1726.90it/s]


tensor(0.2557, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1489.81it/s]


tensor(0.2555, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1550.68it/s]


tensor(0.2553, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1407.17it/s]


tensor(0.2551, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1550.71it/s]


tensor(0.2549, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1688.55it/s]


tensor(0.2547, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1407.09it/s]


tensor(0.2546, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1356.82it/s]


tensor(0.2544, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 883.53it/s]


tensor(0.2542, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1433.63it/s]


tensor(0.2540, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1651.81it/s]


tensor(0.2539, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1726.86it/s]


tensor(0.2537, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1688.54it/s]


tensor(0.2535, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1550.67it/s]


tensor(0.2534, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1433.60it/s]


tensor(0.2532, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1519.68it/s]


tensor(0.2530, grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 76/76 [00:00<00:00, 1407.10it/s]

tensor(0.2529, grad_fn=<BinaryCrossEntropyBackward0>)





Что получилось?

In [146]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [147]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [148]:
result = (output > 0.5) == targets_test

In [149]:
result.sum().item() / len(result)

0.9065420560747663

## На данном датасете самые высокие результаты показали самая простоя модель - bow и модель эмбеддинга word2vec, модель рекурентной нейронной сети со слоем lstm показала очень низкие резульаты при обучение на 30 эпохах, довольно странно. Убрав слой lstm результат сильно улучшился и стал практически равным результатам предыдущей модели, однако для построенной нейросети это значение близко к "потолку". 
## Как можно улучшить модель, не прибегая к использованию трансформеров или других типов моделей ?