Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

In [None]:
#!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2021-10-22 18:12:21--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1.2M) [text/plain]
Saving to: ‘Constraint_Train.csv’


2021-10-22 18:12:21 (21.7 MB/s) - ‘Constraint_Train.csv’ saved [1253562/1253562]



In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Constraint_Train.csv')

In [3]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [4]:
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ivana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]

100%|████████████████████████████████████████████████████████████████████████████| 6420/6420 [00:04<00:00, 1358.16it/s]


In [7]:
from gensim.models.word2vec import Word2Vec
%time model_tweets = Word2Vec(sentences, workers=4, size=300, min_count=3, window=5, iter=15)

Wall time: 12 s


In [8]:
model_tweets.wv.most_similar('vaccine')

[('cure', 0.8175490498542786),
 ('drug', 0.7992066144943237),
 ('developed', 0.7906081676483154),
 ('warned', 0.7869542241096497),
 ('novel', 0.7819766402244568),
 ('fight', 0.775766909122467),
 ('against', 0.7646459341049194),
 ('remedy', 0.7579882144927979),
 ('pandemic', 0.7363992929458618),
 ('crown', 0.7284566164016724)]

In [9]:
model_tweets.init_sims()

In [10]:
import numpy as np

In [11]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.sum(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [12]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|████████████████████████████████████████████████████████████████████████████| 6420/6420 [00:05<00:00, 1143.16it/s]


In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.33)

### Логистическая регрессия

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [17]:
from sklearn.metrics import classification_report

In [18]:
predicted = model.predict(X_test)

In [19]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.88      0.92      0.90       995
        real       0.93      0.89      0.91      1124

    accuracy                           0.91      2119
   macro avg       0.91      0.91      0.91      2119
weighted avg       0.91      0.91      0.91      2119



### Деревья решений 

In [20]:
from sklearn.tree import DecisionTreeClassifier

In [21]:
clf = DecisionTreeClassifier(max_depth=7, random_state=0)

clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=7, random_state=0)

In [22]:
predicted_clf = clf.predict(X_test)

In [23]:
print(classification_report(y_test, predicted_clf))

              precision    recall  f1-score   support

        fake       0.87      0.86      0.87       995
        real       0.88      0.89      0.88      1124

    accuracy                           0.87      2119
   macro avg       0.87      0.87      0.87      2119
weighted avg       0.87      0.87      0.87      2119



### Случайный лес

In [24]:
from sklearn.ensemble import RandomForestClassifier

In [25]:
rfc = RandomForestClassifier(n_jobs = -1, n_estimators = 100, max_depth=10, min_samples_leaf=5, bootstrap = True, 
                             max_features = .1, random_state=0)
rfc.fit(X_train, y_train) 

RandomForestClassifier(max_depth=10, max_features=0.1, min_samples_leaf=5,
                       n_jobs=-1, random_state=0)

In [26]:
predicted_rfc = rfc.predict(X_test)

In [27]:
print(classification_report(y_test, predicted_rfc))

              precision    recall  f1-score   support

        fake       0.92      0.91      0.92       995
        real       0.92      0.93      0.93      1124

    accuracy                           0.92      2119
   macro avg       0.92      0.92      0.92      2119
weighted avg       0.92      0.92      0.92      2119



### KNeighbors

In [28]:
from sklearn.neighbors import KNeighborsClassifier

In [29]:
neigh = KNeighborsClassifier(n_neighbors=3)

neigh.fit(X_train, y_train) 

KNeighborsClassifier(n_neighbors=3)

In [30]:
predicted_neigh = neigh.predict(X_test)

In [31]:
print(classification_report(y_test, predicted_neigh))

              precision    recall  f1-score   support

        fake       0.91      0.92      0.91       995
        real       0.92      0.92      0.92      1124

    accuracy                           0.92      2119
   macro avg       0.92      0.92      0.92      2119
weighted avg       0.92      0.92      0.92      2119



### SVM

In [32]:
from sklearn import svm

In [33]:
clf_svm = svm.SVC()
clf_svm.fit(X_train, y_train) 

SVC()

In [34]:
predicted_svm = clf_svm.predict(X_test)

In [35]:
print(classification_report(y_test, predicted_svm))

              precision    recall  f1-score   support

        fake       0.90      0.92      0.91       995
        real       0.93      0.91      0.92      1124

    accuracy                           0.92      2119
   macro avg       0.91      0.92      0.92      2119
weighted avg       0.92      0.92      0.92      2119



###  Что будет, если использовать самый наивный метод?

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

In [37]:
vec = CountVectorizer()

In [38]:
bow = vec.fit_transform(df.tweet)

In [39]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.33)
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [198]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.90      0.91      0.91      1016
        real       0.92      0.91      0.91      1103

    accuracy                           0.91      2119
   macro avg       0.91      0.91      0.91      2119
weighted avg       0.91      0.91      0.91      2119



Конечно, мы всегда можем поиграться с предобработкой.

### PyTorch + LSTM

In [40]:
labels = (df.label == 'real').astype(int).to_list()

Нужно заранее задать размер для макксимальной длины предложений.

In [41]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]
max_len = len(max(token_lists, key=len))

In [42]:
max_len

1592

Это слишком много. Но какая длина обычно?

In [43]:
from collections import Counter
fd = Counter([len(tokens) for tokens in token_lists])

In [44]:
fd.most_common(10)

[(20, 178),
 (25, 174),
 (22, 170),
 (18, 170),
 (19, 168),
 (21, 168),
 (16, 163),
 (17, 162),
 (15, 160),
 (23, 156)]

Зададим максимум 200.

Возьмём те же w2v эмбеддинги.

In [45]:
def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result

In [205]:
features = [get_word_embedding(text, 200) for text in tqdm(token_lists)]

100%|████████████████████████████████████████████████████████████████████████████| 6420/6420 [00:06<00:00, 1025.56it/s]


In [206]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)

In [207]:
import torch
import torch.nn as nn
import torch.optim as optim

In [208]:
len(features[0][0])

300

In [209]:
len(X_train)

4301

In [210]:
len(X_train[0])

200

In [211]:
len(X_train[0][0])

300

In [212]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction


net = Net()
print(net)

Net(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [213]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

In [214]:
in_data.shape

torch.Size([4301, 200, 300])

In [215]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCELoss()

In [216]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [217]:
train_one_epoch(in_data, targets)

100%|████████████████████████████████████████████████████████████████████████████████| 269/269 [01:25<00:00,  3.15it/s]

tensor(0.6826, grad_fn=<BinaryCrossEntropyBackward0>)





Что получилось?

In [218]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [219]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [220]:
result = (output > 0.5) == targets_test

In [221]:
result.sum().item() / len(result)

0.5106182161396885

Но такую модель надо учить дольше(

### Вариант 2

In [None]:
features = [get_word_embedding(text, 1592) for text in tqdm(df.tweet)]

  8%|██████▎                                                                       | 520/6420 [00:04<00:49, 119.93it/s]

In [240]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        return torch.sigmoid(self.out(x))


net = Net()
print(net)

Net(
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [241]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

In [242]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)

In [243]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [244]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [245]:
for i in range(10):
    train_one_epoch(in_data, targets)

  0%|                                                                                          | 0/269 [00:00<?, ?it/s]


RuntimeError: mat1 and mat2 shapes cannot be multiplied (3200x300 and 100x1)

In [227]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [228]:
result = (output > 0.5) == targets_test

In [229]:
result.sum().item() / len(result)

0.5106182161396885