Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

# Классификация текстов

## Fakenews
Используя ноутбук занятия (также размещен в папке Materials) и данные fakenews, 3 раза разными способами получить на задаче классификации значение f1 выше 0.91 для методов на sklearn и выше 0.52 для методов на pytorch.

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

#### **1. Загрузка данных**

In [38]:
!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2024-12-11 12:51:19--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1.2M) [text/plain]
Saving to: ‘Constraint_Train.csv.1’


2024-12-11 12:51:19 (17.6 MB/s) - ‘Constraint_Train.csv.1’ saved [1253562/1253562]



In [39]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [40]:
df = pd.read_csv('/content/Constraint_Train.csv')

In [41]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6420 entries, 0 to 6419
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      6420 non-null   int64 
 1   tweet   6420 non-null   object
 2   label   6420 non-null   object
dtypes: int64(1), object(2)
memory usage: 150.6+ KB


In [43]:
df.label.value_counts()

#Данные сбалансированы

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
real,3360
fake,3060


In [44]:
# Переведем метку класса в значения 0-fake, 1-real

label_map = {'real': 1, 'fake': 0}
df['label'] = df['label'].map(label_map)

In [45]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,1
1,2,States reported 1121 deaths a small rise from ...,1
2,3,Politically Correct Woman (Almost) Uses Pandem...,0
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,1
4,5,Populous states can generate large case counts...,1


In [46]:
#Случайным образом перемешаем датасет
df = df.sample(frac = 1)
df.head()

Unnamed: 0,id,tweet,label
3725,3726,Close to half (48.45%) of the Active Cases are...,1
1222,1223,American's arm themselves as Coronavirus panic...,0
6239,6240,???Anybody that wants a test (for the coronavi...,0
42,43,including that there will again be testing of ...,1
941,942,The Bill and Melinda Gates Foundation headquar...,0


In [47]:
df.reset_index(inplace = True)
df.drop(['index'], axis = 1, inplace = True)

df.head()

Unnamed: 0,id,tweet,label
0,3726,Close to half (48.45%) of the Active Cases are...,1
1,1223,American's arm themselves as Coronavirus panic...,0
2,6240,???Anybody that wants a test (for the coronavi...,0
3,43,including that there will again be testing of ...,1
4,942,The Bill and Melinda Gates Foundation headquar...,0


#### **2. Препроцессинг текстовых данных**

In [48]:
'''
Функция для перевода текста в нижний регистр,
удаления лишних пробелов,
специальных символов,
URL-адресов и ссылок.
'''
def wordopt(text):
    text = text.lower()
    text = re.sub('\[.*?\]','',text)
    text = re.sub("\\W"," ",text)
    text = re.sub('https?://\S+|www\.\S+','',text)
    text = re.sub('<.*?>+',b'',text)
    text = re.sub('[%s]' % re.escape(string.punctuation),'',text)
    text = re.sub('\w*\d\w*','',text)
    return text

In [49]:
df['tweet'] = df['tweet'].apply(wordopt)
df.head()

Unnamed: 0,id,tweet,label
0,3726,close to half of the active cases are con...,1
1,1223,american s arm themselves as coronavirus panic...,0
2,6240,anybody that wants a test for the coronavi...,0
3,43,including that there will again be testing of ...,1
4,942,the bill and melinda gates foundation headquar...,0


In [50]:
import nltk

# подключаем модуль со стоп-словами
from nltk.corpus import stopwords
nltk.download('stopwords')

stopwords_en = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [51]:
def remove_stopwords(text, stopwords=stopwords_en):
    try:
        return " ".join([token for token in text.split() if not token in stopwords])
    except:
        return ""

In [52]:
df['tweet_no_stopwords'] = df.tweet.apply(remove_stopwords)
df.head()

Unnamed: 0,id,tweet,label,tweet_no_stopwords
0,3726,close to half of the active cases are con...,1,close half active cases concentrated states ma...
1,1223,american s arm themselves as coronavirus panic...,0,american arm coronavirus panic spreads https c...
2,6240,anybody that wants a test for the coronavi...,0,anybody wants test coronavirus get test
3,43,including that there will again be testing of ...,1,including testing asymptomatic workers involve...
4,942,the bill and melinda gates foundation headquar...,0,bill melinda gates foundation headquarters cal...


Зададим переменные для таргета и признака

In [53]:
x = df['tweet_no_stopwords']
y = df['label']

Разобьем данные на тренировочную и тестовую выборки

In [54]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25)

In [55]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4815,), (1605,), (4815,), (1605,))

In [56]:
y_train.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,2485
0,2330


#### **3. Векторизация данных**

In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(X_train)
xv_test = vectorization.transform(X_test)

In [58]:
xv_train

<4815x10954 sparse matrix of type '<class 'numpy.float64'>'
	with 74004 stored elements in Compressed Sparse Row format>

#### **4. Построение моделей**

**Logistic Regression**

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.simplefilter('ignore')

In [60]:
lr_params = {
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['liblinear', 'newton-cholesky', 'newton-cg', 'sag', 'saga', 'lbfgs'],
    'C': [0.1, 0.25, 0.5, 1.0]
}

In [61]:
lr_model = LogisticRegression(random_state = 42)
rscv_lr = RandomizedSearchCV(lr_model, lr_params, cv=3, scoring='accuracy', random_state=42)
rscv_lr.fit(xv_train, y_train)

In [62]:
print(rscv_lr.best_params_)
print(rscv_lr.best_score_)
print(rscv_lr.best_estimator_)

{'solver': 'newton-cg', 'penalty': 'l2', 'C': 1.0}
0.9169262720664589
LogisticRegression(random_state=42, solver='newton-cg')


In [63]:
pred_lr_model = rscv_lr.predict(xv_test)

In [64]:
rscv_lr.score(xv_test, y_test)


0.9339563862928348

In [65]:
print (classification_report(y_test, pred_lr_model))


              precision    recall  f1-score   support

           0       0.91      0.95      0.93       730
           1       0.96      0.92      0.94       875

    accuracy                           0.93      1605
   macro avg       0.93      0.94      0.93      1605
weighted avg       0.94      0.93      0.93      1605



**Random Forest Classifier**

In [66]:
from sklearn.ensemble import RandomForestClassifier

rfc_params = {
    'n_estimators': [20, 40, 100, 200],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [1, 3, 5, 7, 9, None],
    'min_samples_leaf': [1, 2, 4, 8, 16]
}

In [67]:
rfc = RandomForestClassifier(random_state=42)
rscv_rfc = RandomizedSearchCV(rfc, rfc_params, cv=3, scoring='accuracy', random_state=42)
rscv_rfc.fit(xv_train, y_train)

In [68]:
print(rscv_rfc.best_params_)
print(rscv_rfc.best_score_)
print(rscv_rfc.best_estimator_)

{'n_estimators': 20, 'min_samples_leaf': 2, 'max_depth': None, 'criterion': 'entropy'}
0.9109034267912772
RandomForestClassifier(criterion='entropy', min_samples_leaf=2, n_estimators=20,
                       random_state=42)


In [69]:
pred_rf = rscv_rfc.predict(xv_test)


In [70]:
rscv_rfc.score(xv_test, y_test)


0.9171339563862928

In [71]:
print (classification_report(y_test, pred_rf))


              precision    recall  f1-score   support

           0       0.90      0.92      0.91       730
           1       0.93      0.92      0.92       875

    accuracy                           0.92      1605
   macro avg       0.92      0.92      0.92      1605
weighted avg       0.92      0.92      0.92      1605



**Gradient Boost Classifier**

In [37]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

In [72]:
gbc_params = {
    'learning_rate': [0.025, 0.075],
    'max_depth':[5, 8, 10],
    'n_estimators':[40, 100, 150]
}

In [73]:
gbc = GradientBoostingClassifier(random_state = 42)
gscv_gbc = GridSearchCV(gbc, gbc_params, cv=3, scoring='accuracy')
gscv_gbc.fit(xv_train, y_train)

In [74]:
print(gscv_gbc.best_params_)
print(gscv_gbc.best_score_)
print(gscv_gbc.best_estimator_)

{'learning_rate': 0.075, 'max_depth': 10, 'n_estimators': 150}
0.9075804776739357
GradientBoostingClassifier(learning_rate=0.075, max_depth=10, n_estimators=150,
                           random_state=42)


In [75]:
pred_gbc = gscv_gbc.predict(xv_test)

In [76]:
gscv_gbc.score(xv_test, y_test)

0.9227414330218069

In [77]:
print (classification_report(y_test, pred_gbc))

              precision    recall  f1-score   support

           0       0.90      0.93      0.92       730
           1       0.94      0.92      0.93       875

    accuracy                           0.92      1605
   macro avg       0.92      0.92      0.92      1605
weighted avg       0.92      0.92      0.92      1605



**Сравним метрику f1-score моделей по macro_avg**

In [78]:
repot_lr = classification_report(y_test, pred_lr_model, output_dict=True)['macro avg']['f1-score']
repot_rf = classification_report(y_test, pred_rf, output_dict=True)['macro avg']['f1-score']
repot_gbc = classification_report(y_test, pred_gbc, output_dict=True)['macro avg']['f1-score']

In [79]:
print(f'F-1 score для LogisticRegression: {repot_lr:.6f}')
print(f'F-1 score для RandomForrestClassifier: {repot_rf:.6f}')
print(f'F-1 score для GradientBoostingClassifier: {repot_gbc:.6f}')

F-1 score для LogisticRegression: 0.933662
F-1 score для RandomForrestClassifier: 0.916535
F-1 score для GradientBoostingClassifier: 0.922270


Метрика f-1 score соответствует требованиям задания.

**PyTorch**

In [80]:
import torch
import torch.nn as nn
import torch.optim as optim

nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from tqdm import tqdm

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [81]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet_no_stopwords)]

100%|██████████| 6420/6420 [00:01<00:00, 5911.83it/s]


In [82]:
from gensim.models.word2vec import Word2Vec
%time model_tweets = Word2Vec(sentences, workers=4, vector_size=300, min_count=3, window=5, epochs=15)

CPU times: user 5.57 s, sys: 60.8 ms, total: 5.63 s
Wall time: 3.12 s


In [83]:
import numpy as np

In [84]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.sum(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [85]:
features = [get_text_embedding(text) for text in tqdm(df.tweet_no_stopwords)]

100%|██████████| 6420/6420 [00:01<00:00, 4213.13it/s]


In [86]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.out = nn.Linear(300, 1)

    def forward(self, x):
        return torch.sigmoid(self.out(x))


net = Net()
print(net)

Net(
  (out): Linear(in_features=300, out_features=1, bias=True)
)


In [87]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

In [88]:
labels = (df.label == 'real').astype(int).to_list()

In [89]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=42)

In [90]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

In [91]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.squeeze(), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [92]:
for i in range(1):
    train_one_epoch(in_data, targets)

100%|██████████| 301/301 [00:00<00:00, 763.92it/s]


tensor(0.6953, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


Что получилось?

In [93]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [94]:
with torch.no_grad():
    output = net(in_data_test).squeeze()

In [95]:
result = (output.cpu() > 0.5) == targets_test

In [96]:
result.sum().item() / len(result)

0.9993769470404984

In [97]:
for i in range(10):
    train_one_epoch(in_data, targets)

100%|██████████| 301/301 [00:00<00:00, 1648.86it/s]


tensor(0.6942, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1795.60it/s]


tensor(0.6938, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1685.62it/s]


tensor(0.6936, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1615.82it/s]


tensor(0.6935, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1747.60it/s]


tensor(0.6934, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1264.48it/s]


tensor(0.6934, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1754.43it/s]


tensor(0.6933, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1617.81it/s]


tensor(0.6933, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1660.60it/s]


tensor(0.6933, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


100%|██████████| 301/301 [00:00<00:00, 1526.91it/s]

tensor(0.6933, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)





In [98]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [99]:
with torch.no_grad():
    output = net(in_data_test).squeeze(1)

In [100]:
result = (output.cpu() > 0.5) == targets_test

In [101]:
result.sum().item() / len(result)

1.0