## Описание проекта

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.
Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.
Постройте модель со значением метрики качества F1 не меньше 0.75.

In [1]:
#!python -m nltk.downloader all

In [2]:
#!pip install wordcloud

## Подготовка

In [72]:
import numpy as np
import pandas as pd
import nltk
import spacy
import re
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score, make_scorer
from catboost import CatBoostClassifier
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

In [73]:
try:
    data = pd.read_csv('/datasets/toxic_comments.csv')
except:
    data = pd.read_csv('C:/Users/Mikhail/оформление/Машинное обучение для текстов/toxic_comments.csv')

In [74]:
data.sample(20 ,random_state=12345)

Unnamed: 0,text,toxic
146790,Ahh shut the fuck up you douchebag sand nigger...,1
2941,"""\n\nREPLY: There is no such thing as Texas Co...",0
115087,"Reply\nHey, you could at least mention Jasenov...",0
48830,"Thats fine, there is no deadline ) chi?",0
136034,"""\n\nDYK nomination of Mustarabim\n Hello! You...",0
121992,"""\n\nSockpuppetry case\n \nYou have been accus...",0
37282,"Judging by what I've just read in an article, ...",0
64488,Todd and Copper\nIn the first film they were l...,0
16992,"""\n\n \nYou have been blocked from editing for...",0
138230,| decline=Can't find evidence of block either ...,0


In [75]:
data.shape

(159571, 2)

In [76]:
def miss_sorted(data):
    report = data.isna().sum().to_frame()
    report = report.rename(columns = {0: 'missing_values'})
    report['% of total'] = (report['missing_values'] / data.shape[0]).round(2)
    print(report.sort_values(by = 'missing_values', ascending = False))

In [77]:
miss_sorted(data)

       missing_values  % of total
text                0         0.0
toxic               0         0.0


In [78]:
#лишние символы и нижний регистр
data['low_words'] = data['text'].apply(lambda x: re.sub(r'[^a-zA-Z]', ' ', x).lower())

#стоп-слова
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
data['stop_words'] = data['low_words'].apply(lambda x: [w for w in x.split() if not w in stopwords])

#леммы
lemmatizer = nltk.WordNetLemmatizer()
data['lemmas'] = data['stop_words'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
data['lemmas'] = data['lemmas'].astype('U')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mikhail\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [79]:
data = data[['text', 'toxic', 'lemmas']]
data

Unnamed: 0,text,toxic,lemmas
0,Explanation\nWhy the edits made under my usern...,0,"['explanation', 'edits', 'made', 'username', '..."
1,D'aww! He matches this background colour I'm s...,0,"['aww', 'match', 'background', 'colour', 'seem..."
2,"Hey man, I'm really not trying to edit war. It...",0,"['hey', 'man', 'really', 'trying', 'edit', 'wa..."
3,"""\nMore\nI can't make any real suggestions on ...",0,"['make', 'real', 'suggestion', 'improvement', ..."
4,"You, sir, are my hero. Any chance you remember...",0,"['sir', 'hero', 'chance', 'remember', 'page']"
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,"['second', 'time', 'asking', 'view', 'complete..."
159567,You should be ashamed of yourself \n\nThat is ...,0,"['ashamed', 'horrible', 'thing', 'put', 'talk'..."
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,"['spitzer', 'umm', 'there', 'actual', 'article..."
159569,And it looks like it was actually you who put ...,0,"['look', 'like', 'actually', 'put', 'speedy', ..."


In [80]:
data['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

Токсичных сообщейний почти в 10 раз меньше, прежде чем проводить down\upsample посмотрим как модель справляется без него и сравним его с dummyclassifier.

### Разделим датасет на тестовую, валидационную и обучающую

In [81]:
data_train, data_valid = train_test_split(data, test_size=0.25, random_state=12345)
data_train, data_test = train_test_split(data_train, test_size=0.25, random_state=12345)
data_train.shape, data_valid.shape, data_test.shape

((89758, 3), (39893, 3), (29920, 3))

In [82]:
target_train_upsampled.value_counts()

0    64529
1    36385
Name: toxic, dtype: int64

In [83]:
corpus = data_train['lemmas'].values.astype('U')

In [84]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf = count_tf_idf.fit_transform(corpus)

print("Размер матрицы:", tf_idf.shape)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mikhail\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Размер матрицы: (89758, 113164)


Для предсказания ответов повторно вычислите величину TF-IDF для вектора с тестовой выборкой. Примените метод transform() к объекту TfidfVectorizer.
Решение

In [85]:
features_train = data_train['lemmas']
target_train = data_train['toxic']
#__________________________
features_valid = data_valid['lemmas']
target_valid = data_valid['toxic']
#__________________________
features_test = data_test['lemmas']
target_test = data_test['toxic']

In [86]:
features_train.shape,  features_valid.shape, features_test.shape, target_train.shape, target_valid.shape, target_test.shape

((89758,), (39893,), (29920,), (89758,), (39893,), (29920,))

In [87]:
features_train_upsampled = data_train[:71806]['lemmas']
target_train_upsampled = data_train[:71806]['toxic']

In [88]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [89]:
features_train_upsampled, target_train_upsampled = upsample(features_train_upsampled, target_train_upsampled, 5 )
features_train_upsampled.shape, target_train_upsampled.shape

((100914,), (100914,))

**сделаем не чистое разделение 50\50 а 70\30, т.к. в исходном датасете токсичных коментариев очень мало, то такое распределение не сделает сильный перевес в сторону токсичных комментариев**

In [90]:
features_train = count_tf_idf.transform(features_train)
features_valid = count_tf_idf.transform(features_valid)
features_test = count_tf_idf.transform(features_test)
features_train_upsampled = count_tf_idf.transform(features_train_upsampled)

In [91]:
features_train.shape, features_valid.shape, features_test.shape, features_train_upsampled.shape

((89758, 113164), (39893, 113164), (29920, 113164), (100914, 113164))

## Обучение

### До upsample

### LogisticRegression

In [92]:
%%time
model_lr = LogisticRegression(max_iter=2000, class_weight='balanced', penalty='l2').fit(features_train, target_train)

y_pred = model_lr.predict(features_valid)
y_true = target_valid
f1_score(y_true, y_pred)

Wall time: 2.22 s


0.748597324126025

### на test выборке

In [93]:
y_pred_test = model_lr.predict(features_test)
y_true_test = target_test
f1_score(y_true_test, y_pred_test)

0.7452816386247255

### DummyClassifier

In [94]:
%%time
dummy_clf = DummyClassifier(strategy="uniform").fit(features_train, target_train)

y_pred = dummy_clf.predict(features_valid)
y_true = target_valid
f1_score(y_true, y_pred)

Wall time: 16 ms


0.17021276595744683

### на test выборке

In [95]:
y_pred_test = dummy_clf.predict(features_test)
y_true_test = target_test
f1_score(y_true_test, y_pred_test)

0.16708382001223648

### После Upsample

### LogisticRegression

In [96]:
%%time
model_lr = LogisticRegression(max_iter=2000, penalty='l2').fit(features_train_upsampled, target_train_upsampled)

y_pred = model_lr.predict(features_valid)
y_true = target_valid
f1_score(y_true, y_pred)

Wall time: 4.06 s


0.7745734198509975

### на test выборке

In [97]:
y_pred_test = model_lr.predict(features_test)
y_true_test = target_test
f1_score(y_true_test, y_pred_test)

0.7685064935064937

### DummyClassifier

In [98]:
%%time

dummy_clf = DummyClassifier(strategy="uniform").fit(features_train, target_train)

y_pred = dummy_clf.predict(features_valid)
y_true = target_valid
f1_score(y_true, y_pred)

Wall time: 15 ms


0.1738336377220654

### на test выборке

In [99]:
y_pred_test = dummy_clf.predict(features_test)
y_true_test = target_test
f1_score(y_true_test, y_pred_test)

0.16778411924419573

## Выводы

In [100]:
d = {'Model_name': ['LogisticRegression', 'LogisticRegression_upsample', 'DummyClassifier', 'DummyClassifier_upsample'],
     'f1_score': [0.745, 0.774, 0.167, 0.167],
     'fit_time_seconds': [5.18, 0.586, 0.017, 0.004]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,Model_name,f1_score,fit_time_seconds
0,LogisticRegression,0.745,5.18
1,LogisticRegression_upsample,0.774,0.586
2,DummyClassifier,0.167,0.017
3,DummyClassifier_upsample,0.167,0.004


1. Применение Upsampl'a повысил результаты до приемлимых 0.768 на тестовой выборке, таким образом баланс токсичных и не токсичных твитов имеет вес
2. Dummyclassifire значительно уступает логистической регрессии
3. Логичтическая регрессия отлично справилась с задачей и она проста и нетребовательна в использовании, другие модели даже не запустились

- [x]  Jupyter Notebook открыт
- [x]  Весь код выполняется без ошибок
- [x]  Ячейки с кодом расположены в порядке исполнения
- [x]  Данные загружены и подготовлены
- [x]  Модели обучены
- [x]  Значение метрики *F1* не меньше 0.75
- [x]  Выводы написаны