# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.


**Описание данных**

Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [109]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords as nltk_stopwords
import spacy
import sys


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,  make_scorer
from sklearn.model_selection import GridSearchCV

In [110]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [111]:
display(df.head(10))
display(df.isna().sum())
display(df.shape)
df.info()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


text     0
toxic    0
dtype: int64

(159571, 2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [112]:
# взял из выборки только часть, для ускорения обучения
df = df.sample(n=20000, random_state=12345)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 146790 to 2449
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    20000 non-null  object
 1   toxic   20000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 468.8+ KB


In [113]:
df.head(10)

Unnamed: 0,text,toxic
146790,Ahh shut the fuck up you douchebag sand nigger...,1
2941,"""\n\nREPLY: There is no such thing as Texas Co...",0
115087,"Reply\nHey, you could at least mention Jasenov...",0
48830,"Thats fine, there is no deadline ) chi?",0
136034,"""\n\nDYK nomination of Mustarabim\n Hello! You...",0
121992,"""\n\nSockpuppetry case\n \nYou have been accus...",0
37282,"Judging by what I've just read in an article, ...",0
64488,Todd and Copper\nIn the first film they were l...,0
16992,"""\n\n \nYou have been blocked from editing for...",0
138230,| decline=Can't find evidence of block either ...,0


In [114]:
# лемматизируем и очистим текст
nlp = spacy.load ("en_core_web_sm")

def lemmatize_text(text):    
    text = text.lower()
    doc = nlp(text)
    lemm_text = " ".join([token.lemma_ for token in doc])
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', lemm_text) 
    return " ".join(cleared_text.split())

df['text'] = df['text'].apply(lemmatize_text)

In [115]:
df.head(10)

Unnamed: 0,text,toxic
146790,ahh shut the fuck up you douchebag sand nigger...,1
2941,reply there be no such thing as texas commerce...,0
115087,reply hey you could at least mention jasenovac...,0
48830,that s fine there be no deadline chi,0
136034,dyk nomination of mustarabim hello your submis...,0
121992,sockpuppetry case you have be accuse of sockpu...,0
37282,judge by what I ve just read in an article the...,0
64488,todd and copper in the first film they be litt...,0
16992,you have be block from edit for a period of we...,0
138230,decline can t find evidence of block either as...,0


In [116]:
target = df['toxic']
features = df.drop(['toxic'], axis=1)

In [117]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.1, random_state=12345)

In [118]:
# вычислим TF-IDF для текстов
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

features_train = count_tf_idf.fit_transform(features_train['text'])

features_test = count_tf_idf.transform(features_test['text'])

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [119]:
target_train.value_counts(normalize=True)

0    0.898222
1    0.101778
Name: toxic, dtype: float64

### Вывод

Провел предобработку данных, лемматизировал и очистил текст. Вычислил TF-IDF для текстов

## Обучение

In [120]:
# словарь с результатами
best = {}  # результаты F1

In [121]:
model = LogisticRegression(random_state=12345, class_weight='balanced')
model.fit(features_train, target_train)
predicted_test = model.predict(features_test)
f_1 = f1_score(target_test, predicted_test)
print("F1:", f_1)
best['LogisticRegression'] = f_1

F1: 0.75


In [122]:
model = DecisionTreeClassifier(random_state=12345)
params = {'max_depth':list(range(1,101,20))}
tree_gs = GridSearchCV(model, params, cv=5, scoring='f1').fit(features_train, target_train)
print(f'Лучшее значение smape для дерева решений: {tree_gs.best_score_} при значениях гиперпараметров: {tree_gs.best_params_}')

Лучшее значение smape для дерева решений: 0.6986409391330228 при значениях гиперпараметров: {'max_depth': 61}


In [123]:
# проверим на тестовой выборке
model = DecisionTreeClassifier(random_state=12345, max_depth=61)
model.fit(features_train, target_train)
predicted_test = model.predict(features_test)
f_1 = f1_score(target_test, predicted_test)
print("F1:", f_1)

best['DecisionTreeClassifier'] = f_1

F1: 0.7180722891566266


In [124]:
model = RandomForestClassifier(random_state=12345)
params = {'n_estimators': range(10, 101, 20)}
tree_gs = GridSearchCV(model, params, cv=5, scoring='f1').fit(features_train, target_train)
print(f'Лучшее значение smape для дерева решений: {tree_gs.best_score_} при значениях гиперпараметров: {tree_gs.best_params_}')

Лучшее значение smape для дерева решений: 0.6641575707747658 при значениях гиперпараметров: {'n_estimators': 90}


In [127]:
# проверим на тестовой выборке
model = RandomForestClassifier(random_state=12345, n_estimators=90)
model.fit(features_train, target_train)
predicted_test = model.predict(features_test)
f_1 = f1_score(target_test, predicted_test)
print("F1:", f_1)

best['RandomForestClassifier'] = f_1

F1: 0.7184986595174263


## Выводы

In [128]:
pd.DataFrame(list(best.items()), columns=['model', 'f1']).sort_values(by='f1', ascending=False)

Unnamed: 0,model,f1
0,LogisticRegression,0.75
2,RandomForestClassifier,0.718499
1,DecisionTreeClassifier,0.718072


Провел предобработку данных, лемматизировал и очистил текст, вычислил TF-IDF для текстов.
Построил модели и подобрал лучшие параметры. Лучший результат F1 показала модель LogisticRegression = 0,75.