<h1>Содержание<span class="tocSkip"></span></h1>
</ul></li><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span><ul class="toc-item"><li><span><a href="#Рекомендованая-модель:--LogisticRegression,ee--лучшие-показатели-f1---0.77" data-toc-modified-id="Рекомендованая-модель:--LogisticRegression,ee--лучшие-показатели-f1---0.77-3.1"><span class="toc-item-num">

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

**план:**

1 Загрузим и проверим данные.

2 Проведем обработку текста.

3 Обучим несколько моделей,для верности предсказаний используем метрику f1.

## Подготовка

In [4]:
import pandas as pd
import numpy as np
import re 
import spacy
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [5]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

In [6]:
try:
    df = pd.read_csv('toxic_comments.csv')
except:
    df = pd.read_csv('datasets/toxic_comments.csv')

In [7]:
#смотрим на данные
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [8]:
# Проверяем дубликаты
df.duplicated().sum()

0

In [9]:
# удаляем столбец дублирующий индексы
df = df.drop('Unnamed: 0',axis = 1)

In [10]:
# пишем функцию для отчистки текста от знаков
def cleaning(text):
    clear = re.sub(r"[^a-zA-Z ]", "", text)
    clear = clear.lower()
    return clear

df['text'] = df['text'].apply(cleaning)

In [11]:
#  Приводим текс к лемме
df["lemmatized"] = df['text'].apply(lambda x: " ".join([y.lemma_ for y in nlp(x)]))

In [12]:
df

Unnamed: 0,text,toxic,lemmatized
0,explanationwhy the edits made under my usernam...,0,explanationwhy the edit make under my username...
1,daww he matches this background colour im seem...,0,daww he match this background colour I m seemi...
2,hey man im really not trying to edit war its j...,0,hey man I m really not try to edit war its jus...
3,morei cant make any real suggestions on improv...,0,morei can not make any real suggestion on impr...
4,you sir are my hero any chance you remember wh...,0,you sir be my hero any chance you remember wha...
...,...,...,...
159287,and for the second time of asking when your vi...,0,and for the second time of ask when your view ...
159288,you should be ashamed of yourself that is a ho...,0,you should be ashamed of yourself that be a ho...
159289,spitzer umm theres no actual article for prost...,0,spitzer umm there s no actual article for pros...
159290,and it looks like it was actually you who put ...,0,and it look like it be actually you who put on...


In [13]:
# Разделяем данные на 3 части для обучения моделей
X = df['lemmatized']
y = df['toxic']
X_train,X_q,y_train,y_q = train_test_split(X,y,random_state = 42,test_size = .4)
X_valid,X_test,y_valid,y_test = train_test_split(X_q,y_q,random_state=42,test_size = 0.5)

In [14]:
# Преобразуем тест в векторы
corpus = X_train
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
count_tf_idf = TfidfVectorizer(stop_words = stopwords)
corpus = X_train
tf_idf_train = count_tf_idf.fit_transform(corpus)

corpus_test = X_test
tf_idf_test = count_tf_idf.transform(corpus_test)

corpus_valid = X_valid
tf_idf_valid = count_tf_idf.transform(corpus_valid)

## Обучение

In [24]:
# Обучаем модель лес классификай
f1 = make_scorer(f1_score , average='macro')
parametrs = { 'n_estimators': range (10, 51, 10),
              'max_depth': range (1,13, 2)
            }
RFC = RandomForestClassifier()
grid = GridSearchCV(RFC, parametrs, cv=5,scoring= f1)
grid.fit(tf_idf_train, y_train)
best_params_RFC = grid.best_params_
print(best_params_RFC)
pred_RFC = grid.predict(tf_idf_valid)
f1_sc = f1_score(y_valid,pred_RFC)

{'max_depth': 9, 'n_estimators': 10}


In [18]:
%%time
pipe_lr = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(random_state=42,max_iter = 10000))])
model_pipe_lr = pipe_lr.fit(X_train, y_train)
prediction = model_pipe_lr.predict(X_test)

CPU times: total: 25.7 s
Wall time: 10.6 s


In [19]:
print('f1:',(f1_score(y_test, prediction)))

f1: 0.7364634608102991


In [20]:
grid_params_lr = [{'clf__penalty': ['l1', 'l2'],
        'clf__C': [0.1, 10, 100],
        'clf__solver': ['liblinear']}] 

In [21]:
%time
grid = GridSearchCV(pipe_lr, grid_params_lr, cv=5,scoring= f1)
grid.fit(X_train, y_train)
best_params_RFC = grid.best_params_
print(best_params_RFC)
pipe_lr_pred = grid.predict(X_train)
f1_sc = f1_score(y_train,pipe_lr_pred)
f1_sc

CPU times: total: 0 ns
Wall time: 0 ns
{'clf__C': 10, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}


0.9743377910879331

In [22]:
test_rl = grid.predict(X_test)
f1_test_rl = f1_score(y_test,test_rl)
f1_test_rl

0.7808010816292039

In [23]:
model = LogisticRegression(random_state=42,solver='liblinear',max_iter=300,penalty='l1',C = 10)
model.fit(tf_idf_train, y_train)
test_pred = model.predict(tf_idf_test)
f1_score(y_test, test_pred)

0.7738014854827819

## Выводы



Данные отчистили от лишних символов,лематизировали и привели к вектору.

Обучили RandomForestClassifier и LogisticRegression.В обучении использовали обработки Pipeline и подбирали параметры с помощью GridSearchCV. 

**Рекомендованая модель:**  LogisticRegression,ee  лучшие показатели f1 - 0.78.
