# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. <br>В нашем распоряжении набор данных с разметкой о токсичности правок.

## Подготовка

**Импортируем библиотеки:**

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgbm
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
import re 
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score 
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

**Загружаю данные:**

In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [4]:
df.head(10)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


In [5]:
def clear_text(text):
    clear_t = re.sub(r"[^a-zA-Z]", ' ', text) 
    cleat_t = clear_t.split()
    clear_t = " ".join(cleat_t)
    
    return clear_t

In [6]:
df['text'] = df['text'].apply(lambda x: clear_text(x))

In [7]:
df

Unnamed: 0,text,toxic
0,Explanation Why the edits made under my userna...,0
1,D aww He matches this background colour I m se...,0
2,Hey man I m really not trying to edit war It s...,0
3,More I can t make any real suggestions on impr...,0
4,You sir are my hero Any chance you remember wh...,0
...,...,...
159566,And for the second time of asking when your vi...,0
159567,You should be ashamed of yourself That is a ho...,0
159568,Spitzer Umm theres no actual article for prost...,0
159569,And it looks like it was actually you who put ...,0


In [8]:
corpus = df['text'].astype('U').values

In [9]:
corpus

array(['Explanation Why the edits made under my username Hardcore Metallica Fan were reverted They weren t vandalisms just closure on some GAs after I voted at New York Dolls FAC And please don t remove the template from the talk page since I m retired now',
       'D aww He matches this background colour I m seemingly stuck with Thanks talk January UTC',
       'Hey man I m really not trying to edit war It s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info',
       ...,
       'Spitzer Umm theres no actual article for prostitution ring Crunch Captain',
       'And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it',
       'And I really don t think you understand I came here and my idea was bad right away What kind of community goes you have bad ideas go away instead of helping rewrite them'],

In [10]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    txt = [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    text=" ".join(txt)
    return text
    
df['text_lemmatized'] = df['text'].apply(lambda x: lemmatize_text(x))

In [11]:
df

Unnamed: 0,text,toxic,text_lemmatized
0,Explanation Why the edits made under my userna...,0,Explanation Why the edits made under my userna...
1,D aww He matches this background colour I m se...,0,D aww He match this background colour I m seem...
2,Hey man I m really not trying to edit war It s...,0,Hey man I m really not trying to edit war It s...
3,More I can t make any real suggestions on impr...,0,More I can t make any real suggestion on impro...
4,You sir are my hero Any chance you remember wh...,0,You sir are my hero Any chance you remember wh...
...,...,...,...
159566,And for the second time of asking when your vi...,0,And for the second time of asking when your vi...
159567,You should be ashamed of yourself That is a ho...,0,You should be ashamed of yourself That is a ho...
159568,Spitzer Umm theres no actual article for prost...,0,Spitzer Umm there no actual article for prosti...
159569,And it looks like it was actually you who put ...,0,And it look like it wa actually you who put on...


In [12]:
from nltk.stem import WordNetLemmatizer

l = WordNetLemmatizer()
print(l.lemmatize("was"))
l.lemmatize("was was")

wa


'was was'

**Разобью данные на выборки:**

In [13]:
features = df.drop(['toxic'], axis=1)
target = df['toxic']

In [14]:
#Проверю соотношение классов в целевом признаке
df.toxic.value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

In [15]:
class_frequency = df['toxic'].value_counts(normalize=True)
print(class_frequency)

0    0.898321
1    0.101679
Name: toxic, dtype: float64


In [16]:
features_train, features_test, target_train, target_test = train_test_split(features, target, stratify = df['toxic'], test_size=0.25, random_state=42)

In [17]:
features_train.shape

(119678, 2)

In [18]:
target_train.value_counts()

0    107509
1     12169
Name: toxic, dtype: int64

In [19]:
features_train

Unnamed: 0,text,text_lemmatized
75703,How is it known that they originate from Israe...,How is it known that they originate from Israe...
145906,This article is nominated for the below mentio...,This article is nominated for the below mentio...
125334,And dispute resolution How many times have I t...,And dispute resolution How many time have I tr...
27244,Perhaps Perhaps you didn t read the article wh...,Perhaps Perhaps you didn t read the article wh...
42461,Pushing it Unclear what you think is pushing i...,Pushing it Unclear what you think is pushing i...
...,...,...
115421,Go watch Buffy the Vampire Slayer or what ever...,Go watch Buffy the Vampire Slayer or what ever...
127947,its MY TALK PAGE HELLO CAN YOU NOT SEE THAT I ...,it MY TALK PAGE HELLO CAN YOU NOT SEE THAT I A...
112003,The question is what should I do Alex C E,The question is what should I do Alex C E
61703,oops I saw blocking warning on your page as we...,oops I saw blocking warning on your page a wel...


In [20]:
features_test.shape

(39893, 2)

In [21]:
features_test

Unnamed: 0,text,text_lemmatized
55010,germany took over most of it and,germany took over most of it and
81189,I just expanded the article and added Infobox ...,I just expanded the article and added Infobox ...
37372,Skater chick skater chick,Skater chick skater chick
75112,I am the creator of the Ctrl Alt Del Webcomic,I am the creator of the Ctrl Alt Del Webcomic
3806,I Butter is a butter substitute The references...,I Butter is a butter substitute The reference ...
...,...,...
136534,To the guy claiming S Korea has more patents t...,To the guy claiming S Korea ha more patent tha...
47896,And i censored the previous bad word Don t cal...,And i censored the previous bad word Don t cal...
151020,Corvus cornix John Reaves an admirer of femme ...,Corvus cornix John Reaves an admirer of femme ...
11658,Please stop Please stop blocking me It s just ...,Please stop Please stop blocking me It s just ...


In [22]:
stopwords = set(nltk_stopwords.words('english'))

## Обучение

**Модель "Линейная регрессия"**

In [24]:
corpus = features_train['text_lemmatized'].astype('U').values
count_tf_idf = TfidfVectorizer(stop_words = stopwords)
tf_idf = count_tf_idf.fit_transform(corpus)

In [31]:
corpus_test = features_test['text_lemmatized'].astype('U').values
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf_test = TfidfVectorizer(stop_words=stopwords)
tf_idf_test = count_tf_idf.transform(corpus_test)

In [32]:
LR_best_result = 0
for i in range(1, 51, 10):
    LR_model = LogisticRegression(C=i)
    LR_model.fit(tf_idf, target_train)
    LR_predictions = LR_model.predict(tf_idf_test)
    LR_result = f1_score(target_test, LR_predictions)
    if LR_result > LR_best_result:
        LR_best_result = LR_result
        SL_best_i = i

print("F1-мера наилучшей модели:", LR_best_result, "Параметр С:", SL_best_i)

F1-мера наилучшей модели: 0.7778228532792427 Папраметр С: 21


In [33]:
LR_model = LogisticRegression(C=21.0)
LR_model.fit(tf_idf, target_train)

LogisticRegression(C=21.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [34]:
LR_predictions = LR_model.predict(tf_idf_test)

In [35]:
LR_result = f1_score(target_test, LR_predictions)
LR_result

0.7778228532792427

In [29]:
corpus_train = features_train['text_lemmatized'].astype('U').values
tf_idf_train = count_tf_idf.fit_transform(corpus_train)
sco = cross_validate(LR_model, tf_idf_train, target_train, cv=3, scoring = ['accuracy', 'recall', 'f1'])
sco



{'fit_time': array([8.55251813, 8.23401093, 8.12599325]),
 'score_time': array([0.07227874, 0.05802727, 0.05067229]),
 'test_accuracy': array([0.950945  , 0.95104282, 0.95164444]),
 'test_recall': array([0.55287158, 0.54832347, 0.5520217 ]),
 'test_f1': array([0.69625951, 0.69489142, 0.69892305])}

**Модель "Дерево решений"**

In [30]:
%%time
DR_model = DecisionTreeClassifier(random_state=12345)
DR_parameters = {'max_depth': range(1,8)}
DR_grid = GridSearchCV(DR_model, DR_parameters, cv=3, scoring = 'f1')
DR_grid.fit(tf_idf_train, target_train)
DR_best_parameters = DR_grid.best_params_
print('Оптимальные параметры модели:', DR_best_parameters)

Оптимальные параметры модели: {'max_depth': 7}
CPU times: user 3min 21s, sys: 34.5 ms, total: 3min 21s
Wall time: 3min 22s


In [31]:
%%time
DR_model = DecisionTreeClassifier(random_state=12345, max_depth=7)
DR_model.fit(tf_idf_train, target_train)
DR_predictions = DR_model.predict(tf_idf_test) 

CPU times: user 5.49 s, sys: 2.52 ms, total: 5.5 s
Wall time: 5.62 s


In [32]:
DR_result = f1_score(target_test, DR_predictions)
print("f1_score модели дерево решений на тестовой выборке:", DR_result)

f1_score модели дерево решений на тестовой выборке: 0.5413846954711088


**Модель "Случайный лес"**

In [33]:
%%time
SL_model = RandomForestClassifier(random_state=12345)
SL_parameters = {'n_estimators': range(1,12), 'max_depth': range(1,8)}
SL_grid = GridSearchCV(SL_model, SL_parameters, cv=3, scoring= 'f1')
SL_grid.fit(tf_idf_train, target_train)
SL_best_parameters = SL_grid.best_params_
print('Оптимальные параметры модели:', SL_best_parameters)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

Оптимальные параметры модели: {'max_depth': 6, 'n_estimators': 1}
CPU times: user 2min 44s, sys: 264 ms, total: 2min 44s
Wall time: 2min 45s


In [None]:
%%time
SL_model = RandomForestClassifier(random_state=12345, max_depth=6, n_estimators=1) 
SL_model.fit(tf_idf_train, target_train) 
SL_predictions = SL_model.predict(tf_idf_test)

In [None]:
SL_result = f1_score(target_test, SL_predictions) 
print("f1_score модели случайный лес на тестовой выборке:", SL_result)

**Модель градиентного бустинга**

In [37]:
param_grid = {'learning_rate': [0.01, 0.1, 1], 'n_estimators': [20, 40, 60]}
LGB_model = LGBMClassifier(random_state=12345, num_leaves=31)
LGB_grid = GridSearchCV(LGB_model, param_grid, cv=3)
LGB_grid.fit(tf_idf, target_train)
print('Оптимальное значение параметра:', LGB_grid.best_params_)

Оптимальное значение параметра: {'learning_rate': 0.1, 'n_estimators': 60}


In [38]:
LGB_model = LGBMClassifier(random_state=12345, num_leaves = 31, learning_rate = 0.1, n_estimators = 60)
LGB_model.fit(tf_idf, target_train, eval_set=[(tf_idf_test, target_test)], 
             eval_metric='f1', early_stopping_rounds=5)
predictions_test = LGB_model.predict(tf_idf_test)
print("f1:", f1_score(target_test, predictions_test))

[1]	valid_0's binary_logloss: 0.28419
Training until validation scores don't improve for 5 rounds
[2]	valid_0's binary_logloss: 0.263066
[3]	valid_0's binary_logloss: 0.248963
[4]	valid_0's binary_logloss: 0.238214
[5]	valid_0's binary_logloss: 0.22962
[6]	valid_0's binary_logloss: 0.22271
[7]	valid_0's binary_logloss: 0.216826
[8]	valid_0's binary_logloss: 0.211873
[9]	valid_0's binary_logloss: 0.207654
[10]	valid_0's binary_logloss: 0.203687
[11]	valid_0's binary_logloss: 0.199943
[12]	valid_0's binary_logloss: 0.196543
[13]	valid_0's binary_logloss: 0.19296
[14]	valid_0's binary_logloss: 0.189989
[15]	valid_0's binary_logloss: 0.187147
[16]	valid_0's binary_logloss: 0.18443
[17]	valid_0's binary_logloss: 0.18216
[18]	valid_0's binary_logloss: 0.180042
[19]	valid_0's binary_logloss: 0.17801
[20]	valid_0's binary_logloss: 0.176083
[21]	valid_0's binary_logloss: 0.17426
[22]	valid_0's binary_logloss: 0.172477
[23]	valid_0's binary_logloss: 0.171016
[24]	valid_0's binary_logloss: 0.1696

In [39]:
%%time
model = LGBMClassifier(boosting_type='gbdt', device='cpu', verbose=0, seed=42)
model.fit(tf_idf, target_train, verbose=10)
LGBM_predictions = model.predict(tf_idf_test)

CPU times: user 2min 56s, sys: 1.03 s, total: 2min 57s
Wall time: 2min 59s


In [40]:
LGBMR_result = f1_score(target_test, LGBM_predictions)
print("f1_score модели LGBMClassifier на тестовой выборке:",LGBMR_result)

f1_score модели LGBMClassifier на тестовой выборке: 0.7433347744631792


## Выводы

В ходе выполнения этого проекта были обучены модели классификации комментариев на позитивные и негативные. 
Модель со значением метрики качества F1 не меньше 0.75 - модель линейной регрессии.
Значение метрики f1_score = 0.78