# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 


**Описание данных**
 Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [None]:
import pandas as pd
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from pymystem3 import Mystem
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import f1_score
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

import re

In [None]:
print(pd.__version__)

1.2.4


In [None]:
df = pd.read_csv('/datasets/toxic_comments.csv', decimal = ',')
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [None]:
print('колличество негативных отзывов в выборке', df.toxic.sum(), 
      ', что составляет', round(100*df.toxic.sum()/df.shape[0], 2), '%')

колличество негативных отзывов в выборке 16225 , что составляет 10.17 %


Слишком мало негативных отзывов, это плохо для модели, поэтому увеличу выборку за счет замножения негативных комментариев 

In [None]:
# фенкция увеличения объема негативных отзывов для баланса классов
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [None]:
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [None]:
features, target = df['text'], df['toxic']

chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)
df['no_cw'] = df['text'].apply(lambda x:chat_words_conversion(x))
    
import string
#defining the function to remove punctuation
def remove_punctuation_and_numbers(text):
    cleaned_text = re.sub('[^A-Za-z]+', ' ', text.lower().strip())
#     cleaned_text ="".join([i.lower() for i in text if i not in string.punctuation and not i.isdigit()])
    return cleaned_text
#storing the puntuation free text
df['no_punct']= df['no_cw'].apply(lambda x:remove_punctuation_and_numbers(x))


# леммитизация отзывов
lemmatizer = WordNetLemmatizer()
df['lemm'] =df['no_punct'].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split(" ")]))

#delete stopwords
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
    sw_nltk = stopwords.words('english')
    return " ".join([word for word in text.split() if word.lower() not in sw_nltk])
df['no_sw']= df['lemm'].apply(lambda x:remove_stopwords(x))

In [None]:
df['no_sw'][1]

'aww match background colour seemingly stuck thanks talk january utc'

In [None]:
# разбиение выборки 3:1:1
features_80, features_valid, target_80, target_valid = train_test_split(
    df['no_sw'], target, test_size=0.2, random_state=12345)

features_train, features_test, target_train, target_test = train_test_split(
    features_80, target_80, test_size=0.25, random_state=12345)

In [None]:
# балансировка классов
features_train, target_train = upsample(df['no_sw'][target_train.index], target_train, 10) 

In [None]:
# преобразование стоковых данных в вектора
vectorizer = TfidfVectorizer()
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)
features_valid = vectorizer.transform(features_valid)

In [None]:
del features_80, target_80, df
features_train.shape, features_test.shape, features_valid.shape

((182745, 116617), (31914, 116617), (31915, 116617))

In [None]:
print('количество негативных отзывов в выборке', target_train.sum(), 
      ', что составляет', round(100*target_train.sum()/target_train.shape[0], 2), '%')

количество негативных отзывов в выборке 96670 , что составляет 52.9 %


теперь баланс классов идеальный

## Обучение

### подбор гиперпараметров для решающего дерева

In [None]:
best_result_tree = 0
best_depth_tree = 0
for depth in range(10, 20):    # 14
    model_tree = DecisionTreeClassifier(random_state=12345, max_depth=depth) # обучите модель с заданной глубиной дерева
    model_tree.fit(features_train, target_train) # обучите модель
    predictions_valid_tree = model_tree.predict(features_valid)
    result_tree = f1_score(target_test, model_tree.predict(features_test)) # посчитайте качество модели
    if result_tree > best_result_tree:
        best_result_tree = result_tree
        best_depth_tree = depth
        0
print("F1 мера лучшей модели:", best_result_tree)
print("Depth лучшей модели:", best_depth_tree)

F1 мера лучшей модели: 0.5982698961937716
Depth лучшей модели: 19


### подбор гиперпараметров для случайного леса

In [None]:
max_trees = 0
max_f1 = 0
for trees in range(11, 30, 5):    # 20
    Forest_model = RandomForestClassifier(n_estimators=trees, random_state=999)
    Forest_model.fit(features_train, target_train)
    result = f1_score(target_test, Forest_model.predict(features_test))
    if max_f1 < result:
        max_f1 = result
        max_trees = trees
print('n_estimators =', max_trees) 
print('max_f1', max_f1)  

n_estimators = 21
max_f1 0.6727886056971515


### подбор гиперпараметров для GBM модели

In [None]:
max_leaves = 0
max_f1 = 0
for leaves in range(11, 30, 5):    # 20
    GBM_model = lgb.LGBMClassifier(boosting_type = 'gbdt', num_leaves = leaves, n_estimators = 1, class_weight = None)
    GBM_model.fit(features_train, target_train)
    result = f1_score(target_test, GBM_model.predict(features_test))
    if max_f1 < result:
        max_f1 = result
        max_leaves = leaves
    
print('max_leaves =', max_leaves) 
print('max_rmax_f1mse =', max_f1)

max_leaves = 26
max_rmax_f1mse = 0.23815134354452516


In [None]:
 del model_tree, Forest_model, GBM_model

# CatBoost

In [None]:
from catboost import CatBoostClassifier

CatBoost_model = CatBoostClassifier(random_state=11111, iterations = 550).fit(features_train, target_train)
print("F1 мера решающего дерева: ", f1_score(target_test, CatBoost_model.predict(features_test)))

Learning rate set to 0.164752
0:	learn: 0.6233008	total: 3.23s	remaining: 29m 35s
1:	learn: 0.5863373	total: 5.82s	remaining: 26m 34s
2:	learn: 0.5629210	total: 8.42s	remaining: 25m 34s
3:	learn: 0.5437186	total: 11.1s	remaining: 25m 14s
4:	learn: 0.5321561	total: 13.7s	remaining: 24m 57s
5:	learn: 0.5217580	total: 16.2s	remaining: 24m 32s
6:	learn: 0.5114217	total: 18.9s	remaining: 24m 22s
7:	learn: 0.4997315	total: 21.4s	remaining: 24m 6s
8:	learn: 0.4937245	total: 23.9s	remaining: 23m 57s
9:	learn: 0.4873235	total: 26.5s	remaining: 23m 50s
10:	learn: 0.4788657	total: 29.1s	remaining: 23m 44s
11:	learn: 0.4729093	total: 31.6s	remaining: 23m 35s
12:	learn: 0.4677870	total: 34.1s	remaining: 23m 29s
13:	learn: 0.4622688	total: 36.7s	remaining: 23m 26s
14:	learn: 0.4576585	total: 39.3s	remaining: 23m 20s
15:	learn: 0.4519713	total: 41.8s	remaining: 23m 14s
16:	learn: 0.4474559	total: 44.3s	remaining: 23m 8s
17:	learn: 0.4434344	total: 46.9s	remaining: 23m 5s
18:	learn: 0.4379305	total: 4

In [None]:
from catboost import CatBoostClassifier

CatBoost_model_2 = CatBoostClassifier(random_state=98765, iterations = 550).fit(features_train, target_train)
print("F1 мера решающего дерева: ", f1_score(target_test, CatBoost_model_2.predict(features_test)))

Learning rate set to 0.164752
0:	learn: 0.6242676	total: 3.4s	remaining: 31m 4s
1:	learn: 0.5879425	total: 5.95s	remaining: 27m 10s
2:	learn: 0.5643847	total: 8.53s	remaining: 25m 55s
3:	learn: 0.5471847	total: 11.1s	remaining: 25m 15s
4:	learn: 0.5344312	total: 13.7s	remaining: 24m 50s
5:	learn: 0.5251229	total: 16.3s	remaining: 24m 37s
6:	learn: 0.5128329	total: 18.9s	remaining: 24m 23s
7:	learn: 0.5013665	total: 21.5s	remaining: 24m 18s
8:	learn: 0.4920651	total: 24.1s	remaining: 24m 9s
9:	learn: 0.4860315	total: 26.7s	remaining: 24m 2s
10:	learn: 0.4795967	total: 29.3s	remaining: 23m 56s
11:	learn: 0.4750772	total: 32s	remaining: 23m 52s
12:	learn: 0.4694948	total: 34.6s	remaining: 23m 47s
13:	learn: 0.4652705	total: 37.3s	remaining: 23m 49s
14:	learn: 0.4580304	total: 39.9s	remaining: 23m 43s
15:	learn: 0.4516015	total: 42.6s	remaining: 23m 40s
16:	learn: 0.4460995	total: 45.2s	remaining: 23m 35s
17:	learn: 0.4417787	total: 47.8s	remaining: 23m 31s
18:	learn: 0.4381228	total: 50.4

In [None]:
print("F1 мера cat модели: ", f1_score(target_test, CatBoost_model.predict(features_test)))

print("F1 мера cat модели: ", f1_score(target_valid, CatBoost_model.predict(features_valid)))


F1 мера cat модели:  0.7470108695652175
F1 мера cat модели:  0.740976351818559


результат на валидационной выборке чуть хуже

In [None]:
model_tree = DecisionTreeClassifier(random_state=12345, max_depth=19).fit(features_train, target_train)
print("F1 мера решающего дерева: ", f1_score(target_test, model_tree.predict(features_test)))

Forest_model = RandomForestClassifier(n_estimators=21, random_state=12345).fit(features_train, target_train)
print("F1 мера случайного леса: ", f1_score(target_test, Forest_model.predict(features_test)))

model_gbm = lgb.LGBMClassifier(boosting_type = 'gbdt', num_leaves = 26, n_estimators = 15).fit(features_train, target_train)
print("F1 мера lGBM модели: ", f1_score(target_test, model_gbm.predict(features_test)))

F1 мера решающего дерева:  0.5982698961937716
F1 мера случайного леса:  0.6732232325106697
F1 мера lGBM модели:  0.6449520586576425


## Тестирование

In [None]:
print("F1 мера решающего дерева: ", f1_score(target_valid, model_tree.predict(features_valid)))

print("F1 мера случайного леса: ", f1_score(target_valid, Forest_model.predict(features_valid)))

print("F1 мера lGBM модели: ", f1_score(target_valid, model_gbm.predict(features_valid)))

F1 мера решающего дерева:  0.5830218900160171
F1 мера случайного леса:  0.6712224753227033
F1 мера lGBM модели:  0.6310191174356764


## Выводы

Все модели обучились с достаточно высоким качеством. Самой быстрой в обучении было решающее дерево. Самое выссокое качество по F1 мере показала модель случайного леса. Качество предсказания на тестовой и валидационной выборке меняется менее, чем на 1%. 