# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split, GridSearchCV, ShuffleSplit
from sklearn.metrics import f1_score
import time
import warnings
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.utils import shuffle
from nltk.stem import WordNetLemmatizer 
import re
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier 

In [2]:
warnings.filterwarnings('ignore')
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\reiji\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
#df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')
df = pd.read_csv('/datasets/toxic_comments.csv') 

In [4]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [6]:
df['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

**- Все данные загружены корректно**  
**- В датасете 2 колонки с 159571 строкой**  
**- Пропусков нет**  
**- Имеется сильный перекос в данных в целевом признаке: "0" - 89,8% от выборки, "1" - 10,2%.**

In [7]:
X = df.drop('toxic', axis=1)
y = df['toxic']

**Лемматизируем тестковые данные:**

In [8]:
%%time
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

X = X.text.apply(lemmatize_text)

Wall time: 1min 13s


In [9]:
X = X.map(lambda x: ' '.join(x))

In [10]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = re.sub(r'[^a-zA-Z ]',' ', text)
    text = text.strip(' ')
    return text

In [11]:
%%time
X = X.map(lambda x: clean_text(x))

Wall time: 13.7 s


In [12]:
X = X.map(lambda x: ' '.join(x.split()))    
print (X.head())

0    explanation why the edits made under my userna...
1    d aww he match this background colour i am see...
2    hey man i am really not trying to edit war it ...
3    more i cannot make any real suggestion on impr...
4    you sir are my hero any chance you remember wh...
Name: text, dtype: object


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=12345)

**Избавимся от дисбаланса в целевом признаке с помощью апсемплинга:**

In [14]:
def upsample(features, target, repeat): 
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]  
    target_ones = target[target == 1]
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat) 
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    features_upsampled, target_upsampled = shuffle( features_upsampled, target_upsampled, random_state=12345)
    return features_upsampled, target_upsampled

In [15]:
#print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)
#print (y_train.value_counts())
#X_train, y_train = upsample(X_train, y_train, 8)
#print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)
#print (y_train.value_counts())

(119678,) (119678,) (39893,) (39893,)
0    107540
1     12138
Name: toxic, dtype: int64
(119678,) (119678,) (39893,) (39893,)
0    107540
1     12138
Name: toxic, dtype: int64


In [None]:
#**Избавились от дисбаланса: "0" - 53%, "1"- 47%**

## Обучение

In [16]:
def no_parametrs (model):
    model.fit(X_train, y_train)
    predict = model.predict(X_test)
    f1 = f1_score(predict, y_test)
    print('f1:', round(f1, 3))
    return f1

In [17]:
cv=ShuffleSplit(n_splits=1, random_state=12345)

def giper (model, params):
    grid = GridSearchCV(estimator=model, param_grid=params, cv=cv, scoring='f1', n_jobs=-1, refit=False)
    grid.fit(X_train, y_train)
#    f1 = grid.best_score_
    print('Лучшие параметры модели:', grid.best_params_) 
#    print('f1 с гиперпараметрами:', round (f1, 3))
#    return f1

In [18]:
value = []

### LogisticRegression

In [19]:
lr_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stopwords)),
    ('clf', LogisticRegression(random_state=12345))])

In [20]:
%%time
print ('Без гиперпараметров:')
value.append(no_parametrs (lr_pipe))

Без гиперпараметров:
f1: 0.737
Wall time: 16.3 s


In [21]:
lr_params= {'clf__C': [1,3,5],
      'clf__class_weight': ['balanced', None]}

In [22]:
%%time
giper(lr_pipe, lr_params)

Лучшие параметры модели: {'clf__C': 5, 'clf__class_weight': None}
Wall time: 45.4 s


In [23]:
lr_pipe_2= Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stopwords)),
    ('clf', LogisticRegression(random_state=12345, C=5, class_weight=None))])

In [24]:
%%time
print ('С гиперпараметрами:')
value.append(no_parametrs (lr_pipe_2))

С гиперпараметрами:
f1: 0.774
Wall time: 16.8 s


### CatBoostClassifier

In [25]:
cat_pipe = Pipeline([
    ('tfidf', TfidfVectorizer (stop_words=stopwords)),
    ('clf', CatBoostClassifier(random_state=12345, iterations=20))])

In [26]:
%%time
print ('Без гиперпараметров:')
value.append(no_parametrs (cat_pipe))

Без гиперпараметров:
Learning rate set to 0.5
0:	learn: 0.3487411	total: 2.83s	remaining: 53.7s
1:	learn: 0.2611765	total: 5.05s	remaining: 45.5s
2:	learn: 0.2379450	total: 7.2s	remaining: 40.8s
3:	learn: 0.2207759	total: 9.4s	remaining: 37.6s
4:	learn: 0.2125078	total: 11.6s	remaining: 34.7s
5:	learn: 0.2066996	total: 13.8s	remaining: 32.1s
6:	learn: 0.2005670	total: 16s	remaining: 29.6s
7:	learn: 0.1959569	total: 18.2s	remaining: 27.3s
8:	learn: 0.1925906	total: 20.4s	remaining: 25s
9:	learn: 0.1881789	total: 22.7s	remaining: 22.7s
10:	learn: 0.1848464	total: 24.9s	remaining: 20.4s
11:	learn: 0.1812500	total: 27.3s	remaining: 18.2s
12:	learn: 0.1784575	total: 29.4s	remaining: 15.8s
13:	learn: 0.1765541	total: 31.6s	remaining: 13.5s
14:	learn: 0.1744910	total: 33.9s	remaining: 11.3s
15:	learn: 0.1727788	total: 36s	remaining: 9.01s
16:	learn: 0.1710551	total: 38.3s	remaining: 6.76s
17:	learn: 0.1692989	total: 40.4s	remaining: 4.49s
18:	learn: 0.1668213	total: 42.6s	remaining: 2.24s
19:

In [27]:
cat_params= {'clf__max_depth': [1, 3, -1]}

In [28]:
%%time
giper(cat_pipe, cat_params)

Лучшие параметры модели: {'clf__max_depth': 3}
Wall time: 1min 1s


In [29]:
cat_pipe_2= Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stopwords)),
    ('clf', CatBoostClassifier(random_state=12345, iterations=20, max_depth = 3))])

In [30]:
%%time
print ('С гиперпараметрами:')
value.append(no_parametrs (cat_pipe_2))

С гиперпараметрами:
Learning rate set to 0.5
0:	learn: 0.3538501	total: 684ms	remaining: 13s
1:	learn: 0.2779995	total: 1.36s	remaining: 12.2s
2:	learn: 0.2544812	total: 2.02s	remaining: 11.5s
3:	learn: 0.2427986	total: 2.67s	remaining: 10.7s
4:	learn: 0.2348714	total: 3.36s	remaining: 10.1s
5:	learn: 0.2299637	total: 4.09s	remaining: 9.55s
6:	learn: 0.2229257	total: 4.76s	remaining: 8.85s
7:	learn: 0.2180596	total: 5.47s	remaining: 8.21s
8:	learn: 0.2140510	total: 6.14s	remaining: 7.51s
9:	learn: 0.2104940	total: 6.83s	remaining: 6.83s
10:	learn: 0.2062758	total: 7.53s	remaining: 6.16s
11:	learn: 0.2040850	total: 8.2s	remaining: 5.46s
12:	learn: 0.2020268	total: 8.88s	remaining: 4.78s
13:	learn: 0.1997674	total: 9.54s	remaining: 4.09s
14:	learn: 0.1980011	total: 10.2s	remaining: 3.41s
15:	learn: 0.1961883	total: 10.9s	remaining: 2.72s
16:	learn: 0.1943882	total: 11.5s	remaining: 2.04s
17:	learn: 0.1926547	total: 12.2s	remaining: 1.35s
18:	learn: 0.1902146	total: 12.9s	remaining: 678ms

### LGBM

In [31]:
lgb_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stopwords)),
    ('clf', LGBMClassifier(random_state=12345))])

In [32]:
%%time
print ('Без гиперпараметров:')
value.append(no_parametrs (lgb_pipe))

Без гиперпараметров:
f1: 0.745
Wall time: 53.5 s


In [33]:
lgb_params = {'clf__n_estimators': [20],
    'clf__max_depth': [1, 3, -1]}

In [34]:
%%time
giper(lgb_pipe, lgb_params)

Лучшие параметры модели: {'clf__max_depth': -1, 'clf__n_estimators': 20}
Wall time: 44.1 s


In [35]:
lgb_pipe_2= Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stopwords)),
    ('clf', LGBMClassifier(random_state=12345, max_depth = -1, n_estimators = 20))])

In [36]:
%%time
print ('С гиперпараметрами:')
value.append(no_parametrs (lgb_pipe_2))

С гиперпараметрами:
f1: 0.643
Wall time: 26.5 s


## Выводы

In [37]:
rezult=pd.DataFrame(value, index=['LogisticRegression без гиперпараметров', 'LogisticRegression с гиперпараметрами',
                                  'CatBoostClassifier без гиперпараметров', 'CatBoostClassifier с гиперпараметрами',
                                  'LGBMClassifier без гиперпараметров', 'LGBMClassifier с гиперпараметрами'])

In [38]:
rezult.set_axis(['F1'],axis='columns',inplace=True)

In [39]:
rezult

Unnamed: 0,F1
LogisticRegression без гиперпараметров,0.737417
LogisticRegression с гиперпараметрами,0.773656
CatBoostClassifier без гиперпараметров,0.67804
CatBoostClassifier с гиперпараметрами,0.611394
LGBMClassifier без гиперпараметров,0.744678
LGBMClassifier с гиперпараметрами,0.642515


**- наилучшим образом себя показала модель логистической регрессии с гиперпараметрами C = 5 и class_weight = None**  
**- ее показатели составили 0,774**  
**- при увеличении показателя C, значение метрики продолжит расти, однако сильно увеличивается время обсчета модели**