<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Загрузка-данных,-описание" data-toc-modified-id="Загрузка-данных,-описание-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Загрузка данных, описание</a></span></li><li><span><a href="#Конвертация,-лемматизация,-замена,-подготовка-текстовых-данных" data-toc-modified-id="Конвертация,-лемматизация,-замена,-подготовка-текстовых-данных-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Конвертация, лемматизация, замена, подготовка текстовых данных</a></span></li><li><span><a href="#Выделяем-признаки,-делим-на-train-и-test" data-toc-modified-id="Выделяем-признаки,-делим-на-train-и-test-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Выделяем признаки, делим на train и test</a></span></li><li><span><a href="#Pipeline" data-toc-modified-id="Pipeline-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Pipeline</a></span></li><li><span><a href="#Creation-of-different-Pipelines" data-toc-modified-id="Creation-of-different-Pipelines-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Creation of different Pipelines</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

## Подготовка

### Загрузка данных, описание

In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.utils import shuffle

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import catboost as cb
import lightgbm as lgb
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import re

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score


import pymorphy2

from sklearn.pipeline import Pipeline

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

stopwords = set(nltk_stopwords.words('english'))

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Анжела\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Анжела\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Анжела\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv(r'D:\MyDesktop\toxic_comments.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [4]:
print(df.duplicated().sum())

0


In [5]:
df.sample(3)

Unnamed: 0,text,toxic
69749,References specifically for the current defini...,0
83651,Apparently I wasn't clear enough the first tim...,1
126199,"""\nHi there, I noticed your comments on Tim's ...",0


нам даны данные с комментариями пользователей. Пропусков и дубликатов в данных не обнаружено.

### Конвертация, лемматизация, замена, подготовка текстовых данных

In [6]:
df_clear_text = df.copy()

In [7]:
def clean_text(text):
    text = text.lower().strip()
    text = re.sub(r"[^a-z]", ' ', text) 
    text = ' '.join(word for word in text.split() if len(word)>1)    
    return text

<ul>
  <li>я привожу данные к нижнему регистру</li>
  <li>я оставляю только символьные данные</li>
  <li>я оставляю только слова, которые больше 1 символа</li>
</ul>

In [8]:
df_clear_text['text'] = df_clear_text.apply(lambda x: clean_text(x['text']),axis=1)

In [9]:
lemmatizer = WordNetLemmatizer()
df_clear_text['text'] = df_clear_text.apply(lambda x: ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x['text'])]),axis=1)

In [10]:
df.loc[3]

text     "\nMore\nI can't make any real suggestions on ...
toxic                                                    0
Name: 3, dtype: object

In [11]:
df_clear_text.loc[3]

text     more can make any real suggestion on improveme...
toxic                                                    0
Name: 3, dtype: object

мы видим, что нет лишних символов, данные лемматизированы: suggestions -> suggestion

### Выделяем признаки, делим на train и test

In [12]:
features = df_clear_text['text']
target = df_clear_text['toxic']

In [13]:
x_train,x_test,y_train,y_test = train_test_split(features,target, test_size=0.2, random_state=12345,stratify=target)

проведем дополнительно стратификацию, чтобы в train и test классы 0 и 1 в target были равномерно распределены 

### Pipeline

In [14]:
my_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stopwords)),
    ('clf', 'passthrough')
])

In [15]:
parameters = [
    {
        'clf':[LogisticRegression(random_state = 12345,class_weight='balanced',n_jobs=-1)],
        'clf__max_iter':[100,200]
    },
    {
        'clf':[lgb.LGBMClassifier(random_state=1234)],
        'clf__learning_rate':[0.1,0.15],
        'clf__n_estimators':[100,200,500],
        'clf__num_leaves':[31,41]
    },    
    {
        'clf':[RandomForestClassifier(random_state=1234)],
        'clf__n_estimators':range(1,100,30),
        'clf__max_depth':range(2,100,30)
    }, 
    {
        'clf':[cb.CatBoostClassifier(loss_function='Logloss', random_state=12345, custom_metric='F1', verbose=False)   ],
        'clf__iterations':[150, 300]
    },

]

In [16]:
grid_search = GridSearchCV(my_pipeline, param_grid=parameters, cv=3, n_jobs=-1, scoring='f1')
grid_search.fit(x_train, y_train)

Fitting 3 folds for each of 32 candidates, totalling 96 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('vectorizer',
                                        TfidfVectorizer(stop_words={'a',
                                                                    'about',
                                                                    'above',
                                                                    'after',
                                                                    'again',
                                                                    'against',
                                                                    'ain',
                                                                    'all', 'am',
                                                                    'an', 'and',
                                                                    'any',
                                                                    'are',
                                                                    'aren',
    

In [17]:
grid_search.best_params_
#grid_search.cv_results_

{'clf': LGBMClassifier(learning_rate=0.15, n_estimators=500, random_state=1234),
 'clf__learning_rate': 0.15,
 'clf__n_estimators': 500,
 'clf__num_leaves': 31}

In [44]:
print('Best training F1: %.3f' % grid_search.best_score_)

0.769803534019645

In [20]:
predict_test=grid_search.predict(x_test)

In [21]:
print("F1 лучшей модели на test-выборке = ",np.round(f1_score(y_test,predict_test),3))

F1 лучшей модели на test-выборке =  0.783


### Creation of different Pipelines

In [14]:
#LogisticRegression
pipe_lr = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
            ('clf', LogisticRegression(random_state=42))])
#LGBMClassifier
pipe_lg = Pipeline([('vect', CountVectorizer()),
                 ('tfidf', TfidfTransformer()),
            ('clf', lgb.LGBMClassifier(random_state=42))])
#Random Forest
pipe_rf = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
            ('clf', RandomForestClassifier(random_state=42))])

#Support Vector Machine
pipe_cb = Pipeline([('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
            ('clf', cb.CatBoostClassifier(loss_function='Logloss', random_state=12345, custom_metric='F1', verbose=False))])

In [15]:
param_grid_lr = [
    {
        'clf__max_iter':[100,200]
    }]

param_grid_lg = [
    {
        'clf__learning_rate':[0.1,0.15],
        'clf__n_estimators':[100,200,500],
        'clf__num_leaves':[31,41]
    }]

param_grid_rf = [
    {
        'clf__n_estimators':range(1,100,30),
        'clf__max_depth':range(2,100,30)
    }]

param_grid_cb = [
    {
        'clf__iterations':[150, 300]
    }]

In [16]:
LR = GridSearchCV(estimator=pipe_lr,
            param_grid=param_grid_lr,
            scoring='f1',
            cv=3) 

LG = GridSearchCV(estimator=pipe_lg,
            param_grid=param_grid_lg,
            scoring='f1',
            cv=3) 

RF = GridSearchCV(estimator=pipe_rf,
            param_grid=param_grid_rf,
            scoring='f1',
            cv=3, 
            n_jobs=-1)


CB = GridSearchCV(estimator=pipe_cb,
            param_grid=param_grid_cb,
            scoring='f1',
            cv=3,
            n_jobs=-1)

In [17]:
grids = [LR,LG,RF,CB]

In [18]:
grid_dict = {0: 'Logistic Regression', 
        1: 'LGBMClassifier',     
        2: 'Random Forest',
        3: 'CatBoost'}

In [19]:
print('Performing model optimizations...')
best_f1 = 0.0
best_clf = 0
best_gs = ''
spr = {}

for idx, gs in enumerate(grids):
    print('\nEstimator: %s' % grid_dict[idx])
    gs.fit(x_train, y_train)
    print('Best params are : %s' % gs.best_params_)
    print('Best training F1: %.3f' % gs.best_score_)
    
    y_pred = gs.predict(x_test)

    print('Test set F1 score for best params: %.3f ' % f1_score(y_test, y_pred))
    
    spr[grid_dict[idx]] = [gs.best_score_, f1_score(y_test, y_pred)]
    
    if f1_score(y_test, y_pred) > best_f1:
        best_f1 = f1_score(y_test, y_pred)
        best_gs = gs
        best_clf = idx

Performing model optimizations...

Estimator: Logistic Regression
Best params are : {'clf__max_iter': 100}
Best training F1: 0.727
Test set F1 score for best params: 0.749 

Estimator: LGBMClassifier
Best params are : {'clf__learning_rate': 0.15, 'clf__n_estimators': 500, 'clf__num_leaves': 31}
Best training F1: 0.784
Test set F1 score for best params: 0.798 

Estimator: Random Forest
Best params are : {'clf__max_depth': 92, 'clf__n_estimators': 1}
Best training F1: 0.340
Test set F1 score for best params: 0.328 

Estimator: CatBoost
Best params are : {'clf__iterations': 150}
Best training F1: 0.758
Test set F1 score for best params: 0.773 


## Выводы

In [21]:
data_spr = pd.DataFrame(data=spr,index=['F1 на train', 'F1 на test'])

In [22]:
data_spr

Unnamed: 0,Logistic Regression,LGBMClassifier,Random Forest,CatBoost
F1 на train,0.726769,0.783616,0.33955,0.757992
F1 на test,0.748895,0.798011,0.327609,0.773026


In [31]:
print('Classifier with best test set F1: {}'.format(grid_dict[best_clf]))
print('\nTest set F1 score for best params: {}'.format(round(best_f1,3)))

Classifier with best test set F1: LGBMClassifier

Test set F1 score for best params: 0.798
