<h1>Идентификация тональности текста в описании карточек товара интернет-магазина.<span class="tocSkip"></span></h1>
<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Описание

Пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучаем модель классифицировать комментарии на позитивные и негативные на наборе данных с разметкой о токсичности правок.

Критерий точности модели - метрика качества *F1*, целевой уровень - не меньше 0.75. 

**Ход решения:**

1. Загрузка и подготовка данных.
2. Обучение моделей. 
3. Выводы.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* содержит текст комментария, столбец *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import torch
import transformers
import re 
from tqdm import notebook
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords as nltk_stopwords
from transformers import BertTokenizer, BertModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')
data.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


Убираем лишний столбец.

In [3]:
 data = data.drop(columns=["Unnamed: 0"])

In [4]:
data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [5]:
data.duplicated().sum()

0

Пропусков и явных дубликатов в данных нет.

Лемматизация текста:

In [7]:
data['text_lemmatized'] = data['text']

wnl = WordNetLemmatizer()

def penn2morphy(penntag):
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' 

def clear_text(text):
    #text = text.lower()
    pattern = re.sub(r'[^a-zA-Z]', ' ', text)
    clear = pattern.split()
    lemm = []
    for i in range(len(clear)):
        lemm.append(wnl.lemmatize(clear[i]))
    return " ".join(lemm)

def lemmafunction(text): 
    # Text input is string, returns lowercased strings.
    return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) 
            for word, tag in pos_tag(word_tokenize(text))]

from tqdm.notebook import tqdm
tqdm.pandas()

data['text_lemmatized'] = data['text_lemmatized'].progress_apply(clear_text)
data['text_lemmatized'] = data['text_lemmatized'].progress_apply(lemmafunction)

data.head(10)

  0%|          | 0/159292 [00:00<?, ?it/s]

  0%|          | 0/159292 [00:00<?, ?it/s]

Unnamed: 0,text,toxic,text_lemmatized
0,Explanation\nWhy the edits made under my usern...,0,"[explanation, why, the, edits, make, under, my..."
1,D'aww! He matches this background colour I'm s...,0,"[d, aww, he, match, this, background, colour, ..."
2,"Hey man, I'm really not trying to edit war. It...",0,"[hey, man, i, m, really, not, try, to, edit, w..."
3,"""\nMore\nI can't make any real suggestions on ...",0,"[more, i, can, t, make, any, real, suggestion,..."
4,"You, sir, are my hero. Any chance you remember...",0,"[you, sir, be, my, hero, any, chance, you, rem..."
5,"""\n\nCongratulations from me as well, use the ...",0,"[congratulation, from, me, a, well, use, the, ..."
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,"[cocksucker, before, you, piss, around, on, my..."
7,Your vandalism to the Matt Shirvington article...,0,"[your, vandalism, to, the, matt, shirvington, ..."
8,Sorry if the word 'nonsense' was offensive to ...,0,"[sorry, if, the, word, nonsense, wa, offensive..."
9,alignment on this subject and which are contra...,0,"[alignment, on, this, subject, and, which, be,..."


In [8]:
for i in range(len(data['text_lemmatized'])):
    string = data['text_lemmatized'][i]
    data['text_lemmatized'][i] =" ".join([str(j) for j in string])
data.head(10)

Unnamed: 0,text,toxic,text_lemmatized
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits make under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour i m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not try to edit war it s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...
5,"""\n\nCongratulations from me as well, use the ...",0,congratulation from me a well use the tool wel...
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,cocksucker before you piss around on my work
7,Your vandalism to the Matt Shirvington article...,0,your vandalism to the matt shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,0,sorry if the word nonsense wa offensive to you...
9,alignment on this subject and which are contra...,0,alignment on this subject and which be contrar...


Создание признаков:

In [9]:
target = data['toxic']
features = data['text_lemmatized']

Разделение на выборки:

In [10]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.1, random_state=0)

Создание матрицы TF-IDF c учетом стоп-слов:

In [11]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords) 
train_tf_idf = count_tf_idf.fit_transform(features_train)

print("Размер матрицы:", train_tf_idf.shape)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Размер матрицы: (143362, 145383)


In [12]:
test_tf_idf = count_tf_idf.transform(features_test)
print("Размер матрицы:", test_tf_idf.shape)

Размер матрицы: (15930, 145383)


## Обучение

Логистическая регрессия на TF_IDF:

In [45]:
model_lr = LogisticRegression(random_state=12345,
                                solver='liblinear')

In [46]:
model_lr.get_params().keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

Настраиваем регуляризацию через параметр 'C'.

In [47]:
lr_param = {'C': [0.1, 1, 5, 10, 15]}

In [58]:
gsearch = GridSearchCV(estimator=model_lr, 
                       cv=5, 
                       param_grid=lr_param,  
                       scoring='f1')

gsearch.fit(train_tf_idf, target_train)

GridSearchCV(cv=5,
             estimator=LogisticRegression(random_state=12345,
                                          solver='liblinear'),
             param_grid={'C': [0.1, 1, 5, 10, 15]}, scoring='f1')

Лучшая модель:

In [59]:
gsearch.best_estimator_

LogisticRegression(C=15, random_state=12345, solver='liblinear')

Оценка лучшей модели:

In [60]:
gsearch.best_score_

0.7667905969902142

Параметры лучшей модели:

In [61]:
gsearch.best_params_

{'C': 15}

Модель CatBoost.

In [63]:
# сокращаю датасет, т.к. долго обрабатывается:
features_train_cb = train_tf_idf[:900]
features_test_cb = test_tf_idf[:300]
target_train_cb = target_train[:900]
target_test_cb = target_test[:300]

In [69]:
%%time
zeroes = data['toxic'].value_counts()[0]
ones = data['toxic'].value_counts()[1]

cb = CatBoostClassifier(verbose=False, iterations=50)
#список параметров:
cb_param = {'depth': [2, 4, 6, 8, 20]}

gsearch_cb = GridSearchCV(estimator=cb, 
                       cv=5, 
                       param_grid=cb_param,  
                       scoring='f1')

gsearch_cb.fit(features_train_cb, target_train_cb)

CPU times: user 1min 1s, sys: 2.04 s, total: 1min 3s
Wall time: 1min 14s


GridSearchCV(cv=5,
             estimator=<catboost.core.CatBoostClassifier object at 0x7f33ec25d070>,
             param_grid={'depth': [2, 4, 6, 8, 20]}, scoring='f1')

In [71]:
gsearch_cb.best_score_

0.32001367053998636

In [72]:
gsearch_cb.best_params_

{'depth': 4}

Случайный лес:

In [73]:
rf = RandomForestClassifier(class_weight='balanced')

rf_param = {'n_estimators': range(20, 40, 5),
                     'max_depth': range(2, 10, 2)}


In [77]:
gsearch_rf = GridSearchCV(estimator=rf, 
                       cv=5, 
                       param_grid=rf_param,  
                       scoring='f1')

gsearch_rf.fit(train_tf_idf, target_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(class_weight='balanced'),
             param_grid={'max_depth': range(2, 10, 2),
                         'n_estimators': range(20, 40, 5)},
             scoring='f1')

In [78]:
gsearch_rf.best_score_

0.33395510216101343

In [81]:
gsearch_rf.best_params_

{'max_depth': 8, 'n_estimators': 35}

Лучший результат показала модель логистической регрессии - тестируем:

In [82]:
lr_model_test = gsearch.best_estimator_

lr_model_test.fit(train_tf_idf, target_train)

lr_predictions = lr_model_test.predict(test_tf_idf)
lr_f1 = round(f1_score(target_test, lr_predictions), 3) 
print(lr_f1)

0.766


## Выводы

В ходе работы над проектом было сделано:

- Подготовленны данные обучения на моделях
- Обучены модели и выбраны лучшие на кросс-валидации

На тестовой выбоке по метрике F1 лучше всего себя показал LogisticRegression - 0.766.
CatBoost испытать в полной мере не удалось, из-за отказа ядра. Показель F1 значительно ниже, т.к. для теста был урезан датасет.

Наиболее рационалным решением будет использование лоигистической регресси.