<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#DecisionTreeClassifier" data-toc-modified-id="DecisionTreeClassifier-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>DecisionTreeClassifier</a></span></li><li><span><a href="#CatBoostClassifier" data-toc-modified-id="CatBoostClassifier-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>CatBoostClassifier</a></span></li><li><span><a href="#Тестируем-лучшую-модель" data-toc-modified-id="Тестируем-лучшую-модель-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Тестируем лучшую модель</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Проект «Поиск токсичных комментариев»

## Подготовка

In [2]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier 
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, roc_auc_score, roc_curve
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from tqdm import tqdm
tqdm.pandas()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
#загружаем данные
df = pd.read_csv('/datasets/toxic_comments.csv')

In [4]:
#смотри общую информацию
display(df.info())
display(df.head(10))
display(df.describe())
print('Кол-во пропусков:', df.isna().sum())
print('Дубликатов:', df.duplicated().sum())
print('Размер датасета:', df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


None

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


Unnamed: 0.1,Unnamed: 0,toxic
count,159292.0,159292.0
mean,79725.697242,0.101612
std,46028.837471,0.302139
min,0.0,0.0
25%,39872.75,0.0
50%,79721.5,0.0
75%,119573.25,0.0
max,159450.0,1.0


Кол-во пропусков: Unnamed: 0    0
text          0
toxic         0
dtype: int64
Дубликатов: 0
Размер датасета: (159292, 3)


In [5]:
#смотрим сколько токсичных/нектоксичных текстов
display(df['toxic'].value_counts())

0    143106
1     16186
Name: toxic, dtype: int64

In [6]:
#смротрим соотношение
df.toxic.value_counts() / df.shape[0] * 100 

0    89.838787
1    10.161213
Name: toxic, dtype: float64

Дубликатов и пропусков нет.
В данных 2 столбца: toxic, text. В столбце text содержатся тексты твитов. В столбе toxic булевые значения - токсичный твит (1) или нет (0). Почти 90% твитов не токсичны.

In [7]:
#ввожу функцию РОS-тэгирования слов:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,               #прилагательное
                "N": wordnet.NOUN,              #существительное
                "V": wordnet.VERB,              #глагол
                "R": wordnet.ADV                #наречие
               }  
    return tag_dict.get(tag, wordnet.NOUN)

#лемматизируем текст
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    text = text.lower()
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', text)
    lemm_text = " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in cleared_text.split()])
    return " ".join(lemm_text.split())

df['lemm_text'] = df['text'].progress_apply(lemmatize_text)

df = df.drop(['text'], axis=1)

100%|██████████| 159292/159292 [20:35<00:00, 128.88it/s]


In [9]:
#проверяем
display(df.head(10))

Unnamed: 0.1,Unnamed: 0,toxic,lemm_text
0,0,0,explanation why the edits make under my userna...
1,1,0,d aww he match this background colour i m seem...
2,2,0,hey man i m really not try to edit war it s ju...
3,3,0,more i can t make any real suggestion on impro...
4,4,0,you sir be my hero any chance you remember wha...
5,5,0,congratulation from me a well use the tool wel...
6,6,1,cocksucker before you piss around on my work
7,7,0,your vandalism to the matt shirvington article...
8,8,0,sorry if the word nonsense be offensive to you...
9,9,0,alignment on this subject and which be contrar...


In [11]:
#делим данные на выборки:
target = df['toxic']
features = df.drop(['toxic'], axis=1)

features_train, features_test, target_train, target_test = train_test_split(features, target,test_size=.1,random_state=1515)

In [12]:
train_sample = features_train.shape[0] / features.shape[0]
test_sample = features_test.shape[0] / features.shape[0]

print('Размер тренировочной выборки- {:.0%}'.format(train_sample))
print('Размер тестовой выборки - {:.0%}'.format(test_sample))

Размер тренировочной выборки- 90%
Размер тестовой выборки - 10%


In [14]:
#загружаем библиотеку и стоп-слова из английского языка
nltk.download('stopwords') 
stop_words = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stop_words)

features_train = count_tf_idf.fit_transform(features_train['lemm_text'])
features_test = count_tf_idf.transform(features_test['lemm_text'])
print(features_train.shape)
print(features_test.shape)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(143362, 142167)
(15930, 142167)


## Обучение

### LogisticRegression

In [15]:
%%time

model_lr = LogisticRegression()
params = [{'solver':['newton-cg', 'lbfgs', 'liblinear'], 'C':[0.1, 1, 10],'class_weight':['balanced']}]

grid_lr = GridSearchCV(model_lr, params, scoring='f1',cv=3)
grid_lr.fit(features_train, target_train)

lr_best_params = grid_lr.best_params_
print(lr_best_params)

{'C': 10, 'class_weight': 'balanced', 'solver': 'liblinear'}
CPU times: user 5min 33s, sys: 6min 12s, total: 11min 45s
Wall time: 11min 46s


In [17]:
%%time

model_lr.set_params(**lr_best_params)
model_lr.fit(features_train, target_train)
model_lr_val = cross_val_score(model_lr, features_train, target_train, scoring='f1', cv=3).mean() 
print('Среднее качество модели LogisticRegression на кросс-валидации:', model_lr_val)

Среднее качество модели LogisticRegression на кросс-валидации: 0.7587517503201725
CPU times: user 52.2 s, sys: 52.6 s, total: 1min 44s
Wall time: 1min 44s


### DecisionTreeClassifier

In [18]:
%%time

model_tree = DecisionTreeClassifier()
params = [{'max_depth':[x for x in range(50,100,10)], 'random_state':[12345], 'class_weight':['balanced']}]

grid_tree = GridSearchCV(model_tree, params, scoring='f1',cv=3)
grid_tree.fit(features_train, target_train)

tree_best_params = grid_tree.best_params_
print(tree_best_params)

{'class_weight': 'balanced', 'max_depth': 90, 'random_state': 12345}
CPU times: user 15min 25s, sys: 2.93 s, total: 15min 27s
Wall time: 15min 28s


In [20]:
%%time

model_tree.set_params(**tree_best_params)
model_tree.fit(features_train, target_train)
model_tree_val = cross_val_score(model_tree, features_train, target_train, scoring='f1', cv=3).mean() 
print('Среднее качество модели DecisionTreeClassifier на кросс-валидации:', model_tree_val)

Среднее качество модели DecisionTreeClassifier на кросс-валидации: 0.6345772613823034
CPU times: user 5min 4s, sys: 650 ms, total: 5min 4s
Wall time: 5min 5s


### CatBoostClassifier

In [22]:
%%time

model_cat = CatBoostClassifier(verbose=False, iterations=250)
model_cat.fit(features_train, target_train)
model_cat_val = cross_val_score(model_cat, features_train, target_train, scoring='f1', cv=3).mean() 
print('Среднее качество модели CatBoostClassifier на кросс-валидации:', model_cat_val)

Среднее качество модели CatBoostClassifier на кросс-валидации: 0.7380767377277108
CPU times: user 33min 53s, sys: 21.7 s, total: 34min 15s
Wall time: 34min 20s


In [23]:
#сравниваем модели
index = ['LogisticRegression', 'DecisionTreeClassifier','CatBoostClassifier']
data = {'Качество модели': [model_lr_val,  model_tree_val,  model_cat_val]}

models_f1 = pd.DataFrame(data=data, index=index)
print(models_f1)

                        Качество модели
LogisticRegression             0.758752
DecisionTreeClassifier         0.634577
CatBoostClassifier             0.738077


Лучшая модель LogisticRegression со значением метрики качества F1=0,75

### Тестируем лучшую модель

In [24]:
predict = grid_lr.predict(features_test)
print('Качество модели F1 на тестовой выборке =', f1_score(target_test, predict))

Качество модели F1 на тестовой выборке = 0.7595375722543353


## Выводы

<p>При подготовке данных было выявлено: дубликатов и пропусков нет. В данных 2 столбца: toxic, text. В столбце text содержатся тексты твитов. В столбе toxic булевые значения - токсичный твит (1) или нет (0). Почти 90% твитов не токсичны.
<p>Сформированы обучающая, валидационная и тестовая выборка.
<p>Обучены модели и выбрана лучшая из них - LogisticRegression со значением метрики качества F1=0,75.