<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

## Подготовка

In [1]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
import re
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as ltb
from sklearn.metrics import f1_score

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv('/datasets/toxic_comments.csv')
display(df)
df.info()
# загрузили данные

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [4]:
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    lemma = lemmatizer.lemmatize(text)
    return "".join(lemma).strip()
# функция лемматизации

In [5]:
def clear_text(text):
    sub = re.sub(r'[^a-zA-Z ]', ' ', text)
    lem = sub.split()
    return " ".join(lem)
# функция удаления лишних символов

In [6]:
df['clear_text'] = df['text'].apply(lambda x: lemmatize(clear_text(x)))
# лемматизировали корпус и убрали лишние слова

In [7]:
corpus = df['clear_text'].values
# создали корпус

In [8]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
# загрузили стоп-слова

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
# инициализировали TF-IDF-векторайзер

In [10]:
corpus_train, corpus_test = train_test_split(corpus, test_size=0.25, random_state=12345)
print(corpus_train.shape, corpus_test.shape)

target_train, target_test = train_test_split(df['toxic'], test_size=0.25, random_state=12345)
print(target_train.shape, target_test.shape)
# разделили корпус на выборки

(119678,) (39893,)
(119678,) (39893,)


## Обучение

In [11]:
features_train = count_tf_idf.fit_transform(corpus_train)
features_test = count_tf_idf.transform(corpus_test)
# создали матрицу со значениями TF-IDF по корпусу комментариев

In [12]:
clf = LogisticRegression()
parametrs = {'C': [1, 8, 15]}
grid = GridSearchCV(clf, parametrs, cv=5, scoring='f1')
grid.fit(features_train, target_train)
print(grid.best_params_)
grid.best_score_

{'C': 15}


0.7642714339773453

In [13]:
model = LogisticRegression(C=15)
model.fit(features_train, target_train)
predictions = model.predict(features_test)
f1_score(target_test, predictions)

0.7820790471034109

In [14]:
clf = DecisionTreeClassifier(random_state = 12345)
parametrs = {'max_depth': range (1, 10), 
            'max_features': range(1, 5)}
grid = GridSearchCV(clf, parametrs, cv=5, scoring='f1')
grid.fit(features_train, target_train)
print(grid.best_params_)
grid.best_score_

{'max_depth': 8, 'max_features': 4}


0.0022996436731455927

In [15]:
model = DecisionTreeClassifier(random_state=12345, max_depth=8, max_features=4)
model.fit(features_train, target_train)
predictions = model.predict(features_test) 
f1_score(target_test, predictions)

0.004873294346978557

In [16]:
model = ltb.LGBMClassifier(random_state = 12345)
model.fit(features_train, target_train)
predictions = model.predict(features_test)
f1_score(target_test, predictions)
#обучили модель без поиска по сетке

0.7461362335432169

In [17]:
clf = RandomForestClassifier(random_state = 12345)
parametrs = { 'n_estimators': range (10, 51, 10),
              'max_depth': range (1,13, 2) }
grid = GridSearchCV(clf, parametrs, cv=5, scoring='f1')
grid.fit(features_train, target_train)
grid.best_params_

{'max_depth': 11, 'n_estimators': 10}

In [18]:
model = RandomForestClassifier(random_state=12345, n_estimators=10, max_depth=11) 
model.fit(features_train, target_train) 
predictions = model.predict(features_test)
f1_score(target_test, predictions)

0.0

Обучили разные модели и получили величины метрики f1

## Выводы

В результате модель для классификации комментариев подобрана - логистическая регрессия с параметром С=15. F1-мера на тестовой выборке = 0,78. Нужный результат достигнут. 