## Online wiki store
Users can edit and add product descriptions, clients suggest their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and send them for moderation.

Objective: to train the model to classify comments into positive and negative. At our disposal is a set of data with markings about the toxicity of edits.

F1 quality metric value must be of at least 0.75

## Data overview and preprocessing

In [1]:
import pandas as pd
import re 
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier
from tqdm.notebook import tqdm
import spacy


STATE = 1337

In [2]:
data = pd.read_csv('D:\\jupyter_clone\\yandex-practicum\\18. Machine learning for texts\\toxic_comments.csv')
data.info()
data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


Data contains three columns: one that does not carry information, the text of the comment and the target attribute

In [3]:
data.toxic.value_counts()

toxic
0    143106
1     16186
Name: count, dtype: int64

There is an imbalance, we will restore it when preparing training samples, although we could do it when training models, if they have such an option too

In [4]:
def upsample(features, target): 
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]  
    target_ones = target[target == 1]
    repeat = int(round(len(features_zeros)/len(features_ones)))
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat) 
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    return features_upsampled, target_upsampled

Let’s separate the training set and eliminate the imbalance in it without touching the validation and test ones

In [5]:
features = data.text
target = data.toxic
features_train, features, target_train, target = train_test_split(features, target, 
                                                                            test_size=2/5, random_state=STATE)
features_train,target_train=upsample(features_train,target_train)

In [6]:
target_train.value_counts()

toxic
1    87498
0    85853
Name: count, dtype: int64

Now we'll lemmatize and clean the text, separate the validation and test samples

In [7]:
def lemma(text):
    text = nlp(text)
    text = " ".join([token.lemma_ for token in text])
    text = ' '.join(re.sub(r'[^a-zA-Z ]', ' ', text).split())
    
    return text

nlp = spacy.load('en_core_web_sm')
tqdm.pandas(desc='Lemmatizing and cleaning features_train')
features_train = features_train.progress_apply(lemma)
tqdm.pandas(desc='Lemmatizing and cleaning features')
features = features.progress_apply(lemma)

features_test, features_valid, target_test, target_valid = train_test_split(features, target, 
                                                                            test_size=1/2, random_state=STATE)

Lemmatizing and cleaning features_train:   0%|          | 0/173351 [00:00<?, ?it/s]

Lemmatizing and cleaning features:   0%|          | 0/63717 [00:00<?, ?it/s]

<div class="alert alert-warning">
<b>Comment</b>

Here I am waiting for hours for lemmatizing to complete only to realize 2\3 of the way through that I could've caried out upsapling AFTER lemmatization. Smart, very smart
</div>

In [13]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=list(stopwords))
features_train = vectorizer.fit_transform(features_train)
features_valid = vectorizer.transform(features_valid)
features_test = vectorizer.transform(features_test)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bkilf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
features_train.shape

(173351, 115525)

## Model creation and training
### LogisticRegression

In [15]:
model = LogisticRegression(verbose=1, max_iter=200)
model.fit(features_train, target_train)
predict_lr = model.predict(features_valid)
predict_lr_test = model.predict(features_test)

In [16]:
print('LogisticRegression f1 score')
round(f1_score(target_valid,predict_lr),2)

LogisticRegression f1 score


0.75

### RandomForestClassifier

In [17]:
model = RandomForestClassifier(verbose=2,n_estimators=40, random_state=STATE)
model.fit(features_train, target_train)
predict_rf = model.predict(features_valid)
predict_rf_test = model.predict(features_test)

building tree 1 of 40
building tree 2 of 40
building tree 3 of 40
building tree 4 of 40
building tree 5 of 40
building tree 6 of 40
building tree 7 of 40
building tree 8 of 40
building tree 9 of 40
building tree 10 of 40
building tree 11 of 40
building tree 12 of 40
building tree 13 of 40
building tree 14 of 40
building tree 15 of 40
building tree 16 of 40
building tree 17 of 40
building tree 18 of 40
building tree 19 of 40
building tree 20 of 40
building tree 21 of 40
building tree 22 of 40
building tree 23 of 40
building tree 24 of 40
building tree 25 of 40
building tree 26 of 40
building tree 27 of 40
building tree 28 of 40
building tree 29 of 40
building tree 30 of 40
building tree 31 of 40
building tree 32 of 40
building tree 33 of 40
building tree 34 of 40
building tree 35 of 40
building tree 36 of 40
building tree 37 of 40
building tree 38 of 40
building tree 39 of 40
building tree 40 of 40


[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed: 10.4min
[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed: 10.4min
[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed:    1.3s
[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed:    1.3s
[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed:    1.2s
[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed:    1.2s


In [18]:
print('RandomForestClassifier f1 score')
round(f1_score(target_valid,predict_rf),2)

RandomForestClassifier f1 score


0.66

###  CatBoostClassifier

In [44]:
model = CatBoostClassifier()
model.fit(features_train, target_train)
predict_cb_1 = model.predict(features_valid)
predict_cb_test_1 = model.predict(features_test)

Learning rate set to 0.093101
0:	learn: 0.6479185	total: 795ms	remaining: 13m 13s
1:	learn: 0.6193654	total: 1.52s	remaining: 12m 40s
2:	learn: 0.5944593	total: 2.26s	remaining: 12m 30s
3:	learn: 0.5782119	total: 2.98s	remaining: 12m 21s
4:	learn: 0.5638264	total: 3.7s	remaining: 12m 15s
5:	learn: 0.5541246	total: 4.41s	remaining: 12m 11s
6:	learn: 0.5433112	total: 5.14s	remaining: 12m 9s
7:	learn: 0.5356645	total: 5.88s	remaining: 12m 8s
8:	learn: 0.5281099	total: 6.65s	remaining: 12m 12s
9:	learn: 0.5218729	total: 7.41s	remaining: 12m 13s
10:	learn: 0.5153962	total: 8.16s	remaining: 12m 13s
11:	learn: 0.5097012	total: 8.91s	remaining: 12m 13s
12:	learn: 0.5049186	total: 9.64s	remaining: 12m 11s
13:	learn: 0.4988983	total: 10.4s	remaining: 12m 13s
14:	learn: 0.4931300	total: 11.2s	remaining: 12m 13s
15:	learn: 0.4888924	total: 11.9s	remaining: 12m 12s
16:	learn: 0.4855286	total: 12.6s	remaining: 12m 10s
17:	learn: 0.4823756	total: 13.4s	remaining: 12m 8s
18:	learn: 0.4772713	total: 14

154:	learn: 0.3183707	total: 1m 50s	remaining: 10m 1s
155:	learn: 0.3179716	total: 1m 51s	remaining: 10m 1s
156:	learn: 0.3175455	total: 1m 51s	remaining: 10m
157:	learn: 0.3170567	total: 1m 52s	remaining: 9m 59s
158:	learn: 0.3165453	total: 1m 53s	remaining: 9m 58s
159:	learn: 0.3159927	total: 1m 53s	remaining: 9m 58s
160:	learn: 0.3156294	total: 1m 54s	remaining: 9m 57s
161:	learn: 0.3152231	total: 1m 55s	remaining: 9m 56s
162:	learn: 0.3147874	total: 1m 56s	remaining: 9m 55s
163:	learn: 0.3143485	total: 1m 56s	remaining: 9m 55s
164:	learn: 0.3137921	total: 1m 57s	remaining: 9m 54s
165:	learn: 0.3133866	total: 1m 58s	remaining: 9m 53s
166:	learn: 0.3129857	total: 1m 58s	remaining: 9m 52s
167:	learn: 0.3125506	total: 1m 59s	remaining: 9m 52s
168:	learn: 0.3121329	total: 2m	remaining: 9m 51s
169:	learn: 0.3117475	total: 2m	remaining: 9m 50s
170:	learn: 0.3113724	total: 2m 1s	remaining: 9m 49s
171:	learn: 0.3109181	total: 2m 2s	remaining: 9m 48s
172:	learn: 0.3103431	total: 2m 3s	remain

307:	learn: 0.2625788	total: 3m 37s	remaining: 8m 8s
308:	learn: 0.2623481	total: 3m 37s	remaining: 8m 7s
309:	learn: 0.2621669	total: 3m 38s	remaining: 8m 6s
310:	learn: 0.2619240	total: 3m 39s	remaining: 8m 5s
311:	learn: 0.2617408	total: 3m 40s	remaining: 8m 5s
312:	learn: 0.2615217	total: 3m 40s	remaining: 8m 4s
313:	learn: 0.2612789	total: 3m 41s	remaining: 8m 3s
314:	learn: 0.2610171	total: 3m 42s	remaining: 8m 2s
315:	learn: 0.2605549	total: 3m 42s	remaining: 8m 2s
316:	learn: 0.2603349	total: 3m 43s	remaining: 8m 1s
317:	learn: 0.2600698	total: 3m 44s	remaining: 8m
318:	learn: 0.2598877	total: 3m 44s	remaining: 7m 59s
319:	learn: 0.2596552	total: 3m 45s	remaining: 7m 59s
320:	learn: 0.2593843	total: 3m 46s	remaining: 7m 58s
321:	learn: 0.2591873	total: 3m 46s	remaining: 7m 57s
322:	learn: 0.2589416	total: 3m 47s	remaining: 7m 56s
323:	learn: 0.2586988	total: 3m 48s	remaining: 7m 56s
324:	learn: 0.2583983	total: 3m 48s	remaining: 7m 55s
325:	learn: 0.2582448	total: 3m 49s	remain

460:	learn: 0.2308259	total: 5m 21s	remaining: 6m 16s
461:	learn: 0.2306297	total: 5m 22s	remaining: 6m 15s
462:	learn: 0.2303963	total: 5m 23s	remaining: 6m 14s
463:	learn: 0.2302121	total: 5m 23s	remaining: 6m 14s
464:	learn: 0.2301087	total: 5m 24s	remaining: 6m 13s
465:	learn: 0.2299721	total: 5m 25s	remaining: 6m 12s
466:	learn: 0.2298045	total: 5m 26s	remaining: 6m 12s
467:	learn: 0.2296618	total: 5m 26s	remaining: 6m 11s
468:	learn: 0.2294430	total: 5m 27s	remaining: 6m 10s
469:	learn: 0.2292590	total: 5m 28s	remaining: 6m 9s
470:	learn: 0.2289640	total: 5m 28s	remaining: 6m 9s
471:	learn: 0.2287618	total: 5m 29s	remaining: 6m 8s
472:	learn: 0.2285443	total: 5m 30s	remaining: 6m 7s
473:	learn: 0.2283552	total: 5m 30s	remaining: 6m 7s
474:	learn: 0.2281917	total: 5m 31s	remaining: 6m 6s
475:	learn: 0.2277238	total: 5m 32s	remaining: 6m 5s
476:	learn: 0.2275520	total: 5m 32s	remaining: 6m 4s
477:	learn: 0.2274067	total: 5m 33s	remaining: 6m 4s
478:	learn: 0.2272314	total: 5m 34s	r

614:	learn: 0.2070923	total: 7m 6s	remaining: 4m 27s
615:	learn: 0.2070208	total: 7m 7s	remaining: 4m 26s
616:	learn: 0.2068618	total: 7m 8s	remaining: 4m 25s
617:	learn: 0.2067068	total: 7m 9s	remaining: 4m 25s
618:	learn: 0.2065747	total: 7m 9s	remaining: 4m 24s
619:	learn: 0.2064703	total: 7m 10s	remaining: 4m 23s
620:	learn: 0.2064004	total: 7m 11s	remaining: 4m 23s
621:	learn: 0.2062046	total: 7m 11s	remaining: 4m 22s
622:	learn: 0.2061340	total: 7m 12s	remaining: 4m 21s
623:	learn: 0.2060666	total: 7m 13s	remaining: 4m 20s
624:	learn: 0.2058806	total: 7m 13s	remaining: 4m 20s
625:	learn: 0.2057987	total: 7m 14s	remaining: 4m 19s
626:	learn: 0.2056750	total: 7m 15s	remaining: 4m 18s
627:	learn: 0.2055943	total: 7m 15s	remaining: 4m 18s
628:	learn: 0.2055257	total: 7m 16s	remaining: 4m 17s
629:	learn: 0.2054592	total: 7m 17s	remaining: 4m 16s
630:	learn: 0.2053566	total: 7m 17s	remaining: 4m 16s
631:	learn: 0.2052029	total: 7m 18s	remaining: 4m 15s
632:	learn: 0.2051158	total: 7m 1

767:	learn: 0.1896855	total: 8m 51s	remaining: 2m 40s
768:	learn: 0.1895385	total: 8m 51s	remaining: 2m 39s
769:	learn: 0.1894043	total: 8m 52s	remaining: 2m 39s
770:	learn: 0.1893317	total: 8m 53s	remaining: 2m 38s
771:	learn: 0.1892340	total: 8m 53s	remaining: 2m 37s
772:	learn: 0.1891776	total: 8m 54s	remaining: 2m 36s
773:	learn: 0.1890604	total: 8m 55s	remaining: 2m 36s
774:	learn: 0.1890025	total: 8m 55s	remaining: 2m 35s
775:	learn: 0.1889006	total: 8m 56s	remaining: 2m 34s
776:	learn: 0.1888159	total: 8m 57s	remaining: 2m 34s
777:	learn: 0.1886911	total: 8m 57s	remaining: 2m 33s
778:	learn: 0.1885543	total: 8m 58s	remaining: 2m 32s
779:	learn: 0.1884996	total: 8m 59s	remaining: 2m 32s
780:	learn: 0.1883470	total: 9m	remaining: 2m 31s
781:	learn: 0.1882609	total: 9m	remaining: 2m 30s
782:	learn: 0.1881308	total: 9m 1s	remaining: 2m 30s
783:	learn: 0.1879384	total: 9m 2s	remaining: 2m 29s
784:	learn: 0.1878508	total: 9m 2s	remaining: 2m 28s
785:	learn: 0.1877262	total: 9m 3s	rema

920:	learn: 0.1746040	total: 10m 34s	remaining: 54.5s
921:	learn: 0.1744800	total: 10m 35s	remaining: 53.8s
922:	learn: 0.1743956	total: 10m 36s	remaining: 53.1s
923:	learn: 0.1742984	total: 10m 37s	remaining: 52.4s
924:	learn: 0.1741589	total: 10m 37s	remaining: 51.7s
925:	learn: 0.1740697	total: 10m 38s	remaining: 51s
926:	learn: 0.1739753	total: 10m 39s	remaining: 50.3s
927:	learn: 0.1739300	total: 10m 39s	remaining: 49.6s
928:	learn: 0.1738319	total: 10m 40s	remaining: 49s
929:	learn: 0.1737819	total: 10m 41s	remaining: 48.3s
930:	learn: 0.1736939	total: 10m 41s	remaining: 47.6s
931:	learn: 0.1735977	total: 10m 42s	remaining: 46.9s
932:	learn: 0.1735500	total: 10m 43s	remaining: 46.2s
933:	learn: 0.1733841	total: 10m 43s	remaining: 45.5s
934:	learn: 0.1733017	total: 10m 44s	remaining: 44.8s
935:	learn: 0.1732064	total: 10m 45s	remaining: 44.1s
936:	learn: 0.1731558	total: 10m 45s	remaining: 43.4s
937:	learn: 0.1731062	total: 10m 46s	remaining: 42.7s
938:	learn: 0.1730066	total: 10m

In [45]:
print('CatBoostClassifier f1 score')
round(f1_score(target_valid,predict_cb_1),2)

CatBoostClassifier f1 score


0.76

###  Best model evaluation

The best model turned out to be CatBoostClassifier, let’s evaluate it on a test sample

In [47]:
round(f1_score(target_test,predict_cb_test_1),2)

0.76

The test sample converges with the validation sample and is not less than 0.75

In [42]:
from sklearn.dummy import DummyClassifier
model = DummyClassifier()
model.fit(features, target)
predict_d = model.predict(features)
print('DummyClassifier f1 score')
round(f1_score(target,predict_d),2)

DummyClassifier f1 score


0.0

Passed sanity test
## General conclusion
Models were trained to determine the toxicity of comments. The learning process includes balancing classes and separating samples. The spaCy library was used for lemmatization. The performance of three different models: LogisticRegression, RandomForestClassifier and CatBoost is compared using standard hyperparameters. CatBoost demonstrates superior performance compared to other models and produces an f1 metric of 0.76.

The results allow us to select the optimal model for determining the toxicity of comments.