# Comment analysis

Online store launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. The store needs a tool that will look for toxic comments and submit them for moderation.

Train the model to classify comments as positive or negative. At your disposal is a dataset with markup on the toxicity of edits.

Build a model with a quality metric *F1* of at least 0.75.

**Instructions for the implementation of the project**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.


## Loading and processing data 

In [1]:
import pandas as pd
from pymystem3 import Mystem
import nltk 
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, recall_score, precision_score
import re 
import warnings
warnings.simplefilter(action='ignore', category=Warning)


In [2]:
nltk.download('stopwords')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pavel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Initial analysis

In [3]:
df = pd.read_csv('datasets/toxic_comments.csv')

In [4]:
df.query('toxic==1').head(20)

Unnamed: 0,text,toxic
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
12,Hey... what is it..\n@ | talk .\nWhat is it......,1
16,"Bye! \n\nDon't look, come or think of comming ...",1
42,You are gay or antisemmitian? \n\nArchangel WH...,1
43,"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!",1
44,I'm Sorry \n\nI'm sorry I screwed around with ...,1
51,GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK T...,1
55,Stupid peace of shit stop deleting my stuff as...,1
56,=Tony Sidaway is obviously a fistfuckee. He lo...,1
58,My Band Page's deletion. You thought I was gon...,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [6]:
df.describe()

Unnamed: 0,toxic
count,159571.0
mean,0.101679
std,0.302226
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


### Processing data

In [7]:
def clear_text(text):
    x = re.sub(r'[^a-zA-Z ]', ' ', text) 
    return " ".join(x.split())

In [8]:
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)])

In [9]:
df['text'] = df['text'].apply(clear_text)

In [10]:
df['text_lemmatized'] = df['text'].apply(lemmatize_text)

In [11]:
df

Unnamed: 0,text,toxic,text_lemmatized
0,Explanation Why the edits made under my userna...,0,Explanation Why the edits made under my userna...
1,D aww He matches this background colour I m se...,0,D aww He match this background colour I m seem...
2,Hey man I m really not trying to edit war It s...,0,Hey man I m really not trying to edit war It s...
3,More I can t make any real suggestions on impr...,0,More I can t make any real suggestion on impro...
4,You sir are my hero Any chance you remember wh...,0,You sir are my hero Any chance you remember wh...
...,...,...,...
159566,And for the second time of asking when your vi...,0,And for the second time of asking when your vi...
159567,You should be ashamed of yourself That is a ho...,0,You should be ashamed of yourself That is a ho...
159568,Spitzer Umm theres no actual article for prost...,0,Spitzer Umm there no actual article for prosti...
159569,And it looks like it was actually you who put ...,0,And it look like it wa actually you who put on...


In [12]:
stop_words = set(nltk_stopwords.words('english'))

In [13]:
count_tf_idf = TfidfVectorizer(stop_words=stop_words)

In [14]:
df['Uppercase'] = df['text_lemmatized'].str.count(r'[A-Z]')
df['Lowercase'] = df['text_lemmatized'].str.count(r'[a-z]')
df['share_of_capital'] = df['Uppercase']/(df['Lowercase']+df['Uppercase'])

In [15]:
df = df.drop(['text', 'Uppercase', 'Lowercase'], axis=1)

In [16]:
df

Unnamed: 0,toxic,text_lemmatized,share_of_capital
0,0,Explanation Why the edits made under my userna...,0.084158
1,0,D aww He match this background colour I m seem...,0.112676
2,0,Hey man I m really not trying to edit war It s...,0.021505
3,0,More I can t make any real suggestion on impro...,0.023158
4,0,You sir are my hero Any chance you remember wh...,0.040000
...,...,...,...
159566,0,And for the second time of asking when your vi...,0.008929
159567,0,You should be ashamed of yourself That is a ho...,0.030303
159568,0,Spitzer Umm there no actual article for prosti...,0.064516
159569,0,And it look like it wa actually you who put on...,0.022472


**Conclusions**

1. Data uploaded
2. We can try to estimate toxic comments by the share of characters with caps
3. The proportion of toxic comments - about 10% - may not be enough for a models
4. It seems that lemmatization here is not particularly useful (compared to Russian)

## Training

In [17]:
train, test = train_test_split(df, test_size = 0.2, random_state=12)

In [18]:
corpus = train['text_lemmatized'].values.astype('U')

In [19]:
tf_idf = count_tf_idf.fit_transform(corpus)

In [20]:
target_train = train['toxic']

In [21]:
corpus_test = test['text_lemmatized'].values.astype('U')
tf_idf_test = count_tf_idf.transform(corpus_test)
target_test = test['toxic']

In [22]:
model = LogisticRegression()
model.fit(tf_idf, target_train)
predict_train = model.predict(tf_idf)
print(model.score(tf_idf, target_train))

0.9592811932067431


In [23]:
predict_test = model.predict(tf_idf_test)
print(model.score(tf_idf_test, target_test))
print(f1_score(predict_test, target_test))
print(confusion_matrix(target_test, predict_test))

0.9562901456995143
0.7463175122749591
[[28468   122]
 [ 1273  2052]]


In [24]:
test.groupby('toxic')['toxic'].count()

toxic
0    28590
1     3325
Name: toxic, dtype: int64

The model missed 40% of toxic comments

In [25]:
model_l = LogisticRegression(class_weight='balanced')
model_l.fit(tf_idf, target_train)
predict_train = model_l.predict(tf_idf)
print(model_l.score(tf_idf, target_train))
predict_test = model_l.predict(tf_idf_test)
print(model_l.score(tf_idf_test, target_test))
print(f1_score(predict_test, target_test))
print(confusion_matrix(target_test, predict_test))

0.9608479037413048
0.9429735234215886
0.7578499201703034
[[27247  1343]
 [  477  2848]]


There are more false positives in this model, but required $f1$ value is acheived 

In [26]:
model_f = RandomForestClassifier(class_weight='balanced')
model_f.fit(tf_idf, target_train)
predict_train = model_f.predict(tf_idf)
print(model_f.score(tf_idf, target_train))
predict_test = model_f.predict(tf_idf_test)
print(model_f.score(tf_idf_test, target_test))
print(f1_score(predict_test, target_test))
print(confusion_matrix(target_test, predict_test))

0.9996553236823964
0.942754190819364
0.6343806283770262
[[28503    87]
 [ 1740  1585]]


**Results**

1. 3 models were built
2. Two of them are logistic regression, the second one balanced classes
3. The random forest model gives a low result. The reason is that I did not change the hyperparameters. But since we have a lot of variables here, it takes too long to build a good tree

## Final model

In [27]:
model_l = LogisticRegression(class_weight='balanced')
model_l.fit(tf_idf, target_train)
predict_train = model_l.predict(tf_idf)
print("Score на train:",model_l.score(tf_idf, target_train))
predict_test = model_l.predict(tf_idf_test)
print("Score на test:",model_l.score(tf_idf_test, target_test))
print("F1:",f1_score(predict_test, target_test))
print(confusion_matrix(target_test, predict_test))

Score на train: 0.9608400701886319
Score на test: 0.9430048566504778
F1: 0.7579507651363938
[[27248  1342]
 [  477  2848]]


In [28]:
print("Overall score:",accuracy_score(target_test, predict_test))
print("Recall:",recall_score(target_test, predict_test))
print("Precision:",precision_score(target_test, predict_test))
print("f1 score:",f1_score(predict_test, target_test))


Overall score: 0.9430048566504778
Recall: 0.8565413533834586
Precision: 0.6797136038186158
f1 score: 0.7579507651363938


Example of comments, that were not caught by the model

In [29]:
test['pred'] = predict_test.tolist()
test.query('toxic == 1 and pred == 0').head(20)

Unnamed: 0,toxic,text_lemmatized,share_of_capital,pred
69591,1,fukk it im goin encyclopedia dramatica,0.0,0
21861,1,Bullshit Why doe Amy Roloff get her own entry ...,0.065789,0
5020,1,give a on them I inserted,0.05,0
144387,1,oo if you have ANY actual evidence show it You...,0.047945,0
28987,1,Al Qaedia is after you,0.111111,0
30237,1,you re so idi o t so st u p id guy fu c k i n ...,0.0,0
129934,1,Your ongoing effort to vandalize the popper in...,0.023256,0
61143,1,Don t be an as It wa a clear edit conflict And...,0.12069,0
45404,1,Secular Humanism Islam s best friend I know yo...,0.11399,0
55053,1,Your work will be deleted or bastardized wheth...,0.005263,0


## Conclusions

**Conclusions**

1. The resulting model gives 85% of toxic comments for moderation
2. At the same time, she captures a lot of normal comments - about 1 out of 3 comments that model marks as toxic are false positives
3. Hypothesis: in order to improve accuracy, we can additionaly take a look at:
* high proportion of letters in caps
* use a  dictionary check of the curse words
* check the ammount of question and exclamation marks