## Step 1. Data preparation

Uploading the packages

In [2]:
!python -m spacy download en_core_web_sm
import spacy
import string
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
import tqdm
from sklearn.model_selection import GridSearchCV
import nltk
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression as lr

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


Obtaining the dataset under df variable.

In [2]:
data = pd.read_csv("toxic_comments.csv")

# making the data readable visually
pd.set_option('display.max_colwidth', -1)
data.head()

  pd.set_option('display.max_colwidth', -1)


Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0


The shape of the dataset is:

In [3]:
data.shape

(159571, 2)

As expected, we have two columns - with the raw text of the tweets and the target indicator - an indicator of the toxicity of the comment. Let's check what toxicity values are present.

In [4]:
data['toxic'].value_counts()

0    143346
1    16225 
Name: toxic, dtype: int64

Based on the output, the target is binary (boolean) and represents the presence or absence of toxicity in the comments.  
Now we can start processing the data. First of all, let's deal with punctuation. For reference, let's see what exactly the string package removes:

In [5]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

This filter suits us, so let's write a function to clean the text from punctuation and apply it.

In [6]:
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

In [7]:
# new column for the processed data
data['clean_msg']= data['text'].apply(lambda x:remove_punctuation(x))
data.head()

Unnamed: 0,text,toxic,clean_msg
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted They werent vandalisms just closure on some GAs after I voted at New York Dolls FAC And please dont remove the template from the talk page since Im retired now892053827
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,Daww He matches this background colour Im seemingly stuck with Thanks talk 2151 January 11 2016 UTC
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,Hey man Im really not trying to edit war Its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,\nMore\nI cant make any real suggestions on improvement I wondered if the section statistics should be later on or a subsection of types of accidents I think the references may need tidying so that they are all in the exact same format ie date format etc I can do that later on if noone else does first if you have any preferences for formatting style on references or want to do it yourself please let me know\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up Its listed in the relevant form eg WikipediaGoodarticlenominationsTransport
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,You sir are my hero Any chance you remember what page thats on


Now the text should be lowercased.

In [8]:
data['msg_lower']= data['clean_msg'].apply(lambda x: x.lower())
data.head()

Unnamed: 0,text,toxic,clean_msg,msg_lower
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted They werent vandalisms just closure on some GAs after I voted at New York Dolls FAC And please dont remove the template from the talk page since Im retired now892053827,explanation\nwhy the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now892053827
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,Daww He matches this background colour Im seemingly stuck with Thanks talk 2151 January 11 2016 UTC,daww he matches this background colour im seemingly stuck with thanks talk 2151 january 11 2016 utc
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,Hey man Im really not trying to edit war Its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info,hey man im really not trying to edit war its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,\nMore\nI cant make any real suggestions on improvement I wondered if the section statistics should be later on or a subsection of types of accidents I think the references may need tidying so that they are all in the exact same format ie date format etc I can do that later on if noone else does first if you have any preferences for formatting style on references or want to do it yourself please let me know\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up Its listed in the relevant form eg WikipediaGoodarticlenominationsTransport,\nmore\ni cant make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if noone else does first if you have any preferences for formatting style on references or want to do it yourself please let me know\n\nthere appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up its listed in the relevant form eg wikipediagoodarticlenominationstransport
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,You sir are my hero Any chance you remember what page thats on,you sir are my hero any chance you remember what page thats on


The final step would be to lematize the data. I'll use the spacy module to avoid tokenization. Again, let's add a column with lemmas to the dataset.

***Note***  
*In order not to run the lengthy lemmatization process numerous times, I'm saving the lemmatized dataset in order to make it immediately accessible in future.*

In [9]:
en_core = spacy.load('en_core_web_sm')
data["lemmas"] = data['msg_lower'].apply(lambda x: " ".join([y.lemma_ for y in en_core(x)]))

In [10]:
data.to_csv(r'lemmatized_data.csv', index=False)

In [3]:
df = pd.read_csv('lemmatized_data.csv')

So, let's take a look at the final resulting dataset - indeed, the *lemmas* column consists of the necessary lemmas, and the data is ready for model training.

In [12]:
df.head()

Unnamed: 0,text,toxic,clean_msg,msg_lower,lemmas
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted They werent vandalisms just closure on some GAs after I voted at New York Dolls FAC And please dont remove the template from the talk page since Im retired now892053827,explanation\nwhy the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now892053827,explanation \n why the edit make under my username hardcore metallica fan be revert they be not vandalism just closure on some gas after I vote at new york doll fac and please do not remove the template from the talk page since I m retire now892053827
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,Daww He matches this background colour Im seemingly stuck with Thanks talk 2151 January 11 2016 UTC,daww he matches this background colour im seemingly stuck with thanks talk 2151 january 11 2016 utc,daww he match this background colour I m seemingly stuck with thank talk 2151 january 11 2016 utc
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,Hey man Im really not trying to edit war Its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info,hey man im really not trying to edit war its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info,hey man I m really not try to edit war its just that this guy be constantly remove relevant information and talk to I through edit instead of my talk page he seem to care more about the formatting than the actual info
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,\nMore\nI cant make any real suggestions on improvement I wondered if the section statistics should be later on or a subsection of types of accidents I think the references may need tidying so that they are all in the exact same format ie date format etc I can do that later on if noone else does first if you have any preferences for formatting style on references or want to do it yourself please let me know\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up Its listed in the relevant form eg WikipediaGoodarticlenominationsTransport,\nmore\ni cant make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if noone else does first if you have any preferences for formatting style on references or want to do it yourself please let me know\n\nthere appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up its listed in the relevant form eg wikipediagoodarticlenominationstransport,\n more \n I can not make any real suggestion on improvement I wonder if the section statistic should be later on or a subsection of type of accident I think the reference may need tidy so that they be all in the exact same format ie date format etc I can do that later on if noone else do first if you have any preference for format style on reference or want to do it yourself please let I know \n\n there appear to be a backlog on article for review so I guess there may be a delay until a reviewer turn up its list in the relevant form eg wikipediagoodarticlenominationstransport
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,You sir are my hero Any chance you remember what page thats on,you sir are my hero any chance you remember what page thats on,you sir be my hero any chance you remember what page that s on


### Conclusion

As a result of the preparation step, the data was pre-processed to "feed" the models for training, namely: punctuation and capitalization were removed, and lemmatization was carried out.

## Step 2. Learning

Let's create two variables - X and Y, for texts and the target indicator, respectively - and split the dataset into training and test (10%) samples. A validation sample will not be needed, as models will be selected by cross-validation.

In [4]:
X, y = df['lemmas'], df['toxic']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

We will select the model using *pipeline* and *gridsearch*. We will build a pipeline in order to upload the models there and only then carry out cross-validation. For comparison, let's take random forest models and naive bayes model, which is good for classification problems. To build the models themselves, of course, we will use the TF-IDF vectorizer.

In [14]:
# Naive Bayes pipeline
tvc_pipe = Pipeline([('tvec', TfidfVectorizer()), ('mb', MultinomialNB())])

# rendom forest pipeline
rf_pipe = Pipeline([('tvec', TfidfVectorizer()), ('rf', RandomForestClassifier())])

Then we should fit the pipelines.

In [15]:
tvc_pipe.fit(X_train, y_train)

Pipeline(steps=[('tvec', TfidfVectorizer()), ('mb', MultinomialNB())])

In [16]:
rf_pipe.fit(X_train, y_train)

Pipeline(steps=[('tvec', TfidfVectorizer()), ('rf', RandomForestClassifier())])

Next comes parameter set-up.

In [17]:
# Bayes model parameters
tf_params = {
 'tvec__max_features':[50, 100],
 'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
 'tvec__stop_words': [None, 'english'],
 
}

# Random forest parameters
rf_params = {
 'tvec__max_features':[100, 200],
 'tvec__ngram_range': [(1, 2)],
 'tvec__stop_words': ['english'],
 'rf__max_depth': [100],
 'rf__min_samples_split': [100],
 'rf__max_leaf_nodes': [None]
}

Finally, let's start the gridsearch according to the specified parameters. As an estimated parameter *scoring*, we will set, according to the customer request, the measure F1.

In [18]:
# Grisdearch for Bayes model
tvc_gs = GridSearchCV(tvc_pipe, param_grid=tf_params, scoring='f1',
                      cv = 5, verbose =1, n_jobs = -1)

# model fitting
tvc_gs.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('mb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'tvec__max_features': [50, 100],
                         'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
                         'tvec__stop_words': [None, 'english']},
             scoring='f1', verbose=1)

In [19]:
# Random forest gridsearch
rf_gs = GridSearchCV(rf_pipe, param_grid=rf_params, scoring='f1',
                     cv = 5, verbose = 1, n_jobs = -1)

# model fitting
rf_gs.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('rf', RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'rf__max_depth': [100], 'rf__max_leaf_nodes': [None],
                         'rf__min_samples_split': [100],
                         'tvec__max_features': [100, 200],
                         'tvec__ngram_range': [(1, 2)],
                         'tvec__stop_words': ['english']},
             scoring='f1', verbose=1)

Also, for comparison, let's look at the linear regression model, as in the theoretical part of training. To do this, we recode the text data, filter it further by stop words from the nltk module, and again apply the tf_idf vectorizer.

In [5]:
lemm_train = X_train
lemm_test = X_test

nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf = count_tf_idf.fit_transform(lemm_train) 
print("Matrix size:", tf_idf.shape)

features_train = tf_idf
target_train = y_train

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rebek.000\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Размер матрицы: (143613, 213817)


Now let's train a linear regression model and fit it for further verification.

In [10]:
model = lr(random_state=123)
model.fit(features_train, target_train)

LogisticRegression(random_state=123)

Let's create a pipeline for the logistic regression classifier, while taking the data preprocessed for it, vectorized and without stop words.

In [6]:
lr_pipe = Pipeline([('classifier' , lr())])
lr_pipe.fit(features_train, target_train)

Pipeline(steps=[('classifier', LogisticRegression())])

Setting the parameters for iteration:

In [7]:
param_grid = {
    'classifier' : [lr()],
    'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 20),
    'classifier__solver' : ['liblinear']
}

Finally, we will conduct a gridsearch on models (for 200 iterations) and save the best model.

In [8]:
model_lr = GridSearchCV(lr_pipe, param_grid = param_grid, cv = 5, verbose=True, scoring='f1', n_jobs=-1)

best_lr = model_lr.fit(features_train, target_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


### Conclusion

To train the models, a test (10%) and training set was selected, because cross-validation is used to select hyperparameters. Two models were chosen: a random forest and a Bayesian polynomial model. Training structure isa a pipeline with parameters for gridsearch based on F1 measure. Also, for comparison, a logistic regression model was trained, taken both independently and as part of its own pipeline and cross-validation.

## Step 3. Comparison of the models and conclusions

Let us compare the scoring of models on both training and test sets. Let's start with the training sets.

In [25]:
# Bayes model on train data
tvc_gs_pred = tvc_gs.predict(X_train)
print(f1_score(y_train, tvc_gs_pred))

# Random forest on train data
rf_gs_pred = rf_gs.predict(X_train)
print(f1_score(y_train, rf_gs_pred))

0.2745421353277192
0.49972404796548087


In [42]:
# Logistic regression on train data
predictions = model.predict(features_train)
print(f1_score(target_train, predictions))

0.7761049445005045


Now let us study the test data.

In [2]:
# Bayes model on test data
tvc_gs_pred_test = tvc_gs.predict(X_test)
print(f1_score(y_test, tvc_gs_pred_test))

# Random forest on test data
rf_gs_pred_test = rf_gs.predict(X_test)
print(f1_score(y_test, rf_gs_pred_test))

In [11]:
count_tf_idf1 = TfidfVectorizer(stop_words=stopwords)
tf_idf1 = count_tf_idf.transform(lemm_test) 
print("Matrix size:", tf_idf1.shape)

# Logistic regression on test data
features_test = tf_idf1
predict_test = model.predict(features_test)
print(f1_score(y_test, predict_test))

Размер матрицы: (15958, 213817)
0.7323420074349442


Additionally, I will check the performance of the new model - the best logistic regression according to cross-validation.

In [12]:
predict_test_pipe = best_lr.predict(features_test)
print(f1_score(y_test, predict_test_pipe))

0.7899794097460535


So, the results are such that both **logistic regressions** have the best performance, and the random forest is in second place.

For further analysis, it would be interesting to see which words both models considered most important in their assessment of learning outcomes. To do this, we will form and display the corresponding dataframe.

In [23]:
tvc_title = pd.DataFrame(rf_pipe.steps[1][1].feature_importances_,
                         tvc_pipe.steps[0][1].get_feature_names(), columns=['Significance'])
tvc_title.sort_values('Significance', ascending = False).head(20)

Unnamed: 0,Значимость
fuck,0.037444
fucking,0.016716
suck,0.014041
you,0.013708
shit,0.01364
stupid,0.010475
ass,0.01005
bitch,0.008329
the,0.008146
idiot,0.007999


###  Conclusion

Interestingly, the logistic regression showed a much higher result than the other two models, which were vectorized in the pipeline itself. As for the most important words, they turned out to be abusive expressions, which is logical, taking into account the specifics of the task.

## General conclusion

As part of the launch of a new service for the Wikishop online store, the store ordered a tool that will look for toxic customer comments and send them for moderation.

As part of the project, the model was trained to classify comments into positive and negative ones. A dataset with markup on the toxicity of edits was available for training.

So, **the logistic regression with 79% F1-measure** turned out to be the best.

The following steps have been taken:
- data loaded and prepared
- trained two different models
- a threshold F1 measure (75%) was obtained and the best model was selected
- keywords have been identified to define toxicity
- conclusions were made