<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Preparation" data-toc-modified-id="Preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparation</a></span></li><li><span><a href="#Model training" data-toc-modified-id="Model training-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Model training</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#DecisionTree" data-toc-modified-id="DecisionTree-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>DecisionTree</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Checklist" data-toc-modified-id="Checklist-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Checklist</a></span></li></ul></div>

# Project for Vikishop

The online store "Wikishop" launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, customers offer their edits and comment on others' changes. The store needs a tool that will look for toxic comments and send them for moderation. 

Train your model to classify comments into positive and negative. You have a data set with markup about the toxicity of edits.

Build a model with a quality metric value *F1* of at least 0.75. 

**Instruction on how to do the project**.

1. Load and prepare the data.
2. train the different models. 
3. Draw conclusions.

**Data description**.

The data is in the file `toxic_comments.csv`. The *text* column in it contains the text of the comment, and *toxic* contains the target attribute.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
nltk.download('punkt')
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Preparation

In [2]:
# read data
df = pd.read_csv('/datasets/toxic_comments.csv')
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
df.shape

(159571, 2)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [5]:
# Check imbalance
df['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

We have the text in English. Let's start the work of preparing the data:

1) Do tokenization.

2) Stemming

3) Clearing the text from stop words.

4) Divide samples into training and test ones

5) We will use TF-IDF for vectorization

6) We have to take into account class imbalance when training the vectors.

Let's start our work.

In [6]:
# Separate target
y = df['toxic']
X = df['text']

In [7]:
# Implement stemmer and stop-words
snowball = SnowballStemmer(language="english")
stop_words = stopwords.words("english")

In [8]:
# Make function for text-preparation
def tokenize_sentence(sentence: str, remove_stop_words: bool = True):
    # tokenize
    tokens = word_tokenize(sentence, language="english")
    
    # clear punctuation
    tokens = [i for i in tokens if i not in string.punctuation]
    
    # remove stop-words
    if remove_stop_words:
        tokens = [i for i in tokens if i not in stop_words]
        
    # let's stemm
    tokens = [snowball.stem(i) for i in tokens]
    
    return tokens

In [9]:
# Let's divide the samples into training and test samples
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25, random_state = 1123, stratify=y)

In [10]:
# Let's create a vector transducer with tokenization function
vectorizer = TfidfVectorizer(tokenizer=lambda x: tokenize_sentence(x, remove_stop_words=True))

In [11]:
# Let's perform training with transformation for the training sample, and let's simply transform the test sample to avoid leakage
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

## Model training

For further work I have written two functions that will automate the calculations:

1) will calculate the necessary metric

2) will train the model and return the metrics on the training and test samples

In [12]:
def f1_metric(model, X, y):

    metric = 0
    y_pred = model.predict(X)
    metric = metrics.f1_score(y, y_pred)
    
    return metric


In [13]:
def ev_model(model, X_train, y_train, X_test, y_test, scale=False):

    metric_train = 0
    metric_test = 0
    
    if (scale == True):
        sc = StandardScaler()
        X_train = sc.fit_transform(X_train)
        X_test = sc.transform(X_test)
    
    model.fit(X_train, y_train)
  
    metric_train = f1_metric(model, X_train, y_train)
    metric_test = f1_metric(model, X_test, y_test)
    
    return metric_train, metric_test


### LogisticRegression

In [14]:
#Do parameters of gridsearch 
param_grid = [{'penalty' : ['l1', 'l2'], 'C' : [10**i for i in range(-20,7)]}]


#Do gridsearch
scv = StratifiedKFold(n_splits=5)

LR=GridSearchCV(LogisticRegression(class_weight='balanced'), param_grid=param_grid, scoring='f1', cv=scv, verbose=True, n_jobs=-1)

best_LR= LR.fit(X_train, y_train)


print('best parameters (CV f1 score =%0.3f):' % best_LR.best_score_)

print("best model:\n", best_LR.best_params_)

In [15]:
# Let's use our functions and check what indicators we were able to get
LR_report=ev_model(LogisticRegression(penalty='l1', C=10, solver = 'liblinear', class_weight='balanced'), X_train, y_train, X_test, y_test, scale=False)

print('f1-score of the best model on train sample = %0.3f:' %LR_report[0])
print('f1-score of the best model on test sample = %0.3f:' %LR_report[1] )

f1-score of the best model on train sample = 0.954:
f1-score of the best model on test sample = 0.764:


### DecisionTree

In [16]:
#Do parameters of gridsearch 
param_grid = [{ 
    'max_depth' : list(range(2,4)),
    'min_samples_leaf' : list(range(2,4)), 
    'min_samples_split' : list(range(2,4))
     }]


#Do gridsearch


DT=GridSearchCV(DecisionTreeClassifier(class_weight='balanced'), param_grid=param_grid, scoring='f1', verbose=True, n_jobs=-1)

best_DT=DT.fit(X_train, y_train)



print('best parameters (CV f1 score =%0.3f):' % best_DT.best_score_)

print("best model:\n", best_DT.best_params_)

In [17]:
DT_report=ev_model(DecisionTreeClassifier(class_weight='balanced', max_depth = 20, min_samples_split = 8, min_samples_leaf = 4), X_train, y_train, X_test, y_test, scale=False)

print('f1-score of the best model on train sample = %0.3f:' %DT_report[0])
print('f1-score of the best model on test sample = %0.3f:' %DT_report[1])

f1-score of the best model on train sample = 0.671:
f1-score of the best model on test sample = 0.627:


## Conclusion

To perform this work, we used Stemming and TFIDF to vectorize texts. We also trained two models: Logistic regression and DecisionTree. Logistic regression performed better (0.764 vs. 0.627 on the test sample), we were able to overcome the baseline.

## Cheklist

- [x] All code is executed without errors
- [x] Cells with code are in the order of execution
- [x] Data is loaded and prepared
- [x] Models are trained
- [x] Metric value *F1* is at least 0.75
- [x] Conclusions are written