<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Work-plan" data-toc-modified-id="Work-plan-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Work plan</a></span></li><li><span><a href="#Data-description" data-toc-modified-id="Data-description-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Data description</a></span></li></ul></li><li><span><a href="#Load-and-preprocessing" data-toc-modified-id="Load-and-preprocessing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load and preprocessing</a></span></li><li><span><a href="#Build-models" data-toc-modified-id="Build-models-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Build models</a></span></li><li><span><a href="#Best-model-testing" data-toc-modified-id="Best-model-testing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Best model testing</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Negative comments detection for «Wikishop»

Online store "Wikishop" launches new service. Now users can edit and add product descriptions, just like in wiki communities. Customers offer their edits and comment on others' changes. The store needs tool that will look for toxic comments and send them for moderation.

The task is to build and train model to classify comments into positive and negative. 

F1 metric should be higher than **0,75** 

### Work plan

1. Load and preprocessing.
2. Build defferent models. 
3. Make conclusion.

### Data description

Column *text* contains text of comment, *toxic* is target feature.

## Load and preprocessing

In [1]:
import pandas as pd
import numpy as np
import warnings                   
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline

import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
data.sample(20)

Unnamed: 0.1,Unnamed: 0,text,toxic
103318,103415,""":::He can also be classified under WP:NPOL pe...",0
152102,152259,"""\n\nThe Bravery of the Scorpion\n\nWhy would ...",0
123043,123152,"On second thought, disregard the above questio...",0
150159,150315,"""\n\n Tip \n\nDiannaa, when you block a user a...",0
89424,89511,"""\n\n Copyright violation? NPOV? \n\nThe text ...",0
133457,133595,will u be REVERTING the pages I edited to the ...,0
79864,79940,"Thank you for your apology. I accept it, such ...",0
29497,29536,This is not content dispute. It's user conduct...,0
104748,104845,I am dieing stop undoing my edits of the jack ...,0
54252,54313,"When you are on the north pole, there is nowhe...",0


Too much unnecessary characters are in the data. Also there is a different case

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [5]:
data.duplicated().sum()

0

No missing values and duplicates

Lemmatize

In [6]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [7]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [8]:
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    return ' '.join([lemmatizer.
                     lemmatize(w, 
    get_wordnet_pos(w)) for w in nltk.word_tokenize(text)])

In [9]:
data['lemm_text'] = data['text'].apply(lemmatize)

Clean text

In [10]:
def clear_text(text):
    return " ".join(re.sub(r'[^a-zA-Z ]', 
                           ' ', 
                           text).lower().split())

In [11]:
data['lemm_text'] = data['lemm_text'].apply(clear_text)

In [12]:
data['lemm_text'][0]

'explanation why the edits make under my username hardcore metallica fan be revert they be n t vandalism just closure on some gas after i vote at new york dolls fac and please do n t remove the template from the talk page since i m retire now'

Data is ok

## Build models

Train-test split

In [13]:
(features_train, 
 features_test, 
 target_train, 
 target_test) = train_test_split(
                                 data['lemm_text'], 
                                 data['toxic'], 
                                 test_size=0.25, 
                                 random_state=3101)

Import stopwords

In [14]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Count TF-IDF for train and test samples

In [15]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf_train = count_tf_idf.fit_transform(features_train)
tf_idf_test = count_tf_idf.transform(features_test)

Learn models

In [16]:
pipe_lr = Pipeline(
    [
        ('tfidf', TfidfTransformer()),
        ('lr', LogisticRegression(random_state=3101)),
    ]
)

In [17]:
lr_parameters = {'lr__class_weight':['balanced'], 
                 'lr__C':[0.5,1,3,5],
                 'lr__penalty': ['l1', 'l2']}
             
grid_lr = GridSearchCV(pipe_lr,  
                       lr_parameters,
                       scoring = 'f1',
                       verbose=False,
                       cv=10,
                       n_jobs=-1)

grid_lr.fit(tf_idf_train, target_train)

print('Best f1 is equal to', grid_lr.best_score_, 
      'with parameters', grid_lr.best_params_)

Best f1 is equal to 0.7589166065102113 with parameters {'lr__C': 3, 'lr__class_weight': 'balanced', 'lr__penalty': 'l2'}


In [18]:
pipe_rfc = Pipeline(
    [
        ("tfidf", TfidfTransformer()),
        ("clf", RandomForestClassifier(random_state=3101)),
    ]
)

In [19]:
rfc_parameters = {'clf__class_weight':['balanced'],
                 'clf__n_estimators': range (10, 510, 50),
                 'clf__max_depth': range (1, 9, 3)}
             
grid_rfc = GridSearchCV(pipe_rfc,  
                        rfc_parameters,
                        scoring = 'f1',
                        verbose=False,
                        cv=10,
                        n_jobs=-1)

grid_rfc.fit(tf_idf_train, target_train)

print('Best f1 is equal to', grid_rfc.best_score_, 
      'with parameters', grid_rfc.best_params_)

Best f1 is equal to 0.3550176070393337 with parameters {'clf__class_weight': 'balanced', 'clf__max_depth': 7, 'clf__n_estimators': 360}


## Best model testing

Check best model on test sample

In [20]:
lr_pred = grid_lr.predict(tf_idf_test)
lr = f1_score(target_test, lr_pred)
lr

0.759665385484965

In [21]:
rfc_pred = grid_rfc.predict(tf_idf_test)
rfc = f1_score(target_test, rfc_pred)
rfc

0.3613801452784504

In [22]:
print(classification_report(lr_pred, target_test))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97     35035
           1       0.83      0.70      0.76      4788

    accuracy                           0.95     39823
   macro avg       0.89      0.84      0.86     39823
weighted avg       0.94      0.95      0.94     39823



Check constant model prediction

In [23]:
dummy_clf = DummyClassifier(strategy="constant", 
                            constant=1, 
                            random_state=3101)
dummy_clf.fit(tf_idf_train, target_train)
predicted_dummy = dummy_clf.predict(tf_idf_test)
dummy = f1_score(target_test, predicted_dummy)
dummy

0.18495476402087463

In [24]:
result = pd.DataFrame([lr, rfc, dummy], 
                       columns=['f1'], 
                       index=['LogisticRegression',
                              'RandomForestClassifier',
                              'DummyClassifier'])
result

Unnamed: 0,f1
LogisticRegression,0.759665
RandomForestClassifier,0.36138
DummyClassifier,0.184955


## Conclusion
- Best model is **Logistic Regression** 
- F1 metric equal to **0.76**
- Best model has better predictions then constant one. Validity check is done