# Project for Wikishop

Online store "Wikishop" launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

Train the model to classify comments into positive and negative. At your disposal is a dataset with markup on the toxicity of edits.

Build a model with a quality metric *F1* of at least 0.75.

**Instructions for the implementation of the project**

1. Download and prepare data.
2. Train different models.
3. Make conclusions.


**Data Description**

The data is in the `toxic_comments.csv` file. The *text* column contains the text of the comment and *toxic* is the target attribute.

## Data preprocessing

In [1]:
# !pip install fast_ml
!pip install nltk



In [2]:
import pandas as pd
from fast_ml.model_development import train_valid_test_split
import re
from pymystem3 import Mystem
# m = Mystem()
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\leint\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\leint\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\leint\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\leint\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
data = pd.read_csv('toxic_comments.csv', index_col='Unnamed: 0')
data.head()
m = WordNetLemmatizer()

In [4]:
# function to clean the text from extra characters
def reg_data(row):
    row = re.sub(r"(?:\n|\r)", " ", row)
    row = re.sub(r"[^a-zA-Z ]+", "", row).strip() # strip whitespace left and right with strip
    row = row.lower()                             # lower case
    return row

data['text'] = data['text'].apply(reg_data)
data.head()

Unnamed: 0,text,toxic
0,explanation why the edits made under my userna...,0
1,daww he matches this background colour im seem...,0
2,hey man im really not trying to edit war its j...,0
3,more i cant make any real suggestions on impro...,0
4,you sir are my hero any chance you remember wh...,0


In [5]:
data.describe()

Unnamed: 0,toxic
count,159292.0
mean,0.101612
std,0.302139
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Data preprocessed:

- Loaded the main libraries;
- the text is cleared of extra characters and the case is lowered;
- data is separated for training;
- data is lemmatized;
- evaluated by TF-IDF.

In [7]:
# dividing the dataset into samples

train_features, target_train, features_valid, target_valid, features_test, target_test = train_valid_test_split(data, 
                                                                                                        target = 'toxic', 
                                                                                                        train_size=0.6, 
                                                                                                        valid_size=0.2, 
                                                                                                        test_size=0.2)

In [8]:
# Creating a text corpus and lemmatization function

train_corpus = train_features['text'].values

def lemmatize(text):
    word_list = nltk.word_tokenize(text)
    return ' '.join([m.lemmatize(i) for i in word_list])

train_corpus[0] = lemmatize(train_corpus[0])

In [9]:
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords, ngram_range=(1,1)) # stopword list 
tf_idf_train = count_tf_idf.fit_transform(train_corpus)

In [10]:
valid_corpus = features_valid['text'].values
valid_corpus[0] = lemmatize(valid_corpus[0])
tf_idf_valid = count_tf_idf.transform(valid_corpus)

In [11]:
test_corpus = features_test['text'].values
test_corpus[0] = lemmatize(test_corpus[0])
tf_idf_test = count_tf_idf.transform(test_corpus)
test_corpus

array(['well fuckhead seemed a little over the top',
       'scarborough nicknames   while i dont really want to expand on this section there are in fact some web references to scompton  you know youre from scarborugh when busta rules  there is also a wikipedia page called canadian slang which references scompton as well as a couple of other nicknames not mentioned in the scarborough article i used clusty search engine to find these references up to you whether you want to put this reference back i dont care one way or the other  regards    youre absolutely correct  serves me right for editing before my first cup of coffee in the morning  i found nothing relevant for scompton but when i redid the search later i did get results thru a search for scompton and scarborough i ought not to have deleted the entry based on a google search turning up nothingpersonally i generally dont see any value in having a section for nicknames for the municipality  but there is obviously precedent for havi

## Model training

**Model training:**
    
- Logistic regression;
- Logistic regression with cross-validation;
- Decision tree.

In [12]:
# Logistic regression
model_lr = LogisticRegression(random_state=12345)
model_lr.fit(tf_idf_train, target_train)
model_lr_answer = model_lr.predict(tf_idf_valid)
f1_score_lr = f1_score(target_valid, model_lr_answer)
print ("f1_score", f1_score_lr)

f1_score 0.7079377136346372


In [13]:
# Logistic regression with cross-validation
LogisticRegression = LogisticRegression(random_state=1, solver='liblinear', max_iter=100)
params = {
   'penalty':['l1'],        
   'C':list(range(1,5)) 
}
LR = GridSearchCV(LogisticRegression, params, cv=4, scoring='f1').fit(tf_idf_train, target_train)
print ("Best Params", LR.best_params_)
print ("Best Score", LR.best_score_)

Best Params {'C': 3, 'penalty': 'l1'}
Best Score 0.7705894922938182


In [14]:
# Decision tree
tree_model = DecisionTreeClassifier(random_state = 12345, max_depth=10)
tree_model.fit(tf_idf_train, target_train)
tree_model_answer = tree_model.predict(tf_idf_valid)
tree_model_f1_score = f1_score(target_valid, tree_model_answer)
print ("f1_score", tree_model_f1_score)

f1_score 0.5728073856483423


In [15]:
# The best model is logistic regression. Let's check it on a test sample
best = LR.predict(tf_idf_test)      
f1_lr = f1_score(target_test, best)     
print ("f1_score of best model", f1_lr)

f1_score of best model 0.7871162556618017


# Total conclusion

Models trained:
    
- Logistic regression;
- Logistic regression with cross-validation;
- Decision tree.
    
Logistic regression with cross-validation turned out to be the best. She showed f1 = 0.76 on the validation set. On the test sample 0.78. The mark of 0.75 has been overcome.