## Khomkin Konstantin, cohort 53DS
Machine Learning for Texts Project, v.2.0 23.02.2023

## Introduction

Wikishop online store launches a new service. Now users can edit and supplement product descriptions like in wiki communities. That is, customers offer their edits and comment on the changes of others. The store needs a tool that will search for toxic comments and send them for moderation. 

Let's train the model to categorize comments into positive and negative. We have at our disposal a dataset with markup on the toxicity of edits.
Let's build a model with a target value of quality metric F1 not less than 0.75. 

In [1]:
#Import libraries

from nltk.tokenize import SpaceTokenizer
from nltk.stem import WordNetLemmatizer

import numpy as np

import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import f1_score

import warnings

import re 

import nltk

from nltk.tokenize import SpaceTokenizer
from nltk.stem import WordNetLemmatizer

In [2]:
#Gap
warnings.filterwarnings(action='ignore')

In [3]:
# apply the try-except construct to load the file both for the local version and for working from the simulator

try:
    df = pd.read_csv('/datasets/toxic_comments.csv')
except:
    df = pd.read_csv('toxic_comments.csv') 

In [4]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


The column "Unnamed: 0 " does not contain useful information, so we delete it

In [5]:
df = df.drop(['Unnamed: 0'], axis=1)

In [6]:
# load the dictionary
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# load a simple lemmatizer
wnl = WordNetLemmatizer()

In [8]:
# Define a function for line-by-line processing (regularization, tokenization and lemmatize)

def lemmatize(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text) # regularize 
    list = SpaceTokenizer().tokenize(text) # tokenize by separating with spaces
    res = []
    
    for i in range(0, len(list), 1):
        res.append(wnl.lemmatize(list[i])) # lemmatize each word
    
    return " ".join(res) #  return glued words


In [9]:
# apply the function to the "text" column, using a lambda function, collect the result in "lemm_text"
df['lemm_text'] = df['text'].apply ( lambda x: lemmatize(x))

In [10]:
df.head(5)

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He match this background colour I m see...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestion on imp...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember...


In [11]:
# collect corpus of words from lemmatized column
corpus = df['lemm_text'].values

In [15]:
# Features are collected in the corpus, targets in the "toxic" column

#v2
features = corpus
target = df['toxic']

In [16]:
# divided into training and test samples, 4:1 ratio

#v2
corpus_train, corpus_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, stratify=target, random_state=12345) 

In [17]:
# Build a TF-IDF word importance score matrix

# v2
count_tf_idf = TfidfVectorizer() 

# Vectorizer is trained on a training sample
count_tf_idf.fit(corpus_train) 

# Vectorize the training and test samples
tf_idf_train = count_tf_idf.transform(corpus_train) 

tf_idf_test = count_tf_idf.transform(corpus_test) 

print("Размер матрицы tf_idf_train:", tf_idf_train.shape)
print("Размер матрицы tf_idf_test:", tf_idf_test.shape)

Размер матрицы tf_idf_train: (127433, 144134)
Размер матрицы tf_idf_test: (31859, 144134)


In [18]:
# we save matrices of vectorized words as features
features_train = tf_idf_train
features_test = tf_idf_test

In [19]:
# enter dictionaries where we will store models, their metrics and hyperparameters
models = {}
models_scores = {}
param_search = {}

In [20]:
# the models
models[0] = LogisticRegression(class_weight='balanced', random_state=12345) 
models[1] = DecisionTreeClassifier(class_weight='balanced', random_state=12345)
models[2] = RandomForestClassifier(class_weight='balanced', random_state=12345)

In [21]:
#  hyperparameter sets for selection
param_search[0] = {'solver' : ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']}
param_search[1] = {'max_depth' : [25, 50]}
param_search[2] = {'n_estimators' : [5, 15], 'max_depth' : [25, 50]}

In [22]:
# we loop through all models and determine the best model using GridSearchCV
for i in range(len(models)):
    gsearch = GridSearchCV(estimator=models[i], 
                           cv=10,
                           param_grid=param_search[i],
                           scoring='f1')
    model_grid = gsearch.fit(features_train, target_train)
    models_scores[i] =   abs(model_grid.best_score_)
    print('Модель: ', models[i])
    print('Лучшие гиперпараметры: '+str(model_grid.best_params_))
    print(f'F1: {models_scores[i]:.2f}')
    print()


Модель:  LogisticRegression(class_weight='balanced', random_state=12345)
Лучшие гиперпараметры: {'solver': 'saga'}
F1: 0.75

Модель:  DecisionTreeClassifier(class_weight='balanced', random_state=12345)
Лучшие гиперпараметры: {'max_depth': 50}
F1: 0.56

Модель:  RandomForestClassifier(class_weight='balanced', random_state=12345)
Лучшие гиперпараметры: {'max_depth': 50, 'n_estimators': 15}
F1: 0.42



In [23]:
# display the best model and hyperparameter information on the screen
print('Лучшая модель:',models[max(models_scores, key=models_scores.get)])
print(f'Метрика F1= {models_scores[max(models_scores, key=models_scores.get)]:.2}' )


Лучшая модель: LogisticRegression(class_weight='balanced', random_state=12345)
Метрика F1= 0.75


We make a prediction on the test sample, applying the best model, and estimate the f1 metric

In [25]:
model = LogisticRegression(class_weight='balanced', solver='saga', random_state=12345)
               
model.fit(features_train, target_train)

predict = model.predict(features_test)

score = f1_score(target_test, predict)

print('Выбранная модель: ', model)
print(f'Итоговая метрика f1 на тестовой выборке: {score:.2}')

Выбранная модель:  LogisticRegression(class_weight='balanced', random_state=12345, solver='saga')
Итоговая метрика f1 на тестовой выборке: 0.75


##### Conclusions:
1. preprocessing of the original data was carried out, the uninformative column "Unnamed :0" was removed
2. Regularization, tokenization and lemmatization of the original tweets were performed, for which a function for line-by-line processing was defined
3. lemmatized text is added to the original array, a corpus of words is collected
4. The importance of words is evaluated, the evaluation matrix is collected, which is a feature matrix for model training.
5. The target feature is the "toxic" column.
6. 4 model classifiers are announced: 
<br>6.1. logistic regression
<br>6.2 Logistic regression with solver found by GridSearch tool 6.3.
<br>6.3. decision tree
<br>6.4. Random Forest
7. The original sample is divided into training and test samples in the ratio of 80:20.
8. GridSearch tool with crossvalidation searched for the best hyperparameters with comparison by f1 metric.
9. The best metric was found for the simple logistic regression model with Saga solver, f1 = 0.75.
10. The best model is applied for prediction on the test sample, the final metric is f1 = 0.75.

Unfortunately, we failed to apply BERT in the project (locally we also lacked resources), the pymystem3 library did not work locally, and the kernel "crashed" in the simulator when lemmatizing 150K lines of the task. Therefore, we used "simple" tokenization and lemmatization tools from the nltk library


