# Project description

Project for marketplace «Викишоп»

Internet marketplace «Викишоп» starts a new service. Now users can edit and supplement product descriptions, just like in wiki communities. Clients propose their edits and comment on the changes of other users. The store needs a tool that will look for toxic comments and submit them for moderation.

**Project execution plan**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.


**Data description**

Data locates in `toxic comments.csv` file. The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Data download

In [1]:
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
import nltk
nltk.download('wordnet')
import re
import numpy as np
import lightgbm as lgb

from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # < напишите код здесь >
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.neighbors import KNeighborsClassifier

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Рус\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df = pd.read_csv(r'C:\Users\Рус\Desktop\python\Новая папка\toxic_comments.csv')

# Preprocessing

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159571 non-null  int64 
 1   text        159571 non-null  object
 2   toxic       159571 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.7+ MB


There are no gaps, the formats are correct.

In [4]:
df['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

Strong class imbalance

I will make a function for lemmatization and removal of extra characters.

In [5]:
lemmatizer = WordNetLemmatizer()
def lemmatize(sentence):
    text = re.sub(r'[^a-zA-Z]', ' ', sentence)
    word_list = nltk.word_tokenize(text)
    lemmatized_output = " ".join([lemmatizer.lemmatize(w) for w in word_list])
    return " ".join(lemmatized_output.split())

df['text_final'] = df['text'].apply(lemmatize)

print(df['text_final'][0])

Explanation Why the edits made under my username Hardcore Metallica Fan were reverted They weren t vandalism just closure on some GAs after I voted at New York Dolls FAC And please don t remove the template from the talk page since I m retired now


Dividing the data into samples.

In [6]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=12345)
df_train_target = df_train['toxic']
df_test_target = df_test['toxic']
df_train_features = df_train.drop(['toxic'], axis = 1)
df_test_features = df_test.drop(['toxic'], axis = 1)

Converting to unicode, loading stop words.

In [7]:
corpus_train = df_train_features['text'].values.astype('U')
corpus_test = df_test['text'].values.astype('U')
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Рус\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Vectorizing with TfidfVectorizer

In [8]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf_train = count_tf_idf.fit_transform(corpus_train)
tf_idf_test = count_tf_idf.transform(corpus_test)

In [9]:
count_tf_idf = TfidfVectorizer()
tf_idf_train = count_tf_idf.fit_transform(corpus_train)
tf_idf_test = count_tf_idf.transform(corpus_test)

Let's test another vectorization method.

In [10]:
vectorizer = CountVectorizer(stop_words=stopwords, dtype=np.float32)
CV_train = vectorizer.fit_transform(corpus_train)
CV_test = vectorizer.transform(corpus_test)

# Model training

In [11]:
models = [LogisticRegression(max_iter=1000),
    lgb.LGBMClassifier(n_estimators = 1000, learning_rate = 0.1), KNeighborsClassifier(n_neighbors=100)]

for i in models:
    clf_gs = GridSearchCV(i, {}, cv=5, scoring='f1')
    clf_gs.fit(tf_idf_train, df_train_target)

    train_f1_score = f1_score(df_train_target, clf_gs.predict(tf_idf_train))
    test_f1_score = f1_score(df_test_target, clf_gs.predict(tf_idf_test))
    print(i, train_f1_score)
    print(i, test_f1_score)

LogisticRegression(max_iter=1000) 0.7702411021814006
LogisticRegression(max_iter=1000) 0.7376835843093511
LGBMClassifier(n_estimators=1000) 0.971304347826087
LGBMClassifier(n_estimators=1000) 0.7829684537148768
KNeighborsClassifier(n_neighbors=100) 0.5783586103043993
KNeighborsClassifier(n_neighbors=100) 0.5835304235648248


In [12]:
models = [LogisticRegression(max_iter=1000),
    lgb.LGBMClassifier(n_estimators = 1000, learning_rate = 0.1), KNeighborsClassifier(n_neighbors=10)]

for i in models:
    clf_gs = GridSearchCV(i, {}, cv=5, scoring='f1')
    clf_gs.fit(CV_train, df_train_target)

    train_f1_score = f1_score(df_train_target, clf_gs.predict(CV_train))
    test_f1_score = f1_score(df_test_target, clf_gs.predict(CV_test))
    print(i, train_f1_score)
    print(i, test_f1_score)

LogisticRegression(max_iter=1000) 0.9019363934154331
LogisticRegression(max_iter=1000) 0.7609739368998628
LGBMClassifier(n_estimators=1000) 0.9143891998682911
LGBMClassifier(n_estimators=1000) 0.7760013752793535
KNeighborsClassifier(n_neighbors=10) 0.411332127787824
KNeighborsClassifier(n_neighbors=10) 0.37106299212598426


# Сonclusions

3 models were tested: LogisticRegression, lgb.LGBMClassifier, KNeighborsClassifier and 2 vectorization methods: CountVectorizer and TfidfVectorizer. On the considered parameters, the best F1 value on the test sample was achieved using LGBMClassifier and TfidfVectorizer (0.777815699658703).