# Final Project by Abay Jumabayev

## Topic: Toxicity detection
## Description: 
I want to predict toxicity in the comments. I found a [data](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data) with human-labeled information about toxicity. The data contains text from Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
* toxic
* severe_toxic
* obscene
* threat
* insult
* identity_hate
I am going to create a model that will predict whether the message is toxic (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion).

## Steps:
1. I am going to do a preprocessing.

2. I will use three types of word embeddings:
* TF-IDF
* Word2Vec
* GloVe

3. For each of word embeddings, I will train six models:
* Linear regression
* Logistic regression
* Naive Bayes
* Decision Tree
* KNN
* SVC

Overall, I will have $3*6=18$ different models and choose one which has the highest metric score. My metric score will be precision $ = \frac{tp}{tp + fp}$. I want to decrease type II error, that is to decrease false positives. I don't want model to make errors and predict non-toxic comment as toxic.

4. I will apply trained model to Dota 2 in-game chats. 
Dota 2 community is assumed toxic and I want to test whether this is true. I assume here that Dota 2 chat messages are similar to the Wikipedia comments as they are written online, seen by people that do not know each other.

# 1. Preprocessing
Let's start with preprocessing. First thing is to load the data and look at it.

In [6]:
import os
import re
import nltk
import string
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, LancasterStemmer 

from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC

import warnings
warnings.filterwarnings("ignore")

In [1]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


I don't need *id*, and other types of toxicity except *toxic* and *severe_toxic*. Let's get rid of non-necessary columns and look how balanced our data is:

In [3]:
# drop columns
df = df.drop(['id', 'obscene', 'threat', 'insult', 'identity_hate'], axis=1)

# look how balanced the data is
df.groupby(['toxic', 'severe_toxic']).size()

toxic  severe_toxic
0      0               144277
1      0                13699
       1                 1595
dtype: int64

Out of 159,571 comments, 15,294 (about 10% of the data) are toxic and 1595 (about 1% of the data) are severe toxic messages.

The data is very unbalanced, that is why I will need to stratify data when splitting data

Now, I am switching to text preprocessing. I will define a function that takes text as input an does the following:
* replaces all non-alphanumeric characters with spaces
* replaces 3+ consecutive letters to one. e.g. looooove -> love
* removes all emails
* removes all urls
* replaces all letters to lower case
* removes stopwords (nltk stopwords)
* removes words with digits
* conducts a lemmatization

In [7]:
# load the stopword list provided by the NLTK library
stop_words = stopwords.words('english')

In [8]:
def preprocessing_text(text):
    text = re.sub(r'[^A-Za-z0-9 ]+', ' ', text) #remove all non‐alphanumeric characters except white space
    text = re.sub("(.)\\1{2,}", "\\1", text) #replace 3+ consecutive letters to 1 (loooovvvve -> love)
    text = re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", text) # remove emails
    text = re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , text) # remove urls
    words = word_tokenize(text.lower())
    tokens = [word for word in words if word not in stop_words]
    tokens = [token for token in tokens if not any(c.isdigit() for c in token)] #remove everything containing digits
    lemmatizer = WordNetLemmatizer()
    tokens_lematized = [lemmatizer.lemmatize(word) for word in tokens]
    preprocessed_text = ' '.join(tokens_lematized)
    return preprocessed_text

In [19]:
%%time
df['comment_text'] = df['comment_text'].apply(lambda x: preprocessing_text(x))

Wall time: 1min 32s


Now as the preprocessing done, we can save our data to pickle so we can easily load it later

# 2.1 TF-IDF

For TF-IDF I think that *max_df* should not be set to any value because the comments are not related to each other and we will be analyzing each of them on being toxic.

*min_df* is also should be set to 1, because as we saw, there are only 1% of severe toxic comments. By setting min df even to 1% we can get rid of important features that define toxicity.  

In [11]:
vectorizer = TfidfVectorizer() 
tfidf = vectorizer.fit_transform(df.comment_text)

NameError: name 'df' is not defined

## 2.1.1. Linear regression

In [191]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, df['toxic'], test_size=0.2, random_state=1, stratify = df['toxic'])

In [42]:
linear_regressor = LinearRegression()

In [43]:
linear_regressor.fit(X_train, y_train)

LinearRegression()

In [44]:
predictions = linear_regressor.predict(X_test)

In [51]:
# convert continuous values to binary
predictions[predictions >= 0.5] = 1
predictions[predictions < 0.5] = 0

In [52]:
print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95     28856
           1       0.52      0.58      0.55      3059

    accuracy                           0.91     31915
   macro avg       0.74      0.76      0.75     31915
weighted avg       0.91      0.91      0.91     31915



In [54]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predictions))

[[27258  1598]
 [ 1294  1765]]


We can see that accuracy is high, but that's because most of the data has the class 0.

The metrics of interest, precision is 0.52 which is not that good. Let's try changing the threshold by which we assign the class (used 0.5).

In [62]:
linear_regressor.fit(X_train, y_train)
predictions = linear_regressor.predict(X_test)

# convert continuous values to binary
predictions[predictions >= 0.6] = 1
predictions[predictions < 0.6] = 0

print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95     28856
           1       0.57      0.48      0.52      3059

    accuracy                           0.91     31915
   macro avg       0.76      0.72      0.74     31915
weighted avg       0.91      0.91      0.91     31915



The results are not good either, although we increased the precision from 0.52 to 0.57.

Let's see how other models will perform.

## 2.1.2. Logistic regression


In [56]:
logistic_regressor = LogisticRegression(random_state=0)

logistic_regressor.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [59]:
predictions = logistic_regressor.predict(X_test)

In [60]:
print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     28856
           1       0.93      0.61      0.73      3059

    accuracy                           0.96     31915
   macro avg       0.94      0.80      0.86     31915
weighted avg       0.96      0.96      0.95     31915



In [61]:
print(confusion_matrix(y_test, predictions))

[[28712   144]
 [ 1205  1854]]


Logistic regression with l2 penalty is doing a better job in classification than linear regression.

Precision is now 0.93 and accuracy is 0.96 compared to 0.57 and 0.91 given by linear model.

## 2.1.3 Naive Bayes

In [63]:
NB = BernoulliNB()

NB.fit(X_train, y_train)
predictions = NB.predict(X_test)

print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97     28856
           1       0.75      0.61      0.67      3059

    accuracy                           0.94     31915
   macro avg       0.86      0.79      0.82     31915
weighted avg       0.94      0.94      0.94     31915



In [64]:
print(confusion_matrix(y_test, predictions))

[[28252   604]
 [ 1206  1853]]


I used Bernoulli NB because this is the one which predicts boolean values.

The precision is 0.75 which is worse than the one logistic regression had. Accuracy also dropped down a little bit.

## 2.1.4 Decision Tree


In [65]:
tree = DecisionTreeClassifier()

In [73]:
param_grid = {'max_depth': np.arange(1,10),
             'criterion': ['gini', 'entropy']}

In [74]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(tree, param_grid=param_grid, cv=5)

In [75]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9])})

In [76]:
grid.best_params_

{'criterion': 'gini', 'max_depth': 9}

Grid search found that gini criterion and a depth of 9 layers are the best parameters.

In [77]:
predictions = grid.predict(X_test)

In [78]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     28856
           1       0.88      0.44      0.59      3059

    accuracy                           0.94     31915
   macro avg       0.91      0.72      0.78     31915
weighted avg       0.94      0.94      0.93     31915



In [79]:
print(confusion_matrix(y_test, predictions))

[[28680   176]
 [ 1705  1354]]


Precision is quite high, however recall is too low. The model classifies many of toxic messages as non-toxic.

Accuracy is 0.94. Still Logistic model is the best among the four.

## 2.1.5 KNN

In [80]:
knn = KNeighborsClassifier()

In [84]:
param_grid = {'n_neighbors': [1,5,10,15,20,25,30]}

I decreased the size of param grid from 30 to 7 (by using this list [1,5,10,15,20,25,30]) because it takes a lot of time to fit the model.

In [85]:
grid = GridSearchCV(knn, param_grid=param_grid, cv=5)

In [86]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 5, 10, 15, 20, 25, 30]})

In [87]:
grid.best_params_

{'n_neighbors': 1}

In [88]:
predictions = grid.predict(X_test)

In [89]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.93      0.99      0.96     28856
           1       0.73      0.27      0.39      3059

    accuracy                           0.92     31915
   macro avg       0.83      0.63      0.67     31915
weighted avg       0.91      0.92      0.90     31915



In [90]:
print(confusion_matrix(y_test, predictions))

[[28561   295]
 [ 2242   817]]


KNN with 1 neighbor provides not good results. recall is too low.

## 2.1.6 SVM

In [192]:
svc = LinearSVC()

svc.fit(X_train, y_train)
predictions = svc.predict(X_test)

print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98     28856
           1       0.88      0.69      0.77      3059

    accuracy                           0.96     31915
   macro avg       0.92      0.84      0.87     31915
weighted avg       0.96      0.96      0.96     31915



In [193]:
print(confusion_matrix(y_test, predictions))

[[28556   300]
 [  952  2107]]


SVC provides the highest accuarcy as well as logistic regression, but precision is lower (0.88).

Overall, using TF-IDF, the best performance was computed with Logistic model (2.1.2), which has precision of 0.93 and accuracy of 0.96.

Let's move to Word2Vec now and see whether this type of word embedding is better.

# 2.2 Word2Vec

I am going to use SpaCy to get a pretrained model. This pretrained model will allow me to convert string to a vector of length 300.

In [91]:
# load the model
import spacy
nlp = spacy.load('en_core_web_lg')

In [92]:
# function that takes a text as input and returns a vector
def get_vec(x):
    doc = nlp(x)
    vec = doc.vector
    return vec

In [93]:
%%time
df['vec'] = df['comment_text'].apply(lambda x: get_vec(x))
df.head()

Wall time: 21min 24s


Unnamed: 0,comment_text,toxic,severe_toxic,vec
0,explanation edits made username hardcore metal...,0,0,"[0.033511575, 0.017874392, -0.12182923, 0.0465..."
1,aww match background colour seemingly stuck th...,0,0,"[-0.042986494, 0.22736022, -0.048771024, 0.053..."
2,hey man really trying edit war guy constantly ...,0,0,"[-0.20488231, 0.004869846, -0.23974332, 0.0255..."
3,make real suggestion improvement wondered sect...,0,0,"[-0.03875695, 0.10936845, -0.21767448, 0.00962..."
4,sir hero chance remember page,0,0,"[-0.087694, 0.12292659, -0.1436986, 0.08244420..."


Here we see how the *vec* column looks like. Each row in *vec* has a length of 300.

In [94]:
X = df['vec'].to_numpy()
X = X.reshape(-1, 1)
X = np.concatenate(np.concatenate(X, axis = 0), axis = 0).reshape(-1, 300)
X.shape

(159571, 300)

In [95]:
y = df['toxic']

In [194]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)

## 2.2.1. Linear regression


In [97]:
linear_regressor = LinearRegression()

In [98]:
%%time
linear_regressor.fit(X_train, y_train)

LinearRegression()

In [99]:
predictions = linear_regressor.predict(X_test)

In [100]:
# convert continuous values to binary
predictions[predictions >= 0.5] = 1
predictions[predictions < 0.5] = 0

In [101]:
print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97     28856
           1       0.92      0.39      0.55      3059

    accuracy                           0.94     31915
   macro avg       0.93      0.69      0.76     31915
weighted avg       0.94      0.94      0.93     31915



In [102]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predictions))

[[28755   101]
 [ 1874  1185]]


Pecision and accuracy are quite high, but still worse that the model 2.1.2 Logistic regression.

## 2.2.2. Logistic regression


In [103]:
logistic_regressor = LogisticRegression(random_state=0)
%%time
logistic_regressor.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [104]:
predictions = logistic_regressor.predict(X_test)

In [105]:
print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.96      0.99      0.97     28856
           1       0.84      0.60      0.70      3059

    accuracy                           0.95     31915
   macro avg       0.90      0.79      0.84     31915
weighted avg       0.95      0.95      0.95     31915



In [106]:
print(confusion_matrix(y_test, predictions))

[[28511   345]
 [ 1220  1839]]


With word2vec precision decreases comparing to the logistic model 2.1.2. 

## 2.2.3 Naive Bayes


In [107]:
NB = BernoulliNB()
%%time
NB.fit(X_train, y_train)
predictions = NB.predict(X_test)

print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.97      0.91      0.94     28856
           1       0.47      0.78      0.59      3059

    accuracy                           0.90     31915
   macro avg       0.72      0.84      0.76     31915
weighted avg       0.93      0.90      0.91     31915



In [108]:
print(confusion_matrix(y_test, predictions))

[[26199  2657]
 [  673  2386]]


The precision is very low. NB did a very bad job in predicting toxicity.

## 2.2.4 Decision Tree


In [109]:
tree = DecisionTreeClassifier()

In [110]:
param_grid = {'max_depth': np.arange(1,10),
             'criterion': ['gini', 'entropy']}

In [111]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(tree, param_grid=param_grid, cv=5)

In [112]:
%%time
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9])})

In [113]:
grid.best_params_

{'criterion': 'entropy', 'max_depth': 8}

Grid search found that entropy criterion and a depth of 8 layers are the best parameters.

In [114]:
predictions = grid.predict(X_test)

In [115]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     28856
           1       0.82      0.45      0.58      3059

    accuracy                           0.94     31915
   macro avg       0.88      0.72      0.77     31915
weighted avg       0.93      0.94      0.93     31915



In [116]:
print(confusion_matrix(y_test, predictions))

[[28541   315]
 [ 1670  1389]]


This model does not beat the model 2.1.2 as well. Although, the performance is not bad, precision is 0.82 and accuracy is 0.94

## 2.2.5 KNN

In [117]:
knn = KNeighborsClassifier()

In [118]:
param_grid = {'n_neighbors': [1,5,10,15,20,25,30]}

In [119]:
grid = GridSearchCV(knn, param_grid=param_grid, cv=5)

In [121]:
%%time
grid.fit(X_train, y_train)

Wall time: 44min 25s


GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 5, 10, 15, 20, 25, 30]})

In [122]:
grid.best_params_

{'n_neighbors': 20}

In [123]:
predictions = grid.predict(X_test)

In [124]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97     28856
           1       0.84      0.54      0.66      3059

    accuracy                           0.95     31915
   macro avg       0.90      0.77      0.82     31915
weighted avg       0.94      0.95      0.94     31915



In [125]:
print(confusion_matrix(y_test, predictions))

[[28545   311]
 [ 1398  1661]]


KNN with 20 neighbors performed relatively good. Precision is not as good as the model 2.1.2 has.

## 2.2.6 SVM

In [195]:
from sklearn.svm import LinearSVC
svc = LinearSVC()

svc.fit(X_train, y_train)
predictions = svc.predict(X_test)

print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.94      0.99      0.96     28856
           1       0.80      0.40      0.53      3059

    accuracy                           0.93     31915
   macro avg       0.87      0.69      0.75     31915
weighted avg       0.93      0.93      0.92     31915



In [196]:
print(confusion_matrix(y_test, predictions))

[[28553   303]
 [ 1838  1221]]


SVC does not provide the best results. Precision is low (0.8)

Overall, for Word2Vec the linear regression (2.2.1) had the highest precision of 0.92.

Nevertheless, TF-IDF did a better job classifying toxic comments.

Let's see how GloVe will help models perform 

# 2.3 GloVe
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Read more [here](https://nlp.stanford.edu/projects/glove/))

First, we need to load a GloVe vector. I will use the file `glove.6B.100d.txt`, which contains pre-trained word vectors from Wikipedia. You can download it [here](https://nlp.stanford.edu/data/glove.6B.zip)


In [126]:
glove_vectors = dict()

file = open('../../Inputs/glove/glove.6B.100d.txt', encoding='utf-8') # make sure you have the file and that the location is correct

for line in file:
    values = line.split()
    word  = values[0]
    vectors = np.asarray(values[1:])
    glove_vectors[word] = vectors
    
file.close()

Now we need a function that will transform a string to a 100 length vector.
Then we will 

In [155]:
vec_shape = 100

def get_vec(x):
    arr = np.zeros(vec_shape)
    text = str(x).split() 
    if len(text)==0:
        return arr
    else:      
        for t in text:
            try:
                vec = glove_vectors.get(t).astype(float)
                arr = arr + vec
            except:
                pass      
        arr = arr.reshape(1, -1)[0]

        return arr/len(text)

In [156]:
%%time
df['glo_vec'] = df['comment_text'].apply(lambda x: get_vec(x))

Wall time: 4min 25s


In [157]:
df.head()

Unnamed: 0,comment_text,toxic,severe_toxic,vec,glo_vec
0,explanation edits made username hardcore metal...,0,0,"[0.033511575, 0.017874392, -0.12182923, 0.0465...","[-0.06319991304347827, 0.09004426086956521, 0...."
1,aww match background colour seemingly stuck th...,0,0,"[-0.042986494, 0.22736022, -0.048771024, 0.053...","[-0.16120359999999997, -0.016441500000000008, ..."
2,hey man really trying edit war guy constantly ...,0,0,"[-0.20488231, 0.004869846, -0.23974332, 0.0255...","[-0.2687100476190477, 0.1700534895238095, 0.33..."
3,make real suggestion improvement wondered sect...,0,0,"[-0.03875695, 0.10936845, -0.21767448, 0.00962...","[-0.18971845384615385, 0.20216402692307683, 0...."
4,sir hero chance remember page,0,0,"[-0.087694, 0.12292659, -0.1436986, 0.08244420...","[-0.28627062, 0.387602, 0.24465479999999995, -..."


In [158]:
X = df['glo_vec']
y = df['toxic']
X = np.concatenate(X, axis = 0).reshape(-1, vec_shape)

In [159]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

## 2.3.1 Linear regression


In [160]:
linear_regressor = LinearRegression()

In [161]:
linear_regressor.fit(X_train, y_train)

LinearRegression()

In [162]:
predictions = linear_regressor.predict(X_test)

In [163]:
# convert continuous values to binary
predictions[predictions >= 0.5] = 1
predictions[predictions < 0.5] = 0

In [164]:
print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.93      1.00      0.96     28856
           1       0.87      0.29      0.43      3059

    accuracy                           0.93     31915
   macro avg       0.90      0.64      0.70     31915
weighted avg       0.92      0.93      0.91     31915



In [165]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predictions))

[[28724   132]
 [ 2176   883]]


The model is bad because it has very low recall. Although the accuracy is 0.93.

## 2.3.2 Logistic regression


In [166]:
logistic_regressor = LogisticRegression(random_state=0)

logistic_regressor.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [167]:
predictions = logistic_regressor.predict(X_test)

In [168]:
print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.94      0.98      0.96     28856
           1       0.76      0.45      0.57      3059

    accuracy                           0.93     31915
   macro avg       0.85      0.72      0.77     31915
weighted avg       0.93      0.93      0.93     31915



In [169]:
print(confusion_matrix(y_test, predictions))

[[28418   438]
 [ 1673  1386]]


Logistic regression performs much better compared to linear one. But the performance is still not good.

## 2.3.3 Naive Bayes


In [170]:
NB = BernoulliNB()

NB.fit(X_train, y_train)
predictions = NB.predict(X_test)

print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.96      0.89      0.92     28856
           1       0.39      0.65      0.49      3059

    accuracy                           0.87     31915
   macro avg       0.67      0.77      0.70     31915
weighted avg       0.91      0.87      0.88     31915



In [171]:
print(confusion_matrix(y_test, predictions))

[[25675  3181]
 [ 1060  1999]]


I have got a very low precision with NB. Even accuracy is low (0.87)

## 2.3.4 Decision Tree


In [172]:
tree = DecisionTreeClassifier()

In [173]:
param_grid = {'max_depth': np.arange(1,10),
             'criterion': ['gini', 'entropy']}

In [174]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(tree, param_grid=param_grid, cv=5)

In [175]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9])})

In [176]:
grid.best_params_

{'criterion': 'entropy', 'max_depth': 7}

Grid search found that entropy criterion and a depth of 7 layers are the best parameters.

In [177]:
predictions = grid.predict(X_test)

In [178]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.93      0.99      0.96     28856
           1       0.69      0.28      0.40      3059

    accuracy                           0.92     31915
   macro avg       0.81      0.63      0.68     31915
weighted avg       0.91      0.92      0.90     31915



In [179]:
print(confusion_matrix(y_test, predictions))

[[28479   377]
 [ 2201   858]]


Recall and precision are very low in a tree model (0.28 and 0.69 respectively)

## 2.3.5 KNN

In [180]:
knn = KNeighborsClassifier()

In [181]:
param_grid = {'n_neighbors': [1,5,10,15,20,25,30]}

In [182]:
grid = GridSearchCV(knn, param_grid=param_grid, cv=5)

In [183]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 5, 10, 15, 20, 25, 30]})

In [184]:
grid.best_params_

{'n_neighbors': 20}

In [185]:
predictions = grid.predict(X_test)

In [186]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.95      0.98      0.96     28856
           1       0.76      0.47      0.58      3059

    accuracy                           0.94     31915
   macro avg       0.85      0.73      0.77     31915
weighted avg       0.93      0.94      0.93     31915



In [187]:
print(confusion_matrix(y_test, predictions))

[[28404   452]
 [ 1618  1441]]


In KNN, precision is low as well (0.76)

## 2.3.6 SVM

In [188]:
from sklearn.svm import LinearSVC
svc = LinearSVC()

svc.fit(X_train, y_train)
predictions = svc.predict(X_test)

print(classification_report(y_test,  predictions))

              precision    recall  f1-score   support

           0       0.94      0.99      0.96     28856
           1       0.80      0.41      0.54      3059

    accuracy                           0.93     31915
   macro avg       0.87      0.70      0.75     31915
weighted avg       0.93      0.93      0.92     31915



In [189]:
print(confusion_matrix(y_test, predictions))

[[28545   311]
 [ 1813  1246]]


SVC provides not the best results as well.

Overall, for GloVe, linear regression (2.3.1) has the best performance with 0.87 precision and 0.93 accuracy.

Out of all 18 models, TF-IDF logistic regression has the best performance with 0.93 precision and 0.96 accuracy.
Let's save the model and check couple of comments on whether they are toxic.

# 3. Best model

In [197]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, df['toxic'], test_size=0.2, random_state=1, stratify = df['toxic'])

In [198]:
logistic_regressor = LogisticRegression(random_state=0)

logistic_regressor.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [199]:
predictions = logistic_regressor.predict(X_test)

In [200]:
# save model
pickle.dump(logistic_regressor, open('log_model.pkl', 'wb'))

In [2]:
import pickle

In [3]:
# load model
model = pickle.load(open('log_model.pkl', 'rb'))

In [9]:
def get_pred(s):
    x = []
    x.append(s)
    x[0] = preprocessing_text(x[0])
    vec = vectorizer.transform(x)
    label = model.predict(vec)[0]
    return label

Function above returns 1 if the message is classified toxic and 0 otherwise. For input it takes string, which is then preprocessed and converted to vector.

Let's check some comments and see whether they are classified as toxic

In [10]:
get_pred("Love you")

NameError: name 'vectorizer' is not defined

In [239]:
get_pred("I hate you")

1

In [245]:
get_pred("Hate")

1

In [241]:
get_pred("Hate is normal, but we need to get better")

0

In [244]:
get_pred("I would suggest you to cut off internet and jump off the roof")

0

We get good predictions most of the time. However, the model sometimes ignores toxic messages and consider them non-toxic. This is because the recall is low. But we focused on precision, because we want the model to classify message as toxic only if it is really toxic.

Let's import chat messages from Dota2 and check how many messages are toxic there.

# 4. Dota 2 game chat analysis

In [254]:
chat = pd.read_csv('chat.csv')

In [255]:
chat.head()

Unnamed: 0.1,Unnamed: 0,match_id,comment_text,slot,time,unit,id
0,0,0,force it,6,-8,6k Slayer,1
1,1,0,space created,1,5,Monkey,2
2,2,0,hah,1,6,Monkey,3
3,3,0,ez 500,6,9,6k Slayer,4
4,4,0,mvp ulti,4,934,Kira,5


The important columns that I will use are match_id, comment_text, and slot.
* match_id is an ID of the match, there are 50k matches, I will use only first 5k matches just to check.
* comment_text is a chat message
* slot is a player slot. There are 10 players so slot column ranges from 0 to 9.

For each game, I am going to have 10 documents, all messages sent from each of a player.

Then, I am going to create a column toxic which will identify whether the player was toxic.

> By creating a column toxic I assume that the dota2 comments are similar to Wikipedia comments that the model were trained on

In [259]:
# drop columns
chat = chat.drop(['Unnamed: 0', 'time', 'unit', 'id'], axis=1)

Let's check the values of slot:

In [260]:
chat.groupby(['slot']).size()

slot
-9         4
 0    149898
 1    141096
 2    140451
 3    142879
 4    140744
 5    151895
 6    141741
 7    144117
 8    143298
 9    143365
dtype: int64

In [261]:
# remove slot -9 (assume those are mistakes)
chat = chat[chat.slot != -9]

Now let's reduce the df to only 5000 matches

In [264]:
chat = chat[chat.match_id < 5000]

Now we can transform the dataframe:

In [272]:
chat = chat.groupby(['match_id', 'slot'])['comment_text'].apply(' '.join).reset_index()

In [273]:
chat.head()

Unnamed: 0,match_id,slot,comment_text
0,0,0,fuck my ass ka bu tooooooooooooo 6k slayer haha
1,0,1,space created hah hah wtf TA? u srsly? why aly...
2,0,2,lol really ?
3,0,4,mvp ulti
4,0,6,force it ez 500 bye fate is cruel sad spec noe...


So, now we concatenated all the messages by each player in each game. We can assess whether player in a game was toxic

In [274]:
chat['toxic'] = chat['comment_text'].apply(lambda x: get_pred(x))

In [277]:
chat.groupby(['toxic']).size()

toxic
0    26738
1     3618
dtype: int64

What we now have is that 3618 out of 30,356 players (those are the ones who wrote at least something in the chat. Ideally there should be 50,000 players (5,000 games x 10 players in each)) were toxic. That is about 10% of a selected games.

Let's have a brief look on the toxic comments:
> IMPORTANT: data contains text that may be considered profane, vulgar, or offensive.

In [287]:
chat[chat['toxic']==0].head(20)

Unnamed: 0,match_id,slot,comment_text,toxic
1,0,1,space created hah hah wtf TA? u srsly? why aly...,0
2,0,2,lol really ?,0
3,0,4,mvp ulti,0
4,0,6,force it ez 500 bye fate is cruel sad spec noe...,0
5,0,7,wat that one i cant even run everyone kiting m...,0
7,1,1,our eshaker afk it will be abandoned he say sl...,0
8,1,2,HE WANT SLEEP HAHA ULTI ZEUS,0
9,1,3,lol gege,0
10,1,4,4 v 4 he lost hope,0
11,1,5,gg,0


Apparently, the comments contain profane words.

Overall, I think I have a good model which predicts whether the text is toxic. And the model is doing this fast. One implementation of a model could be in any online community with comments. Thus, whether the person writes a toxic message, my model could send a preventive notification like "The message might contain toxicity. You may need to reconsider the content of your message."