#**“Civil Comment” Case: A Study on Class Imbalance Pproblem and Toxicity Detection**

This is the Google Colab notebook for the assignment of Text and Media Analytics course.

Ezgi Günbatar- 8835780

Reading the toxicity dataset

In [1]:
import pandas as pd

# read data
df = pd.read_csv('jigsawToxicitySample.csv')
df.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


Importing some libraries that we might be needing.

In [2]:
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score, classification_report, confusion_matrix

 The target variable was labeled as toxic (equal and higher than 0.5) and non-toxic (lower than 0.5).

In [3]:
# select labels (<0.5 is 'not_toxic', >0.5 is 'toxic')
labels = []
for index, row in df.iterrows():
  tox = row['target']
  if tox > 0.5:
    labels.append('toxic')
  else:
    labels.append('not_toxic')

print('toxic_count',labels.count('toxic'))
print('non_toxic_count',labels.count('not_toxic'))

# add these labels as a new column 'Labels' to the dataframe
df.insert(6,"Labels", labels, True)


toxic_count 657
non_toxic_count 14617


Before starting analysis, data was split into train (%70) and test (%30) sets.


In [13]:
# make train-test split
X = df['comment_text'].tolist()
labels = df['Labels'].tolist()

# Label encoding (converting the labels into numbers)
le = preprocessing.LabelEncoder()
le.fit(labels)
y = le.transform(labels)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


Vectorizing the comments with TF-IDF Method.

In [14]:
# vectorize train and test set
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train_raw) # create the vocabulary

X_train = vectorizer.transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)

## Models on Original Data


Without resampling 2 ML models are applied to original data.


**1- Logistic Regression Model**

In [15]:
# Training and testing a Logistic Regression model

lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)
print(classification_report(y_test, y_pred_lr))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      4394
           1       1.00      0.01      0.01       189

    accuracy                           0.96      4583
   macro avg       0.98      0.50      0.49      4583
weighted avg       0.96      0.96      0.94      4583



**2- Stochastic Gradient Descent Classifier**

In [16]:
# Training and testing a Stochastic Gradient Descent Classifier

sg = SGDClassifier()
sg.fit(X_train, y_train)

y_pred_sg = sg.predict(X_test)

print(classification_report(y_test, y_pred_sg))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      4394
           1       0.92      0.06      0.11       189

    accuracy                           0.96      4583
   macro avg       0.94      0.53      0.54      4583
weighted avg       0.96      0.96      0.94      4583



## Undersampling

Creating a balanced data with Undersampling and applying 2 ML model

In [12]:
# Undersampling

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_train_undersampled, y_train_undersampled = rus.fit_resample(X_train, y_train)
label_indices = le.transform(['toxic','not_toxic'])
print(X_train_undersampled.shape,X_train.shape)
print('Original Toxic',y_train.tolist().count(label_indices[0]),'Original Not toxic',y_train.tolist().count(label_indices[1]))
print('Undersampled Toxic:',y_train_undersampled.tolist().count(1),'Undersampled Not toxic',y_train_undersampled.tolist().count(0))


(936, 26231) (10691, 26231)
Original Toxic 468 Original Not toxic 10223
Undersampled Toxic: 468 Undersampled Not toxic 468


**1- Logistic Regression Model**

In [17]:
# Logistic regression on undersampled data

rus_lr = LogisticRegression(solver='lbfgs')
rus_lr.fit(X_train_undersampled, y_train_undersampled)
y_pred_rus_lr = rus_lr.predict(X_test)
print(classification_report(y_test, y_pred_rus_lr))

              precision    recall  f1-score   support

           0       0.98      0.74      0.84      4394
           1       0.10      0.69      0.18       189

    accuracy                           0.74      4583
   macro avg       0.54      0.72      0.51      4583
weighted avg       0.95      0.74      0.82      4583



**2- Stochastic Gradient Descent Classifier**

In [18]:
# Stochastic Gradient Descent Classifier on undersampled data


rus_sg = SGDClassifier()
rus_sg.fit(X_train_undersampled, y_train_undersampled)

y_pred_rus_sg = rus_sg.predict(X_test)

print(classification_report(y_test, y_pred_rus_sg))

              precision    recall  f1-score   support

           0       0.98      0.74      0.84      4394
           1       0.10      0.70      0.18       189

    accuracy                           0.73      4583
   macro avg       0.54      0.72      0.51      4583
weighted avg       0.95      0.73      0.81      4583



## Oversampling with SMOTE


Creating balanced data with SMOTE and applying 2 ML model

**1- Logistic Regression Model**

In [19]:
# Logistic Regression with SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
smote_lr = LogisticRegression(solver='lbfgs')
smote_lr.fit(X_train_smote, y_train_smote)
y_pred_smote_lr = smote_lr.predict(X_test)
print(classification_report(y_test, y_pred_smote_lr ))


              precision    recall  f1-score   support

           0       0.98      0.98      0.98      4394
           1       0.45      0.44      0.45       189

    accuracy                           0.95      4583
   macro avg       0.71      0.71      0.71      4583
weighted avg       0.95      0.95      0.95      4583



**2- Stochastic Gradient Descent Classifier**

In [20]:
# Stochastic Gradient Descent Classifier with SMOTE

from sklearn.linear_model import SGDClassifier

smote_sg = SGDClassifier()
smote_sg.fit(X_train_smote, y_train_smote)

y_pred_smote_sg = smote_sg.predict(X_test)

print(classification_report(y_test, y_pred_smote_sg))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      4394
           1       0.62      0.39      0.48       189

    accuracy                           0.97      4583
   macro avg       0.80      0.69      0.73      4583
weighted avg       0.96      0.97      0.96      4583

