**Toxic comment classification - Evaluate Model Performance with the Debias Method with Decision Trees, Modelling with SVM**

- Apply TF-IDF, split datasets, train DecisionTree, evaluate dibias impact.
- Apply word embedding, split datasets, train DecisionTree, compare results.
- Train SVM model on debiased dataset (df_train2) with TF-IDF and Word-Embedding features.
- Hyperparameter tuning best models from DT and SVM, then evaluation against real-world data (test dataset)

<div style="border-top: 7px solid #800080; animation: sparkling 2s linear infinite;"></div>

<style>
@keyframes sparkling {
  0% { background-position: 0 0; }
  100% { background-position: 100% 0; }
}
</style>

**Part 1**

- Apply TF-IDF feature representation method.
- Split both the original dataset and the dataset which has the sensitive words masked, into training and test sets.
- Train both datasets with DesicionTree, evaluate the dibias method's impact and model performance.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

In [2]:
# Load the datasets
df_train1 = pd.read_csv('processed_train_data.csv')  # Original data without marking sensitive words
df_train2 = pd.read_csv('ready_train.csv')  # Data with sensitive words marking

# Vectorize lemmas for dataset 1
vectorizer = TfidfVectorizer()
X1 = vectorizer.fit_transform(df_train1['lemmas'])
y1 = df_train1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

# Split data into train and test sets for dataset 1
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.3, random_state=42)

# Vectorize lemmas for dataset 2
X2 = vectorizer.fit_transform(df_train2['lemmas'])
y2 = df_train2[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

# Split data into train and test sets for dataset 2
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.3, random_state=42)

In [3]:
X1.shape, X2.shape

((32450, 57644), (32450, 51860))

In [4]:
# Define and train the Decision Tree classifier
dt = DecisionTreeClassifier(class_weight='balanced', random_state=0)

In [5]:
# Train and evaluate model for dataset 1
dt.fit(X_train1, y_train1)
# Predict labels for the test set
y_pred1 = dt.predict(X_test1)
# Evaluate the model
print("Classification report for dataset 1:")
print(classification_report(y_test1, y_pred1, zero_division=0))

Classification report for dataset 1:
              precision    recall  f1-score   support

           0       0.75      0.74      0.74      4577
           1       0.22      0.41      0.28       496
           2       0.59      0.71      0.65      2531
           3       0.23      0.41      0.29       144
           4       0.52      0.64      0.57      2390
           5       0.24      0.42      0.31       413

   micro avg       0.57      0.68      0.62     10551
   macro avg       0.42      0.55      0.47     10551
weighted avg       0.61      0.68      0.64     10551
 samples avg       0.28      0.32      0.27     10551



In [6]:
# Train and evaluate model for dataset 2
dt.fit(X_train2, y_train2)
# Predict labels for the test set
y_pred2 = dt.predict(X_test2)
# Evaluate the model
print("Classification report for dataset 2:")
print(classification_report(y_test2, y_pred2, zero_division=0))

Classification report for dataset 2:
              precision    recall  f1-score   support

           0       0.74      0.75      0.75      4577
           1       0.19      0.50      0.27       496
           2       0.60      0.75      0.67      2531
           3       0.17      0.36      0.23       144
           4       0.50      0.66      0.57      2390
           5       0.10      0.30      0.14       413

   micro avg       0.53      0.69      0.60     10551
   macro avg       0.38      0.55      0.44     10551
weighted avg       0.59      0.69      0.63     10551
 samples avg       0.27      0.33      0.28     10551



<div style="border-top: 7px solid #800080; animation: sparkling 2s linear infinite;"></div>

<style>
@keyframes sparkling {
  0% { background-position: 0 0; }
  100% { background-position: 100% 0; }
}
</style>

**Part 2**

- Apply word embedding feature representation method.
- Split both the original dataset and the dataset which has the sensitive words masked, into training and test sets.
- Train both datasets with DesicionTree and compare the results. 

In [7]:
import spacy
import numpy as np
nlp = spacy.load("en_core_web_lg")

In [8]:
# Function to generate and normalise word embeddings 
def generate_word_embeddings(df):
    embeddings = []
    for tokens in df["token_nonstop"]:
        sentence_embedding = np.zeros(nlp.vocab.vectors_length)  
        word_count = 0  
        for token in tokens:
            if token in nlp.vocab:
                word_embedding = nlp.vocab[token].vector 
                sentence_embedding += word_embedding 
                word_count += 1  
        if word_count > 0:
            sentence_embedding /= word_count 
        embeddings.append(sentence_embedding)
    return np.array(embeddings)

In [9]:
# Generate word embeddings for token_nonstop in df_train1
X_1 = generate_word_embeddings(df_train1)
# Check the shape of the feature matrix
print("Shape of feature matrix X:", X_1.shape)

Shape of feature matrix X: (32450, 300)


In [10]:
y_label = df_train1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']] # same for both datasets

In [11]:
# Split data into train and test sets for dataset 1
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, y_label, test_size=0.3, random_state=42)

In [12]:
# Define and train the Decision Tree classifier
dt_1 = DecisionTreeClassifier(class_weight='balanced', random_state=0)

In [13]:
# Train and evaluate model for dataset 1
dt_1.fit(X_train_1, y_train_1)
# Predict labels for the test set
y_pred_1 = dt_1.predict(X_test_1)
# Evaluate the model
print("Classification report for dataset 1:")
print(classification_report(y_test_1, y_pred_1, zero_division=0))

Classification report for dataset 1:
              precision    recall  f1-score   support

           0       0.58      0.61      0.59      4577
           1       0.19      0.23      0.21       496
           2       0.43      0.47      0.45      2531
           3       0.04      0.06      0.05       144
           4       0.37      0.41      0.39      2390
           5       0.10      0.13      0.12       413

   micro avg       0.44      0.49      0.46     10551
   macro avg       0.29      0.32      0.30     10551
weighted avg       0.45      0.49      0.47     10551
 samples avg       0.24      0.25      0.22     10551



In [14]:
# Generate word embeddings for token_nonstop in df_train2
X_2 = generate_word_embeddings(df_train2)
# Check the shape of the feature matrix
print("Shape of feature matrix X:", X_2.shape)

Shape of feature matrix X: (32450, 300)


In [15]:
# Split data into train and test sets for dataset 2
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_label, test_size=0.3, random_state=42)

In [16]:
# Train and evaluate model for dataset 2
dt_1.fit(X_train_2, y_train_2)
# Predict labels for the test set
y_pred_2 = dt_1.predict(X_test_2)
# Evaluate the model
print("Classification report for dataset 2:")
print(classification_report(y_test_2, y_pred_2, zero_division=0))

Classification report for dataset 2:
              precision    recall  f1-score   support

           0       0.58      0.62      0.60      4577
           1       0.21      0.24      0.22       496
           2       0.43      0.49      0.46      2531
           3       0.08      0.11      0.10       144
           4       0.38      0.43      0.40      2390
           5       0.10      0.13      0.12       413

   micro avg       0.45      0.50      0.47     10551
   macro avg       0.30      0.34      0.32     10551
weighted avg       0.46      0.50      0.48     10551
 samples avg       0.24      0.25      0.22     10551



**In comparison to the previous results, TF-IDF and Word Embedding methods were used for feature representation for both datasets and then trained with a decision tree model. The results do not show too much difference in model performance. This means that masking sensitive words can reduce bias on the one hand, and does not affect model performance on the other. This means that the masked dataset will be used in the project from now on.**

<div style="border-top: 7px solid #800080; animation: sparkling 2s linear infinite;"></div>

<style>
@keyframes sparkling {
  0% { background-position: 0 0; }
  100% { background-position: 100% 0; }
}
</style>

**Part 3**

- Train SVM model on debiased dataset (df_train2) with TF-IDF features, and 
- with Word-Embedding features
- Experiment with linear and rbf kernel

In [17]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
import warnings
from sklearn.exceptions import ConvergenceWarning

In [18]:
# Suppress ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Define and train the classifier with linear kernel
clf = OneVsRestClassifier(SVC(kernel='linear', class_weight='balanced', max_iter=1000, random_state=0))
clf.fit(X_train2, y_train2) # tf-idf vectors

# Predict labels for the test set
y_pred2 = clf.predict(X_test2)

# Evaluate the model
print("Classification report for tf-idf based linear SVM:")
print(classification_report(y_test2, y_pred2, zero_division=0))

Classification report for tf-idf based linear SVM:
              precision    recall  f1-score   support

           0       0.47      1.00      0.64      4577
           1       0.05      1.00      0.10       496
           2       0.26      1.00      0.41      2531
           3       0.24      0.65      0.36       144
           4       0.25      1.00      0.39      2390
           5       0.04      1.00      0.08       413

   micro avg       0.21      1.00      0.35     10551
   macro avg       0.22      0.94      0.33     10551
weighted avg       0.33      1.00      0.48     10551
 samples avg       0.21      0.50      0.28     10551



In [19]:
# Suppress ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Define and train the SVM classifier with RBF kernel
clf_svm_rbf = OneVsRestClassifier(SVC(kernel='rbf', class_weight='balanced', max_iter=1000, random_state=0))
clf_svm_rbf.fit(X_train2, y_train2) 

# Predict labels for the test set
y_pred_svm_rbf = clf_svm_rbf.predict(X_test2)

# Evaluate the performance
print("Classification report for tf-idf based rbf SVM:")
print(classification_report(y_test2, y_pred_svm_rbf, zero_division=0))

Classification report for tf-idf based rbf SVM:
              precision    recall  f1-score   support

           0       0.47      1.00      0.64      4577
           1       0.05      1.00      0.10       496
           2       0.26      1.00      0.41      2531
           3       0.54      0.38      0.45       144
           4       0.25      1.00      0.39      2390
           5       0.04      1.00      0.08       413

   micro avg       0.21      0.99      0.35     10551
   macro avg       0.27      0.90      0.35     10551
weighted avg       0.33      0.99      0.48     10551
 samples avg       0.21      0.50      0.28     10551



In [20]:
# Suppress ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Define and train the classifier
clf_1 = OneVsRestClassifier(SVC(kernel='linear', class_weight='balanced', max_iter=1000, random_state=0))
clf_1.fit(X_train_2, y_train_2) # word embeddings

# Predict labels for the test set
y_pred_2 = clf_1.predict(X_test_2)

# Evaluate the model
print("Classification report for word-embedding based linear SVM:")
print(classification_report(y_test_2, y_pred_2, zero_division=0))

Classification report for word-embedding based linear SVM:
              precision    recall  f1-score   support

           0       0.47      0.98      0.64      4577
           1       0.05      0.85      0.09       496
           2       0.26      0.80      0.39      2531
           3       0.01      0.74      0.03       144
           4       0.25      0.97      0.39      2390
           5       0.04      0.94      0.08       413

   micro avg       0.19      0.92      0.31     10551
   macro avg       0.18      0.88      0.27     10551
weighted avg       0.33      0.92      0.47     10551
 samples avg       0.19      0.47      0.25     10551



In [21]:
# Suppress ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Define and train the SVM classifier with RBF kernel
clf_svm_rbf_1 = OneVsRestClassifier(SVC(kernel='rbf', class_weight='balanced', max_iter=1000, random_state=0))
clf_svm_rbf_1.fit(X_train_2, y_train_2)

# Predict labels for the test set
y_pred_svm_rbf_2 = clf_svm_rbf_1.predict(X_test_2)

# Evaluate the performance
print("Classification report for word-embedding based rbf SVM:")
print(classification_report(y_test_2, y_pred_svm_rbf_2, zero_division=0))

Classification report for word-embedding based rbf SVM:
              precision    recall  f1-score   support

           0       0.47      0.99      0.64      4577
           1       0.05      1.00      0.10       496
           2       0.26      1.00      0.41      2531
           3       0.01      1.00      0.03       144
           4       0.25      1.00      0.39      2390
           5       0.04      1.00      0.08       413

   micro avg       0.18      1.00      0.31     10551
   macro avg       0.18      1.00      0.28     10551
weighted avg       0.33      1.00      0.47     10551
 samples avg       0.18      0.50      0.25     10551



<div style="border-top: 7px solid #800080; animation: sparkling 2s linear infinite;"></div>

<style>
@keyframes sparkling {
  0% { background-position: 0 0; }
  100% { background-position: 100% 0; }
}
</style>

**Part 4**

- Hyperparameter tuning for best models of decision trees and SVM
- Evaluate the results
- Evaluation of the models against the real-world test data

In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

# Set parameter range
parameters = {
    'estimator__C': [0.01, 0.1, 1, 10],
    'estimator__gamma': [0.01, 0.1, 1, 10]
}

# Create an SVM classifier
svm_classifier = SVC(class_weight='balanced', max_iter=1000, random_state=0)

# Wrap it with OneVsRestClassifier
ovr_classifier = OneVsRestClassifier(svm_classifier)

# Perform search
grid_search = GridSearchCV(ovr_classifier, parameters, cv=5, scoring='f1_micro')
grid_search.fit(X_train2, y_train2) # tf-idf based

# Get the best model
best_rbf = grid_search.best_estimator_

# Make prediction with the best model
y_pred_best_rbf = best_rbf.predict(X_test2)

# Evaluation
f1 = f1_score(y_test2, y_pred_best_rbf,average='micro')

print(f'Best C: {best_rbf.estimator.C}')
print(f'Best gamma: {best_rbf.estimator.gamma}')
print(f'F1 score: {f1}')

Best C: 10
Best gamma: 1
F1 score: 0.6714457541496271


In [23]:
from sklearn.metrics import precision_recall_fscore_support

# Calculate precision, recall, and F1 score for each class
precision, recall, f1, _ = precision_recall_fscore_support(y_test2, y_pred_best_rbf, average=None)

# Print precision, recall, and F1 score for each class
for i in range(len(precision)):
    print(f'Class {i}: Precision={precision[i]:.2f}, Recall={recall[i]:.2f}, F1 Score={f1[i]:.2f}')

Class 0: Precision=0.82, Recall=0.75, F1 Score=0.79
Class 1: Precision=0.34, Recall=0.43, F1 Score=0.38
Class 2: Precision=0.73, Recall=0.73, F1 Score=0.73
Class 3: Precision=0.54, Recall=0.23, F1 Score=0.32
Class 4: Precision=0.62, Recall=0.54, F1 Score=0.57
Class 5: Precision=0.21, Recall=0.40, F1 Score=0.28


In [24]:
# Set parameter range
parameters = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 3, 4, 5, 6],
    'max_features': [ None, 'sqrt', 'log2']
}

# Create a Decision Tree classifier
dt_classifier = DecisionTreeClassifier(class_weight='balanced', random_state=0)

# Perform grid search
grid_search = GridSearchCV(dt_classifier, parameters, cv=5, scoring='f1_micro')
grid_search.fit(X_train2, y_train2)

# Get the best model
best_dt = grid_search.best_estimator_

# Make prediction with the best model
y_pred_best_dt = best_dt.predict(X_test2)

# Evaluation
f1 = f1_score(y_test2, y_pred_best_dt, average='micro')

print(f'Best max_depth: {best_dt.max_depth}')
print(f'Best min_samples_split: {best_dt.min_samples_split}')
print(f'Best min_samples_leaf: {best_dt.min_samples_leaf}')
print(f'Best max_features: {best_dt.max_features}')
print(f'F1 score: {f1}')

Best max_depth: None
Best min_samples_split: 5
Best min_samples_leaf: 1
Best max_features: None
F1 score: 0.6000079481778803


In [25]:
# Calculate precision, recall, and F1 score for each class
precision, recall, f1, _ = precision_recall_fscore_support(y_test2, y_pred_best_dt, average=None)

# Print precision, recall, and F1 score for each class
for i in range(len(precision)):
    print(f'Class {i}: Precision={precision[i]:.2f}, Recall={recall[i]:.2f}, F1 Score={f1[i]:.2f}')

Class 0: Precision=0.73, Recall=0.76, F1 Score=0.74
Class 1: Precision=0.18, Recall=0.52, F1 Score=0.27
Class 2: Precision=0.59, Recall=0.78, F1 Score=0.67
Class 3: Precision=0.18, Recall=0.42, F1 Score=0.25
Class 4: Precision=0.49, Recall=0.70, F1 Score=0.58
Class 5: Precision=0.09, Recall=0.31, F1 Score=0.15


In [26]:
from joblib import dump

# Save the best models, in case later use
dump(best_rbf, 'best_svm_rbf.joblib')
dump(best_dt, 'best_decision_tree.joblib')

['best_decision_tree.joblib']

In [27]:
# Import saved model for validation against real-world data
from joblib import load

# Load the saved models to validate against real-world test data
dt_model = load('best_decision_tree.joblib')
svm_model = load('best_svm_rbf.joblib')

In [28]:
# Load the processed test data
test_df = pd.read_csv('ready_test.csv')

In [29]:
# Transform the test data with the same TfidfVectorizer
X_test = vectorizer.transform(test_df['lemmas'])
y_test = test_df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

In [30]:
# Validation of the decision tree model
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support

y_pred_test=dt_model.predict(X_test)

f1 = f1_score(y_test, y_pred_test, average='micro')
print(f1)

# Calculate precision, recall, and F1 score for each class
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_test, average=None)

# Print precision, recall, and F1 score for each class
for i in range(len(precision)):
    print(f'Class {i}: Precision={precision[i]:.2f}, Recall={recall[i]:.2f}, F1 Score={f1[i]:.2f}')

0.2173226433430515
Class 0: Precision=0.18, Recall=0.82, F1 Score=0.30
Class 1: Precision=0.02, Recall=0.47, F1 Score=0.05
Class 2: Precision=0.16, Recall=0.75, F1 Score=0.26
Class 3: Precision=0.03, Recall=0.45, F1 Score=0.06
Class 4: Precision=0.13, Recall=0.70, F1 Score=0.22
Class 5: Precision=0.03, Recall=0.44, F1 Score=0.05


In [31]:
# Validation of the svm model
y_pred_test_svm=svm_model.predict(X_test)

f1 = f1_score(y_test, y_pred_test_svm, average='micro')
print(f1)

# Calculate precision, recall, and F1 score for each class
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_test_svm, average=None)

# Print precision, recall, and F1 score for each class
for i in range(len(precision)):
    print(f'Class {i}: Precision={precision[i]:.2f}, Recall={recall[i]:.2f}, F1 Score={f1[i]:.2f}')

0.3808105995757192
Class 0: Precision=0.30, Recall=0.83, F1 Score=0.44
Class 1: Precision=0.08, Recall=0.48, F1 Score=0.13
Class 2: Precision=0.34, Recall=0.71, F1 Score=0.46
Class 3: Precision=0.37, Recall=0.32, F1 Score=0.34
Class 4: Precision=0.25, Recall=0.54, F1 Score=0.34
Class 5: Precision=0.09, Recall=0.67, F1 Score=0.16


<div style="border-top: 7px solid #800080; animation: sparkling 2s linear infinite;"></div>

<style>
@keyframes sparkling {
  0% { background-position: 0 0; }
  100% { background-position: 100% 0; }
}
</style>

**Conclusion:**

- Decision Trees on Original vs. Debiased Dataset:

Models: Decision Trees trained on both the original dataset and the dataset after applying a debias method.
Purpose: Evaluate the impact of the debias method on the performance of Decision Tree models.

- Decision Trees with TF-IDF vs. Word Embedding Features:

Models: Decision Tree models trained using both TF-IDF and word embedding feature representations.
Purpose: Compare the performance of Decision Tree models when using different feature representations.

- SVM Models with Linear and RBF Kernel on TF-IDF vs. Word Embedding Features:

Models: SVM models trained with both linear and RBF kernels, using TF-IDF and word embedding features.
Purpose: Compare the performance of SVM models with different kernels and feature representations.

- Hyperparameter Tuning for Best Models:

Purpose: Perform hyperparameter tuning for the best-performing models obtained from the previous experiments to evaluate potential improvements in model performance.

**Result:**
- The debias method has minimal impact on model performance.
- Decision Tree models exhibit superior performance when utilising TF-IDF features.
- SVM models demonstrate better performance when employing the RBF kernel and utilizing TF-IDF features.
- Hyperparameter tuning significantly improves model performance. However, for classes with lower frequencies, the models still struggle to accurately capture patterns during training and perform pooly on real-world data. SVM performs better with real data than decision trees.

**Decision for Next Step:**
- Design and implement neural networks to compare the performance with the existing models.