# SVM Classifier x Toxic Content Detection
Il presente Notebook mostra l'addestramento ed il testing di un Classificatore basato su Support Vector Machine per il task
di Toxic Content Detection.

I dati sono stati processati come segue:
1. Pulizia del testo (si veda, 'dataset_preprocessing.py')
2. Lemmatizzazione con NLTK
3. Vettorizzazione con TF-IDF

In [15]:
import pandas as pd
from sklearn import metrics
import pickle
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from datetime import datetime
from sklearn.metrics import accuracy_score

# Addestramento del Sistema
Il Sistema è ovviamente riaddestrabile a piacere. Si consiglia, tuttavia, dato il tempo necessario per riaddestrare il classificatore, di utilizzare il file pickle 'svm_classifier' per eseguire subito gli esperimenti.

## Caricamento del training set


In [10]:
training_set = pd.read_csv("./../../datasets/training_set.csv")
training_set_lem = pd.read_csv("./../../datasets/training_set_lemmatized.csv")

# Osservazione: il Training Set è stato già ripulito
training_set

Unnamed: 0,comment_text,toxic
0,cocksucker before you piss around on my work,1
1,hey what is it talk what is it an exclusive gr...,1
2,bye dont look come or think of comming back to...,1
3,you are gay or antisemmitian archangel white t...,1
4,fuck your filthy mother in the ass dry,1
...,...,...
30572,chris i dont know who you are talking to but i...,0
30573,operation condor is also named a dirty war can...,0
30574,there is no evidence that this block has anyth...,0
30575,thanks hey utkarshraj thanks for the kindness ...,0


Sia l'addestramento che il testing saranno eseguiti sia sul Dataset "non-lemmatizzato" che sul Dataset "lemmatizzato". Osserviamo immediatamente che lo spazio delle feature del Dataset "lemmatizzato" è inferiore (49188 $<$ 56091) rispetto a quello del Dataset "non-lemmatizzato". Ciò ha impatto sia sul tempo necessario per addestrare il classificatore sia sull'accuracy del modello, come verrà mostrato in seguito.

In [11]:
# Vettorizzazione con TF-IDF
vectorizer = TfidfVectorizer()
vectorizer_lem = TfidfVectorizer()

X_train = vectorizer.fit_transform(training_set['comment_text'])
y_train = training_set['toxic']

X_train_lem = vectorizer_lem.fit_transform(training_set_lem['comment_text'])
y_train_lem = training_set_lem['toxic']

print("X_train.shape: " + str(X_train.shape))
print("y_train.shape: " + str(y_train.shape))

print("X_train_lem.shape: " + str(X_train_lem.shape))
print("y_train_lem.shape: " + str(y_train_lem.shape))

X_train.shape: (30577, 56091)
y_train.shape: (30577,)
X_train_lem.shape: (30577, 49188)
y_train_lem.shape: (30577,)


## Addestramento del Modello

In [12]:
#Import svm model
import pickle
model_filename = 'svm_classifier.pkl'
model_lem_filename = 'svm_classifier_lem.pkl'
cl, cl_lem = None, None

In [13]:
print("Training started on not-Lemmatized Dataset...")
#Create a svm Classifier
clf = SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
start = datetime.now()
clf.fit(X_train, y_train)
end = datetime.now()
print("Training completed! Required time: " + str(end-start))
with open(model_filename, 'wb') as f:
    pickle.dump(cl, f)

Training started on not-Lemmatized Dataset...
Training completed! Required time: 0:03:21.515343


In [14]:
print("Training started on Lemmatized Dataset...")
#Create a svm Classifier
clf_lem = SVC(kernel='poly')
#Train the model using the training sets
start = datetime.now()
clf_lem.fit(X_train_lem, y_train_lem)
end = datetime.now()
print("Training completed! Required time: " + str(end-start))
with open(model_lem_filename, 'wb') as f:
    pickle.dump(cl, f)

Training started on Lemmatized Dataset...
Training completed! Required time: 0:06:28.751011


In [None]:
with open(model_filename, 'rb') as f:
    cl = pickle.load(f)

In [None]:
with open(model_lem_filename, 'rb') as f:
    cl_lem = pickle.load(f)

# Testing del Sistema

In [16]:
test_set = pd.read_csv("./../../datasets/test_set.csv")
test_set_lem = pd.read_csv("./../../datasets/test_set_lemmatized.csv")

test_set.dropna(inplace=True)
test_set_lem.dropna(inplace=True)

In [17]:
test_set = test_set[test_set['toxic'] != -1]
other_set = test_set[test_set['toxic'] == -1]

test_set_lem = test_set_lem[test_set_lem['toxic'] != -1]
other_set_lem = test_set_lem[test_set_lem['toxic'] == -1]

In [18]:
X_test = vectorizer.transform(test_set['comment_text'])
y_test = test_set['toxic']

print("X_test.shape: " + str(X_test.shape))
print("y_test.shape: " + str(y_test.shape))

X_test.shape: (63842, 56091)
y_test.shape: (63842,)


In [19]:
X_test_lem = vectorizer_lem.transform(test_set_lem['comment_text'])
y_test_lem = test_set_lem['toxic']

print("X_test_lem.shape: " + str(X_test_lem.shape))
print("y_test_lem.shape: " + str(y_test_lem.shape))

X_test_lem.shape: (63842, 49188)
y_test_lem.shape: (63842,)


In [20]:
#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.8520253124902102
Precision: 0.3845995329028713
Recall: 0.9198291440775423


In [21]:
#Predict the response for test dataset
y_pred = clf_lem.predict(X_test_lem)

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))


Accuracy: 0.8432066664578177
Precision: 0.3632243218743463
Recall: 0.8557581731559061
