DESCRIPTION

    Using NLP and machine learning, make a model to identify toxic comments from the Talk edit pages on Wikipedia. Help identify the 0swords that make a comment toxic.

Problem Statement:  

    Wikipedia is the world’s largest and most popular reference work on the internet with about 500 million unique visitors per month. It also has millions of contributors who can make edits to pages. The Talk edit pages, the key community interaction forum where the contributing community interacts or discusses or debates about the changes pertaining to a particular topic. 

    Wikipedia continuously strives to help online discussion become more productive and respectful. You are a data scientist at Wikipedia who will help Wikipedia to build a predictive model that identifies toxic comments in the discussion and marks them for cleanup by using NLP and machine learning. Post that, help identify the top terms from the toxic comments. 

Domain: Internet

    Analysis to be done: Build a text classification model using NLP and machine learning that detects toxic comments.

Content: 

    id: identifier number of the comment

    comment_text: the text in the comment

    toxic: 0 (non-toxic) /1 (toxic)

Steps to perform:

    Cleanup the text data, using TF-IDF convert to vector space representation, use Support Vector Machines to detect toxic comments. Finally, get the list of top 15 toxic terms from the comments identified by the model.

Tasks: 

    Load the data using read_csv function from pandas package

    Get the comments into a list, for easy text cleanup and manipulation

Cleanup: 

    Using regular expressions, remove IP addresses

    Using regular expressions, remove URLs

    Normalize the casing

    Tokenize using word_tokenize from NLTK

    Remove stop words

    Remove punctuation

    Define a function to perform all these steps, you’ll use this later on the actual test set

    Using a counter, find the top terms in the data. 

    Can any of these be considered contextual stop words? 

    Words like “Wikipedia”, “page”, “edit” are examples of contextual stop words

    If yes, drop these from the data

Train-Test Split:

    Separate into train and test sets

    Use train-test method to divide your data into 2 sets: train and test

    Use a 70-30 split

Tf-idf transofrmation

    Use TF-IDF values for the terms as feature to get into a vector space model

    Import TF-IDF vectorizer from sklearn

    Instantiate with a maximum of 4000 terms in your vocabulary

    Fit and apply on the train set

    Apply on the test set

Model building: Support Vector Machine

    Instantiate SVC from sklearn with a linear kernel

    Fit on the train data

    Make predictions for the train and the test set

    Model evaluation: Accuracy, recall, and f1_score

    Report the accuracy on the train set

    Report the recall on the train set:decent, high, low?

    Get the f1_score on the train set

Looks like you need to adjust  the class imbalance, as the model seems to focus on the 0s

    Adjust the appropriate parameter in the SVC module

    Train again with the adjustment and evaluate

    Train the model on the train set

    Evaluate the predictions on the validation set: accuracy, recall, f1_score

Hyperparameter tuning

    Import GridSearch and StratifiedKFold (because of class imbalance)

    Provide the parameter grid to choose for ‘C’

    Use a balanced class weight while instantiating the Support Vector Classifier

    Find the parameters with the best recall in cross validation

    Choose ‘recall’ as the metric for scoring

    Choose stratified 5 fold cross validation scheme

Fit on the train set

    What are the best parameters?

    Predict and evaluate using the best estimator

    Use best estimator from the grid search to make predictions on the test set

    What is the recall on the test set for the toxic comments?

    What is the f1_score?

    What are the most prominent terms in the toxic comments?

Separate the comments from the test set that the model identified as toxic

    Make one large list of the terms

    Get the top 15 terms

In [None]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import sklearn.metrics as metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from nltk import word_tokenize
from collections import Counter

In [None]:
data = pd.read_csv("wikipedia.csv")

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
data.head(5)

Unnamed: 0,id,comment_text,toxic
0,e617e2489abe9bca,"""\r\n\r\n A barnstar for you! \r\n\r\n The De...",0
1,9250cf637294e09d,"""\r\n\r\nThis seems unbalanced. whatever I ha...",0
2,ce1aa4592d5240ca,"Marya Dzmitruk was born in Minsk, Belarus in M...",0
3,48105766ff7f075b,"""\r\n\r\nTalkback\r\n\r\n Dear Celestia... """,0
4,0543d4f82e5470b6,New Categories \r\n\r\nI honestly think that w...,0


In [None]:
data.shape

(5000, 3)

Cleanup: 

    Using regular expressions, remove IP addresses

    Using regular expressions, remove URLs

    Normalize the casing

    Tokenize using word_tokenize from NLTK

    Remove stop words

    Remove punctuation

    drop contextual words

In [None]:
def textPreProcess(comment_list):
    
    #Remove IPs
    comment_list_without_ip = []
    for comment in comment_list:
        comment_list_without_ip.append(re.sub('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', '', comment ))

    del comment_list

    #Remove URLs
    comment_list_without_url = []
    regex_url = r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''
    for comment in comment_list_without_ip:
        comment_list_without_url.append( re.sub(regex_url, '', comment))    

    del comment_list_without_ip

    #Remove Punctuation
    comment_list_without_punctuation = []
    for comment in comment_list_without_url:
        removePunctuation = [char for char in comment if char not in string.punctuation]
        modifiedcomment = ''.join(removePunctuation)
        comment_list_without_punctuation.append(modifiedcomment)

    del comment_list_without_url

    #Remove StopWords and Normalize
    comment_list_without_stopwords = []
    for comment in comment_list_without_punctuation:
        words = comment.split(" ")
        wordNormalized = [word.lower() for word in words]
        finalWords = [word for word in wordNormalized if word not in stopwords.words('english')]
        comment_list_without_stopwords.append(' '.join(word for word in finalWords))

    del comment_list_without_punctuation

    #Remove Contextual StopWords
    comment_list_without_context_stopwords = [] 
    for comment in comment_list_without_stopwords: 
        words = comment.split(" ")
        wordListWithoutContextualStopWords = [word for word in words if ((not (word.startswith("wikipe"))) and (not (word.startswith("wikipi"))) and
                                                                         (not (word.startswith("wikipp")))  and (not (word.startswith("edit"))) and (not (word.startswith("page"))))]
        comment_list_without_context_stopwords.append(' '.join(word for word in wordListWithoutContextualStopWords))

    del comment_list_without_stopwords

    #Tokenize using Word_Tokenizer
    sentences = ' '.join(comment for comment in comment_list_without_context_stopwords)
    words = word_tokenize(sentences)
    
    return words, comment_list_without_context_stopwords

In [None]:
comment_list = data['comment_text'].to_list()
wordList, commentList = textPreProcess(comment_list)

Display the top 15 most repeated words 

In [None]:
count_words = Counter(wordList)
count_words.most_common(15)

[('article', 1659),
 ('talk', 1047),
 ('please', 1033),
 ('would', 965),
 ('one', 856),
 ('like', 836),
 ('dont', 784),
 ('ass', 709),
 ('also', 657),
 ('i', 643),
 ('think', 630),
 ('fuck', 630),
 ('see', 628),
 ('know', 595),
 ('im', 561)]

In [None]:
#Build Vocabulary
wordVector = CountVectorizer()
features = np.array(comment_list)
finalWordVocab = wordVector.fit(features)

Seperate Feature and labels and form bagofwords

In [None]:
#Seperate data as features and label
features = data.iloc[:,1].values
label = data.iloc[:,2].values
bagOfWords = finalWordVocab.transform(features)

In [None]:
pd.Series(label).value_counts()

0    4563
1     437
dtype: int64

Do tfidf transformation

In [None]:
#Calc IDF values
tfidfObject = TfidfTransformer().fit(bagOfWords)

#Transform data 
finalFeatureApply = tfidfObject.transform(bagOfWords)

Split Train and Test Data

In [None]:
# Apply TrainTestSplit
X_train,X_test,y_train,y_test,indices_train,indices_test = train_test_split(finalFeatureApply,
                                                                            label,
                                                                            range(5000),
                                                                            test_size=0.3,
                                                                            random_state=6)

In [None]:
X_train.shape, X_test.shape,y_train.shape,y_test.shape

((3500, 22886), (1500, 22886), (3500,), (1500,))

Scale the data and apply SVC classifier and print the accuracy,precision , Recall and F1 score

In [None]:
scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = SVC(gamma='auto')
clf.fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)

In [None]:
y_train_pred = clf.predict(X_train_scaled)
print("Train Accuracy: ",metrics.accuracy_score(y_train,y_train_pred)*100)
print("Train Recall: ",metrics.recall_score(y_train,y_train_pred,average='weighted')*100)
print("Train f1 score: ",metrics.f1_score(y_train,y_train_pred,average='weighted')*100)
print("Test Accuracy: ",metrics.accuracy_score(y_test,y_pred)*100)
print("Test Recall: ",metrics.recall_score(y_test,y_pred,average='weighted')*100)
print("Test f1 score: ",metrics.f1_score(y_test,y_pred,average='weighted')*100)

Train Accuracy:  92.82857142857142
Train Recall:  92.82857142857142
Train f1 score:  90.39978901353544
Test Accuracy:  90.93333333333334
Test Recall:  90.93333333333334
Test f1 score:  86.61527001862197


In [None]:
print(metrics.classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1364
           1       0.00      0.00      0.00       136

    accuracy                           0.91      1500
   macro avg       0.45      0.50      0.48      1500
weighted avg       0.83      0.91      0.87      1500



  _warn_prf(average, modifier, msg_start, len(result))


We see that the model only predicts 0 class, doesnt predict 1 class , hence we need to tune hyperparameters and apply different weights for each class. crossvalidate using stratifiedKFold

In [None]:
clf.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'auto',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [None]:
param_grid = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [0.1,0.01,0.001,0.0001], 
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'break_ties' : [False,True],
              'class_weight': ['balanced']
}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, n_jobs = -1, scoring = 'recall_weighted',cv = 5) 

grid.fit(X_train_scaled, y_train)
print(grid.best_params_)
print(grid.best_estimator_)

Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 284 tasks      | elapsed: 11.7min
[Parallel(n_jobs=-1)]: Done 508 tasks      | elapsed: 20.7min
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed: 32.1min
[Parallel(n_jobs=-1)]: Done 800 out of 800 | elapsed: 32.3min finished


{'C': 1, 'break_ties': False, 'class_weight': 'balanced', 'gamma': 0.0001, 'kernel': 'sigmoid'}
SVC(C=1, break_ties=False, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='sigmoid',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


Classify using the best params

In [None]:
grid_predictions = grid.best_estimator_.predict(X_test_scaled) 
print("Test Accuracy: ",metrics.accuracy_score(y_test,grid_predictions)*100)
print("Test Precision: ",metrics.precision_score(y_test,grid_predictions,average='weighted')*100)
print("Test Recall: ",metrics.recall_score(y_test,grid_predictions,average='weighted')*100)
print("Test f1 score: ",metrics.f1_score(y_test,grid_predictions,average='weighted')*100)
# print classification report
print(metrics.classification_report(y_test, grid_predictions)) 

Test Accuracy:  92.93333333333334
Test Precision:  92.05329212807966
Test Recall:  92.93333333333334
Test f1 score:  92.23728557705503
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1364
           1       0.67      0.43      0.53       136

    accuracy                           0.93      1500
   macro avg       0.81      0.71      0.74      1500
weighted avg       0.92      0.93      0.92      1500



We can see clearly that both Precision and Recall have improved after tuning hyperparameters

In [None]:
type(grid_predictions)

numpy.ndarray

Form a dataframe, containing the actual and predicted values of test data along with their indices

In [None]:
df = pd.DataFrame({'Pred':pd.Series(grid_predictions),'Actual':pd.Series(y_test),'test_indices':pd.Series(indices_test)})
df.head(5)

Unnamed: 0,Pred,Actual,test_indices
0,0,0,2191
1,0,0,529
2,0,0,2541
3,0,0,2416
4,0,0,2049


Find the comments that were classified as toxic

In [None]:
toxic_comment_list = list(data.iloc[df[df['Pred']==1.0]['test_indices'].to_list()]['comment_text'])

Display the top 10 comments

In [None]:
toxic_comment_list[:10]

['Piss Off \r\n\r\nSuck my dick you pussy',
 'Fuck wiki\r\n\r\nFuck this piece of shit called Wikipedia, it bullshit of misinformation and Zionist propaganda! 188.23.179.183',
 "The elephant population has tripled over the last decade... \r\n\r\nI'm not particularly fond of your level of douchebag.",
 '"\r\n\r\nATTENTION ""MIND CONTROLLED DIS INFO AGENT""  KEEP IT REAL!  YOU THOUGHT YOU COULD USE WIKIPEDIA TO MISLEAD THE PUBLIC ABOUT ELECTRONIC HARASSMENT AND IT IS JUST NOT GOING TO HAPPEN.  YOU HAVE BEEN EXPOSED ESPECIALLY BY YOU UNPROFESSIONAL REMARKS ABOVE."',
 'NIGHTSTALLIONS WIFE GOT FUCKED BY A NIGGER AND HAD HIS BABY\r\n\r\nAND IT SMELLED OF FRIED CHICKEN',
 "Stop deleting my jont.  I told you! You can ask him he'll tell you that we're friends we went to the same preschool!\r\nStop being so damn ignorant.  I hate people like you. Yall think you're soo cool but really all u are is just a buttfart!  GO rub you're nipples sherlock.  Leave me alone!",
 'QUIT Threatening Me... \r\n\r

Display the top 15 toxic words

In [None]:
toxicWordList, toxicCommentList = textPreProcess(toxic_comment_list)
toxic_word_count = Counter(toxicWordList)
toxic_word_count.most_common(15)

[('assfuck', 277),
 ('gay', 212),
 ('fucking', 118),
 ('shit', 104),
 ('eat', 96),
 ('admins', 95),
 ('cocksucking', 94),
 ('cunts', 94),
 ('fuck', 16),
 ('bitch', 13),
 ('like', 10),
 ('dont', 10),
 ('stop', 8),
 ('talk', 8),
 ('cult', 8)]

Lets try other Models and see if we get better recall values

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)
ypred = model.predict(X_test)
print(metrics.classification_report(y_test, ypred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96      1364
           1       1.00      0.15      0.27       136

    accuracy                           0.92      1500
   macro avg       0.96      0.58      0.61      1500
weighted avg       0.93      0.92      0.90      1500



In [None]:
C = [10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200]
class_weight = ['dict', 'balanced', 'none']
dual = [True,False]
fit_intercept = [True,False]
penalty = ['l1', 'l2', 'elasticnet', 'none']
l1_ratio = [0,0.2,0.4,0.6,0.8,1]
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
max_iter = [11000]
random_grid = {'C': C,
               'class_weight': class_weight,
               'fit_intercept':fit_intercept,
               'solver':solver,
               'dual': dual,
               'penalty': penalty,
               'l1_ratio': l1_ratio,
               'max_iter': max_iter
              }
print(random_grid)

{'C': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200], 'class_weight': ['dict', 'balanced', 'none'], 'fit_intercept': [True, False], 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'dual': [True, False], 'penalty': ['l1', 'l2', 'elasticnet', 'none'], 'l1_ratio': [0, 0.2, 0.4, 0.6, 0.8, 1], 'max_iter': [11000]}


In [None]:
from sklearn.model_selection import RandomizedSearchCV
clf = LogisticRegression(random_state=0)
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 1000, cv = 3, verbose=2, n_jobs = -1)
clf_random.fit(X_train, y_train)
clf_random.best_params_

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   33.0s
[Parallel(n_jobs=-1)]: Done 299 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 732 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 1414 tasks      | elapsed: 16.1min
[Parallel(n_jobs=-1)]: Done 2033 tasks      | elapsed: 23.9min
[Parallel(n_jobs=-1)]: Done 2780 tasks      | elapsed: 36.8min
[Parallel(n_jobs=-1)]: Done 3000 out of 3000 | elapsed: 47.8min finished
  "(penalty={})".format(self.penalty))


{'C': 70,
 'class_weight': 'balanced',
 'dual': False,
 'fit_intercept': True,
 'l1_ratio': 0,
 'max_iter': 11000,
 'penalty': 'l1',
 'solver': 'saga'}

In [None]:
grid_predictions = clf_random.best_estimator_.predict(X_test) 
print("Test Accuracy: ",metrics.accuracy_score(y_test,grid_predictions)*100)
print("Test Precision: ",metrics.precision_score(y_test,grid_predictions,average='weighted')*100)
print("Test Recall: ",metrics.recall_score(y_test,grid_predictions,average='weighted')*100)
print("Test f1 score: ",metrics.f1_score(y_test,grid_predictions,average='weighted')*100)
# print classification report
print(metrics.classification_report(y_test, grid_predictions))

Test Accuracy:  94.73333333333333
Test Precision:  94.30277796883479
Test Recall:  94.73333333333333
Test f1 score:  94.27972833626713
              precision    recall  f1-score   support

           0       0.96      0.99      0.97      1364
           1       0.81      0.55      0.66       136

    accuracy                           0.95      1500
   macro avg       0.88      0.77      0.81      1500
weighted avg       0.94      0.95      0.94      1500



In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# evaluate predictions
accuracy = metrics.accuracy_score(y_test, y_pred)
recall = metrics.recall_score(y_test,y_pred,average='weighted')
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("recall weighted: %.2f%%" % (recall * 100.0) )
print(metrics.classification_report(y_test, y_pred, labels=[0,1]))

scores = cross_val_score(model, finalFeatureApply, label, cv=5,scoring='recall_weighted')
print(scores*100)
print("mean recall_weighted score: ",scores.mean()*100,"%")
print("std dev: ",scores.std()*100,"%")

Accuracy: 93.60%
recall weighted: 93.60%
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      1364
           1       0.93      0.32      0.47       136

    accuracy                           0.94      1500
   macro avg       0.94      0.66      0.72      1500
weighted avg       0.94      0.94      0.92      1500

[94.4 93.5 94.6 93.4 93.5]
mean recall_weighted score:  93.88000000000002 %
std dev:  0.511468474101772 %


Comparing all 3 Models, we see that Logistic Regression gives us the best Recall Score