# Wikipedia Toxicity.

### DESCRIPTION

### Using NLP and machine learning, make a model to identify toxic comments from the Talk edit pages on Wikipedia. Help identify the words that make a comment toxic.

### Problem Statement:  

### Wikipedia is the world’s largest and most popular reference work on the internet with about 500 million unique visitors per month. It also has millions of contributors who can make edits to pages. The Talk edit pages, the key community interaction forum where the contributing community interacts or discusses or debates about the changes pertaining to a particular topic. 

### Wikipedia continuously strives to help online discussion become more productive and respectful. You are a data scientist at Wikipedia who will help Wikipedia to build a predictive model that identifies toxic comments in the discussion and marks them for cleanup by using NLP and machine learning. Post that, help identify the top terms from the toxic comments. 

### Domain: Internet

### Analysis to be done: Build a text classification model using NLP and machine learning that detects toxic comments.

### Content: 

### id: identifier number of the comment

### comment_text: the text in the comment

### toxic: 0 (non-toxic) /1 (toxic)

## IMPORTS

In [1]:
import pandas as pd
import seaborn as sns
import scipy.stats as ss
import numpy as np
import warnings
warnings.filterwarnings(action="ignore")
import re
import matplotlib.pyplot as plt
import string

In [2]:
import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')
#nltk.download('punkt')
#import spacy
#nlp = spacy.load('en')

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import sklearn.metrics as metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from nltk import word_tokenize
from collections import Counter

## Steps to perform:

### Cleanup the text data, using TF-IDF convert to vector space representation, use Support Vector Machines to detect toxic comments. Finally, get the list of top 15 toxic terms from the comments identified by the model.

## Tasks: 

### 1. Load the data using read_csv function from pandas package

In [4]:
Wiki = pd.read_csv("train.csv")

In [5]:
Wiki.head()

Unnamed: 0,id,comment_text,toxic
0,e617e2489abe9bca,"""\r\n\r\n A barnstar for you! \r\n\r\n The De...",0
1,9250cf637294e09d,"""\r\n\r\nThis seems unbalanced. whatever I ha...",0
2,ce1aa4592d5240ca,"Marya Dzmitruk was born in Minsk, Belarus in M...",0
3,48105766ff7f075b,"""\r\n\r\nTalkback\r\n\r\n Dear Celestia... """,0
4,0543d4f82e5470b6,New Categories \r\n\r\nI honestly think that w...,0


In [6]:
Wiki.shape

(5000, 3)

In [7]:
Wiki.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            5000 non-null   object
 1   comment_text  5000 non-null   object
 2   toxic         5000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 117.3+ KB


In [8]:
Wiki.isna().sum()

id              0
comment_text    0
toxic           0
dtype: int64

In [9]:
#Wiki.describe()

## 2. Get the comments into a list, for easy text cleanup and manipulation

In [10]:
Comments = Wiki['comment_text'].to_list()

In [11]:
Comments[:2]

['"\r\n\r\n A barnstar for you! \r\n\r\n  The Defender of the Wiki Barnstar I like your edit on the Kayastha page. Lets form a solidarity group against those who malign the article and its subject matter. I propose the folloing name for the group.\r\n\r\nUnited intellectuals\' front of Kayastha ethinicty against racist or castist abuse (UIFKEARCA)   "',
 '"\r\n\r\nThis seems unbalanced.  whatever I have said about Mathsci, he has said far more extreme and unpleasant things about me (not to mention others), and with much greater frequency.  I\'m more than happy to reign myself in, if that\'s what you\'d like (ruth be told, I was just trying to get Mathsci to pay attention and stop being uncivil).  I would expect you to issue the same request to Mathsci.  \r\n\r\n If this is intentionally unbalanced (for whatever reason), please let me know, and I will voluntarily close this account and move on to other things.  I like wikipedia, and I have a lot to contribute in my own way, but there is

## 3. Cleanup: 

### a) Using regular expressions, remove IP addresses

### b) Using regular expressions, remove URLs

### c) Normalize the casing

### d) Tokenize using word_tokenize from NLTK

### e) Remove stop words

### f) Remove punctuation

### g) Define a function to perform all these steps, you’ll use this later on the actual test set

In [12]:
def textPreProcess(comment_list):
    
    #Remove IPs
    comment_list_without_ip = []
    for comment in comment_list:
        comment_list_without_ip.append(re.sub('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', '', comment ))

    del comment_list

    #Remove URLs
    comment_list_without_url = []
    regex_url = r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''
    for comment in comment_list_without_ip:
        comment_list_without_url.append( re.sub(regex_url, '', comment))    

    del comment_list_without_ip

    #Remove Punctuation
    comment_list_without_punctuation = []
    for comment in comment_list_without_url:
        removePunctuation = [char for char in comment if char not in string.punctuation]
        modifiedcomment = ''.join(removePunctuation)
        comment_list_without_punctuation.append(modifiedcomment)

    del comment_list_without_url

    #Remove StopWords and Normalize
    comment_list_without_stopwords = []
    for comment in comment_list_without_punctuation:
        words = comment.split(" ")
        wordNormalized = [word.lower() for word in words]
        finalWords = [word for word in wordNormalized if word not in stopwords.words('english')]
        comment_list_without_stopwords.append(' '.join(word for word in finalWords))

    del comment_list_without_punctuation

    #Remove Contextual StopWords
    comment_list_without_context_stopwords = [] 
    for comment in comment_list_without_stopwords: 
        words = comment.split(" ")
        wordListWithoutContextualStopWords = [word for word in words if ((not (word.startswith("wikipe"))) and (not (word.startswith("wikipi"))) and
                                                                         (not (word.startswith("wikipp")))  and (not (word.startswith("edit"))) and (not (word.startswith("page"))))]
        comment_list_without_context_stopwords.append(' '.join(word for word in wordListWithoutContextualStopWords))

    del comment_list_without_stopwords

    #Tokenize using Word_Tokenizer
    sentences = ' '.join(comment for comment in comment_list_without_context_stopwords)
    words = word_tokenize(sentences)
    
    return words, comment_list_without_context_stopwords

In [13]:
wordList, commentList = textPreProcess(Comments)

## 4. Using a counter, find the top terms in the data. 

In [14]:
vect = CountVectorizer()
#vect.fit(train)

In [15]:
count_words = Counter(wordList)
count_words.most_common(15)

[('article', 1659),
 ('talk', 1047),
 ('please', 1033),
 ('would', 965),
 ('one', 856),
 ('like', 836),
 ('dont', 784),
 ('ass', 709),
 ('also', 657),
 ('i', 643),
 ('think', 630),
 ('fuck', 630),
 ('see', 628),
 ('know', 595),
 ('im', 561)]

In [16]:
features = np.array(Comments)
finalWordVocab = vect.fit(features)

### a) Can any of these be considered contextual stop words? 

### b) Words like “Wikipedia”, “page”, “edit” are examples of contextual stop words

### c)If yes, drop these from the data

In [17]:
# NO these Important and should not be dropped because if the context changes the meaning of the and seniment change.
# which is not acceptable. And connot be dropped from the the list.

In [18]:
features = Wiki.iloc[:,1].values
label = Wiki.iloc[:,2].values
bagOfWords = finalWordVocab.transform(features)

In [19]:
pd.Series(label).value_counts()

0    4563
1     437
dtype: int64

## 5. Separate into train and test sets

### a) Use train-test method to divide your data into 2 sets: train and test

### b) Use a 70-30 split

In [20]:
# Implementing Question 5 and 6 at the same time.

## 6. Use TF-IDF values for the terms as feature to get into a vector space model

### a) Import TF-IDF vectorizer from sklearn

### b) Instantiate with a maximum of 4000 terms in your vocabulary

### c) Fit and apply on the train set

### d) Apply on the test set

In [21]:
#Calc IDF values
tfidfObject = TfidfTransformer().fit(bagOfWords)

#Transform data 
finalFeatureApply = tfidfObject.transform(bagOfWords)

In [22]:
X_train,X_test,y_train,y_test,indices_train,indices_test = train_test_split(finalFeatureApply,
                                                                            label,
                                                                            range(5000),
                                                                            test_size=0.3,
                                                                            random_state=6)

In [23]:
X_train.shape, X_test.shape,y_train.shape,y_test.shape

((3500, 22886), (1500, 22886), (3500,), (1500,))

## 7. Model building: Support Vector Machine

### a) Instantiate SVC from sklearn with a linear kernel

### b) Fit on the train data

### c) Make predictions for the train and the test set

In [24]:
scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = SVC(gamma='auto')
clf.fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)

## 8. Model evaluation: Accuracy, recall, and f1_score

### a) Report the accuracy on the train set

### b) Report the recall on the train set:decent, high, low?

### c) Get the f1_score on the train set

In [25]:
y_train_pred = clf.predict(X_train_scaled)
print("Train Accuracy: ",metrics.accuracy_score(y_train,y_train_pred)*100)
print("Train Recall: ",metrics.recall_score(y_train,y_train_pred,average='weighted')*100)
print("Train f1 score: ",metrics.f1_score(y_train,y_train_pred,average='weighted')*100)
print("Test Accuracy: ",metrics.accuracy_score(y_test,y_pred)*100)
print("Test Recall: ",metrics.recall_score(y_test,y_pred,average='weighted')*100)
print("Test f1 score: ",metrics.f1_score(y_test,y_pred,average='weighted')*100)

Train Accuracy:  92.82857142857142
Train Recall:  92.82857142857142
Train f1 score:  90.39978901353544
Test Accuracy:  90.93333333333334
Test Recall:  90.93333333333334
Test f1 score:  86.61527001862197


In [26]:
print(metrics.classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1364
           1       0.00      0.00      0.00       136

    accuracy                           0.91      1500
   macro avg       0.45      0.50      0.48      1500
weighted avg       0.83      0.91      0.87      1500



## 9. Looks like you need to adjust  the class imbalance, as the model seems to focus on the 0s

### a) Adjust the appropriate parameter in the SVC module


In [27]:
clf.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'auto',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

## 10. Train again with the adjustment and evaluate

### a) Train the model on the train set

### b) Evaluate the predictions on the validation set: accuracy, recall, f1_score

## 11. Hyperparameter tuning

### a) Import GridSearch and StratifiedKFold (because of class imbalance)

### b) Provide the parameter grid to choose for ‘C’

### c) Use a balanced class weight while instantiating the Support Vector Classifier

In [28]:
param_grid = {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [0.1,0.01,0.001,0.0001], 
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'break_ties' : [False,True],
              'class_weight': ['balanced']
}
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, n_jobs = -1, scoring = 'recall_weighted',cv = 5) 

grid.fit(X_train_scaled, y_train)
print(grid.best_params_)
print(grid.best_estimator_)

Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   56.8s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  8.2min
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 22.6min
[Parallel(n_jobs=-1)]: Done 800 out of 800 | elapsed: 22.8min finished


{'C': 1, 'break_ties': False, 'class_weight': 'balanced', 'gamma': 0.0001, 'kernel': 'sigmoid'}
SVC(C=1, break_ties=False, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='sigmoid',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


## 12. Find the parameters with the best recall in cross validation

### a) Choose ‘recall’ as the metric for scoring

### b) Choose stratified 5 fold cross validation scheme

### c) Fit on the train set

## 13. What are the best parameters?


## 14. Predict and evaluate using the best estimator

### a) Use best estimator from the grid search to make predictions on the test set

### b) What is the recall on the test set for the toxic comments?

### c) What is the f1_score?


In [29]:
grid_predictions = grid.best_estimator_.predict(X_test_scaled) 
print("Test Accuracy: ",metrics.accuracy_score(y_test,grid_predictions)*100)
print("Test Precision: ",metrics.precision_score(y_test,grid_predictions,average='weighted')*100)
print("Test Recall: ",metrics.recall_score(y_test,grid_predictions,average='weighted')*100)
print("Test f1 score: ",metrics.f1_score(y_test,grid_predictions,average='weighted')*100)
# print classification report
print(metrics.classification_report(y_test, grid_predictions))

Test Accuracy:  92.93333333333334
Test Precision:  92.05329212807966
Test Recall:  92.93333333333334
Test f1 score:  92.23728557705503
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1364
           1       0.67      0.43      0.53       136

    accuracy                           0.93      1500
   macro avg       0.81      0.71      0.74      1500
weighted avg       0.92      0.93      0.92      1500



In [30]:
type(grid_predictions)

numpy.ndarray

## 15. What are the most prominent terms in the toxic comments?

### a) Separate the comments from the test set that the model identified as toxic

### b) Make one large list of the terms

### c) Get the top 15 terms

In [31]:
df = pd.DataFrame({'Pred':pd.Series(grid_predictions),'Actual':pd.Series(y_test),'test_indices':pd.Series(indices_test)})
df.head(5)

Unnamed: 0,Pred,Actual,test_indices
0,0,0,2191
1,0,0,529
2,0,0,2541
3,0,0,2416
4,0,0,2049


In [32]:
toxic_comment_list = list(Wiki.iloc[df[df['Pred']==1.0]['test_indices'].to_list()]['comment_text'])

In [33]:
toxic_comment_list[:5]

['Piss Off \r\n\r\nSuck my dick you pussy',
 'Fuck wiki\r\n\r\nFuck this piece of shit called Wikipedia, it bullshit of misinformation and Zionist propaganda! 188.23.179.183',
 "The elephant population has tripled over the last decade... \r\n\r\nI'm not particularly fond of your level of douchebag.",
 'penis as I write this ==',
 '"\r\n\r\nATTENTION ""MIND CONTROLLED DIS INFO AGENT""  KEEP IT REAL!  YOU THOUGHT YOU COULD USE WIKIPEDIA TO MISLEAD THE PUBLIC ABOUT ELECTRONIC HARASSMENT AND IT IS JUST NOT GOING TO HAPPEN.  YOU HAVE BEEN EXPOSED ESPECIALLY BY YOU UNPROFESSIONAL REMARKS ABOVE."']

In [34]:
toxicWordList, toxicCommentList = textPreProcess(toxic_comment_list)
toxic_word_count = Counter(toxicWordList)
toxic_word_count.most_common(15)

[('suck', 361),
 ('mexicans', 356),
 ('assfuck', 277),
 ('gay', 214),
 ('fucking', 118),
 ('shit', 104),
 ('eat', 96),
 ('admins', 95),
 ('cocksucking', 94),
 ('cunts', 94),
 ('like', 20),
 ('fuck', 16),
 ('bitch', 13),
 ('dont', 12),
 ('go', 11)]