# Toxic Comment Classification

## Topic Modeling

To summarize the context of a huge corpus by guessing what the general theme (topic) of the text. 

##### Steps:
* Preprocessing (Tokenization using gensim's simple_preprocess)
* Cleaning
    * Stop word removal
    * Bigram collation
    * Lemmatization
* Creation of dictionary (list of all words in the cleaned text)
* Topic modeling using LDA
* Visualization with pyLDAviz

##### Import Essential Libraries

In [1]:
# Data manipulation
import pandas as pd 
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# LDA topic modeling
import pyLDAvis.gensim
import gensim
from gensim.models import LdaModel
from gensim.corpora import Dictionary

# NLTK
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))
from nltk.corpus import wordnet

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [2]:
# Importing the data
df=pd.read_csv("toxic_comments_dataset.csv")

In [3]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


##### Data Preprocessing

In [4]:
# Tokenization
def preprocess(comment):
    return gensim.utils.simple_preprocess(comment, deacc=True, min_len=3)

df_text=df.comment_text.apply(lambda x: preprocess(x))

In [5]:
# Metadata
print("Total number of comments: ",len(df_text))
print("Before preprocessing: ",df.comment_text.iloc[30])
print("After preprocessing: ",df_text.iloc[30])

Total number of comments:  159571
Before preprocessing:  How could I post before the block expires?  The funny thing is, you think I'm being uncivil!
After preprocessing:  ['how', 'could', 'post', 'before', 'the', 'block', 'expires', 'the', 'funny', 'thing', 'you', 'think', 'being', 'uncivil']


In [6]:
# Group together bigrams :  new + york --> new_york
bigram = gensim.models.Phrases(df_text)
print(bigram[df_text.iloc[30]])

['how', 'could', 'post', 'before', 'the', 'block_expires', 'the', 'funny_thing', 'you', 'think', 'being_uncivil']


In [7]:
def clean(word_list):
    # Remove stop words
    clean_words = [w for w in word_list if not w in eng_stopwords]
    # Collate bigrams
    clean_words = bigram[clean_words]
    # Lemmatize
    clean_words=[lem.lemmatize(word, "v") for word in clean_words]
    return(clean_words)

print("Before cleaning: ",df_text.iloc[1])
print("After cleaning: ",clean(df_text.iloc[1]))

all_text=df_text.apply(lambda x:clean(x))

Before cleaning:  ['aww', 'matches', 'this', 'background', 'colour', 'seemingly', 'stuck', 'with', 'thanks', 'talk', 'january', 'utc']
After cleaning:  ['aww', 'match', 'background', 'colour', 'seemingly', 'stick', 'thank', 'talk', 'january_utc']


In [8]:
# Create the dictionary
dictionary = Dictionary(all_text)
print("There are",len(dictionary)," words in the final dictionary")

There are 171254  words in the final dictionary


In [9]:
# Convert into lookup tuples within the dictionary using doc2bow
print(dictionary.doc2bow(all_text.iloc[1]))
print("Wordlist from the sentence: ",all_text.iloc[1])

print("Wordlist from the dictionary lookup: ", 
      dictionary[21],dictionary[22],dictionary[23],dictionary[24],dictionary[25],dictionary[26],dictionary[27],dictionary[28],dictionary[29])

corpus = [dictionary.doc2bow(text) for text in all_text]

[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]
Wordlist from the sentence:  ['aww', 'match', 'background', 'colour', 'seemingly', 'stick', 'thank', 'talk', 'january_utc']
Wordlist from the dictionary lookup:  aww background colour january_utc match seemingly stick talk thank


##### LDA Topic Modeling

In [13]:
# Create the LDA model
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [14]:
pyLDAvis.enable_notebook()

In [15]:
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)