# Experiment 1
## Comparing Multiple Lemmatizations
In this experiment, I have successfully used NLTK and spaCy lemmatizers. I had attempted to use Gensim, TreeTaggerWrapper and TextBlob as well but ran into various issues.

Gensim's lemmatizer relied on Pattern, but Pattern has not been updated for several years, so Gensim has removed their lemmatize function from Gensnim 4.0. The discussion can be found here: https://github.com/RaRe-Technologies/gensim/issues/2716

TreeTaggerWrapper required multiple local installations to use compared to a simple pip install command which other libraries use. First their package, then their tagging scripts, followed by installation scripts and finally parameter files. I was concerned that due to a multi-step installation, it would not work properly when run on a different computer. (https://www.cis.lmu.de/~schmid/tools/TreeTagger/)

Lastly, TextBlob made use of TextBlob and Word objects. TextBlob only takes in non-tokenized Strings and Word only takes in a single word String. To use TextBlob, I would have to use untokenized text which would not make for a fair experiment as NLTK and spaCy performed lemmatizations on sentences tokenized by the same library (NLTK). While documentation states that they have WordList objects which contains Words in a TextBlob, WordLists do not have contain a pos tags parameter which makes me unable to use it for lemmatization. 

In [245]:
import keras
import numpy as np
import pandas as pd
import pickle
import sklearn
import tensorflow as tf

In [246]:
# File paths

# Data Directory
DATA_DIR = "data"

# Balanced datasets
BALANCED_TRAIN_DATASET = "data/balanced_dataset.pickle"
BALANCED_TEST_DATASET = "data/balanced_test_dataset.pickle"

# Preprocessed balanced test dataset
PREPROCESSED_BAL_TEST_DATASET = "data/preprocessed_test.pickle"

In [247]:
# Function to save data as a .pickle file
# Params: 
    # List or Dataframe - @data: Data to be saved as pickle
    # Str - @folder: folder name
    # Str - file name
# Output: Pickle file in directory/repo 
def save_pickle(data, folder, file_name):
    with open("{0}/{1}.pickle".format(folder, file_name), 'wb') as f:
        pickle.dump(data, f)
    print(f"Saved data is stored in \'{folder}\' in the form of {file_name}.pickle")
    #pickle.dump(data, open("data/{0}.pickle".format(file_name),"wb"))

# Function to load pickle file
# Params:
    # Str - @file_path: File path of pickle file
# Output:
    # Saved object in original file type (list/dataframe)
def load_pickle(file_path):
    return pickle.load(open(file_path, "rb"))

In [248]:
# Load datasets

# Balanced, unprocessed datasets
bal_train_df = load_pickle(BALANCED_TRAIN_DATASET)
bal_test_df = load_pickle(BALANCED_TEST_DATASET)

# Get test dataset
bal_test_dataset = load_pickle(PREPROCESSED_BAL_TEST_DATASET)

# Get train_y
bal_train_y = pd.read_pickle(BALANCED_TRAIN_DATASET)
bal_train_y = bal_train_y.drop(columns="comment_text")

# Get test_y
bal_test_y = pd.read_pickle(BALANCED_TEST_DATASET)
bal_test_y = bal_test_y.drop(columns="comment_text")

In [249]:
# Pre-processing imports
import functools
import nltk

from functools import lru_cache
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [250]:
# Pre-processing functions

# Function to clean comments in train dataset
# Params: 
#   pd dataframe    - @train_dataset: Training dataset
# Output: 
#   2D List         - @comment_list: cleaned comments
def clean_data(train_dataset):
    # Remove punctuation
    regex_str = "[^a-zA-Z\s]"
    train_dataset['comment_text'] = train_dataset['comment_text'].replace(regex=regex_str, value="")

    # Remove extra whitespaces
    regex_space = "\s+"
    train_dataset['comment_text'] = train_dataset['comment_text'].replace(regex=regex_space, value=" ")

    # Strip whitespaces
    train_dataset['comment_text'] = train_dataset['comment_text'].str.strip()

    # Lowercase
    train_dataset['comment_text'] = train_dataset['comment_text'].str.lower()

    # Convert comment_text column into a list
    comment_list = train_dataset['comment_text'].tolist()

    return comment_list

# Function to get NLTK POS Tagger
# Params: 
#   Str - @word: Token
# Output
#   Dict - POS tagger
def nltk_get_wordnet_pos(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()

    # Convert NLTK to wordnet POS notations

    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN) # Default to noun if not found

# Function to use NLTK lemmatizer
# Params: 2D List - Tokenized comments with stopwords removed
# Returns: 2D List - lemmatized tokens
def nltk_lemmatize(comment_stop):

    nltk.download('averaged_perceptron_tagger')
    comment_lemma = []
    lemmatizer = WordNetLemmatizer()
    lemmatizer_cache = lru_cache(maxsize=50000)(lemmatizer.lemmatize)

    for comment in comment_stop:
        temp = []
        temp.append([lemmatizer_cache(word, pos=nltk_get_wordnet_pos(word)) for word in comment])
        comment_lemma += temp

    return comment_lemma

# Function to remove NLTK stopwords
# Params: 
#   2D List - @comment_token: cleaned & tokenized comments
# Output:
#   2D List - @comment_stop: cleaned tokens with stopwords removed
def nltk_stopwords(comment_token):
    # Stopwords in English only
    STOP_WORDS = set(stopwords.words('english'))

    # Remove stopwords
    comment_stop = []

    for comment in comment_token:
        
        temp_word = []

        for word in comment:
            
            if word not in STOP_WORDS:
                temp_word.append(word)

        comment_stop.append(temp_word)
    
    return comment_stop

# Function to tokenize comments using NLTK Word Tokenize
# Params: 
#   2D List - @text: cleaned comments
# Output: 
#   2D List - tokenized comments
def nltk_tokenize(text):
    return [word_tokenize(word) for word in text]

# Function for all pre-processing functions
# Params:
    # Pandas Dataframe  - @dataset: Dataset to be pre-processed (train/test)
    # Str               - @file_name: File name to save pre-processed data as pickle
# Output: Pickle file in directory/repo
def preprocess_data(dataset, file_name):

    comment_cleaned = clean_data(dataset)
    
    # NLTK Tokenize
    comment_token = nltk_tokenize(comment_cleaned)

    # Remove NLTK stopwords
    comment_stop = nltk_stopwords(comment_token)

    # NLTK Lemmatization
    comment_lemma = nltk_lemmatize(comment_stop)

    save_pickle(comment_lemma, folder, file_name)

In [251]:
# Prepare basic pre-processing steps until before lemmatization

# Train dataset
train_clean = clean_data(bal_train_df)
train_token = nltk_tokenize(train_clean)
train_stopwords = nltk_stopwords(train_token)

# Test dataset
test_clean = clean_data(bal_test_df)
test_token = nltk_tokenize(test_clean)
test_stopwords = nltk_stopwords(test_clean)

# NLTK Lemmatization
The tokens have already been lemmatized by NLTK in basic pre-processing, but since it is a method of lemmatization, I am including it in the experiment.

NLTK requires each token to be tagged to a pos tag. NLTK uses WordNetLemmatizer, so it gets pos tags from WordNet. We are required to create a dictionary to convert NLTK's own pos tags to WordNet's equivalent.

In [252]:
# Function to get NLTK POS Tagger
# Params: 
#   Str - @word: Token
# Output
#   Dict - POS tagger
def nltk_get_wordnet_pos(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()

    # Convert NLTK to wordnet POS notations
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN) # Default to noun if not found

# Function to use NLTK lemmatizer
# Params: 2D List - Tokenized comments with stopwords removed
# Returns: 2D List - lemmatized tokens
def nltk_lemmatize(comment_stop):

    nltk.download('averaged_perceptron_tagger')
    comment_lemma = []
    lemmatizer = WordNetLemmatizer()
    lemmatizer_cache = lru_cache(maxsize=50000)(lemmatizer.lemmatize)

    for comment in comment_stop:
        temp = []
        temp.append([lemmatizer_cache(word, pos=nltk_get_wordnet_pos(word)) for word in comment])
        comment_lemma += temp

    return comment_lemma

In [253]:
# Get comments lemmatized with NLTK to compare
comments_nltk = nltk_lemmatize(train_stopwords)
print(comments_nltk[:2])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\lamxw\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[['cocksucker', 'piss', 'around', 'work'], ['gay', 'antisemmitian', 'archangel', 'white', 'tiger', 'meow', 'greetingshhh', 'uh', 'two', 'way', 'erase', 'comment', 'ww', 'holocaust', 'brutally', 'slay', 'jew', 'gaysgypsysslavsanyone', 'antisemitian', 'shave', 'head', 'bald', 'go', 'skinhead', 'meeting', 'doubt', 'word', 'bible', 'homosexuality', 'deadly', 'sin', 'make', 'pentagram', 'tatoo', 'forehead', 'go', 'satanistic', 'mass', 'gay', 'pal', 'first', 'last', 'warn', 'fuck', 'gay', 'wont', 'appreciate', 'nazi', 'shwain', 'would', 'write', 'page', 'dont', 'wish', 'talk', 'anymore', 'beware', 'dark', 'side']]


# Spacy Lemmatization

In [254]:
import spacy
from spacy.tokens import Doc

In [255]:
# Load small version of spacy's language model
nlp = spacy.load("en_core_web_sm", disable=["textcat"], exclude=["parser", "ner"])

Since spacy tokenizes sentences automatically with their model, I have to replace their tokenizer with our used tokenizer (nltk) to ensure spacy's tokenizing function does not affect our result. I will need to modify nltk_tokenize() to return a spacy Doc object.

In [256]:
# Create each tokenized comment as a spacy Doc object, then add it to a list
# Params:
#   2D List - @comment_token: Tokenized comments with stopwords removed
# Output:
#   List    - @doc_list: List of Doc objects
def create_docs(comment_stopwords):

    doc_list = []

    for comment in comment_stopwords:

        single_comment = []

        for word in comment:
            single_comment.append(word)

        doc_list.append(Doc(nlp.vocab, single_comment))
        single_comment.clear()
    
    return doc_list

In [257]:
# Create list of Docs
doc_list = create_docs(train_stopwords)
print(len(doc_list))

7132


spaCy lemmatization has a known bug for a specific word 'first' and returns it as a unicode character. I am unsure if any other words are affected by a similar bug.

Reference: https://github.com/explosion/spaCy/issues/6281

I have chosen to workaround it by hardcoding it to 'first'.

In [258]:
# Perform lemmatization with spaCy
# Params:
#   List    - @doc_list: List of Doc objects
# Output:
#   2D List - @comment_list: List of lemmatized tokens
def spacy_lemmatize(doc_list):

    comment_list = []

    for doc in doc_list:

        token_list = []

        for token in doc:

            lemma = token.lemma_

            # Hardcode '1' to 'first' to workaround spaCy bug
            if ord(lemma[0]) > 127:
                lemma = 'first'
                token_list.append(lemma)

            # spaCy lemmatization returns pronouns as '-PRON-', so exclude it
            elif lemma != '-PRON-':
                token_list.append(lemma)
        
        comment_list.append(token_list)
    
    return comment_list

I have attempted to resolve it by encoding/decoding it but it always resulted in '\ufeff1' despite printing '1' during my attempts. I have shown evidence of attempting it with 'convert_tokens()'. The following cell shows my attempt.

In [259]:
# Perform lemmatization with spaCy
# Params:
#   List    - @doc_list: List of Doc objects
# Output:
#   2D List - @comment_list: List of lemmatized tokens
def spacy_lemmatize_test(doc_list):

    comment_list = []

    for doc in doc_list:

        token_list = []

        #token_list.append([token.lemma_ for token in doc if token.lemma != '-PRON'])

        for token in doc:

            lemma = token.lemma_

            # spaCy lemmatization returns pronouns as '-PRON-', so exclude it
            if lemma != '-PRON-':
                token_list.append(lemma)
        
        comment_list.append(token_list)
    
    return comment_list

# Encode and decode unicode bug
# Does not work
# Params:
#   List    - @spacy_list: List of spaCy lemmatized tokens
# Output:
#   2D List - @comment_list: List of decoded tokens as required
def convert_tokens(spacy_list):

    comment_list = []

    for doc in spacy_list:

        token_list = []

        for token in doc:
            
            # If unicode
            if ord(token[0]) > 127:

                token = token.encode('utf-8')
                token = token.decode('utf-8')
                token_list.append(token)
            
            else:
                token_list.append(token)
        
        comment_list.append(token_list)
    
    return comment_list

comments_spacy_test = spacy_lemmatize_test(doc_list)
print(comments_spacy_test[:2])

# Result remains the same
comments_spacy_test = convert_tokens(comments_spacy_test)
print(comments_spacy_test[:2])

[['cocksucker', 'piss', 'around', 'work'], ['gay', 'antisemmitian', 'archangel', 'white', 'tiger', 'meow', 'greetingshhh', 'uh', 'two', 'way', 'erase', 'comment', 'ww', 'holocaust', 'brutally', 'slay', 'jews', 'gaysgypsysslavsanyone', 'antisemitian', 'shave', 'head', 'bald', 'go', 'skinhead', 'meeting', 'doubt', 'word', 'bible', 'homosexuality', 'deadly', 'sin', 'make', 'pentagram', 'tatoo', 'forehead', 'go', 'satanistic', 'masse', 'gay', 'pal', '\ufeff1', 'last', 'warn', 'fuck', 'gay', 'wont', 'appreciate', 'nazi', 'shwain', 'would', 'write', 'page', 'do', 'wish', 'talk', 'anymore', 'beware', 'dark', 'side']]
[['cocksucker', 'piss', 'around', 'work'], ['gay', 'antisemmitian', 'archangel', 'white', 'tiger', 'meow', 'greetingshhh', 'uh', 'two', 'way', 'erase', 'comment', 'ww', 'holocaust', 'brutally', 'slay', 'jews', 'gaysgypsysslavsanyone', 'antisemitian', 'shave', 'head', 'bald', 'go', 'skinhead', 'meeting', 'doubt', 'word', 'bible', 'homosexuality', 'deadly', 'sin', 'make', 'pentagra

In [260]:
comments_spacy = spacy_lemmatize(doc_list)
print(comments_spacy[:2])

[['cocksucker', 'piss', 'around', 'work'], ['gay', 'antisemmitian', 'archangel', 'white', 'tiger', 'meow', 'greetingshhh', 'uh', 'two', 'way', 'erase', 'comment', 'ww', 'holocaust', 'brutally', 'slay', 'jews', 'gaysgypsysslavsanyone', 'antisemitian', 'shave', 'head', 'bald', 'go', 'skinhead', 'meeting', 'doubt', 'word', 'bible', 'homosexuality', 'deadly', 'sin', 'make', 'pentagram', 'tatoo', 'forehead', 'go', 'satanistic', 'masse', 'gay', 'pal', 'first', 'last', 'warn', 'fuck', 'gay', 'wont', 'appreciate', 'nazi', 'shwain', 'would', 'write', 'page', 'do', 'wish', 'talk', 'anymore', 'beware', 'dark', 'side']]


## Compare NLTK lemmatization with spaCy's lemmatization

In [261]:
print("NLTK lemmatization")
print(comments_nltk[:2])

print("spaCy lemmatization")
print(comments_spacy[:2])

NLTK lemmatization
[['cocksucker', 'piss', 'around', 'work'], ['gay', 'antisemmitian', 'archangel', 'white', 'tiger', 'meow', 'greetingshhh', 'uh', 'two', 'way', 'erase', 'comment', 'ww', 'holocaust', 'brutally', 'slay', 'jew', 'gaysgypsysslavsanyone', 'antisemitian', 'shave', 'head', 'bald', 'go', 'skinhead', 'meeting', 'doubt', 'word', 'bible', 'homosexuality', 'deadly', 'sin', 'make', 'pentagram', 'tatoo', 'forehead', 'go', 'satanistic', 'mass', 'gay', 'pal', 'first', 'last', 'warn', 'fuck', 'gay', 'wont', 'appreciate', 'nazi', 'shwain', 'would', 'write', 'page', 'dont', 'wish', 'talk', 'anymore', 'beware', 'dark', 'side']]
spaCy lemmatization
[['cocksucker', 'piss', 'around', 'work'], ['gay', 'antisemmitian', 'archangel', 'white', 'tiger', 'meow', 'greetingshhh', 'uh', 'two', 'way', 'erase', 'comment', 'ww', 'holocaust', 'brutally', 'slay', 'jews', 'gaysgypsysslavsanyone', 'antisemitian', 'shave', 'head', 'bald', 'go', 'skinhead', 'meeting', 'doubt', 'word', 'bible', 'homosexuality

## Fitting to Keras model

In [262]:
# Imports
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Dropout, Conv1D, GlobalMaxPooling1D, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [263]:
# Constants for keras model
NUM_WORDS = 20000
MAX_LEN = 100

In [264]:
# Function taken from utils
def build_model(num_words):
    EPOCHS = 30
    INIT_LR = 1e-3

    model = Sequential()

    model.add(Embedding(num_words, 128))
    model.add(Dropout(0.4))
    model.add(Conv1D(128, 7, padding="valid", activation="relu", strides=3))
    model.add(Conv1D(128, 7, padding="valid", activation="relu", strides=3))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.5))
    model.add(Dense(6, activation='softmax'))

    adam = tf.keras.optimizers.Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)

    model.compile(loss='binary_crossentropy',
                optimizer=adam,
                metrics=['accuracy'])

    return model

# Model for NLTK

In [265]:
# Use Tensorflow's Tokenizer for text featurization
tokenizer = Tokenizer(NUM_WORDS)

# Update internal vocabulary
tokenizer.fit_on_texts(comments_nltk)

# Create an integer index of each word
corpus = tokenizer.word_index

# Integer to word
reverse_corpus = dict(map(reversed, corpus.items()))

In [266]:
# Turn each word into its corresponding integer
nltk_train_x = tokenizer.texts_to_sequences(comments_nltk)

# Pad sequences
nltk_train_x = keras.preprocessing.sequence.pad_sequences(nltk_train_x, MAX_LEN)
nltk_train_x = np.array(nltk_train_x)

In [267]:
model = build_model(NUM_WORDS)

model.fit(nltk_train_x, bal_train_y, batch_size=60, epochs=30)

# Save model to use for evaluation
model.save('models/exp1_nltk')

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
INFO:tensorflow:Assets written to: models/exp1_nltk\assets


In [268]:
# Prepare test_x
# No preprocessing needs to be done on the dataset as the base model already made use of NLTK's lemmatization

# Turn each word into its corresponding integer
nltk_test_x = tokenizer.texts_to_sequences(bal_test_dataset)

# Pad sequences
nltk_test_x = keras.preprocessing.sequence.pad_sequences(nltk_test_x, MAX_LEN)
nltk_test_x = np.array(nltk_test_x)

print(nltk_test_x.shape)

(6978, 100)


In [269]:
# Evaluate model for NLTK
model.evaluate(nltk_test_x, bal_test_y, batch_size=60)



[0.9349310398101807, 0.3505302369594574]

# Model for spaCy

In [270]:
# Use Tensorflow's Tokenizer for text featurization
tokenizer = Tokenizer(NUM_WORDS)

# Update internal vocabulary
tokenizer.fit_on_texts(comments_spacy)

# Create an integer index of each word
corpus = tokenizer.word_index

# Integer to word
reverse_corpus = dict(map(reversed, corpus.items()))

In [271]:
# Turn each word into its corresponding integer
spacy_train_x = tokenizer.texts_to_sequences(comments_spacy)

# Pad sequences
spacy_train_x = keras.preprocessing.sequence.pad_sequences(spacy_train_x, MAX_LEN)
spacy_train_x = np.array(spacy_train_x)

In [272]:
model = build_model(NUM_WORDS)

model.fit(spacy_train_x, bal_train_y, batch_size=60, epochs=30)

# Save model to use for evaluation
model.save('models/exp1_spacy')

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
INFO:tensorflow:Assets written to: models/exp1_spacy\assets


In [273]:
# Prepare test_x

# Pre-process test_x 
spacy_clean = clean_data(bal_test_df)
spacy_tokens = nltk_tokenize(spacy_clean)
spacy_stopwords = nltk_stopwords(spacy_tokens)
spacy_docs = create_docs(spacy_stopwords)
spacy_test_x = spacy_lemmatize(spacy_docs)

# Turn each word into its corresponding integer
spacy_test_x = tokenizer.texts_to_sequences(spacy_test_x)

# Pad sequences
spacy_test_x = keras.preprocessing.sequence.pad_sequences(spacy_test_x, MAX_LEN)
spacy_test_x = np.array(spacy_test_x)

print(nltk_test_x.shape)

(6978, 100)


In [274]:
# Evaluate model for NLTK
model.evaluate(spacy_test_x, bal_test_y, batch_size=60)



[1.0017470121383667, 0.3220120370388031]