# Applying Machine Learning To The 20 Newsgroups Dataset

The 20 Newsgroups dataset is a corpus of text data downloaded from Usenet groups about 20 years ago and has become a standard in Natural Language Processing projects. The goal of this project is to train a Machine Learning model to classify and predict which entries in the dataset belong to which Newsgroups. Three types of models were tested, a simple Multinomial Naive Bayes classifier, an Artificial Neural Network and a Convolusional Neural Network. As the dataset is considered small by todays standards, the Easy Data Augmentation method was also tested to see if it improves the accuracy of each of the models. 

### Import the libraries needed for Machine Learning

In [100]:
import pandas as pd
import numpy as np
from scipy import stats

import os
import re
import glob
import random
from random import shuffle
import pickle

import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Comment these in if you are running for the first time
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('tagsets')
# nltk.download('wordnet')

from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.datasets import _twenty_newsgroups
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline

from geneticalgorithm import geneticalgorithm as ga

from keras.models import Sequential
from keras.models import load_model
from keras import layers
from keras.layers.recurrent import LSTM
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import GridSearchCV

from pypet import Environment, cartesian_product, pypetconstants
from pypet.parameter import ArrayParameter

import smart_open
from gensim.models import Word2Vec

from IPython.display import display
from matplotlib import pyplot as plt

get_ipython().run_line_magic("matplotlib", "inline")

## Import and process the 20 Newsgroups Dataset

The data is stored in 20 text files, one for each Newsgroup and then imported into a single pandas dataframe. Details about the importing of the dataset can be found in the other Jupyter Notebook, ```20_Newsgroups_Exploration.ipynb```. 

In [101]:
# Importing Newsgroups Manually
cwd = os.getcwd()
files = glob.glob(cwd + "/data/raw/*.txt")

# Dictionary to contain the data
raw_data_dict = {}

# For each newsgroup file
for file in files:
    # Save the catagory to be used as the dict key
    category = file.split("/")[-1].replace(".txt", "")

    # Open the file and save the contents to the dict
    with open(file, "rb") as datafile:
        raw_data_dict[category] = datafile.read().decode("iso-8859-1")

# Create a pandas dataframe to contain the data with two columns
data_frame = pd.DataFrame(columns=["entry", "newsgroup"])

# For each Newsgroup
for key in raw_data_dict.keys():
    # Define the Newsgroup line
    divider = "Newsgroup: " + key
    # print(divider)

    # Split the data using the line as the divider
    # It is assumed that each unique entry has the Newsgroup in its header
    raw_data_dict[key] = raw_data_dict[key].split(divider)

    temp_list = []
    doc_id_list = []
    missing_doc_id = 0

    # Filter out any duplicates based on document_id
    for i in range(len(raw_data_dict[key])):
        # Split the entry into a list
        doc_list = raw_data_dict[key][i].split("\n")

        doc_id = False
        doc_index = []

        # Isolate and save the list element with the document_id
        for j, elem in enumerate(doc_list):
            # Sometimes document_id can use either an upper or lower case d
            if "ocument_id: " in elem:
                doc_index.append(j)
                doc_id = elem.split("ocument_id: ")[1]

        # If the entry has a document_id
        if doc_id:
            # And if the document_id is not a duplicate
            if doc_id not in doc_id_list:
                # Strip the document_id and save the entry to a list
                temp_list.append(raw_data_dict[key][i].split(doc_list[doc_index[0]])[1])
                # Add the document_id to the list of encountered document_ids
                doc_id_list.append(doc_id)

        # Otherwise log that the document_id was missing
        else:
            if len(doc_list) > 1:
                missing_doc_id += 1

    # Append the entry list to the dataframe with its corresponding Newsgroup
    temp_frame = pd.DataFrame(
        list(zip(temp_list, [key] * len(temp_list))), columns=["entry", "newsgroup"]
    )
    data_frame = data_frame.append(temp_frame)

print("\nResulting Dataframe:")
display(data_frame.describe())


Resulting Dataframe:


Unnamed: 0,entry,newsgroup
count,18828,18828
unique,18828,20
top,\nFrom: enis@cbnewsg.cb.att.com (enis.surensoy...,rec.sport.hockey
freq,1,999


## Cleaning the data

The first thing that needs to be done to the data to prepare it for modeling is to clean it. The most basic processes of text cleaning involves:

- Removing any punctuation
- Setting all words to lower case
- Removing stop words such as ```the```, ```to```, ```and```, etc. 
   
Cleaning reduces the amount of data and its complexity so the model has an easier time processing it. Additionally, words in the dataset can be stemmed or lemmatized to help further reduce the complexity of the dataset. Stemming is the process of reducing a word to it's stem so for example, ```car```, ```cars```, ```car's```, and ```cars'``` will all be reduced to ```car```. The issue with stemming algorithms is that they can be quite crude where stemmed words may not actually be valid words (eg ```something``` would reduce to ```someth```). Lemmatization on the other hand is similar to stemming but in most cases will return a valid word as a result and will try to reduce a word based on the word type (noun, verb, etc.). After some testing, the function ```strip_stopwords_lem``` will be used as the main text cleaning function as it makes for a clean dataset that will work well with later text processing. 

In [102]:
# Function to remove stop words and clean a given string
def strip_stopwords(text):
    # Remove punctuation and numbers, set all words to lower case and create a list of the words
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower()
    text = text.split()

    # Create a list of stop words
    all_stopwords = stopwords.words("english")

    # Remove any stop words
    text = [word for word in text if not word in set(all_stopwords)]

    # Recreate the string and return
    return " ".join(text)


# Function to remove stop words, stem words, and clean a given string
def strip_stopwords_stem(text):
    # Remove punctuation and numbers, set all words to lower case and create a list of the words
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower()
    text = text.split()

    # Initilize the porterstemmer and create a list of stop words
    ps = PorterStemmer()
    all_stopwords = stopwords.words("english")

    # Stem all the words in the list and remove any stop words
    text = [ps.stem(word) for word in text if not word in set(all_stopwords)]

    # Recreate the string and return
    return " ".join(text)


# Function to remove stop words, lemmatize words, and clean a given string
def strip_stopwords_lem(text):
    # Remove punctuation and numbers, set all words to lower case and create a list of the words
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower()
    text = text.split()

    # Initilize the lemmatizer and create a list of stop words
    lem = WordNetLemmatizer()
    all_stopwords = stopwords.words("english")

    # Lemmatize all the words in the list and remove any stop words
    text = [lem.lemmatize(word) for word in text if not word in set(all_stopwords)]

    # Recreate the string and return
    return " ".join(text)

In [103]:
# Remove the stop words, punctuation and set all words to lower case
data_frame["entry-s"] = data_frame["entry"].apply(strip_stopwords_lem)

## Set up a Baseline Model

A simple Multinomial Naive Bayes (MNB) model is used as a baseline to compare any changes made to the dataset while preparing it for more complex machine learning techniques. A Naive Bayes model is a probabilistic model that predicts when given an input the probability that it is associated with a given output. The model takes the naive assumption that all the features are independent for a given class which in most cases is not true. However, even though the assumption does not always hold, a Naive Bayes model has been shown to give good results for even complex classification problems. The model was picked over other simple models because it is quick to compute and there are few hyperparameters to choose so no tuning is needed. 

Before the data can be processed by the model it must first be converted to a list of numbers or vectorized. At first, this is done using the simple CountVectorizer which turns each entry into a list as long as the number of unique words in the dataset containing the frequency of the words in each entry. So for example:```The dog chased the cat``` becomes:```[2, 1, 1, 1]``` which corresponds to: ```[the, dog, chased, cat]```. 

The model is first trained on the raw data before text cleaning, as a baseline. 

In [104]:
# Define a table to keep track of how well models perform
score_table = pd.DataFrame()

In [105]:
# Function to show the top features used by a classifier to make its predictions
def show_top10(pipeline, categories):

    vectorizer = pipeline.named_steps["vectorizer"]
    classifier = pipeline.named_steps["classifier"]

    # Save the feature names from the vectorizer as an array
    feature_names = np.asarray(vectorizer.get_feature_names())

    # Define a pandas dataframe to retrun
    data_table = pd.DataFrame()

    # For each category used in training
    for i, category in enumerate(categories):
        # Extract the top 10 features from the classifier
        top10 = np.argsort(classifier.feature_log_prob_[i])[-10:]

        # Append the features to a row of a pandas dataframe
        data_row = pd.Series(feature_names[top10], name=category)
        data_table = data_table.append(data_row)

    # Reset the index to start from 1 and return the dataframe
    data_table = data_table.T
    data_table.index = data_table.index + 1
    return data_table.T

#### Multinomial Naive Bayes using CountVectorizer on the raw data

In [106]:
# Multinomial NB using CountVectorizer on the raw data
MultiNB_Count = Pipeline(
    [("vectorizer", CountVectorizer()), ("classifier", MultinomialNB())], verbose=False
)

entries = data_frame["entry"].values
y = data_frame["newsgroup"].values

# Split the data into test and training groups
entries_train, entries_test, y_train, y_test = train_test_split(
    entries, y, test_size=0.2, random_state=1234
)

print("Multinomial Navie Bayes with CountVectorizer:")
MultiNB_Count.fit(entries_train, y_train)

# Print the score
train_score = MultiNB_Count.score(entries_train, y_train)
print("Training Accuracy: {:.4f}".format(train_score))
test_score = MultiNB_Count.score(entries_test, y_test)
print("Testing Accuracy:  {:.4f}".format(test_score))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_score, name="MultiNB_Count-Raw", index=["Accuracy Score"])
)

top_10 = show_top10(MultiNB_Count, sorted(set(data_frame["newsgroup"])))
# display(top_10)

Multinomial Navie Bayes with CountVectorizer:
Training Accuracy: 0.9234
Testing Accuracy:  0.8505


As you can see, the models accuracy is initially quite high at ~85% accuracy on the testing data. This indicates that there might be some overfitting occurring.

#### Multinomial Naive Bayes using TfidfVectorizer on the raw data

In [107]:
# Multinomial NB using TfidfVectorizer on the raw data
MultiNB_Tfidf = Pipeline(
    [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())], verbose=False
)

entries = data_frame["entry"].values
y = data_frame["newsgroup"].values

# Split the data into test and training groups
entries_train, entries_test, y_train, y_test = train_test_split(
    entries, y, test_size=0.2, random_state=1234
)

print("Multinomial Navie Bayes with TfidfVectorizer:")
MultiNB_Tfidf.fit(entries_train, y_train)

# Print the score
train_score = MultiNB_Tfidf.score(entries_train, y_train)
print("Training Accuracy: {:.4f}".format(train_score))
test_score = MultiNB_Tfidf.score(entries_test, y_test)
print("Testing Accuracy:  {:.4f}".format(test_score))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_score, name="MultiNB_Tfidf-Raw", index=["Accuracy Score"])
)

top_10 = show_top10(MultiNB_Tfidf, sorted(set(data_frame["newsgroup"])))
# display(top_10)

Multinomial Navie Bayes with TfidfVectorizer:
Training Accuracy: 0.9276
Testing Accuracy:  0.8555


Instead of using CountVectorizer, TfidfVectorizer was tested on the raw data to see what affect it has on the accuracy. Like CountVectorizer, TfidfVectorizer (Term Frequency Inverse Document Frequency) counts the words in a given entry but normalizes the count based on the number of times a word appears in a dataset. This has the effect of penalizing words that appear often across a dataset and highlighting words that appear in only a few entries. 

We can see a marginal improvement in the accuracy using TfidfVectorizer meaning that there is still some overfitting of the data. 

#### Multinomial Naive Bayes using TfidfVectorizer on the data with basic cleaning

In [108]:
# Multinomial NB using TfidfVectorizer on the the data without stop words
MultiNB_Tfidf = Pipeline(
    [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())], verbose=False
)

entries = data_frame["entry-s"].values
y = data_frame["newsgroup"].values

# Split the data into test and training groups
entries_train, entries_test, y_train, y_test = train_test_split(
    entries, y, test_size=0.2, random_state=1234
)

print("Multinomial Navie Bayes with TfidfVectorizer:")
MultiNB_Tfidf.fit(entries_train, y_train)

# Print the score
train_score = MultiNB_Tfidf.score(entries_train, y_train)
print("Training Accuracy: {:.4f}".format(train_score))
test_score = MultiNB_Tfidf.score(entries_test, y_test)
print("Testing Accuracy:  {:.4f}".format(test_score))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_score, name="MultiNB_Tfidf-s", index=["Accuracy Score"])
)

top_10 = show_top10(MultiNB_Tfidf, sorted(set(data_frame["newsgroup"])))
# display(top_10)

print("Classification Report:")
y_pred = MultiNB_Tfidf.predict(entries_test)
print(classification_report(y_pred, y_test))

Multinomial Navie Bayes with TfidfVectorizer:
Training Accuracy: 0.9421
Testing Accuracy:  0.8733
Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.81      0.87      0.84       141
           comp.graphics       0.79      0.85      0.82       187
 comp.os.ms-windows.misc       0.88      0.79      0.83       199
comp.sys.ibm.pc.hardware       0.88      0.78      0.82       233
   comp.sys.mac.hardware       0.83      0.91      0.87       181
          comp.windows.x       0.85      0.93      0.89       176
            misc.forsale       0.74      0.92      0.82       155
               rec.autos       0.92      0.91      0.92       202
         rec.motorcycles       0.96      0.95      0.95       206
      rec.sport.baseball       0.96      0.98      0.97       218
        rec.sport.hockey       0.98      0.95      0.96       215
               sci.crypt       0.97      0.88      0.92       229
         sci.electro

After removing the stop words from the data there was a further improvement in the accuracy score with the model now achieving over 87% accuracy on average with the accuracy identifying some topics higher than 95%! For such a simple model this appears to be too good to be true and so the data needs to be further processed in order to make the predictions more realistic. 

## Further Data Cleaning

The first thing that may be causing such a high accuracy result could be emails in the dataset, perhaps the model is finding repeat users in the dataset and identifying those rather than the substance of the text. To eliminate this possibility all emails are identified and removed from the dataset. 

In [109]:
def strip_email(text):
    return " ".join([i for i in text.split(" ") if "@" not in i])

In [110]:
# Print the number of emails in the dataset
data_frame_email = data_frame[data_frame["entry"].str.contains("@", na=False)]
print("Number of entries with emails: %i" % len(data_frame_email))

# Remove emails
data_frame["entry-e"] = data_frame["entry"].apply(strip_email)

# Print any remaining emails in the dataset
data_frame_email = data_frame[data_frame["entry-e"].str.contains("@", na=False)]
print("Number of entries with emails: %i" % len(data_frame_email))

# Remove the stop words, punctuation and set all words to lower case
data_frame["entry-es"] = data_frame["entry-e"].apply(strip_stopwords_lem)

Number of entries with emails: 18816
Number of entries with emails: 0


#### Multinomial Naive Bayes using TfidfVectorizer on clean data without emails

In [111]:
# Multinomial NB using TfidfVectorizer on the data without emails or stop words
MultiNB_Tfidf = Pipeline(
    [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())], verbose=False
)

entries = data_frame["entry-es"].values
y = data_frame["newsgroup"].values

# Split the data into test and training groups
entries_train, entries_test, y_train, y_test = train_test_split(
    entries, y, test_size=0.2, random_state=1234
)

print("Multinomial Navie Bayes with TfidfVectorizer:")
MultiNB_Tfidf.fit(entries_train, y_train)

# Print the score
train_score = MultiNB_Tfidf.score(entries_train, y_train)
print("Training Accuracy: {:.4f}".format(train_score))
test_score = MultiNB_Tfidf.score(entries_test, y_test)
print("Testing Accuracy:  {:.4f}".format(test_score))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_score, name="MultiNB_Tfidf-es", index=["Accuracy Score"])
)

top_10 = show_top10(MultiNB_Tfidf, sorted(set(data_frame["newsgroup"])))
# display(top_10)

Multinomial Navie Bayes with TfidfVectorizer:
Training Accuracy: 0.9289
Testing Accuracy:  0.8609


#### Removing Headers, Footers and Quotes

After removing the emails, the accuracy has reduced but only marginally. At this point, after extensive research it was found that the 20 Newsgroups dataset is actually built into scikit-learn and with it are methods to remove the header, footer and quote blocks from the entries. The reasoning for removing these features is that they are too similar between Newsgroups and the classifier has an easier time learning these features rather than learning to identify the substance of the topics being discussed. The functions to remove the header, footer and quote blocks are applied to the data as well as the function to remove any remaining emails. The text is then cleaned using the ```strip_stopwords_lem``` function from before. 

In [112]:
# Remove headers using scikit-learn function
data_frame["entry-h"] = data_frame["entry"].apply(
    _twenty_newsgroups.strip_newsgroup_header
)

# Remove quotes using scikit-learn function
data_frame["entry-hq"] = data_frame["entry-h"].apply(
    _twenty_newsgroups.strip_newsgroup_quoting
)

# Remove footers using scikit-learn function
data_frame["entry-hqf"] = data_frame["entry-hq"].apply(
    _twenty_newsgroups.strip_newsgroup_footer
)

# Remove remaining emails
data_frame["entry-hqfe"] = data_frame["entry-hqf"].apply(strip_email)

# Remove the stop words, punctuation and set all words to lower case
data_frame["entry-hqfes"] = data_frame["entry-hqfe"].apply(strip_stopwords_lem)

#### Example Entry

In [113]:
# Print an entry before removing the header, footer, quote blocks and emails
print("Entry before removing the header, footer and quote block")
print(data_frame["entry"].iloc[0])

# Print it after removing those features
print("\n\nentry after removing the header, footer and quote block")
print(data_frame["entry-hqfe"].iloc[0])

Entry before removing the header, footer and quote block

From: et@teal.csn.org (Eric H. Taylor)
Subject: Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise

In article <C4KvJF.4qo@well.sf.ca.us> metares@well.sf.ca.us (Tom Van Flandern) writes:
>crb7q@kelvin.seas.Virginia.EDU (Cameron Randale Bass) writes:
>> Bruce.Scott@launchpad.unc.edu (Bruce Scott) writes:
>>> "Existence" is undefined unless it is synonymous with "observable" in
>>> physics.
>> [crb] Dong ....  Dong ....  Dong ....  Do I hear the death-knell of
>> string theory?
>
>     I agree.  You can add "dark matter" and quarks and a lot of other
>unobservable, purely theoretical constructs in physics to that list,
>including the omni-present "black holes."
>
>     Will Bruce argue that their existence can be inferred from theory
>alone?  Then what about my original criticism, when I said "Curvature
>can only exist relative to something non-curved"?  Bruce replied:
>"'Existence' is undefined unless it 

As you can see, removing these features significantly reduces the size of a given entry. While the scikit-learn functions work well in most cases they are not perfect. In the case above, ```strip_newsgroup_footer``` failed to remove the footer, only removing the dashed line from the bottom. Other examples were explored and in most cases the functions worked well and so applying them should make a difference to the overall accuracy score.  

#### Multinomial Naive Bayes using TfidfVectorizer on clean data without headers, footers, quotes or emails

In [114]:
# Multinomial NB using TfidfVectorizer on the the data without stop words
MultiNB_Tfidf = Pipeline(
    [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())], verbose=False
)

entries = data_frame["entry-hqfes"].values
y = data_frame["newsgroup"].values

# Split the data into test and training groups
entries_train, entries_test, y_train, y_test = train_test_split(
    entries, y, test_size=0.2, random_state=1234
)

print("Multinomial Navie Bayes with TfidfVectorizer:")
MultiNB_Tfidf.fit(entries_train, y_train)

# Print the score
train_score = MultiNB_Tfidf.score(entries_train, y_train)
print("Training Accuracy: {:.4f}".format(train_score))
test_score = MultiNB_Tfidf.score(entries_test, y_test)
print("Testing Accuracy:  {:.4f}".format(test_score))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_score, name="MultiNB_Tfidf-hqfes", index=["Accuracy Score"])
)

top_10 = show_top10(MultiNB_Tfidf, sorted(set(data_frame["newsgroup"])))
# display(top_10)

print("Classification Report:")
y_pred = MultiNB_Tfidf.predict(entries_test)
print(classification_report(y_pred, y_test))

Multinomial Navie Bayes with TfidfVectorizer:
Training Accuracy: 0.8602
Testing Accuracy:  0.7286
Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.32      0.75      0.45        65
           comp.graphics       0.69      0.66      0.67       212
 comp.os.ms-windows.misc       0.73      0.66      0.69       200
comp.sys.ibm.pc.hardware       0.80      0.64      0.71       255
   comp.sys.mac.hardware       0.67      0.84      0.75       159
          comp.windows.x       0.80      0.83      0.82       184
            misc.forsale       0.72      0.80      0.76       176
               rec.autos       0.78      0.79      0.78       197
         rec.motorcycles       0.73      0.90      0.81       167
      rec.sport.baseball       0.88      0.93      0.91       210
        rec.sport.hockey       0.88      0.91      0.89       202
               sci.crypt       0.84      0.69      0.76       252
         sci.electro

Now the accuracy score has reduced significantly, reducing to an average of ~73%. This appears to be a more realistic score based on the simplicity of the model. Looking at the classification report, some topics are still being identified easily with scores of over 90% whereas the model is struggling with others. The largest drops are in the Newsgroups with similar topics such as the ```alt.atheism``` and the ```talk.religion``` groups where the model failed to identify more than one entry as from the ```talk.religion``` group. This makes sense as the topics of discussion of these groups are very similar and so the model is having a hard time distinguishing between them. 

In [115]:
display(score_table)

Unnamed: 0,Accuracy Score
MultiNB_Count-Raw,0.850505
MultiNB_Tfidf-Raw,0.85555
MultiNB_Tfidf-s,0.87334
MultiNB_Tfidf-es,0.86086
MultiNB_Tfidf-hqfes,0.728625


Looking at the final scores after each step of preprocessing, it is clear that each change marginally improved the score until the last step where there was a significant but expected reduction. ~73% accuracy is now the baseline to beat when applying more sophisticated and advanced techniques.  

## Implementing EDA

Easy Data Augmentation (EDA) is the process of augmenting the entries of a dataset in order to improve text classification models. Inspired by similar processes used in machine vision projects, the idea of the technique is that by introducing noise to existing entries and adding them to a dataset, the size of the dataset can be increased giving the model more data to work with. The process was introduced by Jason Wei and Kai Zou in their paper ```EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks``` (which can be found here: https://arxiv.org/pdf/1901.11196.pdf). They use four different methods to randomly apply noise to a dataset:
- Synonym Replacement (SR): Words are randomly chosen and replaced with their synonym, with the synonym itself being  chosen at random
- Random Insertion (RI): A synonym of a word in the entry is inserted in a random position
- Random Swap (RS): Two words are randomly selected and their positions swapped
- Random Deletion (RD): Words are randomly deleted from an entry

New entries are created by randomly applying one of these techniques to the existing entries and adding the new augmented entry to the dataset. The paper details that on a small dataset, EDA can have a similar affect to using a dataset twice the size. The process of EDA is not built into any data science toolkits and so has been implemented below by modifying the code that the authors used in their paper. 

#### Synonym Replacement (SR)

In [116]:
# Function to randomly replace words in a string with their synonyms
def synonym_replacement(words, n):

    # Split the words into a list
    words = words.split(" ")

    # Create a copy of the word list
    new_words = words.copy()

    # Create a randomly shuffled list of the unique words in the entry
    random_word_list = list(set(words))
    random.shuffle(random_word_list)

    num_replaced = 0

    # For each word in the random word list
    for random_word in random_word_list:
        # Find its synonyms
        synonyms = get_synonyms(random_word)

        # If there are synonyms for the word
        if len(synonyms) >= 1:
            # Choose a random sysnonym
            synonym = random.choice(list(synonyms))

            # Replace each instance of the random word with its synonym
            new_words = [synonym if word == random_word else word for word in new_words]

            # print("\treplaced", random_word, "with", synonym)

            # Increment the counter
            num_replaced += 1

        # If n number of words have been replaced then stop replacing words
        if num_replaced >= n:  # only replace up to n words
            break

    return " ".join(new_words)


# Function to return a list of synonyms for a given word
def get_synonyms(word):

    # Define the set of synonyms
    synonyms = set()

    # For each synonym of the word
    for syn in wordnet.synsets(word):
        # For each lemma of each synonym
        for l in syn.lemmas():
            # Remove any non letter characters and add the synonym to the set
            synonym = l.name().replace("_", " ").replace("-", " ").lower()
            synonym = "".join(
                [char for char in synonym if char in " qwertyuiopasdfghjklzxcvbnm"]
            )
            synonyms.add(synonym)

    # If the word itself is in the set of synonyms then remove it
    if word in synonyms:
        synonyms.remove(word)

    # Return the set of synonyms as a list
    return list(synonyms)


test = data_frame["entry-hqfe"].iloc[0]
test = strip_stopwords_lem(test)

print("Test before synonym replacement:")
print(test)
print("\nTest after synonym replacement:")
print(synonym_replacement(test, 5))

Test before synonym replacement:
hold space cannot curved simple reason property property speak dealing matter filling space say presence large body space becomes curved equivalent stating something act upon nothing one refuse subscribe view nikola tesla et tesla year ahead time perhaps time come

Test after synonym replacement:
hold space cannot curved simple reasonableness attribute attribute speak dealing matter filling space say presence large body space becomes curved equivalent state something dissemble upon nothing one defy subscribe view nikola tesla et tesla year ahead time perhaps time come


#### Random Insertion (RI)

In [117]:
# Function to randomly insert a synonym for a word in a string into that string n times
def random_insertion(words, n):

    # Split the words into a list
    words = words.split(" ")

    # Create a copy of the word list
    new_words = words.copy()

    # For the number of words to be inserted
    for _ in range(n):
        # Add a new word to the entry
        add_word(new_words)

    return " ".join(new_words)


# Function to randomly insert a synonym for a word in a string into that string
def add_word(new_words):

    # Define the synonyms list and counter
    synonyms = []
    counter = 0

    # While the number of synonyms is less than 1
    while len(synonyms) < 1:
        # Choose a random word
        random_word = new_words[random.randint(0, len(new_words) - 1)]

        # Find its synonyms and increment the counter
        synonyms = get_synonyms(random_word)
        counter += 1

        # If the counter reaches 10 before finding a synonym, return nothing
        if counter >= 10:
            return

    # Pick the first random synonym
    random_synonym = synonyms[0]

    # Pick a random spot
    random_idx = random.randint(0, len(new_words) - 1)

    # Insert the synonym into the random spot
    new_words.insert(random_idx, random_synonym)

    # print('\tInserting %s at %i' % (random_synonym, random_idx))


test = data_frame["entry-hqfe"].iloc[0]
test = strip_stopwords_lem(test)

print("Test before random insertion:")
print(test)
print("\nTest after random insertion:")
print(random_insertion(test, 5))

Test before random insertion:
hold space cannot curved simple reason property property speak dealing matter filling space say presence large body space becomes curved equivalent stating something act upon nothing one refuse subscribe view nikola tesla et tesla year ahead time perhaps time come

Test after random insertion:
hold space cannot curved simple reason property playact property speak dealing physical structure matter filling space say presence large body space becomes curved quadriceps femoris equivalent stating something act upon distribute nothing one refuse subscribe view nikola tesla et tesla year ahead quad time perhaps time come


#### Random Swap (RS)

In [118]:
# Function to randomly swap two words in a string n times
def random_swap(words, n):

    # Split the words into a list
    words = words.split(" ")

    # Create a copy of the word list
    new_words = words.copy()

    # For the number of words to be swapped
    for _ in range(n):
        # Swap words
        new_words = swap_word(new_words)

    return " ".join(new_words)


# Function to randomly swap two words in a string
def swap_word(new_words):

    # Pick a random spot in the entry and define a counter
    random_idx_1 = random.randint(0, len(new_words) - 1)
    random_idx_2 = random_idx_1
    counter = 0

    # While the random spots are the same
    while random_idx_2 == random_idx_1:

        # Pick another random spot and increment the counter
        random_idx_2 = random.randint(0, len(new_words) - 1)
        counter += 1

        # If the counter reaches 3 and another random spot has not been picked then return the list of words
        if counter > 3:
            return new_words

    # Swap the two words at the two random spots and return
    new_words[random_idx_1], new_words[random_idx_2] = (
        new_words[random_idx_2],
        new_words[random_idx_1],
    )
    return new_words


test = data_frame["entry-hqfe"].iloc[0]
test = strip_stopwords_lem(test)

print("Test before random swap:")
print(test)
print("\nTest after random swap:")
print(random_swap(test, 5))

Test before random swap:
hold space cannot curved simple reason property property speak dealing matter filling space say presence large body space becomes curved equivalent stating something act upon nothing one refuse subscribe view nikola tesla et tesla year ahead time perhaps time come

Test after random swap:
hold space cannot curved simple reason something property speak dealing view filling space say presence large body curved becomes space equivalent stating property act upon nothing one refuse nikola matter subscribe tesla et tesla year ahead time come time perhaps


#### Random Deletion (RD)

In [119]:
# Function to randomly delete words in a string with a given probabililty p
def random_deletion(words, p):

    # Split the words into a list
    words = words.split(" ")

    # If there's only one word, don't delete it and return
    if len(words) == 1:
        return " ".join(words)

    new_words = []

    # For each word in the list of words
    for word in words:
        # Pick a random number between 0 and 1
        r = random.uniform(0, 1)

        # If the number is greater than p, add the word to the new list
        if r > p:
            new_words.append(word)

    # If you end up deleting all words, just return a random word
    if len(new_words) == 0:
        rand_int = random.randint(0, len(words) - 1)
        return words[rand_int]

    return " ".join(new_words)


test = data_frame["entry-hqfe"].iloc[0]
test = strip_stopwords_lem(test)

print("Test before random swap:")
print(test)
print("\nTest after random swap:")
print(random_deletion(test, 0.1))

Test before random swap:
hold space cannot curved simple reason property property speak dealing matter filling space say presence large body space becomes curved equivalent stating something act upon nothing one refuse subscribe view nikola tesla et tesla year ahead time perhaps time come

Test after random swap:
hold space cannot simple reason property property speak dealing filling space presence body space becomes curved equivalent stating something act upon nothing one refuse subscribe nikola tesla et tesla year ahead time perhaps time come


#### EDA Function

The EDA function takes in an entry and returns a given number of augmented entries each associated with its Newsgroup. The function also has four inputs, one for each process of EDA, to control the percentage of words to be replaced/swapped/deleted. The values for these inputs are set to the default found in the EDA paper.

In [120]:
# Function to apply the EDA method to a given entry and its newsgroup
def eda(
    entry, newsgroup, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9
):

    # Calculate the number of words in the entry
    num_words = len(entry.split(" "))

    augmented_entries = []

    # Calculate the number of new entries per technique (min 1)
    num_new_per_technique = int(num_aug / 4) + 1

    # Synonym Replacement
    if alpha_sr > 0:
        # Calculate the number of words to replace (min 1)
        n_sr = max(1, int(alpha_sr * num_words))

        # For the number of augments
        for _ in range(num_new_per_technique):

            # Replace a number of words in the entry and add the new entry to the list
            a_words = synonym_replacement(entry, n_sr)
            augmented_entries.append(a_words)

    # Random Insertion
    if alpha_ri > 0:
        # Calculate the number of words to insert (min 1)
        n_ri = max(1, int(alpha_ri * num_words))

        # For the number of augments
        for _ in range(num_new_per_technique):

            # Insert a number of words in the entry and add the new entry to the list
            a_words = random_insertion(entry, n_ri)
            augmented_entries.append(a_words)

    # Random Swap
    if alpha_rs > 0:
        # Calculate the number of words to swap (min 1)
        n_rs = max(1, int(alpha_rs * num_words))

        # For the number of augments
        for _ in range(num_new_per_technique):

            # Swap a number of words in the entry and add the new entry to the list
            a_words = random_swap(entry, n_rs)
            augmented_entries.append(a_words)

    # Random Deletion
    if p_rd > 0:

        # For the number of augments
        for _ in range(num_new_per_technique):

            # Delete a percentage of words in the entry and add the new entry to the list
            a_words = random_deletion(entry, p_rd)
            augmented_entries.append(a_words)

    shuffle(augmented_entries)

    # trim so that we have the desired number of augmented sentences
    if num_aug >= 1:
        augmented_entries = augmented_entries[:num_aug]
    else:
        keep_prob = num_aug / len(augmented_entries)
        augmented_entries = [
            s for s in augmented_entries if random.uniform(0, 1) < keep_prob
        ]

    # append the original sentence
    augmented_entries.append(entry)

    # Return the list of entries as a pandas dataframe with its newsgroup
    return pd.DataFrame(
        list(zip(augmented_entries, [newsgroup] * len(augmented_entries))),
        columns=["entry", "newsgroup"],
    )

#### Apply EDA and split the dataset into training and test sets

In [121]:
# Function to split a dataset into traning and testing groups and then apply EDA to the training data
def eda_split(
    entries,
    y,
    test_size=0.2,
    random_state=1234,
    alpha_sr=0.1,
    alpha_ri=0.1,
    alpha_rs=0.1,
    p_rd=0.1,
    num_aug=9,
):
    eda_list = []

    # Split the data into test and training groups
    X_train, X_test, y_train, y_test = train_test_split(
        entries, y, test_size=test_size, random_state=random_state
    )

    # For each entry in the training list
    for i, entry in enumerate(X_train):
        # Apply the EDA method to the entry and append
        # the resulting augmented entries to a list
        eda_list.append(
            eda(
                entry,
                y_train[i],
                alpha_sr=alpha_sr,
                alpha_ri=alpha_ri,
                alpha_rs=alpha_rs,
                p_rd=p_rd,
                num_aug=num_aug,
            )
        )

    # Concatinate the dataframes, shuffle the rows and then extract the entries and newsgroups
    eda_data_frame = pd.concat(eda_list)
    eda_data_frame = eda_data_frame.sample(frac=1)
    X_train_eda = eda_data_frame["entry"].values
    y_train_eda = eda_data_frame["newsgroup"].values

    return X_train_eda, X_test, y_train_eda, y_test

## Baseline Model using EDA

#### Multinomial Naive Bayes using TfidfVectorizer on cleaned data after applying EDA

In [122]:
# Multinomial NB using TfidfVectorizer on the the data without stop words
MultiNB_Tfidf = Pipeline(
    [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())], verbose=False
)

entries = data_frame["entry-hqfes"].values
y = data_frame["newsgroup"].values

# Split the data into test and training groups applying EDA to the training group
entries_train, entries_test, y_train, y_test = eda_split(entries, y)

print("Multinomial Navie Bayes with TfidfVectorizer:")
MultiNB_Tfidf.fit(entries_train, y_train)
score = MultiNB_Tfidf.score(entries_test, y_test)

# Print the score
train_score = MultiNB_Tfidf.score(entries_train, y_train)
print("Training Accuracy: {:.4f}".format(train_score))
test_score = MultiNB_Tfidf.score(entries_test, y_test)
print("Testing Accuracy:  {:.4f}".format(test_score))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_score, name="MultiNB_Tfidf_EDA-hqfes", index=["Accuracy Score"])
)

top_10 = show_top10(MultiNB_Tfidf, sorted(set(data_frame["newsgroup"])))
# display(top_10)

print("Classification Report:")
y_pred = MultiNB_Tfidf.predict(entries_test)
print(classification_report(y_pred, y_test))

Multinomial Navie Bayes with TfidfVectorizer:
Training Accuracy: 0.9341
Testing Accuracy:  0.7693
Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.56      0.70      0.62       122
           comp.graphics       0.72      0.70      0.71       207
 comp.os.ms-windows.misc       0.74      0.70      0.72       190
comp.sys.ibm.pc.hardware       0.80      0.67      0.73       248
   comp.sys.mac.hardware       0.72      0.82      0.77       174
          comp.windows.x       0.81      0.83      0.82       187
            misc.forsale       0.71      0.82      0.76       168
               rec.autos       0.81      0.81      0.81       199
         rec.motorcycles       0.79      0.87      0.83       185
      rec.sport.baseball       0.92      0.93      0.93       220
        rec.sport.hockey       0.88      0.93      0.90       200
               sci.crypt       0.84      0.78      0.81       224
         sci.electro

#### EDA Parameter Space Exploration - Grid Search

Using the pypet toolkit, the parameter space for the EDA method can be explored. This involves creating an EDA dataset for each permutation of the parameters and running the baseline Multinomial Naive Bayes model on the resulting dataset. Taking about two minutes per run and with about 3750 combinations this is quite computationally expensive and so has not been executed yet. Perhaps the EDA function can be optimized, parallelized or if this script was run on a more powerful computer then the parameter space could be explored.

In [123]:
def eda_run(traj):

    # Multinomial NB using TfidfVectorizer on the the data without stop words
    MultiNB_Tfidf = Pipeline(
        [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())],
        verbose=False,
    )

    # Split the data into test and training groups applying EDA to the training group
    entries_train, entries_test, y_train, y_test = eda_split(
        traj.entries,
        traj.y,
        alpha_sr=traj.alpha_sr,
        alpha_ri=traj.alpha_ri,
        alpha_rs=traj.alpha_rs,
        p_rd=traj.p_rd,
        num_aug=traj.num_aug,
    )

    MultiNB_Tfidf.fit(entries_train, y_train)
    score = MultiNB_Tfidf.score(entries_test, y_test)

    traj.f_add_result("score", score, comment="Result of NB")


# Create an environment that handles running our simulation
env = Environment(
    trajectory="EDA_Hyper",
    filename="./HDF/eda_hyper.hdf5",
    file_title="EDA_Hyper",
    overwrite_file=True,
    comment="Exploring the hyperparameters of the EDA method",
)

# Get the trajectory from the environment
traj = env.trajectory

# Add parameters
traj.f_add_parameter(
    "entries", list(data_frame["entry-hqfes"].values), comment="entries"
)
traj.f_add_parameter("y", list(data_frame["newsgroup"].values), comment="y values")

traj.f_add_parameter("alpha_sr", 0.0, comment="Synonym Replacement alpha")
traj.f_add_parameter("alpha_ri", 0.0, comment="Random Insertion alpha")
traj.f_add_parameter("alpha_rs", 0.0, comment="Random Swap alpha")
traj.f_add_parameter("p_rd", 0.0, comment="Random Deletion probability")
traj.f_add_parameter("num_aug", 0, comment="Number of Augmented Entries")

param_space = {
    "alpha_sr": [0.0, 0.1, 0.2, 0.3, 0.4],
    "alpha_ri": [0.0, 0.1, 0.2, 0.3, 0.4],
    "alpha_rs": [0.0, 0.1, 0.2, 0.3, 0.4],
    "p_rd": [0.0, 0.1, 0.2, 0.3, 0.4],
    "num_aug": [2, 4, 6, 8, 10, 12],
}

# space_product = cartesian_product(param_space)
# print(len(space_product['alpha_sr']))

# Explore the parameters with a cartesian product
traj.f_explore(cartesian_product(param_space))

# Run the simulation with all parameter combinations
# env.run(eda_run)

MainProcess pypet.storageservice.HDF5StorageService INFO     I will use the hdf5 file `./HDF/eda_hyper.hdf5`.
MainProcess pypet.storageservice.HDF5StorageService INFO     You specified ``overwrite_file=True``, so I deleted the file `./HDF/eda_hyper.hdf5`.
MainProcess pypet.environment.Environment INFO     Environment initialized.


#### EDA Parameter Space Exploration - Genetic Algorithm

Alternatively the parameter space for the EDA method could be explored using a genetic algorithm. A genetic algorithm (GA) takes a set of random combinations of parameters called a population and compares them to one another finding the best preforming combination in a population. The parameters, also called genes, that make up the best individuals are then passed on to the next generation where the process repeats. However, before the genes are passed on, some random fluctuations called mutations are applied to the genes to add some variability to the next generation. Iterating through this process of selection and mutation the algorithm should converge on an optimum result. This process can be quicker than a standard coarse grid search as was implemented above but is still expected to take several days to run so as of now it has not been tested. The geneticalgorithm library which implements an elitist genetic algorithm is used rather than implementing a GA from scratch.

In [124]:
# Global parameters which the EDA function uses. They cannot be given to the funtion 
# directly as it must only accept an array of numbers for the GA library to work
entries = data_frame["entry-hqfes"].values
y = data_frame["newsgroup"].values

# Function to create and train an EDA dataset for given parameters 
def eda_run(X):
    
    # Multinomial NB using TfidfVectorizer on the the data without stop words
    MultiNB_Tfidf = Pipeline(
        [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())],
        verbose=False,
    )

    # Split the data into test and training groups applying EDA to the training group
    entries_train, entries_test, y_train, y_test = eda_split(
        entries,
        y,
        alpha_sr=X[0]/10,
        alpha_ri=X[1]/10,
        alpha_rs=X[2]/10,
        p_rd=X[3]/10,
        num_aug=int(X[4]),
    )

    MultiNB_Tfidf.fit(entries_train, y_train)
    score = MultiNB_Tfidf.score(entries_test, y_test)
    
    # print(X, score)
    
    return -score


# Define the parameter space for each parameter 
varbound = np.array([[0, 5], [0, 5], [0, 5], [0, 5], [2, 12]])
vartype = np.array([["int"], ["int"], ["int"], ["int"], ["int"]])

# Set the parameters for the algorithm
algorithm_param = {
    "max_num_iteration": 100,
    "population_size": 20,
    "mutation_probability": 0.1,
    "elit_ratio": 0.01,
    "crossover_probability": 0.5,
    "parents_portion": 0.3,
    "crossover_type": "uniform",
    "max_iteration_without_improv": 5,
}

# Define the algorithm and execute it
model = ga(
    function=eda_run,
    dimension=5,
    variable_type_mixed=vartype,
    variable_boundaries=varbound,
    algorithm_parameters=algorithm_param,
    function_timeout=600
)

# model.run()

After running a coarse GA test the algorithm began to converge on a result of:

- 20% Synonym Replacement 
- 50% Random Insertion  
- 10% random Swap 
- 40% Random Deletion Chance 
- 11 Augmented Sentences

A more in depth test could be carried out but testing out these values should see an improved result. 

#### Multinomial Naive Bayes using TfidfVectorizer on cleaned data after applying optimized EDA

In [125]:
# Multinomial NB using TfidfVectorizer on the the data without stop words
MultiNB_Tfidf = Pipeline(
    [("vectorizer", TfidfVectorizer()), ("classifier", MultinomialNB())], verbose=False
)

entries = data_frame["entry-hqfes"].values
y = data_frame["newsgroup"].values

# Split the data into test and training groups applying EDA to the training group
entries_train, entries_test, y_train, y_test = eda_split(entries, y, alpha_sr=0.2, alpha_ri=0.5, alpha_rs=0.1, p_rd=0.4, num_aug=11)

print("Multinomial Navie Bayes with TfidfVectorizer:")
MultiNB_Tfidf.fit(entries_train, y_train)
score = MultiNB_Tfidf.score(entries_test, y_test)

# Print the score
train_score = MultiNB_Tfidf.score(entries_train, y_train)
print("Training Accuracy: {:.4f}".format(train_score))
test_score = MultiNB_Tfidf.score(entries_test, y_test)
print("Testing Accuracy:  {:.4f}".format(test_score))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_score, name="MultiNB_Tfidf_oEDA-hqfes", index=["Accuracy Score"])
)

top_10 = show_top10(MultiNB_Tfidf, sorted(set(data_frame["newsgroup"])))
# display(top_10)

Multinomial Navie Bayes with TfidfVectorizer:
Training Accuracy: 0.9234
Testing Accuracy:  0.7685


In [126]:
display(score_table)

Unnamed: 0,Accuracy Score
MultiNB_Count-Raw,0.850505
MultiNB_Tfidf-Raw,0.85555
MultiNB_Tfidf-s,0.87334
MultiNB_Tfidf-es,0.86086
MultiNB_Tfidf-hqfes,0.728625
MultiNB_Tfidf_EDA-hqfes,0.769251
MultiNB_Tfidf_oEDA-hqfes,0.768455


Using the EDA method the resulting accuracy score increased from ~73% to ~77%. This is a significant increase indicating that the original dataset was indeed small and that with more entries a more accurate result can be achieved. Only a coarse exploration of the EDA parameter space was explored using the genetic algorithm, in the future a more robust exploration should be conducted to see if any more improvements could be made.

## Deep Learning Models

Now that an adequate score has been achieved using the baseline model with the amount of text preprocessing done, the augmented dataset can now be used to train more sophisticated machine learning models. Two types of Deep Learning models are tested, a simple Artificial Neural Network and a Convolutional Neural Network. A Recurrent Neural Network was also implemented but never tested as it failed in training because it is too computationally expensive to run. 

#### Define the dataset 

Since training takes a long time, the resulting models are saved after training. The problem with saving the models is that the data they use is randomly generated and so the next time the model is loaded, the dataset would have changed. To fix this issue, the data is also saved along with the models and loaded the next time that a model is used. 

In [127]:
# Shuffle the dataframe and extract the entries and Newsgroups
data_frame_shuffle = data_frame.sample(frac=1, random_state=1234)
entries = data_frame_shuffle["entry-hqfes"].values
y = data_frame_shuffle["newsgroup"].values

# Define a percentage of the dataset to run on
# This is used to speed up processing during development and should be set to 1
percentage = 1
data_len = int(len(entries) * percentage)
entries = entries[:data_len]
y = y[:data_len]

# Label encode the Newsgroups
encoder = LabelEncoder()
y_label = encoder.fit_transform(y)

# The label encoded Newsgroups are then one hot encoded
encoder = OneHotEncoder(sparse=False)
y_label = y_label.reshape((len(y_label), 1))
y_ohe = encoder.fit_transform(y_label)

# Print the number of Newsgroups in the dataset
# This is the check that a slice contains all the Newsgroups
print("Number of Newsgroups sanity check:", len(set(y)))

# Define the number of epochs, the patience of the early stopping algorithims
# and the batch size for all models
epochs = 100
patience = max(2, epochs / 10)
batch = 128

# A flag used in the below cells to dictate when a network is imported or trained from scratch
import_model = False

entries_filename = "data/entries" + str(percentage * 100) + ".pickle"
eda_filename = "data/entries_eda" + str(percentage * 100) + ".pickle"

# If import_model is true and the data has been saved before then import the data
if import_model and os.path.exists(entries_filename):
    with open(entries_filename, "rb") as entryfile:
        entries_train, entries_test, y_train, y_test = pickle.load(entryfile)

    with open(eda_filename, "rb") as edafile:
        entries_train_eda, entries_test_eda, y_train_eda, y_test_eda = pickle.load(
            edafile
        )

# Otherwise split the data into training as test sets and pickle save them
else:
    # Split the data into test and training groups
    entries_train, entries_test, y_train, y_test = train_test_split(
        entries, y_ohe, test_size=0.2, random_state=1234
    )

    # Split the data into test and training groups using EDA
    entries_train_eda, entries_test_eda, y_train_eda, y_test_eda = eda_split(
        entries, y_ohe, alpha_sr=0.2, alpha_ri=0.5, alpha_rs=0.1, p_rd=0.4, num_aug=11,
    )

    with open(entries_filename, "wb") as entryfile:
        data = (entries_train, entries_test, y_train, y_test)
        pickle.dump(data, entryfile, protocol=pickle.HIGHEST_PROTOCOL)

    with open(eda_filename, "wb") as edafile:
        eda_data = (entries_train_eda, entries_test_eda, y_train_eda, y_test_eda)
        pickle.dump(eda_data, edafile, protocol=pickle.HIGHEST_PROTOCOL)

Number of Newsgroups sanity check: 20


### Artificial Neural Network

The first model to be tested is a simple Artificial Neural Network (ANN). An ANN is a network of connected nodes called neurons which function together to relate a given input to a predicted output. These neurons are stacked into layers and are connected to one another where the output of one neuron can be the input to multiple neurons. Each neuron itself contains an activation function which dictates if the neuron has been 'activated' based on the sum of its inputs. Through the process of adjusting the ways that these neurons are connected to one another an ANN can be trained to recognize patterns in data such as identify subjects in photos or classify texts, as it will be used in this case. 

In order to train the network the Newsgroup categories must first be one hot encoded. One hot encoding is the process of converting a list of categories into vectors so the neural network can better understand the relationship between the inputs and the outputs. For this model, TfidfVectorizer is again used to encode the entries. 

The length of the input to the neural network is of the order ~75000 (the dataset vocab) and the output is length 20, one output for each Newsgroup. Initially a single hidden layer with 1000 neurons was chosen based on a rule of thumb which says:

    number of hidden nodes ~ sqrt(input layer nodes * output layer nodes)
    
but these hyperparameters will be tuned to get the best result. The model is first trained on the cleaned data and then trained again on the EDA data to see what affect it has on its accuracy. 

In [128]:
# Fit the vectorizer to the training data
vectorizer = TfidfVectorizer()
vectorizer.fit(entries_train)

# Transform the data using the training data fit
X_train = vectorizer.transform(entries_train)
X_test = vectorizer.transform(entries_test)

# The indices of the sparse matrices are then sorted
X_train.sort_indices()
X_test.sort_indices()

input_dim = X_train.shape[1]  # Number of features

Note: The training and test data are returned as sparse matrices, initially with their indices out of order. The tensorflow backend to keras will not run when the indices are out of order and so they are sorted using ```sort_indices()```. By reordering the indices this has the affect of reordering the entries but since the model does not take into consideration the order of the words in each entry it is thought that this shouldn't have an affect on the overall accuracy. 

In [129]:
# Define the model
def ANN_model(input_dim, neurons=1000, hidden=1):
    model = Sequential()

    for i in range(hidden):
        model.add(layers.Dense(neurons, input_dim=input_dim, activation="relu"))

    model.add(layers.Dense(20, activation="softmax"))
    model.compile(
        loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    # model.summary()

    return model


# Define EarlyStopping function
es = EarlyStopping(monitor="loss", mode="min", patience=patience)

#### Hyperparameter Tuning

There are multiple parameters to choose when defining a machine learning model. For now, the best way to optimize a model is to test multiple values for each parameter to see which set gives the best results. In the case of an ANN, the number of layers in the network and the number of neurons in each layer can be tuned. In this case, 1 to 3 layers were tested each containing 100-1500 neurons going up in steps of 100 neurons. The process of training each model takes about 20 minutes on my local PC using CPU, so the hyperparameter tuning was done using Google Colab since a more powerful GPU could be used. The best results from Google Colab are then used to train a model on this system.  

#### Artificial Neural Network using TfidfVectorizer on cleaned data

In [130]:
model = ANN_model(input_dim, neurons=250, hidden=1)
model.summary()

model_name = "ANN_Tfidf-hqfes_" + str(epochs) + str(batch) + str(percentage * 100)

if import_model and os.path.exists("models/" + model_name):
    model = load_model("models/" + model_name)
else:
    # fit network
    model.fit(
        X_train,
        y_train,
        epochs=epochs,
        verbose=True,
        batch_size=batch,
        shuffle=True,
        callbacks=[es],
    )
    model.save("models/" + model_name)

# Print the score
train_loss, train_accuracy = model.evaluate(
    X_train, y_train, batch_size=128, verbose=False
)
print("Training Accuracy: {:.4f}".format(train_accuracy))
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(test_accuracy))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_accuracy, name=model_name, index=["Accuracy Score"])
)

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_10 (Dense)             (None, 250)               15722500  
_________________________________________________________________
dense_11 (Dense)             (None, 20)                5020      
Total params: 15,727,520
Trainable params: 15,727,520
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 

MainProcess tensorflow INFO     Assets written to: models/ANN_Tfidf-hqfes_100128100/assets


Training Accuracy: 0.9944
Testing Accuracy:  0.7650


After extensive testing, ANNs with a single layer and a neuron count of between 100 and 500 all performed similarly so a value of 250 was chosen. This model produces a result of ~76.5% which is about as good as the MNB model using EDA. 

#### Neural Network using TfidfVectorizer on cleaned EDA data 

Now that the optimum values for the network have been chosen, the model can be trained again on the data after applying EDA to see what affect the method has on the result. 

In [131]:
# Fit the vectorizer to the training data
vectorizer = TfidfVectorizer()
vectorizer.fit(entries_train_eda)

# Transform the data using the training data fir
X_train = vectorizer.transform(entries_train_eda)
X_test = vectorizer.transform(entries_test_eda)

# The indices of the sparse matrices are then sorted
X_train.sort_indices()
X_test.sort_indices()

input_dim = X_train.shape[1]  # Number of features

# For some reason the output of eda_split does not work when fitting so must be converted to numpy arrays
y_train_eda = np.array([np.array(x) for x in y_train_eda])

In [132]:
# Define the model
model = ANN_model(input_dim, neurons=250, hidden=1)
model.summary()

model_name = "ANN_Tfidf_oEDA-hqfes_" + str(epochs) + str(batch) + str(percentage * 100)

if import_model and os.path.exists("models/" + model_name):
    model = load_model("models/" + model_name)
else:
    # fit network
    model.fit(
        X_train,
        y_train_eda,
        epochs=epochs,
        verbose=True,
        batch_size=batch,
        shuffle=True,
        callbacks=[es],
    )
    model.save("models/" + model_name)

# Print the score
train_loss, train_accuracy = model.evaluate(
    X_train, y_train_eda, batch_size=128, verbose=False
)
print("Training Accuracy: {:.4f}".format(train_accuracy))
test_loss, test_accuracy = model.evaluate(X_test, y_test_eda, verbose=False)
print("Testing Accuracy:  {:.4f}".format(test_accuracy))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_accuracy, name=model_name, index=["Accuracy Score"])
)

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 250)               18794750  
_________________________________________________________________
dense_13 (Dense)             (None, 20)                5020      
Total params: 18,799,770
Trainable params: 18,799,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 

MainProcess tensorflow INFO     Assets written to: models/ANN_Tfidf_oEDA-hqfes_100128100/assets


Training Accuracy: 0.9947
Testing Accuracy:  0.7350


In [133]:
display(score_table)

Unnamed: 0,Accuracy Score
MultiNB_Count-Raw,0.850505
MultiNB_Tfidf-Raw,0.85555
MultiNB_Tfidf-s,0.87334
MultiNB_Tfidf-es,0.86086
MultiNB_Tfidf-hqfes,0.728625
MultiNB_Tfidf_EDA-hqfes,0.769251
MultiNB_Tfidf_oEDA-hqfes,0.768455
ANN_Tfidf-hqfes_100128100,0.765003
ANN_Tfidf_oEDA-hqfes_100128100,0.734997


Comparing the results of the of the ANN to the MNB model, the accuracy was about the same when trained on the cleaned data without EDA but actually decreased when trained with the EDA data. This is a surprising result and it is unclear why the accuracy might have decreased. Perhaps there are too many augmented entries created by the EDA method and so the network is being over-trained on the training data. The optimum result from the GA was determined on the MNB model and was assumed that it would be the optimum result for the ANN too but maybe that assumption is not correct. Ideally the optimum EDA method could be determined on the ANN but that would be too computationally expensive to test on this PC.

### Convolutional Neural Network

The next model to be tested is a Convolutional Neural Network (CNN). A CNN is similar to an ANN in that it is made up of connected layers of nodes but in the CNN the nodes preform convolutions. When using a 2D input such as a matrix, the convolutional layers multiply small sections of the matrix together and sum the results to form a smaller matrix called a feature map. The effect is that the convolutional layers extract complex features from the input data and then can relate these features to a given output. CNNs are used often in machine vision projects but can also be used in NLP projects if the data is processed correctly. 

The best way to represent text data for a CNN is to use word embedding. Similar to TfidfVectorizer, this involves converting the entries into vectors but this time each word is represented by an n-dimensional vector itself. To convert a word to a vector, an embedding layer is added to the network which encodes the words based on their relationship to one another. A vector for one word should be close to the vector of a related word in n-dimensional space. Embedding can also be done using external algorithms and then fed into the convolution layers of the network directly. Both the keras embedding layer and the Word2Vec algorithm will be tested. 

In [134]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(entries_train)

X_train = tokenizer.texts_to_sequences(entries_train)
X_test = tokenizer.texts_to_sequences(entries_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

# A max length of 600 was chosen as the majority of entries are no longer than 600 words
maxlen = 600

X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)

In [135]:
def CNN_model(
    vocab_size, embedding_dim, maxlen, filters=128, kernals=3, hidden=1, neurons=100
):
    model = Sequential()
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))

    for i in range(hidden):
        model.add(layers.Conv1D(filters, kernals, activation="relu"))

    model.add(layers.MaxPooling1D())
    model.add(layers.Flatten())
    model.add(layers.Dense(neurons, activation="relu"))
    model.add(layers.Dense(20, kernel_initializer="normal", activation="softmax"))
    model.compile(
        optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]
    )
    # model.summary()

    return model

#### Hyperparameter tuning

Since each CNN takes about 30 minutes to train even when using Google Colab the hyperparameter space could not be explored extensively. The initial plan was to test the parameters seen below but since training takes so long this was not possible. In the end, the same parameters that the authors used in the EDA paper were chosen for the structure of the CNN so that results generated here could be compared to the results in the paper. The authors of the paper used a single convolutional layer with 128 filters and a kernel size of 5. The convolutional layer is then connected to a dense layer with 100 neurons. The authors use Word2Vec as the input to the convolutional layer but in this case an embedding layer is tested before using Word2Vec to compare the results.     

####  Convolutional Neural Network on cleaned data using word embedding

In [136]:
embedding_dim = 100
model = CNN_model(
    vocab_size, embedding_dim, maxlen, filters=128, kernals=5, hidden=1, neurons=100
)
model.summary()

model_name = "CNN_embed-hqfes_" + str(epochs) + str(batch) + str(percentage * 100)

if import_model and os.path.exists("models/" + model_name):
    model = load_model("models/" + model_name)
else:
    # fit network
    model.fit(
        X_train,
        y_train,
        epochs=epochs,
        verbose=True,
        batch_size=batch,
        shuffle=True,
        callbacks=[es],
    )
    model.save("models/" + model_name)

# Print the score
train_loss, train_accuracy = model.evaluate(
    X_train, y_train, batch_size=128, verbose=False
)
print("Training Accuracy: {:.4f}".format(train_accuracy))
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(test_accuracy))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_accuracy, name=model_name, index=["Accuracy Score"])
)

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 600, 100)          6291500   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 596, 128)          64128     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 298, 128)          0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 38144)             0         
_________________________________________________________________
dense_14 (Dense)             (None, 100)               3814500   
_________________________________________________________________
dense_15 (Dense)             (None, 20)                2020      
Total params: 10,172,148
Trainable params: 10,172,148
Non-trainable params: 0
__________________________________________

MainProcess tensorflow INFO     Assets written to: models/CNN_embed-hqfes_100128100/assets


Training Accuracy: 0.9948
Testing Accuracy:  0.6216


After training the CNN, the model produces an accuracy score of ~62% which is lower than even the worst performing MNB model.

#### Convolutional Neural Network on cleaned EDA data using word embedding

In [137]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(entries_train_eda)

X_train = tokenizer.texts_to_sequences(entries_train_eda)
X_test = tokenizer.texts_to_sequences(entries_test_eda)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

maxlen = 600

X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)

y_train_eda = np.array([np.array(x) for x in y_train_eda])

In [138]:
embedding_dim = 100
model = CNN_model(
    vocab_size, embedding_dim, maxlen, filters=128, kernals=3, hidden=1, neurons=100
)
model.summary()

model_name = "CNN_embed_oEDA-hqfes_" + str(epochs) + str(batch) + str(percentage * 100)

if import_model and os.path.exists("models/" + model_name):
    model = load_model("models/" + model_name)
else:
    # fit network
    model.fit(
        X_train,
        y_train_eda,
        epochs=epochs,
        verbose=True,
        batch_size=batch,
        shuffle=True,
        callbacks=[es],
    )
    model.save("models/" + model_name)

# Print the score
train_loss, train_accuracy = model.evaluate(
    X_train, y_train_eda, batch_size=128, verbose=False
)
print("Training Accuracy: {:.4f}".format(train_accuracy))
test_loss, test_accuracy = model.evaluate(X_test, y_test_eda, verbose=False)
print("Testing Accuracy:  {:.4f}".format(test_accuracy))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_accuracy, name=model_name, index=["Accuracy Score"])
)

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 600, 100)          7520500   
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 598, 128)          38528     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 299, 128)          0         
_________________________________________________________________
flatten_4 (Flatten)          (None, 38272)             0         
_________________________________________________________________
dense_16 (Dense)             (None, 100)               3827300   
_________________________________________________________________
dense_17 (Dense)             (None, 20)                2020      
Total params: 11,388,348
Trainable params: 11,388,348
Non-trainable params: 0
__________________________________________

MainProcess tensorflow INFO     Assets written to: models/CNN_embed_oEDA-hqfes_100128100/assets


Training Accuracy: 0.9941
Testing Accuracy:  0.6819


After applying EDA the accuracy score has increased this time from 62% to 68% which is about the same level of improvement seen when using the MNB model but is still quite a poor result.

#### Convolutional Neural Network on cleaned EDA data using Word2Vec

In [140]:
entries_eda = list(entries_train_eda) + list(entries_test_eda)
entries_eda_list = [text.split(" ") for text in entries_eda]

w2v_model = Word2Vec(entries_eda_list, size=100, window=5, workers=8, min_count=1)
# summarize vocabulary size in model
words = list(w2v_model.wv.vocab)
print("Vocabulary size: %d" % len(words))
# print(w2v_model.wv['space'])

MainProcess gensim.models.word2vec INFO     collecting all words and their counts
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentence #10000, processed 1015417 words, keeping 48484 word types
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentence #20000, processed 2051802 words, keeping 61300 word types
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentence #30000, processed 3041242 words, keeping 66755 word types
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentence #40000, processed 4073497 words, keeping 69664 word types
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentence #50000, processed 5046978 words, keeping 71130 word types
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentence #60000, processed 6020643 words, keeping 72014 word types
MainProcess gensim.models.word2vec INFO     PROGRESS: at sentenc

MainProcess gensim.models.base_any2vec INFO     worker thread finished; awaiting finish of 0 more threads
MainProcess gensim.models.base_any2vec INFO     EPOCH - 2 : training on 18550843 raw words (18416319 effective words) took 11.9s, 1545553 effective words/s
MainProcess gensim.models.base_any2vec INFO     EPOCH 3 - PROGRESS: at 8.11% examples, 1509929 words/s, in_qsize 13, out_qsize 2
MainProcess gensim.models.base_any2vec INFO     EPOCH 3 - PROGRESS: at 16.53% examples, 1510227 words/s, in_qsize 16, out_qsize 0
MainProcess gensim.models.base_any2vec INFO     EPOCH 3 - PROGRESS: at 24.65% examples, 1506355 words/s, in_qsize 14, out_qsize 1
MainProcess gensim.models.base_any2vec INFO     EPOCH 3 - PROGRESS: at 33.11% examples, 1511092 words/s, in_qsize 15, out_qsize 0
MainProcess gensim.models.base_any2vec INFO     EPOCH 3 - PROGRESS: at 41.50% examples, 1514285 words/s, in_qsize 16, out_qsize 1
MainProcess gensim.models.base_any2vec INFO     EPOCH 3 - PROGRESS: at 49.97% examples, 1

Vocabulary size: 82789


In [141]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(entries_train_eda)

X_train = tokenizer.texts_to_sequences(entries_train_eda)
X_test = tokenizer.texts_to_sequences(entries_test_eda)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

# A max length of 600 was chosen as the majority of entries are no longer than 600 words
maxlen = 600

X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)

y_train_eda = np.array([np.array(x) for x in y_train_eda])

In [142]:
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = np.zeros((vocab_size, 100))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        weight_matrix[i] = embedding[word]
    return weight_matrix


# get vectors in the right order
embedding_vectors = get_weight_matrix(w2v_model.wv, tokenizer.word_index)

# create the embedding layer
embedding_layer = layers.Embedding(
    vocab_size, 100, weights=[embedding_vectors], input_length=maxlen, trainable=False
)

In [143]:
# Custom model which uses Word2Vec
model = Sequential()
model.add(embedding_layer)
model.add(layers.Conv1D(128, 5, activation="relu"))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Flatten())
model.add(layers.Dense(100, activation="relu"))
model.add(layers.Dense(20, kernel_initializer="normal", activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()

model_name = (
    "CNN_word2vec_oEDA-hqfes_" + str(epochs) + str(batch) + str(percentage * 100)
)

if import_model and os.path.exists("models/" + model_name):
    model = load_model("models/" + model_name)
else:
    # fit network
    model.fit(
        X_train,
        y_train_eda,
        epochs=epochs,
        verbose=True,
        batch_size=batch,
        shuffle=True,
        callbacks=[es],
    )
    model.save("models/" + model_name)

# Print the score
train_loss, train_accuracy = model.evaluate(
    X_train, y_train_eda, batch_size=128, verbose=False
)
print("Training Accuracy: {:.4f}".format(train_accuracy))
test_loss, test_accuracy = model.evaluate(X_test, y_test_eda, verbose=False)
print("Testing Accuracy:  {:.4f}".format(test_accuracy))

# Append the score to the leaderboard
score_table = score_table.append(
    pd.Series(test_accuracy, name=model_name, index=["Accuracy Score"])
)

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 600, 100)          7520500   
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 596, 128)          64128     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_18 (Dense)             (None, 100)               12900     
_________________________________________________________________
dense_19 (Dense)             (None, 20)                2020      
Total params: 7,599,548
Trainable params: 79,048
Non-trainable params: 7,520,500
_______________________________________

Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
INFO:tensorflow:Assets written to: models/CNN_word2vec_oEDA-hqfes_100128100/assets


MainProcess tensorflow INFO     Assets written to: models/CNN_word2vec_oEDA-hqfes_100128100/assets


Training Accuracy: 0.9845
Testing Accuracy:  0.6710


In [144]:
display(score_table)

Unnamed: 0,Accuracy Score
MultiNB_Count-Raw,0.850505
MultiNB_Tfidf-Raw,0.85555
MultiNB_Tfidf-s,0.87334
MultiNB_Tfidf-es,0.86086
MultiNB_Tfidf-hqfes,0.728625
MultiNB_Tfidf_EDA-hqfes,0.769251
MultiNB_Tfidf_oEDA-hqfes,0.768455
ANN_Tfidf-hqfes_100128100,0.765003
ANN_Tfidf_oEDA-hqfes_100128100,0.734997
CNN_embed-hqfes_100128100,0.621614


Comparing the accuracy of the CNN to the other models tested it appears to have a worse performance but in this case the EDA method has increased the performance of the model. The poor performance is a surprising result as CNNs are frequently used in text classification problems and have been shown to give good results. Perhaps the parameters chosen are not a good fit for this dataset and in the future the parameter space could be explored to better tune the model.   

### Recurrent Neural Network

Initially the plan was to test and compare the performance of a Recurrent Neural Network (RNN) against the other models. Unfortunately while training, the RNN would consistently crash, regardless of what structure was picked for the model. In the future, the RNN could be tested and trained on a more powerful PC with better GPU support. 

#### Recurrent Neural Network on cleaned EDA data using word embedding

#### Recurrent Neural Network on cleaned EDA data using Word2Vec

## Final Results

In [145]:
display(score_table)

Unnamed: 0,Accuracy Score
MultiNB_Count-Raw,0.850505
MultiNB_Tfidf-Raw,0.85555
MultiNB_Tfidf-s,0.87334
MultiNB_Tfidf-es,0.86086
MultiNB_Tfidf-hqfes,0.728625
MultiNB_Tfidf_EDA-hqfes,0.769251
MultiNB_Tfidf_oEDA-hqfes,0.768455
ANN_Tfidf-hqfes_100128100,0.765003
ANN_Tfidf_oEDA-hqfes_100128100,0.734997
CNN_embed-hqfes_100128100,0.621614


The final scores have produced a surprising result. The ANN trained on the data without EDA performed the best with a accuracy score of 77.5% but only marginally beating the Multinomial Naive Bayes model using EDA with 76.8%. The CNN performed the worst of the three models only achieving a score of 68% on the embedded data without EDA.

## Conclusions

While the ANN produced adequate results, the deep learning models have not preformed as well as expected. There are many reasons for this, the first one is that no extensive hyperparameter tuning has been done on these models. Different model structures were tested for the ANN but only one CNN structure was tested. With some tuning perhaps the accuracy scores could be improved but since each round of training takes anywhere from 30 minutes to 4 hours an extensive search is not possible using this computer or using Google Colab. The optimizer and the neuron activation function parameters used by the models were picked based on previous work done but perhaps by trying different types, some improvements could also be made. 

As well as the models hyperparameters, the values picked for the EDA method itself have also not been greatly tuned. The genetic algorithm began to converge on a result after a coarse search but perhaps if the process was run for longer a different set of parameters might have emerged. Efforts could be made to parallelize the EDA function to speed up the search but that is outside the scope of this project. Also the GA was tested on the MNB model with the assumption that the optimum result for that model would be the same for the others. This assumption might not be correct as the ANN showed a decrease in performance after applying EDA.   

Finally, perhaps this classification problem might not be suited for the use of deep learning methods. In the future, the performance of more simple machine learning models could be tested such as Support Vector Machine (SVM) and Random Forrest Classifiers. 

## References 

- https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes
- https://www.upgrad.com/blog/multinomial-naive-bayes-explained/
- https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
- https://scikit-learn.org/stable/user_guide.html
- https://scikit-learn.org/stable/modules/classes.html
- https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/
- https://arxiv.org/pdf/1901.11196.pdf
- https://github.com/jasonwei20/eda_nlp
- https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
- https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/
- https://machinelearningmastery.com/best-practices-document-classification-deep-learning/
- https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-analysis/
- https://machinelearningmastery.com/how-to-develop-a-convolutional-neural-network-to-classify-satellite-photos-of-the-amazon-rainforest/
- https://www.analyticsvidhya.com/blog/2020/11/text-cleaning-nltk-library/
- https://machinelearningmastery.com/clean-text-machine-learning-python/
- https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a
- https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/
- https://realpython.com/python-keras-text-classification/#a-primer-on-deep-neural-networks
- https://stackabuse.com/text-classification-with-python-and-scikit-learn/
- https://towardsdatascience.com/text-classification-in-python-dd95d264c802
- https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/
- https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
- https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
- https://github.com/scikit-learn/scikit-learn/blob/95119c13a/sklearn/datasets/_twenty_newsgroups.py#L329
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html?highlight=news#sklearn.datasets.fetch_20newsgroups
- https://stackoverflow.com/questions/61961042/indices201-0-8-is-out-of-order-many-sparse-ops-require-sorted-indices-use
- https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/
- https://en.wikipedia.org/wiki/Convolutional_neural_network
- https://ai.stackexchange.com/questions/5546/what-is-the-difference-between-a-convolutional-neural-network-and-a-regular-neur#:~:text=A%20convolutional%20neural%20network%20is,is%20closer%20to%20the%20truth).
- https://machinelearningmastery.com/cnn-models-for-human-activity-recognition-time-series-classification/
- https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D
- https://www.tensorflow.org/versions
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://machinelearningmastery.com/recommendations-for-deep-learning-neural-network-practitioners/
- https://towardsdatascience.com/a-walkthrough-of-convolutional-neural-network-7f474f91d7bd
- https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
- https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
- https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work
- https://stackoverflow.com/questions/51956000/what-does-keras-tokenizer-method-exactly-do
- https://machinelearningmastery.com/what-are-word-embeddings/
- https://stats.stackexchange.com/questions/31060/bag-of-words-vs-vector-space-model
- https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
- https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
- https://machinelearningmastery.com/check-point-deep-learning-models-keras/
- https://towardsdatascience.com/keras-callbacks-and-how-to-save-your-model-from-overtraining-244fc1de8608