# Assignment L10: Topic Identification with an ANN

**Author: Preston Went**  
**Course: DATASCI 420**  
**DATE (MM/DD/YYYY): 03/15/2021**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Importing-and-Preparing-the-Data" data-toc-modified-id="Importing-and-Preparing-the-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Importing and Preparing the Data</a></span></li><li><span><a href="#The-Initial-Model" data-toc-modified-id="The-Initial-Model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The Initial Model</a></span></li><li><span><a href="#Improving-the-Model" data-toc-modified-id="Improving-the-Model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Improving the Model</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Introduction

In this assignment, I will use an artificial neural network (ANN) to identify the topics of given news articles.

As always, I will import all necessary libraries before starting.

In [3]:
# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set();

# Modeling
import tensorflow as tf
from tensorflow import keras

# Data preprocessing
from keras.utils import to_categorical

# Model evaluation
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

# Set our random seed
RAND_SEED = 5743829
np.random.seed(RAND_SEED)

# Filter some unavoidable warnings
import warnings
warnings.filterwarnings("ignore")

## Importing and Preparing the Data

We will be identifying news articles on the Reuters newswire dataset, which contains 11,228 news articles across 46 topics.

In [4]:
# Load the Reuters newswire dataset
vocab_size = 15000
data = keras.datasets.reuters
(X_train, y_train), (X_test, y_test) = data.load_data(num_words=vocab_size,
                                                      seed=RAND_SEED)

# Propely encode the label vectors
num_topics = 46
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# Set maximum length of newswires
max_wire_len = 600
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_wire_len)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_wire_len)

## The Initial Model

We will start with a simple recurrent neural network (RNN).

In [5]:
# Construct the RNN
embed_dim = 64
rnn_mdl = keras.Sequential([
    keras.layers.Embedding(input_dim=vocab_size,
                           output_dim=embed_dim,
                           input_length=max_wire_len),
    keras.layers.SimpleRNN(50),
    keras.layers.Dense(num_topics,
                       activation=tf.nn.sigmoid)
])

Now we can train our model on the training set. We will use the binary cross entropy as our loss function, and ADAM as our training method.

In [129]:
# Define the training and evaluation process
rnn_mdl.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

In [130]:
# Train the neural network
rnn_mdl.fit(X_train, y_train, epochs=3)

Train on 8982 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1c8636d9108>

The loss and training accuracy improved very little. Let us see how the model performs on our testing set using the average precision, AUC-ROC, and classification matrix.

In [131]:
# Get predictions on the testing set
y_pred_proba = rnn_mdl.predict(X_test)
y_pred = np.zeros_like(y_pred_proba)
y_pred[np.arange(len(y_pred_proba)), y_pred_proba.argmax(1)] = 1

# Get and print the evaluation metrics
print('Avg. Precision: {:.2f}'.format(average_precision_score(y_test, y_pred_proba)))
print('AUC-ROC: {:.2f}'.format(roc_auc_score(y_test, y_pred_proba)))
print(classification_report(y_test, y_pred, zero_division=1))

Avg. Precision: 0.03
AUC-ROC: 0.57
              precision    recall  f1-score   support

           0       1.00      0.00      0.00        12
           1       1.00      0.00      0.00       121
           2       1.00      0.00      0.00        21
           3       0.36      1.00      0.53       812
           4       1.00      0.00      0.00       458
           5       1.00      0.00      0.00         4
           6       1.00      0.00      0.00         7
           7       1.00      0.00      0.00         6
           8       1.00      0.00      0.00        38
           9       1.00      0.00      0.00        21
          10       1.00      0.00      0.00        34
          11       1.00      0.00      0.00        96
          12       1.00      0.00      0.00        10
          13       1.00      0.00      0.00        35
          14       1.00      0.00      0.00         9
          15       1.00      0.00      0.00         4
          16       1.00      0.00      0.00   

The model performs quite poorly, if still better than random guesswork. It appears to all but ignore the existence of most kinds of article.

## Improving the Model

Let us try to improve the model. I will do this by using Long Short-Term Memory (LSTM) instead of an RNN and an additional dense hidden layer.

In [124]:
# Construct the RNN
embed_dim = 256
lstm_mdl = keras.Sequential([
    keras.layers.Embedding(input_dim=vocab_size,
                           output_dim=embed_dim,
                           input_length=max_wire_len),
    keras.layers.LSTM(128),
    keras.layers.Dense(64,
                      activation=tf.nn.relu),
    keras.layers.Dense(num_topics,
                       activation=tf.nn.sigmoid)
])

The training process will move faster with this LSTM than with the RNN, so we can train for more epochs in the same amount of time. Let us do so now.

In [125]:
# Define the training and evaluation process
lstm_mdl.compile(optimizer='adam',
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])

In [126]:
# Train the neural network
lstm_mdl.fit(X_train, y_train, epochs=20)

Train on 8982 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x1c85f2e32c8>

The loss and training accuracy improved dramatically, both from the start of training and the previous iteration of the ANN. Let us see if this is is reflected in our evaluation.

In [127]:
# Get predictions on the testing set
y_pred_proba = lstm_mdl.predict(X_test)
y_pred = np.zeros_like(y_pred_proba)
y_pred[np.arange(len(y_pred_proba)), y_pred_proba.argmax(1)] = 1

# Get and print the evaluation metrics
print('Avg. Precision: {:.2f}'.format(average_precision_score(y_test, y_pred_proba)))
print('AUC-ROC: {:.2f}'.format(roc_auc_score(y_test, y_pred_proba)))
print(classification_report(y_test, y_pred, zero_division=1))

Avg. Precision: 0.28
AUC-ROC: 0.84
              precision    recall  f1-score   support

           0       0.64      0.58      0.61        12
           1       0.66      0.75      0.70       121
           2       0.77      0.48      0.59        21
           3       0.92      0.92      0.92       812
           4       0.83      0.88      0.85       458
           5       0.00      0.00      0.00         4
           6       0.60      0.43      0.50         7
           7       0.33      0.17      0.22         6
           8       0.35      0.50      0.41        38
           9       0.50      0.86      0.63        21
          10       0.77      0.71      0.74        34
          11       0.72      0.65      0.68        96
          12       0.14      0.20      0.17        10
          13       0.57      0.66      0.61        35
          14       0.25      0.11      0.15         9
          15       0.00      0.00      0.00         4
          16       0.65      0.56      0.60   

The LSTM is a dramatic improvement over the RNN. It still has problems identifying the less common topics, but not nearly so much as the RNN. The order of magnitude increase in average precision is indicative of this.

## Conclusion

We constructed two ANNs - a RNN and a LSTM - to classify the topic of news articles, training and testing them on the Reuters newswire dataset. The LSTM dramatically outperformed the RNN - particularly on the smaller classes. The challenge with those appears to be not having enough examples to train on. We cannot resample them without causing overfitting, and oversampling is difficult with such complex data. I plan on looking into this problem more later.