# Appendix _______________________ - Supporting Python Code

#### Steve Desilets

#### November 5, 2023

## 1) Introduction

In this notebook, we aim to build natural language processing pipelines capable of effectively classifying text articles into their respective article categories. The underlying data that we leverage is the AG_News dataset, which includes over one million news articles corresponding to four categories. We aim to build a variety of models, including artificial neural networks, recurrent neural networks, long short term memory models, and transformer-based models to discover which methods are most effective for classifying articles into their respective categories.

### 1.1) Notebook Setup

First, let's set up this notebook by importing the relevant packages and by defining functions that we will use throughout our analysis.

In [5]:
import datetime
from packaging import version
from collections import Counter
import numpy as np
import pandas as pd
import time
import os
import re
import string

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

import nltk
from nltk.corpus import stopwords

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import accuracy_score

import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow.keras.backend as k


In [6]:
%matplotlib inline
np.set_printoptions(precision=3, suppress=True)

We can verify the version of TensorFlow in place.

In [7]:
print("This notebook requires TensorFlow 2.0 or above")
print("TensorFlow version: ", tf.__version__)
assert version.parse(tf.__version__).release[0] >=2

This notebook requires TensorFlow 2.0 or above
TensorFlow version:  2.12.0


Let's define visualization functions that we'll use throughout this analysis.

In [8]:
def print_validation_report(test_labels, predictions):
    print("Classification Report")
    print(classification_report(test_labels, predictions))
    print('Accuracy Score: {}'.format(accuracy_score(test_labels, predictions)))
    print('Root Mean Square Error: {}'.format(np.sqrt(MSE(test_labels, predictions))))

def plot_confusion_matrix(y_true, y_pred):
    mtx = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots(figsize=(8,8))
    sns.heatmap(mtx, annot=True, fmt='d', linewidths=.75,  cbar=False, ax=ax,cmap='Blues',linecolor='white')
    #  square=True,
    plt.ylabel('true label')
    plt.xlabel('predicted label')

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

def display_training_curves(training, validation, title, subplot):
  ax = plt.subplot(subplot)
  ax.plot(training)
  ax.plot(validation)
  ax.set_title('model '+ title)
  ax.set_ylabel(title)
  ax.set_xlabel('epoch')
  ax.legend(['training', 'validation'])


Let's mount to the Google Colab environment

In [None]:
#from google.colab import drive
#drive.mount('/content/gdrive')

### 1.2) Exploratory Data Analysis

Now that we've set up our notebook, let's load the subset of the AG news dataset.

### Load AG News Subset Data
<div class="alert alert-block alert-info">
    <b> ag_news_subset</b><br>
    See https://www.tensorflow.org/datasets/catalog/ag_news_subset
    </div>

Get all the words in the documents (as well as the number of words in each document) by using the encoder to get the indices associated with each token and then translating the indices to tokens. But first we need to get the "unpadded" new articles so that we can get their length.

In [None]:
#register  ag_news_subset so that tfds.load doesn't generate a checksum (mismatch) error
!python -m tensorflow_datasets.scripts.download_and_prepare \
        --register_checksums --datasets=ag_news_subset

# https://www.tensorflow.org/datasets/splits
# The full `train` and `test` splits, interleaved together.
ri = tfds.core.ReadInstruction('train') + tfds.core.ReadInstruction('test')
dataset_all, info = tfds.load('ag_news_subset', with_info=True,  split=ri, as_supervised=True)
text_only_dataset_all=dataset_all.map(lambda x, y: x)

Let's conduct exploratory data analysis of this ag_news_subset dataset. We combined the training and test data for a total of 127,600 news articles. We can begin by observing the first 10 rows of this dataset.

In [None]:
tfds.as_dataframe(dataset_all.take(10),info)

Let's review the labels for the categories of articles in this dataset.

In [None]:
categories =dict(enumerate(info.features["label"].names))
print(f'Dictionary: ',categories)

Let's observe the number of observations that correspond to each class in the dataset.

In [None]:
train_categories = [categories[label] for label in dataset_all.map(lambda text, label: label).as_numpy_iterator()]
Counter(train_categories).most_common()

We see that the 127,600 articles are evenly distributed across the four classes.

Let's do a bit of data preprocessing to enable further exploratory data analysis. 

We can start by making the corpus all lowercase, stripping punctuation, and removing stopwords.

In [None]:
def custom_stopwords(input_text):
    lowercase = tf.strings.lower(input_text)
    stripped_punct = tf.strings.regex_replace(lowercase
                                  ,'[%s]' % re.escape(string.punctuation)
                                  ,'')
    return tf.strings.regex_replace(stripped_punct, r'\b(' + r'|'.join(STOPWORDS) + r')\b\s*',"")

In [None]:
nltk.download('stopwords',quiet=True)
STOPWORDS = stopwords.words("english")

In [None]:
%%time
max_tokens = None
text_vectorization=layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    standardize=custom_stopwords
)
text_vectorization.adapt(text_only_dataset_all)

In [None]:
%%time
doc_sizes = []
corpus = []
for example, _ in dataset_all.as_numpy_iterator():
  enc_example = text_vectorization(example)
  doc_sizes.append(len(enc_example))
  corpus+=list(enc_example.numpy())

In [None]:
print(f"There are {len(corpus)} words in the corpus of {len(doc_sizes)} news articles.")
print(f"Each news article has between {min(doc_sizes)} and {max(doc_sizes)} tokens in it.")

In [None]:
print(f"There are {len(text_vectorization.get_vocabulary())} vocabulary words in the corpus.")

Let's observe the first 50 words in our vocabulary.

In [None]:
vocab = np.array(text_vectorization.get_vocabulary())
print(vocab[:50])

Let's examine the distribution of the number of tokens per document.

In [None]:
plt.figure(figsize=(15,9))
plt.hist(doc_sizes, bins=20,range = (0,80))
plt.xlabel("Tokens Per Document")
plt.ylabel("Number of AG News Articles");

### 1.3) Data Pre-Processing

Now that we've conducted exploratory data analysis, let's preprocess the text.

In [None]:
# register  ag_news_subset so that tfds.load doesn't generate a checksum (mismatch) error
!python -m tensorflow_datasets.scripts.download_and_prepare --register_checksums --datasets=ag_news_subset

dataset,info=\
tfds.load('ag_news_subset', with_info=True,  split=['train[:95%]','train[95%:]', 'test'],batch_size = 32
          , as_supervised=True)

train_ds, val_ds, test_ds = dataset
text_only_train_ds = train_ds.map(lambda x, y: x)

In [None]:
max_length = 96
max_tokens = 1000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
    standardize=custom_stopwords
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

## 2) Model 1 - Bi-Directional Long Short Term Memory (LSTM) Model

### 2.1) Build The Model

In [None]:
k.clear_session()
inputs = tf.keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens
                            ,output_dim=256
                            ,mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(4, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="SparseCategoricalCrossentropy",
              metrics=["accuracy"])

model.summary()

callbacks = [
    tf.keras.callbacks.ModelCheckpoint("Model_One",save_best_only=True)
    ,tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3)
]



start_time = datetime.datetime.now()

history=model.fit(int_train_ds, validation_data=int_val_ds, epochs=200, callbacks=callbacks)

end_time = datetime.datetime.now()
runtime = end_time - start_time
print(f"The runtime to fit this model was: {runtime}.")


model = keras.models.load_model("Model_One")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

### 2.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 3) Model 2 - 1-Dimensional Convolutional Neural Network

### 3.1) Build The Model

In [None]:
k.clear_session()
inputs = tf.keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Conv1D(filters=32, kernel_size=3, activation='relu')(embedded)
x = layers.Dropout(0.5)(x)
x = layers.MaxPooling1D(pool_size=2)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(256, activation='relu')(x)
outputs = layers.Dense(4, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="SparseCategoricalCrossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    tf.keras.callbacks.ModelCheckpoint("Model_Two",save_best_only=True)
    ,tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3)
]



start_time = datetime.datetime.now()

history=model.fit(int_train_ds, validation_data=int_val_ds, epochs=200, callbacks=callbacks)

end_time = datetime.datetime.now()
runtime = end_time - start_time
print(f"The runtime to fit this model was: {runtime}.")

model = keras.models.load_model("Model_Two")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

### 3.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 4) Model 3 - _______________________

### 4.1) Build the Model

### 4.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 5) Model 4 - _______________________

### 5.1) Build The Model

### 5.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 6) Model 5 - ____________________

### 6.1) Build The Model

### 6.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 7) Model 6 - _____________________

### 7.1) Build The Model

### 7.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 8) Model 7 - _________________________

### 8.1) Build The Model

### 8.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 9) Model 8 - ___________________________

### 9.1) Build The Model

### 9.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 10) Model 9 - ___________________________

### 10.1) Build The Model

### 10.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)

## 11) Model 10 - _______________________

### 11.1) Build The Model

### 11.2) Evaluate Model Performance

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)

In [None]:
plt.subplots(figsize=(16,12))
plt.tight_layout()
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)

In [None]:
y_test = np.concatenate([y for x, y in int_test_ds], axis=0)
pred_classes = np.argmax(model.predict(int_test_ds), axis=-1)

In [None]:
print_validation_report(y_test, pred_classes)

In [None]:
plot_confusion_matrix(y_test,pred_classes)