# Lab Assignment Seven: Recurrent Network Architectures

CS 5324

2021-05-12

Anthony Wang

## Preparation

[The dataset](https://www.kaggle.com/hetulmehta/website-classification) is a list of websites that have been scraped for text and classified into one of sixteen categories based on their content. For training a recurrent network, the length of each site's text will be homogenized to 100 words via clipping and padding. This length errs on the side of clipping, as despite the average text content of a site being over 700 words, that figure is heavily skewed by large outliers. Most sites are expected to contain at least 100 words, which should be sufficient to determined the purpose of the website. The labels are one-hot encoded for model training and classification.

In [1]:
import numpy as np
from pandas import DataFrame, read_csv
from tensorflow import keras
from keras.layers.experimental.preprocessing import TextVectorization

dataframe = read_csv("website_classification.csv")
print(dataframe.Category.unique())

dataframe.drop(['Unnamed: 0', 'website_url'], axis=1, inplace=True)
dataframe["Category"] = keras.utils.to_categorical(dataframe["Category"])

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=100)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

dataframe.sample(n=5)

ModuleNotFoundError: No module named 'tensorflow'

F-score will be used to determine performance. Uses of a site-classifier are unlikely to heavily prefer precision or recall over the other. The data is imbalanced, making accuracy unsuitable for evaluation as classification performance of a more frequent class would be prioritized.

In [None]:
# https://aakashgoel12.medium.com/how-to-add-user-defined-function-get-f1-score-in-keras-metrics-3013f979ce0d
# Definition of custom f-score function

import keras.backend as K

def F1(y_true, y_pred): #taken from old keras source code
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

This dataset contains just 1408 instances which. This is not a large amount, and holdout would further reduce the quantity of valuable training data. Usage of data will be maximized with stratified 10-fold cross validation. This also ensures training is independent of any subset of the data.

In [None]:
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

X_num = df[numerical].values
y = df["RainTomorrow"].values

## Modeling

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

# Utility function for plotting model statistics
def plot_histories(histories):
    F1 = []
    val_F1 = []
    loss = []
    val_loss = []
    for h in histories:
        F1.extend(h.history['F1'])
        val_F1.extend(h.history['val_F1'])
        loss.extend(h.history['loss'])
        val_loss.extend(h.history['val_loss'])

    plt.figure(figsize=(10,4))
    plt.suptitle(f'Fold {i+1}')
    plt.subplot(2,2,1)
    plt.plot(F1)

    plt.ylabel('F-score')
    plt.title('Training')
    plt.subplot(2,2,2)
    plt.plot(val_F1)
    plt.title('Validation')

    plt.subplot(2,2,3)
    plt.plot(loss)
    plt.ylabel('Training Loss')
    plt.xlabel('epochs')

    plt.subplot(2,2,4)
    plt.plot(val_loss)
    plt.xlabel('epochs')

In [None]:
with open("glove.6B.50d.txt") as f: # Assumes the 50-dimensional glove embedding file is in the local directory
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

