# Section 2: Experiments with Deep Learning Models

In this section, we build five Deep Learning models, using **Tensorflow**. Later, we will see how to build the same models to PyTorch and learn how to translate models between Tensorflow and PyTorch.

    2a. A simple dense model: with only pooling layer between the input and the output layers.
    2b. A deeper model with more layers
    2c. An Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM)
    2d. An RNN with Gated Recurrent Unit (GRU)
    2e. A Convolutional Neural Network (CNN)


**The Dataset** we will use is the CiteSeer Dataset and classify the documents or the nodes. This dataset is a popular benchmark for Graph-based MLs. As of January 2025, the best accuracy achieved is **82.07 ± 1.04** by ["ACMII-Snowball-2"](https://paperswithcode.com/paper/is-heterophily-a-real-nightmare-for-graph). A live update on the rankings can be found in this [link](https://paperswithcode.com/sota/node-classification-on-citeseer).

Can we beat it? Perhaps not so easily, as brilliant ML scientists and engineers have already thrown the kitchen sink at it. But we can definitely try! Why not dream? We will see how close we can get.

The information within the dataset: This dataset contains a set of 3327 scientific papers represented by binary vectors of 3703 words, with the values represent the presence or absence of the words in the document. A **key feature** of the dataset is that it also contains data on the citations among the papers as a citation graph or network, along with the text data. Here we are only use the text data. In later sections, we will incorporate the Graph data and see how it changes things. The availability of both types of data is the biggest reason we picked this dataset.

**The General Plan**:
1. <u>Build our Deep Learning models</u>: For each model, we will set up all layers between the inputs and the outputs. Some layers after the input layer may be for processing the text data and convert them to numbers -- such as, tokenizing/vectorizing and embedding. Then, we may have a convolutional or a recurrent layer. We may as well have pooling in between to reduce dimensionality of the vectors. Finally, as a we have a multi-class classification at hand, we will use the softmax function at the output layer.

2. <u>Train, Validate, and Test</u>: After training, we will check the validation and the test accuracies. 

3. <u>Save the Models</u>: We will then save the models so that we can call them up again in later sections.

It is almost as simple as it sounds. Of course, there are some nuances to these methods. But, we do not need to worry too much about it now. We will discuss things as they become necessary.

Finally, a big thanks to [Daniel Bourke](https://github.com/mrdbourke) for his awesome, student-friendly courses on Deep Learning, which helped me a lot in building the models in this section. Please consider taking his courses if you want a more detailed understanding of the deep learning models here.

Enough talking! Let's get started!

In [22]:
# First thing, get some essential Packages
# We also create a new directory to save the models

# Numpy for matrices
import numpy as np
import pandas as pd
np.random.seed(0)

# Visualization
import networkx as nx
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

import itertools
from collections import Counter

import os


# NO NEED FOR THE FOLLOWING AS IMPLEMENTED IN THE TENSORBOARD_CALLBACKS
# Define the name of the directory to be created
# directory_name = "Saved_ML_models_Exp2"

# # Get the current working directory
# current_working_directory = os.getcwd()
# # Create the full path for the new directory
# new_directory_path = os.path.join(current_working_directory, directory_name)

# # Check if the directory exists, and create it if it does not
# if not os.path.exists(new_directory_path):
#     os.makedirs(new_directory_path)
#     print(f"Directory '{directory_name}' created at {new_directory_path}")
# else:
#     print(f"Directory '{directory_name}' already exists at {new_directory_path}")


## Get the CiteSeer Dataset
This dataset is available through PyTorch Geometric, a package dedicated to Graph NNs. The CiteSeer is one of the several datasets available.

In [2]:
from torch_geometric.datasets import Planetoid

# Import dataset from PyTorch Geometric
dataset = Planetoid(root=".", name="CiteSeer")

data = dataset[0] # We extract the data we need.

In [3]:
# Print information about the dataset
print("Dataset name:", dataset)
print("Input Text Data shape:", data.x.shape)
print("First five rows of the text data:\n", data.x[0:5, :])

Dataset name: CiteSeer()
Input Text Data shape: torch.Size([3327, 3703])
First five rows of the text data:
 tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


As we see, the dataset has 3327 documents as rows, made up of 3703 unique words. The documents are represented as one-hot vectors of length 3703. One hot vectors simply mean that if a word exists, then we assign it's magnitude to be 1 and if not, then we assign the magnitude to be 0. We just to need to follow the same order of words for each document, and that is it.

An interesting point is the array type, which is "torch.tensor". Torch tensors are perfectly compatible with Numpy, so we should be fine.

**Important**: Please note that we are not using all of the data available, rather using only about half of the documents. Moreover, we are using just 120 documents for training. The reason is that these are stipulations imposed in benchmarking different models that we saw earlier. We keep the split as is to be able to compare with the state-of-the-art results.

Now, we are ready to get modeling!

## Model Set 2: Deep Learning Models

In the next block, we load all the packages we would need. We create a function to calculate different types of accuracies between the true labels and the predicted labels. In this work, we will use some "helper" functions (such as "create_tensorboard_callback" and "unzip_data") developed by Daniel Bourke. Please see [here](https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/extras/helper_functions.py) for the functions and to modify as necessary. In the codes below, we also specify a directory to save our models and their checkpoints -- which we will see shortly.

In [27]:
import tensorflow as tf
from tensorflow.keras import layers
tf.random.set_seed(42)


# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pickle # For saving models

def calculate_results(y_true, y_pred):
# Calculate model accuracy
# Returns a dictionary of accuracy, precision, recall, f1-score.
  
# y_true: true labels in the form of a 1D array
# y_pred: predicted labels in the form of a 1D array
    model_accuracy = accuracy_score(y_true, y_pred) * 100
    # Calculate model precision, recall and f1 score using "weighted" average
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
    model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
    return(model_results)

!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

# Create directory to save TensorBoard checkpoints and entire models
SAVE_DIR_for_checkpoints = "Saved_DL_models_Exp2/Model_checkpoints"
SAVE_DIR_for_entire_models = "Saved_DL_models_Exp2"

--2025-01-18 14:05:10--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: 'helper_functions.py.4'

     0K ..........                                            100% 15.3M=0.001s

2025-01-18 14:05:10 (15.3 MB/s) - 'helper_functions.py.4' saved [10246/10246]



## Preparing Texts in two steps
### Tokenizing

<!-- (Need to functionize the whole thing, a little left to do.) -->

Tokenizing simply means assigning each unique word in a set of documents a unique token or an ID and then representing the docs as lists of tokens or numbers uniquely assigned to the words. The tokens allows us to represent the docs as vectors with the dimensions of words or their tokens. A related step we complete here is to "pad" the vectors, which simply makes the vectors the same size, by adding zeros to the end. Please see the below printout for an example. The tokenized vectors would be then used to create embeddings, which we look into in the next section.

We have a binary (one-hot encoded) version of the documents at hand. This representation is essentially a form of vector representation we could use directly. However, we need the tokenized version from the one-hot encoded version as we want to use an embedding layer from Tensorflow which works better on tokenized documents. We tried using Tensorflow's tokenizer function to tokenize, but it did not work so well with our one-hot encoded data. It was easier to do the tokenizing ourselves than trying to fix it.

The below function tokenizes the words with numbers from the range (1, max number of tokens). We leave 0 out as a token, rather the vectors padded with 0s to make equal size for all.

In [12]:
def pad_sequences_upto_a_certain_length(i, x, desired_length):
    x = list(x)
    current_length = len(x)
    if current_length < desired_length:
        x[current_length: ] = np.zeros(desired_length - current_length, dtype = int)
    elif (current_length == desired_length):
        print("%i^{th} item: current_length == desired_length. So did nothing."%i)
    elif (current_length > desired_length):
        print("%i^{th} item: current_length > desired_length. Shouldn't happen. Please check"%i, current_length, desired_length)
    return(x)

# How many words are there in the document with the most words? We will pad the vectors upto that length.
max_tokens = int(data.x.sum(axis= 1).max())

temp_x = [np.argwhere(data.x[i, :]>0)+1 for i in range(data.x.shape[0])] # add 1 as we do not want to use 0 as a token. We will use it for padding.
temp_x = [i.squeeze().tolist() for i in temp_x]

padded_x = [pad_sequences_upto_a_certain_length(i_item, temp_x[i_item], desired_length = max_tokens) for i_item in range(len(temp_x))]
padded_x = np.array(padded_x) # np arrays would allow using the data masks

print("\nFirst two documents:\n", padded_x[0:2, :])

3046^{th} item: current_length == desired_length. So did nothing.

First two documents:
 [[ 185  258  363  561  566  598  601  602  638  730  806  817  943 1117
  1436 1546 1624 1636 1847 2086 2339 2344 2566 2605 2697 2742 2919 2971
  3503 3549 3648    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]
 [  83  103  115  418  654  798  806  832  850  893 1074 1084 1166 1289
  1954 2437 2511 2734 2742 2880 2910 2931 3017 3127 3161 3229 3255 3331
  3365 3448 3462 3640 3641    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]]


### Embedding

After tokenizing, we come to embed our vectors. In the last part, after tokenizing, we discussed representing *each document* as a vector of the tokens or the words. In contrast, the process of embedding turn *each word* into a vector in a latent space. The embeddings allow models to understand the relationships between words based on their context and usage in the text. Popular methods for generating embeddings include Word2Vec, GloVe, and BERT.

In summary, tokenizing breaks text into pieces, and embedding transforms those pieces into numerical representations that capture their meanings. This combination is fundamental for many NLP tasks.

Below, we set up the embedding layer. The max vocab length is usually equal to or less than the total number of words. We set the length to be 10 higher than the total number of unique words as a cushion to ensure we are indeed taking all words into account.

In [13]:
max_vocab_length = data.x.shape[1] + 10 # Total number of words + 10 for some cushion as unsure whether the padding token ("0") must be accounted for here.

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize randomly
                             input_length=None, # how long is each input
                             name="embedding_1") 

## Train-Validation-Test Splitting
We already have the mask values in the dataset. We just use them to split it as necessary. As these input data is in one_hot format, we add a subscript here. Later, we will use a list of tokens/features representation, so the distinction may be helpful.

In [14]:
train_sentences = padded_x[data.train_mask]
train_labels = data.y[data.train_mask]

val_sentences = padded_x[data.val_mask]
val_labels = data.y[data.val_mask]

test_sentences = padded_x[data.test_mask]
test_labels = data.y[data.test_mask]

print(train_sentences.shape, val_sentences.shape, test_sentences.shape)

(120, 54) (500, 54) (1000, 54)


## Exp 2a: A not-so-deep deep learning model

Between the input and the output layers, we just the embedding layer followed by a average pooling layer.

With this simple model, we will also see how to save (1) model checkpoints and (2) the entire model. We save the model check points with tensorboard_callback, WHILE FITTING THE MODEL (Step 3). For this step, we use a convenient "helper" function -- "create_tensorboard_callback" -- developed by Daniel Bourke. Please see [here](https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/extras/helper_functions.py) for the functions and to modify as necessary. After the model is fit, we save the model (Step 4). We use the new ".keras" format. Please see [here](https://www.tensorflow.org/tutorials/keras/save_and_load) for details and for the legacy save formats (i.e., .h5 and SavedModel) All checkpoints and the model are saved in the directory we had specified earlier.

In [31]:

# 1. Define or Build Model -- Define its layers (input, hidden layers, output)
input_length = train_sentences.shape[1]

inputs = layers.Input(shape=(input_length, ), dtype="int") # inputs are 1-dimensional array of input length
x = embedding(inputs) # create an embedding of the numerized numbers
print(x.shape)
x = layers.GlobalAveragePooling1D()(x) # lower the dimensionality of the embedding (try running the model without this layer and see what happens)
print(x.shape)
outputs = layers.Dense(6, activation="softmax")(x) # create the output layer, want binary outputs so use sigmoid activation
model_2a = tf.keras.Model(inputs, outputs, name="model_2a") # construct the model

# 1a. Print a model summary describing the layers and the parameters.
model_2a.summary()

# 2. Compile model -- set up the loss, optimizer, and accuracy metrics to use in training the model
model_2a.compile(loss= "sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# 3. Train or Fit model
model_2a_history = model_2a.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=50,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR_for_checkpoints, experiment_name="Model_2a")])

# 3a: Test accuracy across sets

model_2a_train_pred_probs = model_2a.predict(train_sentences)
model_2a_train_preds = tf.argmax(model_2a_train_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2a_val_pred_probs = model_2a.predict(val_sentences)
model_2a_val_preds = tf.argmax(model_2a_val_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2a_test_pred_probs = model_2a.predict(test_sentences)
model_2a_test_preds = tf.argmax(model_2a_test_pred_probs, axis = 1) # Taking the most likely class as the prediction

print(calculate_results(y_pred = model_2a_train_preds, y_true = train_labels))
print(calculate_results(y_pred = model_2a_val_preds, y_true = val_labels))
print(calculate_results(y_pred = model_2a_test_preds, y_true = test_labels))

# 4: Save Model in *.keras format
model_2a.save(SAVE_DIR_for_entire_models + "/Model_2a.keras")

(None, 54, 128)
(None, 128)


Saving TensorBoard log files to: Saved_DL_models_Exp2/Model_checkpoints/Model_2a/20250118-140716
Epoch 1/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 60ms/step - accuracy: 0.1629 - loss: 1.7587 - val_accuracy: 0.3560 - val_loss: 1.7530
Epoch 2/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - accuracy: 0.3640 - loss: 1.6731 - val_accuracy: 0.4080 - val_loss: 1.7392
Epoch 3/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - accuracy: 0.6304 - loss: 1.5938 - val_accuracy: 0.4540 - val_loss: 1.7260
Epoch 4/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - accuracy: 0.6719 - loss: 1.5170 - val_accuracy: 0.4720 - val_loss: 1.7132
Epoch 5/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step - accuracy: 0.7013 - loss: 1.4425 - val_accuracy: 0.4860 - val_loss: 1.7007
Epoch 6/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - accuracy: 0.8542 - loss

## Exp 2b: Adding more layers

We add a dense layer with 128 neurons, after the input.


In [None]:

# 1. Define or Build Model -- Define its layers (input, hidden layers, output)
input_length = train_sentences.shape[1]

inputs = layers.Input(shape=(input_length, ), dtype="int") # inputs are 1-dimensional array of input length
x = embedding(inputs) # create an embedding of the numerized numbers
# -----> NEW LAYER ADDED ----> A Dense layer
x = layers.Dense(128, activation="relu")(x)
x = layers.GlobalAveragePooling1D()(x) # Other options: GlobalMaxPooling1D
outputs = layers.Dense(6, activation="softmax")(x) # create the output layer, want binary outputs so use sigmoid activation

model_2b = tf.keras.Model(inputs, outputs, name="model_2b") # construct the model

# 1a. Print a model summary describing the layers and the parameters.
model_2b.summary()

# 2. Compile model -- set up the loss, optimizer, and accuracy metrics to use in training the model
model_2b.compile(loss= "sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# 3. Train or Fit model
model_2b_history = model_2b.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=50,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR_for_checkpoints, experiment_name="model_2b")])

# 3a: Test accuracy across sets

model_2b_train_pred_probs = model_2b.predict(train_sentences)
model_2b_train_preds = tf.argmax(model_2b_train_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2b_val_pred_probs = model_2b.predict(val_sentences)
model_2b_val_preds = tf.argmax(model_2b_val_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2b_test_pred_probs = model_2b.predict(test_sentences)
model_2b_test_preds = tf.argmax(model_2b_test_pred_probs, axis = 1) # Taking the most likely class as the prediction

print(calculate_results(y_pred = model_2b_train_preds, y_true = train_labels))
print(calculate_results(y_pred = model_2b_val_preds, y_true = val_labels))
print(calculate_results(y_pred = model_2b_test_preds, y_true = test_labels))

# 4: Save Model in *.keras format
model_2b.save(SAVE_DIR_for_entire_models + "/Model_2b.keras")

Saving TensorBoard log files to: Saved_DL_models_Exp2/Model_checkpoints/model_2b/20250118-140722
Epoch 1/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 69ms/step - accuracy: 0.4481 - loss: 1.7432 - val_accuracy: 0.2920 - val_loss: 1.7439
Epoch 2/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 0.6890 - loss: 1.4558 - val_accuracy: 0.3680 - val_loss: 1.6932
Epoch 3/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 1.0000 - loss: 1.2134 - val_accuracy: 0.4640 - val_loss: 1.6490
Epoch 4/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step - accuracy: 1.0000 - loss: 1.0047 - val_accuracy: 0.5000 - val_loss: 1.6088
Epoch 5/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - accuracy: 1.0000 - loss: 0.8252 - val_accuracy: 0.5180 - val_loss: 1.5715
Epoch 6/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 1.0000 - loss

## Exp 2c: An RNN with LSTM
We add an LSTM layer in the middle. Two other changes from Model 2b. First, we do not need pooling anymore as it is inherent in LSTM. Second, we saw that the dense layer was not really helping so we take that out too.

In [33]:

# 1. Define or Build Model -- Define its layers (input, hidden layers, output)
input_length = train_sentences.shape[1]

inputs = layers.Input(shape=(input_length, ), dtype="int") # inputs are 1-dimensional array of input length
x = embedding(inputs) # create an embedding of the numerized numbers
# x = layers.Dense(128, activation="relu")(x) # LAYER REMOVED -- as not helpful
# -----> NEW LAYER ADDED ----> LSTM
x = layers.LSTM(64)(x)
# x = layers.GlobalAveragePooling1D()(x) # LAYER REMOVED -- No longer needed
outputs = layers.Dense(6, activation="softmax")(x) # create the output layer, want binary outputs so use sigmoid activation

model_2c = tf.keras.Model(inputs, outputs, name="model_2c") # construct the model

# 1a. Print a model summary describing the layers and the parameters.
model_2c.summary()

# 2. Compile model -- set up the loss, optimizer, and accuracy metrics to use in training the model
model_2c.compile(loss= "sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# 3. Train or Fit model
model_2c_history = model_2c.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=50,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR_for_checkpoints, experiment_name="model_2c")])

# 3a: Test accuracy across sets

model_2c_train_pred_probs = model_2c.predict(train_sentences)
model_2c_train_preds = tf.argmax(model_2c_train_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2c_val_pred_probs = model_2c.predict(val_sentences)
model_2c_val_preds = tf.argmax(model_2c_val_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2c_test_pred_probs = model_2c.predict(test_sentences)
model_2c_test_preds = tf.argmax(model_2c_test_pred_probs, axis = 1) # Taking the most likely class as the prediction

print(calculate_results(y_pred = model_2c_train_preds, y_true = train_labels))
print(calculate_results(y_pred = model_2c_val_preds, y_true = val_labels))
print(calculate_results(y_pred = model_2c_test_preds, y_true = test_labels))

# 4: Save Model in *.keras format
model_2c.save(SAVE_DIR_for_entire_models + "/Model_2c.keras")

Saving TensorBoard log files to: Saved_DL_models_Exp2/Model_checkpoints/model_2c/20250118-140738
Epoch 1/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 103ms/step - accuracy: 0.1615 - loss: 1.8050 - val_accuracy: 0.1780 - val_loss: 1.7862
Epoch 2/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.3263 - loss: 1.7566 - val_accuracy: 0.2120 - val_loss: 1.7846
Epoch 3/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.6071 - loss: 1.6959 - val_accuracy: 0.1840 - val_loss: 1.7745
Epoch 4/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.7002 - loss: 1.5538 - val_accuracy: 0.2700 - val_loss: 1.7141
Epoch 5/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.7312 - loss: 1.2798 - val_accuracy: 0.3480 - val_loss: 1.6067
Epoch 6/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.7602 - los

## Exp 2d: An RNN with GRU

Change from 2c: Just replacing the LSTM layer with a GRU layer.

In [None]:

# 1. Define or Build Model -- Define its layers (input, hidden layers, output)
input_length = train_sentences.shape[1]

inputs = layers.Input(shape=(input_length, ), dtype="int") # inputs are 1-dimensional array of input length
x = embedding(inputs) # create an embedding of the numerized numbers
# x = layers.Dense(128, activation="relu")(x) # LAYER REMOVED -- as not helpful
# -----> NEW LAYER ADDED ----> GRU
x = layers.GRU(64)(x)
# x = layers.GlobalAveragePooling1D()(x) # LAYER REMOVED -- No longer needed
outputs = layers.Dense(6, activation="softmax")(x) # create the output layer, want binary outputs so use sigmoid activation

model_2d = tf.keras.Model(inputs, outputs, name="model_2d") # construct the model

# 1a. Print a model summary describing the layers and the parameters.
model_2d.summary()

# 2. Compile model -- set up the loss, optimizer, and accuracy metrics to use in training the model
model_2d.compile(loss= "sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# 3. Train or Fit model
model_2d_history = model_2d.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=50,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR_for_checkpoints, experiment_name="model_2d")])

# 3a: Test accuracy across sets

model_2d_train_pred_probs = model_2d.predict(train_sentences)
model_2d_train_preds = tf.argmax(model_2d_train_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2d_val_pred_probs = model_2d.predict(val_sentences)
model_2d_val_preds = tf.argmax(model_2d_val_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2d_test_pred_probs = model_2d.predict(test_sentences)
model_2d_test_preds = tf.argmax(model_2d_test_pred_probs, axis = 1) # Taking the most likely class as the prediction

print(calculate_results(y_pred = model_2d_train_preds, y_true = train_labels))
print(calculate_results(y_pred = model_2d_val_preds, y_true = val_labels))
print(calculate_results(y_pred = model_2d_test_preds, y_true = test_labels))

# 4: Save Model in *.keras format
model_2d.save(SAVE_DIR_for_entire_models + "/Model_2d.keras")

Saving TensorBoard log files to: Saved_DL_models_Exp2/Model_checkpoints/model_2d/20250118-140813
Epoch 1/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 117ms/step - accuracy: 0.1833 - loss: 1.8361 - val_accuracy: 0.2120 - val_loss: 1.7806
Epoch 2/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - accuracy: 0.1663 - loss: 1.7938 - val_accuracy: 0.2340 - val_loss: 1.7974
Epoch 3/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.1615 - loss: 1.8061 - val_accuracy: 0.0580 - val_loss: 1.8071
Epoch 4/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.1192 - loss: 1.8027 - val_accuracy: 0.0580 - val_loss: 1.8045
Epoch 5/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step - accuracy: 0.1254 - loss: 1.7956 - val_accuracy: 0.1720 - val_loss: 1.7984
Epoch 6/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step - accuracy: 0.1990 - los

## Exp 2e: A CNN
This time, we use a Convolution layer with a pooling layer. Essentially, we just added a convolution layer in model_2a before the pooling layer.

In [36]:

# 1. Define or Build Model -- Define its layers (input, hidden layers, output)
input_length = train_sentences.shape[1]

inputs = layers.Input(shape=(input_length, ), dtype="int") # inputs are 1-dimensional array of input length
x = embedding(inputs) # create an embedding of the numerized numbers
# x = layers.Dense(128, activation="relu")(x) # LAYER REMOVED -- as not helpful
# -----> NEW LAYERS ADDED ----> Conv1D + Pooling
x = layers.Conv1D(filters=32, kernel_size=5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x) # Alt option: GlobalAveragePooling1D()(x) 
outputs = layers.Dense(6, activation="softmax")(x) # create the output layer, want binary outputs so use sigmoid activation

model_2e = tf.keras.Model(inputs, outputs, name="model_2e") # construct the model

# 1a. Print a model summary describing the layers and the parameters.
model_2e.summary()

# 2. Compile model -- set up the loss, optimizer, and accuracy metrics to use in training the model
model_2e.compile(loss= "sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# 3. Train or Fit model
model_2e_history = model_2e.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=50,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR_for_checkpoints, experiment_name="model_2e")])

# 3a: Test accuracy across sets

model_2e_train_pred_probs = model_2e.predict(train_sentences)
model_2e_train_preds = tf.argmax(model_2e_train_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2e_val_pred_probs = model_2e.predict(val_sentences)
model_2e_val_preds = tf.argmax(model_2e_val_pred_probs, axis = 1) # Taking the most likely class as the prediction

model_2e_test_pred_probs = model_2e.predict(test_sentences)
model_2e_test_preds = tf.argmax(model_2e_test_pred_probs, axis = 1) # Taking the most likely class as the prediction

print(calculate_results(y_pred = model_2e_train_preds, y_true = train_labels))
print(calculate_results(y_pred = model_2e_val_preds, y_true = val_labels))
print(calculate_results(y_pred = model_2e_test_preds, y_true = test_labels))

# 4: Save Model in *.keras format
model_2e.save(SAVE_DIR_for_entire_models + "/Model_2e.keras")

Saving TensorBoard log files to: Saved_DL_models_Exp2/Model_checkpoints/model_2e/20250118-140847
Epoch 1/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 67ms/step - accuracy: 0.2463 - loss: 2.1721 - val_accuracy: 0.3140 - val_loss: 1.8433
Epoch 2/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 0.7727 - loss: 0.7792 - val_accuracy: 0.3400 - val_loss: 1.5684
Epoch 3/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 0.9852 - loss: 0.2694 - val_accuracy: 0.4100 - val_loss: 1.4627
Epoch 4/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - accuracy: 1.0000 - loss: 0.1088 - val_accuracy: 0.4440 - val_loss: 1.4260
Epoch 5/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step - accuracy: 1.0000 - loss: 0.0562 - val_accuracy: 0.4420 - val_loss: 1.4118
Epoch 6/50
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 1.0000 - loss

## How to Load and Use the Saved Models?

We show an example of how to load the models back and use them. Below, we load the last model we developed. We check the architecture of our loaded model. As we see, they are the same as the one we had developed.

Then, we check the prediction accuracies again. Once again, we find a match with our original model.

In [39]:
new_model_2e = tf.keras.models.load_model(SAVE_DIR_for_entire_models + "/Model_2e.keras")

# Check its architecture
print(new_model_2e.summary())

# Check prediction accuracy across sets

new_model_2e_train_pred_probs = model_2e.predict(train_sentences)
new_model_2e_train_preds = tf.argmax(new_model_2e_train_pred_probs, axis = 1) # Taking the most likely class as the prediction

new_model_2e_val_pred_probs = model_2e.predict(val_sentences)
new_model_2e_val_preds = tf.argmax(new_model_2e_val_pred_probs, axis = 1) # Taking the most likely class as the prediction

new_model_2e_test_pred_probs = model_2e.predict(test_sentences)
new_model_2e_test_preds = tf.argmax(new_model_2e_test_pred_probs, axis = 1) # Taking the most likely class as the prediction

print(calculate_results(y_pred = new_model_2e_train_preds, y_true = train_labels))
print(calculate_results(y_pred = new_model_2e_val_preds, y_true = val_labels))
print(calculate_results(y_pred = new_model_2e_test_preds, y_true = test_labels))

None
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
{'accuracy': 100.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'accuracy': 50.0, 'precision': 0.553013233374945, 'recall': 0.5, 'f1': 0.51651009579847}
{'accuracy': 52.6, 'precision': 0.5602566288976001, 'recall': 0.526, 'f1': 0.5373783896950333}


## Some Observations about the Models' Performance

We see that we could not much better than the shallow ML methods. Of course, it is possible that a different network architecture could yield improvements, but it seems improbable as we tested quite a few types of models by now without seeing a gain. Possible reasons are that (1) the limited amount of training data we used and (2) the lack of information about the text contents in the documents, as our information only includes the presence or the absence of a word. We do not even have the counts of the words or the sequence in which the words occur in. Therefore, it seems that there is not enough information for our deep learning models to find "deep" and complex relationships.

In the next section, we will incorporate the "citation network" (i.e., the edges between the documents) as additional information and see how it improves things. We will also turn off and on the two datastreams and see what happens, as a sort of an ablation study. Let's go then!
