# Section 2: Experiments with Deep Learning Models

In this section, we will build eight "shallow" ML classifiers for text/node classification. We will learn how to implement each of them using Scikit-Learn and save the models for later comparisons with other models. The models we will develop are:

    1a. Naive Bayes
    1b. XGBoost
    1c. Decision Trees
    1d. Random Forest
    1e. Gradient Boosting
    1f. CatBoost
    1g. LightGBM
    1h. Support Vector Machine (SVM) Classifiers


**The Dataset** we will use is the CiteSeer Dataset and classify the documents or the nodes. This dataset is a popular benchmark for Graph-based MLs. As of January 2025, the best accuracy achieved is **82.07 ± 1.04** by ["ACMII-Snowball-2"](https://paperswithcode.com/paper/is-heterophily-a-real-nightmare-for-graph). A live update on the rankings can be found in this [link](https://paperswithcode.com/sota/node-classification-on-citeseer).

Can we beat it? Perhaps not so easily, as brilliant ML scientists and engineers have already thrown the kitchen sink at it. But we can definitely try! Why not dream? We will see how close we can get.

The information within the dataset: This dataset contains a set of 3327 scientific papers represented by binary vectors of 3703 words, with the values represent the presence or absence of the words in the document. A **key feature** of the dataset is that it also contains data on the citations among the papers as a citation graph or network, along with the text data. Here we are only use the text data. In later sections, we will incorporate the Graph data and see how it changes things. The availability of both types of data is the biggest reason we picked this dataset.

**The General Plan**:
1. <u>Build a Modeling Pipeline</u>: For each model, we will create a "pipeline". The pipelines can include everything between inputs and outputs. For example, we may want to represent our texts as certain kind of vectors (e.g., one_hot, TF-IDF). Then, We may want to transform our vectors and reduce their dimensions using methods such as Singular Value Decomposition (SVD) or Non-negative Matrix Factorization (NMF). Finally, we would have our model to feed these all into. This workflow can be conveniently represented as a pipeline, as we will see.

2. <u>Train, Validate, and Test</u>: After training, we will check the validation and the test accuracies. 

3. <u>Save the Models</u>: We will then save the models so that we can call them up again in later sections.

It is almost as simple as it sounds. Of course, there are some nuances to these methods. But, we do not need to worry too much about it now. We will discuss things as they become necessary.

Enough talking! Let's get started!

In [None]:
# First thing, get some essential Packages
# We also create a new directory to save the models

# Numpy for matrices
import numpy as np
import pandas as pd
np.random.seed(0)

# Visualization
import networkx as nx
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

import itertools
from collections import Counter

import os

# Define the name of the directory to be created
directory_name = "Saved_ML_models_Exp1"

# Get the current working directory
current_working_directory = os.getcwd()
# Create the full path for the new directory
new_directory_path = os.path.join(current_working_directory, directory_name)

# Check if the directory exists, and create it if it does not
if not os.path.exists(new_directory_path):
    os.makedirs(new_directory_path)
    print(f"Directory '{directory_name}' created at {new_directory_path}")
else:
    print(f"Directory '{directory_name}' already exists at {new_directory_path}")


Directory 'Saved_ML_models_Exp1' created at c:\Users\rouss\Documents\GitHub\Many_MLs_for_Node_Classification\Saved_ML_models_Exp1


## Get the CiteSeer Dataset
This dataset is available through PyTorch Geometric, a package dedicated to Graph NNs. The CiteSeer is one of the several datasets available.

In [None]:
from torch_geometric.datasets import Planetoid

# Import dataset from PyTorch Geometric
dataset = Planetoid(root=".", name="CiteSeer")

data = dataset[0] # We extract the data we need.

In [None]:
# Print information about the dataset
print("Dataset name:", dataset)
print("Input Text Data shape:", data.x.shape)
print("First five rows of the text data:\n", data.x[0:5, :])

Dataset name: CiteSeer()
Input Text Data shape: torch.Size([3327, 3703])
First five rows of the text data:
 tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


As we see, the dataset has 3327 documents as rows, made up of 3703 unique words. The documents are represented as one-hot vectors of length 3703. One hot vectors simply mean that if a word exists, then we assign it's magnitude to be 1 and if not, then we assign the magnitude to be 0. We just to need to follow the same order of words for each document, and that is it.

An interesting point is the array type, which is "torch.tensor". Torch tensors are perfectly compatible with Numpy, so we should be fine.

Now, we are ready to get modeling!