# Section 1: Experiments with "Shallow" Machine Learning Algorithms

In this section, we will build eight "shallow" ML classifiers for text/node classification. We will learn how to implement each of them using Scikit-Learn and save the models for later comparisons with other models. The models we will develop are:

    1a. Naive Bayes
    1b. XGBoost
    1c. Decision Trees
    1d. Random Forest
    1e. Gradient Boosting
    1f. CatBoost
    1g. LightGBM
    1h. Support Vector Machine (SVM) Classifiers


**The Dataset** we will use is the CiteSeer Dataset and classify the documents or the nodes. This dataset is a popular benchmark for Graph-based MLs. As of January 2025, the best accuracy achieved is **82.07 ± 1.04** by ["ACMII-Snowball-2"](https://paperswithcode.com/paper/is-heterophily-a-real-nightmare-for-graph). A live update on the rankings can be found in this [link](https://paperswithcode.com/sota/node-classification-on-citeseer).

Can we beat it? Perhaps not so easily, as brilliant ML scientists and engineers have already thrown the kitchen sink at it. But we can definitely try! Why not dream? We will see how close we can get.

The information within the dataset: This dataset contains a set of 3327 scientific papers represented by binary vectors of 3703 words, with the values represent the presence or absence of the words in the document. A **key feature** of the dataset is that it also contains data on the citations among the papers as a citation graph or network, along with the text data. Here we are only use the text data. In later sections, we will incorporate the Graph data and see how it changes things. The availability of both types of data is the biggest reason we picked this dataset.

**The General Plan**:
1. <u>Build a Modeling Pipeline</u>: For each model, we will create a "pipeline". The pipelines can include everything between the inputs and the outputs. For example, we may want to represent our texts as certain kind of vectors (e.g., one_hot, TF-IDF). Then, We may want to transform our vectors and reduce their dimensions using methods such as Singular Value Decomposition (SVD) or Non-negative Matrix Factorization (NMF). Finally, we would have our model to feed these all into. This workflow can be conveniently represented as a pipeline, as we will see.

2. <u>Train, Validate, and Test</u>: After training, we will check the validation and the test accuracies. 

3. <u>Save the Models</u>: We will then save the models so that we can call them up again in later sections.

It is almost as simple as it sounds. Of course, there are some nuances to these methods. But, we do not need to worry too much about it now. We will discuss things as they become necessary.

Enough talking! Let's get started!

In [25]:
# First thing, get some essential Packages
# We also create a new directory to save the models

# Numpy for matrices
import numpy as np
import pandas as pd
np.random.seed(0)

# Visualization
import networkx as nx
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

import itertools
from collections import Counter

import os

# Define the name of the directory to be created
directory_name = "Saved_ML_models_Exp1"

# Get the current working directory
current_working_directory = os.getcwd()
# Create the full path for the new directory
new_directory_path = os.path.join(current_working_directory, directory_name)

# Check if the directory exists, and create it if it does not
if not os.path.exists(new_directory_path):
    os.makedirs(new_directory_path)
    print(f"Directory '{directory_name}' created at {new_directory_path}")
else:
    print(f"Directory '{directory_name}' already exists at {new_directory_path}")


Directory 'Saved_ML_models_Exp1' created at c:\Users\rouss\Documents\GitHub\Many_MLs_for_Node_Classification\Saved_ML_models_Exp1


## Get the CiteSeer Dataset
This dataset is available through PyTorch Geometric, a package dedicated to Graph NNs. The CiteSeer is one of the several datasets available.

In [3]:
from torch_geometric.datasets import Planetoid

# Import dataset from PyTorch Geometric
dataset = Planetoid(root=".", name="CiteSeer")

data = dataset[0] # We extract the data we need.

In [20]:
# Print information about the dataset
print("Dataset name:", dataset)
print("Input Text Data shape:", data.x.shape)
print("First five rows of the text data:\n", data.x[0:5, :])

Dataset name: CiteSeer()
Input Text Data shape: torch.Size([3327, 3703])
First five rows of the text data:
 tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


As we see, the dataset has 3327 documents as rows, made up of 3703 unique words. The documents are represented as one-hot vectors of length 3703. One hot vectors simply mean that if a word exists, then we assign it's magnitude to be 1 and if not, then we assign the magnitude to be 0. We just to need to follow the same order of words for each document, and that is it.

An interesting point is the array type, which is "torch.tensor". Torch tensors are perfectly compatible with Numpy, so we should be fine.

### Train-Validation-Test Splitting
We already have the mask values in the dataset. We just use them to split it as necessary. As these input data is in one_hot format, we add a subscript here. Later, we will use a list of tokens/features representation, so the distinction may be helpful.

In [None]:
train_sentences_one_hot = data.x[data.train_mask]
train_labels = data.y[data.train_mask]

val_sentences_one_hot = data.x[data.val_mask]
val_labels = data.y[data.val_mask]

test_sentences_one_hot = data.x[data.test_mask]
test_labels = data.y[data.test_mask]

print(train_sentences_one_hot.shape, val_sentences_one_hot.shape, test_sentences_one_hot.shape)

torch.Size([120, 3703]) torch.Size([500, 3703]) torch.Size([1000, 3703])


**Important**: Please note that we are not using all of the data available, rather using only about half of the documents. Moreover, we are using just 120 documents for training. The reason is that these are stipulations imposed in benchmarking different models that we saw earlier. We keep the split as is to be able to compare with the state-of-the-art results.

Now, we are ready to get modeling!

## Model Set 1: Shallow Machine Learning Models

In the next block, we load all the packages we would need. We create a function to calculate different types of accuracies between the true labels and the predicted labels.

In [21]:
from sklearn.pipeline import Pipeline
from sklearn import set_config

# Vector representations and transformations
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import TruncatedSVD, NMF

# Classifiers assemble!
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB

from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from sklearn.svm import SVC

# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pickle # For saving models

def calculate_results(y_true, y_pred):
# Calculate model accuracy
# Returns a dictionary of accuracy, precision, recall, f1-score.
  
# y_true: true labels in the form of a 1D array
# y_pred: predicted labels in the form of a 1D array
    model_accuracy = accuracy_score(y_true, y_pred) * 100
    # Calculate model precision, recall and f1 score using "weighted" average
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
    model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
    return(model_results)

### Model 1a: Multinomial Naive Bayes

In [27]:
# 1. Create modeling pipeline
model_1a = Pipeline([
                    ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    ("MNB_clf", MultinomialNB()) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1a.fit(train_sentences_one_hot, train_labels)
print(model_1a)

print(calculate_results(y_pred = model_1a.predict(train_sentences_one_hot), y_true = train_labels))
print(calculate_results(y_pred = model_1a.predict(val_sentences_one_hot), y_true = val_labels))
print(calculate_results(y_pred = model_1a.predict(test_sentences_one_hot), y_true = test_labels))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1a_pipeline_MNB.pkl", "wb") as file_to_write:
    pickle.dump(model_1a, file_to_write)


# ---------- A demo for later: How to load these models ----------------------------
# Load/read our model!
# with open("Saved_ML_models_Exp1/model_1a_pipeline_MNB.pkl", "rb") as file_to_read:
#     model_1a_loaded = pickle.load(file_to_read)

# Check for sameness of predictions
# (model_1a.predict(train_sentences_one_hot) == model_1a_loaded.predict(train_sentences_one_hot)).all()

# How to get prediction probabilities, instead of just the predictions?
    # This will be important when we trying to combine the predictions from multiple models.
# model_1a.predict_proba(test_sentences_one_hot)


Pipeline(steps=[('tfidf', TfidfTransformer()), ('MNB_clf', MultinomialNB())])
{'accuracy': 100.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'accuracy': 54.0, 'precision': 0.5777780591193468, 'recall': 0.54, 'f1': 0.5512981198811956}
{'accuracy': 56.3, 'precision': 0.5820479271082984, 'recall': 0.563, 'f1': 0.568783273700513}


## Model 1b: XGBoost

In [28]:
# 1. Create modeling pipeline
model_1b = Pipeline([
                    # ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    # ("svd", TruncatedSVD(n_components=1500)), # Truncated SVD
                    ("XGB_clf", XGBClassifier()) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1b.fit(train_sentences_one_hot, train_labels)
print(model_1b)

print(calculate_results(y_pred = model_1b.predict(train_sentences_one_hot), y_true = train_labels))
print(calculate_results(y_pred = model_1b.predict(val_sentences_one_hot), y_true = val_labels))
print(calculate_results(y_pred = model_1b.predict(test_sentences_one_hot), y_true = test_labels))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1b_pipeline_XGB.pkl", "wb") as file_to_write:
    pickle.dump(model_1b, file_to_write)

Pipeline(steps=[('XGB_clf',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None, device=None,
                               early_stopping_rounds=None,
                               enable_categorical=False, eval_metric=None,
                               feature_types=None, gamma=None, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, multi_strategy=None,
                               n_estimators=N

## Exp 1c: Decision Trees

In [41]:
# 1. Create modeling pipeline
model_1c = Pipeline([
                    # ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    # ("svd", TruncatedSVD(n_components=3700)), # Truncated SVD
                    ("DT_clf", DecisionTreeClassifier()) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1c.fit(train_sentences_one_hot, train_labels)
print(model_1c)

print(calculate_results(y_pred = model_1c.predict(train_sentences_one_hot), y_true = train_labels))
print(calculate_results(y_pred = model_1c.predict(val_sentences_one_hot), y_true = val_labels))
print(calculate_results(y_pred = model_1c.predict(test_sentences_one_hot), y_true = test_labels))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1c_pipeline_DT.pkl", "wb") as file_to_write:
    pickle.dump(model_1c, file_to_write)

Pipeline(steps=[('DT_clf', DecisionTreeClassifier())])
{'accuracy': 100.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'accuracy': 41.4, 'precision': 0.473685047803572, 'recall': 0.414, 'f1': 0.4268444624555068}
{'accuracy': 43.7, 'precision': 0.474738278321096, 'recall': 0.437, 'f1': 0.44206067878426303}


## Exp 1d: Random Forest

In [44]:
# 1. Create modeling pipeline
model_1d = Pipeline([
                    # ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    # ("svd", TruncatedSVD(n_components=3700)), # Truncated SVD
                    ("RF_clf", RandomForestClassifier()) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1d.fit(train_sentences_one_hot, train_labels)
print(model_1d)

print(calculate_results(y_pred = model_1d.predict(train_sentences_one_hot), y_true = train_labels))
print(calculate_results(y_pred = model_1d.predict(val_sentences_one_hot), y_true = val_labels))
print(calculate_results(y_pred = model_1d.predict(test_sentences_one_hot), y_true = test_labels))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1d_pipeline_RF.pkl", "wb") as file_to_write:
    pickle.dump(model_1d, file_to_write)

Pipeline(steps=[('RF_clf', RandomForestClassifier())])
{'accuracy': 100.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'accuracy': 54.400000000000006, 'precision': 0.5930279530821303, 'recall': 0.544, 'f1': 0.5557055661914357}
{'accuracy': 57.099999999999994, 'precision': 0.6131233852966536, 'recall': 0.571, 'f1': 0.5753765396638579}


## Exp 1e: Gradient Boosting

In [45]:
# 1. Create modeling pipeline
model_1e = Pipeline([
                    # ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    # ("svd", TruncatedSVD(n_components=3700)), # Truncated SVD
                    ("GB_clf", GradientBoostingClassifier()) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1e.fit(train_sentences_one_hot, train_labels)
print(model_1e)

print(calculate_results(y_pred = model_1e.predict(train_sentences_one_hot), y_true = train_labels))
print(calculate_results(y_pred = model_1e.predict(val_sentences_one_hot), y_true = val_labels))
print(calculate_results(y_pred = model_1e.predict(test_sentences_one_hot), y_true = test_labels))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1e_pipeline_GB.pkl", "wb") as file_to_write:
    pickle.dump(model_1e, file_to_write)

Pipeline(steps=[('GB_clf', GradientBoostingClassifier())])
{'accuracy': 100.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'accuracy': 47.0, 'precision': 0.5394820512820513, 'recall': 0.47, 'f1': 0.49385439878950516}
{'accuracy': 53.400000000000006, 'precision': 0.5870791451412424, 'recall': 0.534, 'f1': 0.5510373154313499}


## Exp 1f: Catboost

In [46]:
# 1. Create modeling pipeline
model_1f = Pipeline([
                    # ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    # ("svd", TruncatedSVD(n_components=500)), # Truncated SVD
                    ("CB_clf", CatBoostClassifier(silent = True)) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1f.fit(np.array(train_sentences_one_hot), np.array(train_labels))
print(model_1f)

print(calculate_results(y_pred = model_1f.predict(np.array(train_sentences_one_hot)), y_true = np.array(train_labels)))
print(calculate_results(y_pred = model_1f.predict(np.array(val_sentences_one_hot)), y_true = np.array(val_labels)))
print(calculate_results(y_pred = model_1f.predict(np.array(test_sentences_one_hot)), y_true = np.array(test_labels)))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1f_pipeline_CB.pkl", "wb") as file_to_write:
    pickle.dump(model_1f, file_to_write)

Pipeline(steps=[('CB_clf',
                 <catboost.core.CatBoostClassifier object at 0x000001B63FF38BC0>)])
{'accuracy': 100.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'accuracy': 52.800000000000004, 'precision': 0.5786610684770186, 'recall': 0.528, 'f1': 0.5395437756984717}
{'accuracy': 59.599999999999994, 'precision': 0.6164890745863009, 'recall': 0.596, 'f1': 0.6020877472151334}


## Exp 1g: LightGBM

In [58]:
# 1. Create modeling pipeline
model_1g = Pipeline([
                    ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    # ("svd", TruncatedSVD(n_components=3700)), # Truncated SVD
                    ("LGBM_clf", LGBMClassifier(verbose = 0, num_leaves = 20)) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1g.fit(train_sentences_one_hot, train_labels)
print(model_1g)

print(calculate_results(y_pred = model_1g.predict(train_sentences_one_hot), y_true = train_labels))
print(calculate_results(y_pred = model_1g.predict(val_sentences_one_hot), y_true = val_labels))
print(calculate_results(y_pred = model_1g.predict(test_sentences_one_hot), y_true = test_labels))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1g_pipeline_LGBM.pkl", "wb") as file_to_write:
    pickle.dump(model_1g, file_to_write)

Pipeline(steps=[('tfidf', TfidfTransformer()),
                ('LGBM_clf', LGBMClassifier(num_leaves=20, verbose=0))])
{'accuracy': 57.49999999999999, 'precision': 0.5765988037727168, 'recall': 0.575, 'f1': 0.570177902406568}
{'accuracy': 35.0, 'precision': 0.38156110363793977, 'recall': 0.35, 'f1': 0.3594627867221727}
{'accuracy': 33.5, 'precision': 0.3559731855780759, 'recall': 0.335, 'f1': 0.33669096582481817}


The warnings indicate that the tree growing process was stopped before reaching the max_depth/num_leaves. The default number of leaves is 20. By trial-and-error, we found num_leaves >2 generates this error. The reason is that we have very small number of samples in the training data, too small a number for the tree-growing process to go on.

## Exp 1h: SVM Classifier

In [59]:
# # Create tokenization and modelling pipeline
model_1h = Pipeline([
                    # ("tfidf", TfidfTransformer()), # convert words to numbers using tfidf
                    # ("svd", TruncatedSVD(n_components=1500)), # Truncated SVD
                    ("SVM_clf", SVC()) # model the text
])

# 2. Fit the model pipeline to the training data and calculate accuracies on different sets
model_1h.fit(train_sentences_one_hot, train_labels)
print(model_1h)

print(calculate_results(y_pred = model_1h.predict(train_sentences_one_hot), y_true = train_labels))
print(calculate_results(y_pred = model_1h.predict(val_sentences_one_hot), y_true = val_labels))
print(calculate_results(y_pred = model_1h.predict(test_sentences_one_hot), y_true = test_labels))

# 3. Save/dump/write our model!
with open("Saved_ML_models_Exp1/model_1h_pipeline_SVM.pkl", "wb") as file_to_write:
    pickle.dump(model_1h, file_to_write)

Pipeline(steps=[('SVM_clf', SVC())])
{'accuracy': 100.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'accuracy': 55.60000000000001, 'precision': 0.6345146732750749, 'recall': 0.556, 'f1': 0.5777567998764525}
{'accuracy': 56.39999999999999, 'precision': 0.635814400637842, 'recall': 0.564, 'f1': 0.5798618433841595}
