### Notebook of Codes for ***Vis-SWNIR Spectroscopic and Hyperspectral Imaging Sensor Integrated with Artificial Intelligence for Early Diagnosis of Breast Cancer***


This notebook implements and evaluates four different machine learning classification models: Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), and Random Forest (RF). The primary goal is to classify data loaded from `.mat` files into two categories: "HC" and "BC".

The general workflow for each model is:

1.  Load training and testing datasets.
2.  Define the classifier with specific hyperparameters, which were previously optimized using a Grid Search methodology.


3.  Perform 5-fold stratified cross-validation on the training data.
4.  Train the model on the entire training set.
5.  Evaluate the final model on the unseen test set and generate a classification report.


#### _1\. Utility Function for ***Data Loading***_

A helper function `load_mat_data` is defined to handle loading data from `.mat` files. This function is robust as it attempts to load the file using two different libraries:

  * `h5py`: For newer `.mat` files (version 7.3 and above).
  * `scipy.io.loadmat`: As a fallback for older `.mat` file versions.

It can also optionally transpose the data matrix upon loading.

In [1]:
import numpy as np
import h5py
from scipy.io import loadmat

def load_mat_data(filepath, variable_name=None, transpose=False):
    """
    Loads a numerical array from a .mat file, trying both h5py and scipy.
    """
    try:  # Method 1: Try h5py (for v7.3+ .mat files)
        with h5py.File(filepath, "r") as f:
            if variable_name:
                key = variable_name
            else:
                key = [k for k in f.keys() if not k.startswith("#")][0]
            data = np.array(f[key])
        print(f"Loaded '{filepath}' using h5py.")
    except Exception:  # Method 2: Try scipy.io.loadmat
        try:
            mat_contents = loadmat(filepath)
            if variable_name:
                key = variable_name
            else:
                key = [k for k in mat_contents.keys() if not k.startswith("__")][0]
            data = mat_contents[key]
            print(f"Loaded '{filepath}' using scipy.")
        except Exception as e:
            raise IOError(f"Failed to load data from {filepath}") from e

    if transpose:
        data = data.T
    print(f"  Final matrix shape: {data.shape}")
    return data

#### _2\. Model 1: ***Support Vector Machine (SVM)***_

This section trains and evaluates a Support Vector Classifier (`SVC`) with a Radial Basis Function (`rbf`) kernel.

The code loads pre-split training and testing data, defines an `SVC` model, and then evaluates it using cross-validation and a final test set prediction.


In [None]:
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

X_train = load_mat_data(r"X_Train.mat")
X_test = load_mat_data(r"X_Test.mat")
y_train = load_mat_data(r"Y_Train.mat").ravel()
y_test = load_mat_data(r"Y_Test.mat").ravel()

# Define model
svm = SVC(C=100, kernel="rbf", gamma=0.003, probability=True)

# Cross-validation on training data
print("\n=== Cross-validation on Training Data ===")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

y_pred_cv = cross_val_predict(svm, X_train, y_train, cv=cv)
train_accuracy_cv = accuracy_score(y_train, y_pred_cv)

print("Training Accuracy (CV): {:.4f}".format(train_accuracy_cv))
print("Classification Report (CV):")
print(classification_report(y_train, y_pred_cv, target_names=["HC", "BC"]))

# Final prediction on test data
print("\n=== Final Prediction on Test Data ===")
svm.fit(X_train, y_train)
y_pred_test = svm.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
print("Training Accuracy: {:.4f}".format(svm.score(X_train, y_train)))
print("Test Accuracy: {:.4f}".format(test_accuracy))
print("Classification Report (Test):")
print(classification_report(y_test, y_pred_test, target_names=["HC", "BC"]))

Loaded 'X_Train.mat' using scipy.
  Final matrix shape: (116, 301)
Loaded 'X_Test.mat' using scipy.
  Final matrix shape: (29, 301)
Loaded 'Y_Train.mat' using scipy.
  Final matrix shape: (116, 1)
Loaded 'Y_Test.mat' using scipy.
  Final matrix shape: (29, 1)

=== Cross-validation on Training Data ===
[0.66666667 0.73913043 0.82608696 0.52173913 0.56521739]
Training Accuracy (CV): 0.6638
Classification Report (CV):
              precision    recall  f1-score   support

          HC       0.67      0.59      0.63        56
          BC       0.66      0.73      0.69        60

    accuracy                           0.66       116
   macro avg       0.67      0.66      0.66       116
weighted avg       0.66      0.66      0.66       116


=== Final Prediction on Test Data ===
Training Accuracy: 0.8534
Test Accuracy: 0.6207
Classification Report (Test):
              precision    recall  f1-score   support

          HC       0.70      0.74      0.72        19
          BC       0.44     

#### _3\. Model 2: ***K-Nearest Neighbors (KNN)***_

Here, a K-Nearest Neighbors classifier is implemented with `k=20`. It uses a `distance` weighting scheme, where closer neighbors have a greater influence on the prediction.

The process mirrors the SVM workflow. Data is loaded, a `KNeighborsClassifier` is defined, and performance is measured with cross-validation and on the test set.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
from utils import load_mat_data

X_train = load_mat_data(r"X_Train.mat")
X_test = load_mat_data(r"X_Test.mat")
y_train = load_mat_data(r"Y_Train.mat").ravel()
y_test = load_mat_data(r"Y_Test.mat").ravel()

# Define model
knn = KNeighborsClassifier(n_neighbors=20, weights="distance", metric="euclidean")

# Cross-validation on training data
print("\n=== Cross-validation on Training Data ===")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

y_pred_cv = cross_val_predict(knn, X_train, y_train, cv=cv)
train_accuracy_cv = accuracy_score(y_train, y_pred_cv)

print("Training Accuracy (CV): {:.4f}".format(train_accuracy_cv))
print("Classification Report (CV):")
print(classification_report(y_train, y_pred_cv, target_names=["HC", "BC"]))

# Final prediction on test data
print("\n=== Final Prediction on Test Data ===")
knn.fit(X_train, y_train)
y_pred_test = knn.predict(X_test)

train_acc = knn.score(X_train, y_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("Training Accuracy: {:.4f}".format(train_acc))
print("Test Accuracy: {:.4f}".format(test_acc))
print("Classification Report (Test):")
print(classification_report(y_test, y_pred_test, target_names=["HC", "BC"]))

#### _4\. Model 3: ***Multi-layer Perceptron (MLP)***_

This section implements a neural network classifier. A crucial preprocessing step, `StandardScaler`, is added to normalize the features, which is essential for the proper functioning of MLP models.

The network architecture consists of two hidden layers with 100 and 50 neurons, respectively. It uses the `relu` activation function and the `sgd` solver with an adaptive learning rate.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score
from utils import load_mat_data

X_train = load_mat_data(r"X_Train.mat")
X_test = load_mat_data(r"X_Test.mat")
y_train = load_mat_data(r"Y_Train.mat").ravel()
y_test = load_mat_data(r"Y_Test.mat").ravel()

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define model
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50), 
    activation="relu",
    solver="sgd",
    alpha=0.001,
    learning_rate="adaptive",
    max_iter=10000,
    random_state=42,
)

# Cross-validation on training data
print("\n=== Cross-validation on Training Data ===")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

y_pred_cv = cross_val_predict(mlp, X_train_scaled, y_train, cv=cv)
train_accuracy_cv = accuracy_score(y_train, y_pred_cv)

print("Training Accuracy (CV): {:.4f}".format(train_accuracy_cv))
print("Classification Report (CV):")
print(classification_report(y_train, y_pred_cv, target_names=["HC", "BC"]))

# Final prediction on test data
print("\n=== Final Prediction on Test Data ===")
mlp.fit(X_train_scaled, y_train)
y_pred_test = mlp.predict(X_test_scaled)

train_acc = mlp.score(X_train_scaled, y_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("Training Accuracy: {:.4f}".format(train_acc))
print("Test Accuracy: {:.4f}".format(test_acc))
print("Classification Report (Test):")
print(classification_report(y_test, y_pred_test, target_names=["HC", "BC"]))

#### _5\. Model 4: ***Random Forest (RF)***_

This final section uses a Random Forest classifier, an ensemble model composed of 500 decision trees. 

The process mirrors the SVM workflow. Data is loaded, a `RandomForestClassifier` is defined, and performance is measured with cross-validation and on the test set.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from utils import load_mat_data

X_train = load_mat_data(r"X_Train.mat")
X_test = load_mat_data(r"X_Test.mat")
y_train = load_mat_data(r"Y_Train.mat").ravel()
y_test = load_mat_data(r"Y_Test.mat").ravel()

# Define model
rf = RandomForestClassifier(
    n_estimators=500,
    n_jobs=-1,
)

# Cross-validation on training data
print("\n=== Cross-validation on Training Data ===")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

y_pred_cv = cross_val_predict(rf, X_train, y_train, cv=cv)
train_accuracy_cv = accuracy_score(y_train, y_pred_cv)

print("Training Accuracy (CV): {:.4f}".format(train_accuracy_cv))
print("Classification Report (CV):")
print(classification_report(y_train, y_pred_cv, target_names=["HC", "BC"]))

# Final prediction on test data
print("\n=== Final Prediction on Test Data ===")
rf.fit(X_train, y_train)
y_pred_test = rf.predict(X_test)

train_acc = rf.score(X_train, y_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("Training Accuracy: {:.4f}".format(train_acc))
print("Test Accuracy: {:.4f}".format(test_acc))
print("Classification Report (Test):")
print(classification_report(y_test, y_pred_test, target_names=["HC", "BC"]))