# **Support Vector Machine (SVM) Classifier - CMPT 459 Course Project**

This notebook demonstrates the process of training a **SVM Classifier** and evaluating it on a test/train split of the diabetic patient dataset. At the end, the evaluation metrics are printed and visualized.

We do the following:
- Data preprocessing consistent with the project pipeline
- PCA for dimensionality reduction for visualization of results
- 2D Visualization of the separating hyperplane superimposed on the data
- Custom implemention of **soft SVM classifer without kernel** ( `svm_classifier.py`)
- Accuracy score and precision/recall/fscore evaluation metrics to rate the correctness of the SVM classifier 

This notebook is part of our group’s modular report and references:
- `svm_classifier.py`(original script version)

In [3]:
import argparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support
)

from svm_classifier import SVMClassifier

##  Loading and Preprocessing Data 

Following the same loading and preprocessing data process as the rest of the project, we apply the following:

- Replace `'?'` values with `NaN`  
- Drop columns with >40% missing values 
- One-hot encode high-cardinality categorical columns  
- Label-encode low-cardinality categorical features  
- Normalize numerical features  
- Encode our target variable `readmitted` as integers:  
  - `NO → 0`, `>30 → 1`, `<30 → 2`  
- Remove sensitive/identifying data: `encounter_id`, `patient_nbr`

In [2]:
def load_and_preprocess_data(path: str):

    print("Loading data...")
    df = pd.read_csv(path)
    print(f"Original shape: {df.shape}")

    df = df.replace('?', np.nan)

    # Drop columns with >40% missing
    threshold = 0.4 * len(df)
    df = df.dropna(thresh=threshold, axis=1)

    # Fill categorical NAs
    for col in df.select_dtypes(include='object').columns:
        df[col] = df[col].fillna("Unknown")

    # Encode target
    df["readmitted"] = df["readmitted"].map({'NO': 0, '>30': 1, '<30': 2})

    # Encode categorical
    cat_cols = df.select_dtypes(include='object').columns
    le = LabelEncoder()
    for col in cat_cols:
        if df[col].nunique() < 10:
            df[col] = le.fit_transform(df[col].astype(str))
        else:
            df = pd.get_dummies(df, columns=[col], drop_first=True)

    # Drop IDs
    for col in ["encounter_id", "patient_nbr"]:
        if col in df.columns:
            df = df.drop(columns=[col])

    # Scale numeric
    num_cols = df.select_dtypes(include=['int64', 'float64']).columns
    scaler = StandardScaler()
    df[num_cols] = scaler.fit_transform(df[num_cols])

    X = df.drop(columns=["readmitted"]).values
    y = df["readmitted"].values
    print("Preprocessing complete! Final shape:", X.shape)
    return X, y

## PCA Dimensionality Reduction 

We reduce dimensionality to **50 principal components**, preserving ~85–90% variance.  We use PCA to reduce dimensionality of the dataset to **50 principal components**, preserving ~85-90% variance. Doing so allows us to speed up the classification process.

In [None]:
n_components = 50
random_state = 42 
pca = PCA(n_components, random_state)
X_pca = pca.fit_transform(X)

print("PCA shape:", X_pca.shape)
print("PCA done running. Explained variance:", np.sum(pca.explained_variance_ratio_))

##  Training SVM Classifier

Before we begin to train the SVM classifier, we split the preprocessed data into test/train sets with a default test to train set ratio of 0.2. 

The accuracy of the separating hyperplane heavily depends on the choice of learning rate (alpha), margin tradeoff (lambda) and the number of iterations we allow the classifer to run for.

In [None]:
X, y = load_and_preprocess_data("data/diabetic_data.csv")
test_size = 0.2 
random_state = 42
alpha = 0.001
lmda = 0.01
iterations = 100
# Create train/test split of data 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = test_size, random_state, stratify = y
)

# Train Soft SVM
svm = SVMClassifier(
    alpha = alpha,
    lmda = lmda,
    num_iterations = iterations
)
svm.fit(X_train, y_train)

## 2D Visualization of Separating Hyperplane

With function `getHyperplane`, we draw the equation of the hyperplane with input: weights, intercept and offset.

For visualization, we plot the 50-dim PCA data into a 2D plot. Then, we superimpose the generated hyperplane onto the graph, separating the class labels. 

*Dotted line = hyperplane*
*Solid line = margin*

In [None]:
def getHyperplane(X: np.ndarray, weights, b, offset):
    """ Helper function for visualization of hyperplanes."""
    # Hyperplane equation: X_i * W + b = 0
    # Draws a plane with soft margins  
    hyperplane = (-weights[0] * X + b + offset) / weights[1]
    return hyperplane

print("Reducing data via PCA for 2D Visualization")
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(10, 8))
plt.title("Plot for linear SVM Classification", fontsize = 14, fontweight = "bold")
plt.tight_layout()
plt.show()


## Classification Evaluation Metrics

We use the following metrics to evaluate the predictions made by the SVM Classifer:
- **Accuracy**: computes the number of correctly classified labels
- **Precision**: computes the ratio of true positive to all true positive and false positive (classified as positive) labels 
- **Recall**: computes the ratio of true positive to all true positive and false negative (classified as negative) labels 
- **F-score**: computes the weighted harmonic mean of precision and recall metrics, between (0,1)

In [None]:
label_pred = svm.predict(X_test)
accuracy = accuracy_score(y_test, label_pred)
prec, rec, f1, _ = precision_recall_fscore_support(
    y_test, label_pred, average = 'weighted', zero_division = 0
)

print("\nClassification Results:")
print("=" * 70)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {rec:.4f}")
print(f"F1-score: {f1:.4f}")

# **Interpretation & Discussion**
## **Strengths of Soft SVM Classification**

* Soft constraints on margin and slack variables speeds up runtime quite significantly (considering that the SVM Classifier runs in quadratic time) and allows for non-linear surfaces.
* Effective in high-dimensional spaces where dimensions may surpass the number of samples.
* Memory efficient as SVM utilizes a subset of training points (support vectors).

---

## **Problems of Soft SVM Classification**

* Due to the nature of creating a separating hyperplane, the SVM classifier can only handle 2-class (binary) target classification and does not work for multi-class labels.
* Running time at worse case (highly dependent on kernel choice, data size, misclassification penalty C)goes from **O(n^2)** to **O(n^3)** for small to large choice of C. It has at least quadratic running time. 