Train an SVM Classifier on the MNIST Dataset

In [1]:
import numpy as np
from sklearn import svm
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import time

# --- 1. Data Loading ---
print("Loading MNIST dataset... (This may take a moment and requires internet access)")
# Fetch the full MNIST dataset (70,000 images, 784 features)
# 'mnist_784' is the dataset name on OpenML
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

# Features (X) are the flattened images (784 pixels)
X = mnist.data.astype('float32')
# Labels (y) are the digits (0-9)
y = mnist.target.astype('int')

print(f"Dataset loaded: {X.shape[0]} samples, {X.shape[1]} features.")

# --- 2. Data Preprocessing and Splitting ---

# Normalize the data: scale pixel values from [0, 255] to [0, 1]
# This is crucial for most ML algorithms, especially SVMs.
X /= 255.0

# Optional: To save significant computation time, you can work with a smaller subset.
# Uncomment the following lines to use only the first 10,000 samples.
# N_SAMPLES = 10000
# print(f"Using a subset of {N_SAMPLES} samples for faster training.")
# X = X[:N_SAMPLES]
# y = y[:N_SAMPLES]

# Split the data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# --- 3. Model Training ---

# Create an SVM Classifier with a Radial Basis Function (RBF) kernel
# RBF kernel is generally a good choice for classification problems like this.
# C: Regularization parameter. Higher C means lower tolerance for misclassification.
# gamma: Kernel coefficient. 'scale' is a good default: 1 / (n_features * X.var())
clf = svm.SVC(gamma='scale', kernel='rbf', C=10)

print("\nStarting SVM training... (This may take a long time on a large dataset)")
start_time = time.time()

# Train the classifier on the training data
clf.fit(X_train, y_train)

end_time = time.time()
print(f"Training finished in {end_time - start_time:.2f} seconds.")

# --- 4. Model Evaluation ---

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)

print("\n--- Model Evaluation ---")
print(f"SVM Classifier Accuracy on Test Set: {accuracy * 100:.2f}%")

# Print a detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Loading MNIST dataset... (This may take a moment and requires internet access)
Dataset loaded: 70000 samples, 784 features.
Training set size: 56000 samples
Test set size: 14000 samples

Starting SVM training... (This may take a long time on a large dataset)
Training finished in 184.25 seconds.

--- Model Evaluation ---
SVM Classifier Accuracy on Test Set: 98.34%

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1381
           1       0.99      0.99      0.99      1575
           2       0.98      0.98      0.98      1398
           3       0.99      0.98      0.98      1428
           4       0.98      0.98      0.98      1365
           5       0.98      0.98      0.98      1263
           6       0.99      0.99      0.99      1375
           7       0.98      0.98      0.98      1459
           8       0.98      0.98      0.98      1365
           9       0.97      0.98      0.98      1391

    accuracy  

Use Grid/Random Search with Cross-Validation to find the best hyperparameter values for the SVM classifier.
For the Polynomial Kernel, optimize the degree, C, and coef0 hyperparameters.
For the RBF Kernel, focus on optimizing C and gamma.

In [2]:
import time
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# --- 1. Load and Prepare Data (Using a Subset for Efficient Search) ---

print("Loading MNIST dataset for hyperparameter tuning...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X_full = mnist.data.astype('float32') / 255.0  # Normalize
y_full = mnist.target.astype('int')

# CRITICAL STEP: Use a small, manageable subset for GridSearchCV.
# Grid Search is computationally expensive (O(n^2) to O(n^3)) for SVC.
N_SAMPLES = 6000 # Use 6,000 samples for the search (fast enough for testing)
X_subset = X_full[:N_SAMPLES]
y_subset = y_full[:N_SAMPLES]

# Split the subset for the hyperparameter search
# Note: We are splitting the subset into 'search_train' and 'search_test'
# for the GridSearchCV, which will internally use K-fold cross-validation on 'search_train'.
X_train_search, X_test_search, y_train_search, y_test_search = train_test_split(
    X_subset, y_subset, test_size=0.2, random_state=42, stratify=y_subset
)

print(f"Using a subset of {X_subset.shape[0]} samples for Grid Search.")
print(f"Grid Search Training Set Size: {X_train_search.shape[0]}")

# --- 2. Grid Search for RBF Kernel ---
print("\n" + "="*50)
print("Starting Grid Search for RBF Kernel (C, gamma)")
print("="*50)

# Define the parameter grid for the RBF kernel
# C: Regularization parameter. Values are typically powers of 10.
# gamma: Kernel coefficient. 'scale' is a robust default, but we test others.
param_grid_rbf = [
    {'C': [1, 10], 'gamma': [0.001, 0.01, 'scale']},
]

# Create the SVC model
svc_rbf = svm.SVC(kernel='rbf')

# Create the Grid Search object
# cv=3 means 3-fold cross-validation: the training set is split into 3 folds,
# and the model is trained 3 times on 2 folds and tested on the remaining fold.
grid_search_rbf = GridSearchCV(
    svc_rbf, param_grid_rbf, cv=3, verbose=2, n_jobs=-1, scoring='accuracy'
)

start_time_rbf = time.time()
grid_search_rbf.fit(X_train_search, y_train_search)
end_time_rbf = time.time()

# Print results
print(f"\nRBF Grid Search finished in {end_time_rbf - start_time_rbf:.2f} seconds.")
print(f"Best RBF parameters found: {grid_search_rbf.best_params_}")
print(f"Best RBF Cross-Validation score: {grid_search_rbf.best_score_:.4f}")

# Optional: Evaluate on the hold-out test set
best_rbf_clf = grid_search_rbf.best_estimator_
test_accuracy_rbf = best_rbf_clf.score(X_test_search, y_test_search)
print(f"Best RBF Model Accuracy on Test Subset: {test_accuracy_rbf:.4f}")

# --- 3. Grid Search for Polynomial Kernel ---
print("\n" + "="*50)
print("Starting Grid Search for Polynomial Kernel (degree, C, coef0)")
print("="*50)

# Define the parameter grid for the Polynomial kernel
# degree: Degree of the polynomial kernel function (e.g., degree 3 for cubic)
# C: Regularization parameter.
# coef0: Independent term in the kernel function.
param_grid_poly = [
    {
        'degree': [2, 3],
        'C': [1, 10],
        'coef0': [0.0, 1.0]
    }
]

# Create the SVC model
svc_poly = svm.SVC(kernel='poly')

# Create the Grid Search object
grid_search_poly = GridSearchCV(
    svc_poly, param_grid_poly, cv=3, verbose=2, n_jobs=-1, scoring='accuracy'
)

start_time_poly = time.time()
grid_search_poly.fit(X_train_search, y_train_search)
end_time_poly = time.time()

# Print results
print(f"\nPolynomial Grid Search finished in {end_time_poly - start_time_poly:.2f} seconds.")
print(f"Best Polynomial parameters found: {grid_search_poly.best_params_}")
print(f"Best Polynomial Cross-Validation score: {grid_search_poly.best_score_:.4f}")

# Optional: Evaluate on the hold-out test set
best_poly_clf = grid_search_poly.best_estimator_
test_accuracy_poly = best_poly_clf.score(X_test_search, y_test_search)
print(f"Best Polynomial Model Accuracy on Test Subset: {test_accuracy_poly:.4f}")

Loading MNIST dataset for hyperparameter tuning...
Using a subset of 6000 samples for Grid Search.
Grid Search Training Set Size: 4800

Starting Grid Search for RBF Kernel (C, gamma)
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] END ....................................C=1, gamma=0.01; total time=  17.0s
[CV] END ...................................C=1, gamma=0.001; total time=  21.1s
[CV] END ...................................C=1, gamma=0.001; total time=  23.0s
[CV] END ...................................C=1, gamma=0.001; total time=  24.8s
[CV] END ....................................C=1, gamma=0.01; total time=  14.8s
[CV] END ....................................C=1, gamma=0.01; total time=  14.5s
[CV] END ...................................C=1, gamma=scale; total time=  12.2s
[CV] END ...................................C=1, gamma=scale; total time=  15.7s
[CV] END ...................................C=1, gamma=scale; total time=  10.0s
[CV] END ...................

# Compare the performance of the SVM with different kernels (Linear, Polynomial, and RBF) and select the best one based on test set accuracy and other metrics like precision, recall, and F1-score.

Part A: Train and Evaluate Linear Kernel SVM

In [3]:
import numpy as np
from sklearn import svm
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import time

# --- 1. Data Loading and Preparation (Full Dataset) ---
print("Loading MNIST dataset for Linear SVM...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X = mnist.data.astype('float32') / 255.0  # Normalize
y = mnist.target.astype('int')

# Split the full data (same split as Task 1)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True, stratify=y
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# --- 2. Train Linear Kernel SVM ---
# Use the default C=1.0 for the Linear Kernel SVC
print("\nStarting SVM training with LINEAR Kernel...")
clf_linear = svm.SVC(kernel='linear', C=1.0)
start_time = time.time()

# Train the classifier
clf_linear.fit(X_train, y_train)

end_time = time.time()
print(f"Training finished in {end_time - start_time:.2f} seconds.")

# --- 3. Model Evaluation ---
y_pred_linear = clf_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

print("\n--- Linear Kernel Model Evaluation ---")
print(f"Linear SVM Classifier Accuracy on Test Set: {accuracy_linear * 100:.2f}%")

# Print detailed classification report
report_linear = classification_report(y_test, y_pred_linear, output_dict=True)
print("\nClassification Report (Linear Kernel):")
print(classification_report(y_test, y_pred_linear))

Loading MNIST dataset for Linear SVM...


Exception ignored in: <function ResourceTracker.__del__ at 0x7dffcc382660>
Traceback (most recent call last):
  File "/home/clauds/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/home/clauds/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/home/clauds/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x742a5e17e660>
Traceback (most recent call last):
  File "/home/clauds/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/home/clauds/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/home/clauds/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes


Training set size: 56000 samples
Test set size: 14000 samples

Starting SVM training with LINEAR Kernel...
Training finished in 244.00 seconds.

--- Linear Kernel Model Evaluation ---
Linear SVM Classifier Accuracy on Test Set: 93.59%

Classification Report (Linear Kernel):
              precision    recall  f1-score   support

           0       0.94      0.98      0.96      1381
           1       0.96      0.98      0.97      1575
           2       0.92      0.93      0.93      1398
           3       0.90      0.92      0.91      1428
           4       0.94      0.94      0.94      1365
           5       0.92      0.90      0.91      1263
           6       0.96      0.96      0.96      1375
           7       0.95      0.95      0.95      1459
           8       0.93      0.89      0.91      1365
           9       0.94      0.90      0.92      1391

    accuracy                           0.94     14000
   macro avg       0.94      0.93      0.93     14000
weighted avg       0.

## Compare the SVM classifier’s performance with your classifiers from Assignment 4, i.e KNN, SGD, and Random Forest. Pay attention to accuracy, precision, recall, and other evaluation metrics. Also, include training time (computational complexity) as evaluation metric.

Final Conclusion and Trade-Off
The comparison highlights the classic accuracy vs. speed trade-off in machine learning:

The SVM with RBF Kernel is the best classifier in terms of pure performance (≈98.34% accuracy) but has the highest computational cost during training, making it the slowest choice.

The SGD Classifier is the best classifier in terms of speed and scalability, offering near-instantaneous training at the expense of a significant drop in accuracy (≈91.39%).

Choice	Best For...

SVM (RBF)	Maximum Accuracy: When high performance is paramount and training time/resources are not a major constraint.

KNN	Fast Training: When a high-accuracy, simple model is needed, and you can tolerate a slow prediction time.

SGD	Speed and Scalability: When data size is enormous or training must be done quickly (e.g., streaming data), even if accuracy is sacrificed.