# Handwritten Digits Recognition — Traditional ML Models

**This notebook trains and compares traditional machine learning models** (Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Random Forest) on the MNIST / handwritten digits dataset.

**How it works:** The notebook tries to load the dataset from the uploaded zip at `/mnt/data/PRCP-1002-HandwrittenDigits.zip`. If not available, it falls back to `sklearn.datasets.load_digits` so you can run the notebook anywhere.

**What you'll find:**

- Data loading and EDA (shape, class distribution, sample images)
- Preprocessing and flattening / scaling
- Training: Logistic Regression, KNN, SVM, RandomForest
- Performance comparison (accuracy, precision, recall, f1, confusion matrix)
- Short discussion and save best model

---

*Run all cells in order. This notebook is intended to be runnable in Google Colab or local Jupyter.*

In [None]:

# Basic imports
import os, zipfile, warnings, time
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# For fallback dataset
from sklearn.datasets import load_digits

print('Imports done.')


In [None]:

# Attempt to find and load the uploaded dataset zip
zip_path = '/mnt/data/PRCP-1002-HandwrittenDigits.zip'
data_loaded = False

if os.path.exists(zip_path):
    print('Found zip at', zip_path)
    z = zipfile.ZipFile(zip_path, 'r')
    z.extractall('/mnt/data/PRCP-1002-HandwrittenDigits_extracted')
    print('Extracted to /mnt/data/PRCP-1002-HandwrittenDigits_extracted')
    extracted = os.listdir('/mnt/data/PRCP-1002-HandwrittenDigits_extracted')
    print('Contains:', extracted[:20])
    csv_files = [f for f in extracted if f.lower().endswith('.csv')]
    npz_files = [f for f in extracted if f.lower().endswith('.npz')]
    if csv_files:
        csv_path = os.path.join('/mnt/data/PRCP-1002-HandwrittenDigits_extracted', csv_files[0])
        print('Loading CSV:', csv_path)
        df = pd.read_csv(csv_path)
        if 'label' in df.columns:
            y = df['label'].values
            X = df.drop(columns=['label']).values
        else:
            y = df.iloc[:,0].values
            X = df.iloc[:,1:].values
        data_loaded = True
    elif npz_files:
        npz_path = os.path.join('/mnt/data/PRCP-1002-HandwrittenDigits_extracted', npz_files[0])
        print('Loading NPZ:', npz_path)
        with np.load(npz_path) as data:
            print('Keys in npz:', list(data.keys()))
            if 'x_train' in data and 'y_train' in data:
                X = np.concatenate([data['x_train'], data.get('x_test', np.array([]))], axis=0)
                y = np.concatenate([data['y_train'], data.get('y_test', np.array([]))], axis=0)
            elif 'images' in data and 'labels' in data:
                X = data['images']
                y = data['labels']
            else:
                keys = list(data.keys())
                if len(keys) >= 2:
                    X = data[keys[0]]
                    y = data[keys[1]]
                else:
                    raise ValueError('Unrecognized .npz format')
        data_loaded = True
    else:
        imgs = []
        for root, dirs, files in os.walk('/mnt/data/PRCP-1002-HandwrittenDigits_extracted'):
            for f in files:
                if f.lower().endswith(('.png', '.jpg', '.jpeg')):
                    imgs.append(os.path.join(root,f))
        if imgs:
            print('Found image files - loading first 2000 images for demo.')
            from PIL import Image
            samples = min(2000, len(imgs))
            X_list, y_list = [], []
            for img_path in imgs[:samples]:
                label = os.path.basename(os.path.dirname(img_path))
                try:
                    lab = int(label)
                except:
                    lab = 0
                img = Image.open(img_path).convert('L').resize((28,28))
                arr = np.array(img).reshape(-1)
                X_list.append(arr)
                y_list.append(lab)
            X = np.array(X_list)
            y = np.array(y_list)
            data_loaded = True

if not data_loaded:
    print('Uploaded dataset not found or not readable. Falling back to sklearn.datasets.load_digits (8x8 images).')
    digits = load_digits()
    X = digits.data
    y = digits.target

X = X.astype('float32')
print('X shape:', X.shape, 'y shape:', y.shape)


In [None]:

# Basic EDA
import matplotlib.pyplot as plt

print('Number of samples:', X.shape[0])
unique, counts = np.unique(y, return_counts=True)
print('Class distribution:')
for u,c in zip(unique, counts):
    print(f'  {int(u)} : {c} samples')

def show_samples(X, y, n=10):
    plt.figure(figsize=(12,2))
    for i in range(n):
        ax = plt.subplot(1, n, i+1)
        pix = X[i]
        L = int(np.sqrt(pix.size))
        plt.imshow(pix.reshape(L, L), cmap='gray')
        ax.set_title(str(int(y[i])))
        plt.axis('off')
    plt.show()

show_samples(X, y, n=10)


In [None]:

# Preprocessing: flatten (if needed), scale, train-test split
n_samples = X.shape[0]
if X.ndim > 2:
    X_flat = X.reshape(n_samples, -1)
else:
    X_flat = X.copy()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_flat)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)


In [None]:

# Train a few traditional models and record training time & accuracy
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, solver='saga', multi_class='multinomial'),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'SVM': SVC(kernel='rbf', gamma='scale'),
    'RandomForest': RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
}

results = []
for name, model in models.items():
    print('\nTraining', name)
    t0 = time.time()
    model.fit(X_train, y_train)
    t1 = time.time()
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f'{name} accuracy: {acc:.4f} (train time {t1-t0:.1f}s)')
    results.append({'model': name, 'accuracy': acc, 'train_time_s': t1-t0, 'model_obj': model})


In [None]:

# Results table
results_df = pd.DataFrame([{'Model': r['model'], 'Accuracy': r['accuracy'], 'TrainTime(s)': r['train_time_s']} for r in results])
results_df = results_df.sort_values('Accuracy', ascending=False).reset_index(drop=True)
results_df


In [None]:

best = results[ np.argmax([r['accuracy'] for r in results]) ]
best_model = best['model_obj']
print('Best model:', best['model'], 'Accuracy:', best['accuracy'])

from sklearn.metrics import ConfusionMatrixDisplay
preds = best_model.predict(X_test)
print('\nClassification report for best model:\n')
print(classification_report(y_test, preds, digits=4))

cm = confusion_matrix(y_test, preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=np.unique(y))
fig, ax = plt.subplots(figsize=(8,6))
disp.plot(ax=ax, cmap='viridis')
plt.title(f'Confusion Matrix - {best["model"]}')
plt.show()


In [None]:

# Save the best model and scaler to disk using joblib
import joblib
out_dir = '/mnt/data/handwritten_model_output'
os.makedirs(out_dir, exist_ok=True)
joblib.dump(best_model, os.path.join(out_dir, f'best_model_{best["model"]}.joblib'))
joblib.dump(scaler, os.path.join(out_dir, 'scaler.joblib'))
print('Saved best model and scaler to', out_dir)


## (Optional) Quick GridSearch for SVM hyperparameters

*Run this cell if you want to try tuning SVM — it may take time depending on dataset size.*

In [None]:

# A small grid search for SVM (uncomment to run). Kept small by design.
# params = {'C':[0.1,1], 'gamma':['scale','auto'], 'kernel':['rbf']}
# gs = GridSearchCV(SVC(), params, cv=3, n_jobs=-1, scoring='accuracy')
# gs.fit(X_train, y_train)
# print('Best params', gs.best_params_, 'Best score', gs.best_score_)


## Conclusion & Next Steps

- Models trained: Logistic Regression, KNN, SVM, RandomForest.
- The notebook reports accuracy and shows classification report + confusion matrix for the best model.

**Next steps (if you want to improve performance):**
1. Use a Convolutional Neural Network (CNN) in TensorFlow/Keras — typically yields highest accuracy on MNIST.
2. Data augmentation (slight rotations, shifts) to improve generalization.
3. More extensive hyperparameter tuning (GridSearch or RandomizedSearchCV).
4. If dataset is the original MNIST (28x28), consider models that use image shape (CNN) instead of flattened vectors.

---

Upload the dataset zip to `/mnt/data/PRCP-1002-HandwrittenDigits.zip` or ensure your data file is present, then run this notebook.