# MNIST SI4 CNN

## Imports

In [11]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, MaxPool2D, Flatten, Dense
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

## Load and format MNIST dataset

In [None]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype('float32') / 255
x_test  = x_test.astype('float32')  / 255
x_train = x_train.reshape((60000, 28, 28, 1)) # 'channels_last' format
x_test  = x_test.reshape((10000, 28, 28, 1)) # 'channels_last' format
y_train = to_categorical(y_train, 10)
y_test  = to_categorical(y_test,  10)

## Save validation data to CSV

In [13]:
np.savetxt('x_test.csv', x_test.reshape((x_test.shape[0], -1))[0:250], delimiter=',', fmt='%s') 
np.savetxt('y_test.csv', y_test[0:250], delimiter=',', fmt='%s')

## Build model

## Model Architecture Improvements Explained

### What Changed and Why It Matters

#### 1. **Increased Convolutional Filters (2 → 32, then 64)**
- **Original**: 2 filters
- **Improved**: 32 filters in first layer, 64 in second layer

**Why this helps:**
- Each filter learns to detect different features (edges, curves, corners, patterns)
- With only 2 filters, the model could only learn 2 different features total
- 32 filters allows learning 32 diverse patterns simultaneously (e.g., horizontal lines, diagonal lines, corners, circles)
- More filters = richer representation of what makes each digit unique
- *Analogy*: Like having more "detectives" looking for different clues to identify digits

#### 2. **Multiple Convolutional Layers (1 → 2)**
- **Original**: Single Conv2D layer
- **Improved**: Two stacked Conv2D layers

**Why this helps:**
- First layer learns low-level features (simple edges, lines)
- Second layer learns high-level features by combining outputs from first layer (shapes, loops, structures)
- This hierarchical learning is more efficient and biologically inspired (like human visual cortex)
- Stacking layers creates "feature pyramids" - detecting increasingly complex patterns
- *Analogy*: First layer = learning individual brush strokes; Second layer = recognizing full shapes

#### 3. **MaxPooling Layers (0 → 2)**
- **Original**: No pooling despite importing it
- **Improved**: MaxPool2D after each Conv layer

**Why this helps:**
- Reduces spatial dimensions (28×28 → 14×14 → 7×7)
- Keeps only the "most important" values in each region (takes the maximum)
- **Robustness**: Model becomes invariant to small shifts/rotations of digits
- **Efficiency**: Fewer parameters reduces overfitting risk
- **Speed**: Smaller feature maps = faster computation
- *Example*: If a "7" is shifted 1-2 pixels, MaxPool helps the model still recognize it

#### 4. **Larger Dense Layers (10 → 128 → 64 → 10)**
- **Original**: Single dense layer with 10 units (one per digit class)
- **Improved**: Two hidden dense layers (128 and 64 units) then output layer

**Why this helps:**
- Learns non-linear decision boundaries in feature space
- First dense layer (128 units) = complex classifier combining all features
- Second dense layer (64 units) = further refining the classification
- More parameters in hidden layers allow learning more complex patterns
- *Analogy*: Like having multiple "expert committees" voting on what digit it is, rather than one quick decision

#### 5. **More Training Epochs (3 → 15)**
- **Original**: 3 epochs (only 3 passes through entire dataset)
- **Improved**: 15 epochs

**Why this helps:**
- More opportunities to update weights and reduce loss
- Model converges better with a more powerful architecture
- Can observe if validation accuracy plateaus (overfitting indicator)
- *Trade-off*: More epochs = longer training, but within reason it improves generalization

#### 6. **Batch Size Optimization (32 → 128)**
- **Original**: Default batch size (32)
- **Improved**: Batch size of 128

**Why this helps:**
- Larger batches = more stable gradient estimates
- Faster training with better GPU utilization
- Better generalization (averaging gradients over more samples)

### Summary: How These Work Together

```
Input (28×28×1) 
    ↓
Conv2D (32 filters) → learns basic features (edges, corners)
    ↓
MaxPool (2×2) → reduces noise, keeps important features
    ↓
Conv2D (64 filters) → learns complex patterns (digit shapes, structures)
    ↓
MaxPool (2×2) → further noise reduction
    ↓
Flatten → converts 2D features to 1D vector
    ↓
Dense (128 units) → complex decision boundary
    ↓
Dense (64 units) → refined classification
    ↓
Dense (10 units, softmax) → probability for each digit (0-9)
```

**Expected Impact:**
- **Original model**: ~95-97% accuracy (limited feature learning)
- **Improved model**: ~98-99% accuracy (hierarchical feature learning + regularization)

The key principle: **Deep learning excels when you give it multiple layers to progressively extract higher-level features from raw input.**


In [23]:
model = Sequential()
model.add(Input(shape=(28, 28, 1)))
model.add(Conv2D(filters=32, kernel_size=(3, 3), activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=10, activation='softmax'))
model.summary()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['categorical_accuracy'])

## Train model

In [24]:
model.fit(x_train, y_train, epochs=15, validation_data=(x_test, y_test), batch_size=128)

Epoch 1/15
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 27ms/step - categorical_accuracy: 0.8461 - loss: 0.5352 - val_categorical_accuracy: 0.9813 - val_loss: 0.0553
Epoch 2/15
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 27ms/step - categorical_accuracy: 0.9803 - loss: 0.0630 - val_categorical_accuracy: 0.9846 - val_loss: 0.0463
Epoch 3/15
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 26ms/step - categorical_accuracy: 0.9865 - loss: 0.0419 - val_categorical_accuracy: 0.9859 - val_loss: 0.0437
Epoch 4/15
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 26ms/step - categorical_accuracy: 0.9912 - loss: 0.0291 - val_categorical_accuracy: 0.9861 - val_loss: 0.0422
Epoch 5/15
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 27ms/step - categorical_accuracy: 0.9919 - loss: 0.0255 - val_categorical_accuracy: 0.9915 - val_loss: 0.0252
Epoch 6/15
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37

<keras.src.callbacks.history.History at 0x1699ae310>

## Evaluate model

In [25]:
model.evaluate(x_test, y_test, verbose=2)

# Evaluate on test set
loss, val_accuracy = model.evaluate(x_test, y_test, verbose=2)
print(f"Validation Accuracy: {val_accuracy:.4f}")

# Get predicitions and confusion matrix
pred_test = model.predict(x_test)
print(tf.math.confusion_matrix(y_test.argmax(axis=1), pred_test.argmax(axis=1)))

313/313 - 1s - 3ms/step - categorical_accuracy: 0.9925 - loss: 0.0325
313/313 - 1s - 3ms/step - categorical_accuracy: 0.9925 - loss: 0.0325
313/313 - 1s - 3ms/step - categorical_accuracy: 0.9925 - loss: 0.0325
Validation Accuracy: 0.9925
[1m  1/313[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m13s[0m 44ms/stepValidation Accuracy: 0.9925
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
tf.Tensor(
[[ 976    1    0    0    0    0    0    2    1    0]
 [   0 1133    0    0    0    0    1    1    0    0]
 [   0    0 1027    0    0    0    1    2    2    0]
 [   1    0    4  998    0    2    0    2    3    0]
 [   0    0    0    0  977    0    2    0    0    3]
 [   0    0    1    7    0  880    3    1    0    0]
 [   4    2    0    0    1    1  948    0    2    0]
 [   0    1    2    0    1    0    0 1022    1    1]
 [   2    0    1    1    0    0    0    2  967    1]
 [   0    0    0    0    6   

## Save trained model

In [26]:
model.save('mnist_lenet5.h5')

