# LDA Classification on MNIST Dataset

This notebook demonstrates how to apply Linear Discriminant Analysis (LDA) for digit classification using the MNIST dataset. The notebook is organized into sections covering data loading, preprocessing, model training, and evaluation.

---

## 1. Import Libraries

We start by importing the necessary libraries, including the utility functions from `utils.py`.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import utils
import os


ModuleNotFoundError: No module named 'utils'

## 2. Load the MNIST Dataset

Next, we load the MNIST dataset using the `load_data` function from `utils.py`. The dataset is split into features (`X_train`, `X_test`) and labels (`y_train`).


In [None]:
# Paths to the dataset CSV files
train_path = 'data/mnist_train_small.csv'
test_path = 'data/mnist_test.csv'

# Load data
X_train, y_train, X_test = utils.load_data(train_path, test_path)

# Check the shape of the datasets
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")


## 3. Data Preprocessing

We normalize the pixel values to [0, 1] and split the training data into a training and validation set.


In [None]:
# Normalize the data
X_train = utils.normalize_data(X_train)
X_test = utils.normalize_data(X_test)

# Split the training data into training and validation sets
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train_split.shape}, Validation set shape: {X_val.shape}")


## 4. Visualize Sample Data

Before training, let's visualize some sample images and their corresponding labels.


In [None]:
# Plot a few sample images from the training set
sample_indices = [0, 1, 2, 3, 4]
utils.plot_multiple_samples(X_train_split, y_train_split, sample_indices)


## 5. Apply Linear Discriminant Analysis (LDA)

We now train an LDA model on the training data and evaluate its performance on the validation set.


In [None]:
# Initialize and train the LDA model
lda = LDA()
lda.fit(X_train_split, y_train_split)

# Predict on the validation set
y_val_pred = lda.predict(X_val)

# Evaluate the model
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {val_accuracy:.4f}")

# Confusion matrix and classification report
conf_matrix = confusion_matrix(y_val, y_val_pred)
class_report = classification_report(y_val, y_val_pred)

print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)


## 6. Save Results

Now, let's save the confusion matrix, classification report, and other results to the `results` directory.


In [None]:
# Apply LDA to reduce dimensions to 2
lda_2d = LDA(n_components=2)
X_train_2d = lda_2d.fit_transform(X_train_split, y_train_split)

# Plot the 2D decision boundaries (this is more illustrative than practical)
utils.plot_sample(X_train_2d, y_train_split, index=0)  # Plotting the first sample after reduction


# Create results directory if it doesn't exist
os.makedirs('results', exist_ok=True)

# Save the validation accuracy
with open('results/validation_accuracy.txt', 'w') as f:
    f.write(f"Validation Accuracy: {val_accuracy:.4f}\n")

# Save the classification report
with open('results/classification_report.txt', 'w') as f:
    f.write(class_report)

# Plot and save the confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.savefig('results/confusion_matrix.png')
plt.show()


## 7. Predict on the Test Set

Finally, we use the trained LDA model to make predictions on the test set.


In [None]:
# Apply LDA to reduce dimensions to 2
lda_2d = LDA(n_components=2)
X_train_2d = lda_2d.fit_transform(X_train_split, y_train_split)

# Plot the 2D decision boundaries
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train_split, cmap='tab10', alpha=0.6)
plt.title('LDA 2D Projection of MNIST Data')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.colorbar(scatter, label='Digit')
plt.savefig('results/lda_2d_projection.png')
plt.show()


## 8. Predict on the Test Set and Save Predictions

Finally, we use the trained LDA model to make predictions on the test set and save the predictions to a CSV file.


In [None]:
# Predict on the test set
y_test_pred = lda.predict(X_test)

# Save the test set predictions
pd.DataFrame(y_test_pred, columns=['Label']).to_csv('results/test_predictions.csv', index_label='ImageId')


## 9. Conclusion

This notebook demonstrated how to use Linear Discriminant Analysis (LDA) for digit classification using the MNIST dataset. The results, including the confusion matrix, classification report, 2D projection, and test predictions, have been saved to the `results` folder for further analysis.
