### Classification on MNIST - Logistic Regression, LDA, QDA, and Naive Bayes

In [8]:
# loading in packages
import pandas as pd
import os

import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.metrics import accuracy_score

### Loading in data

In [2]:
# loading in data
file_path_train = os.path.join("data", "mnist_train.csv")
file_path_test = os.path.join("data", "mnist_test.csv")

train_df = pd.read_csv(file_path_train)
test_df = pd.read_csv(file_path_test)

x_train = train_df.drop('label', axis = 1)
y_train = train_df['label']

x_test = test_df.drop('label', axis = 1)
y_test = test_df['label']

### Data Preprocessing
We need to normalize the data by dividing everything but the label by 255. 255 because we're working with computer vision.

In [3]:
# Divide all of predictors by 255
x_train = x_train/255
x_test = x_test/255

# do this if we did not separate labels and predictors into different dataframes.
# train_df.loc[:, train_df.columns != 'label'] /= 255

Because the dataset is so big, we'll conduct PCA for dimension reduction. We want to keep enough principal components that it explains 90% of the variance.

In [4]:
# fitting the PCA
pca = PCA()
pca.fit(x_train)

In [12]:
# getting enough principal components to explain 90% of variance in the training data.
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
n_components = np.argmax(cumulative_variance_ratio >= 0.99) + 1

# pca.transform returns a numpy array
pca_train = pca.transform(x_train)[:, :n_components]
pca_test = pca.transform(x_test)[:, :n_components]

# we turn the numpy array into a pandas dataframe
pca_train_df = pd.DataFrame(pca_train, columns=[f"PC{i+1}" for i in range(n_components)])
pca_test_df = pd.DataFrame(pca_test, columns=[f"PC{i+1}" for i in range(n_components)])

print('We keep', n_components, 'principal components')

We keep 331 principal components


### Logistic Regression

In [13]:
# Initializing model
log_reg = LogisticRegression()

# Fitting the model
log_reg_model = log_reg.fit(pca_train_df, y_train)

# predicting with the model
predictions_log_reg = log_reg_model.predict(pca_test_df)

# calculating accuracy
accuracy_log_reg = accuracy_score(y_test, predictions_log_reg)

print('Accuracy of Logistic regression model: ', accuracy_log_reg)

Accuracy of Logistic regression model:  0.9247


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 