# G Logistic regression MNIST
_4 points_

Evalute logistic regression as B  on MNIST

In [2]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [3]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

In [4]:
x_train_flat = x_train.reshape(x_train.shape[0], x_train.shape[1] * x_train.shape[2])
x_test_flat = x_test.reshape(x_test.shape[0], x_test.shape[1] * x_test.shape[2])

In [5]:
log_regression = LogisticRegression(n_jobs=-1, max_iter=500) # use all processor cores
log_regression.fit(x_train_flat[:2000], y_train[:2000]) # subsample because it is a very slow model

  " = {}.".format(self.n_jobs))


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [6]:
predictions = log_regression.predict(x_test_flat[:500]) # again, subsample

In [7]:
f1_score(y_test[:500], predictions, average="micro")

0.824

In [8]:
# Because the normal solver is not optimized for multiclass scenarios and it is not able to run on multiple cores,
# we tried the "saga" solver, which is better suited for multiclass problems

log_regression_saga = LogisticRegression(solver="saga", n_jobs=-1, max_iter=500)
log_regression_saga.fit(x_train_flat[:2000], y_train[:2000])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='saga', tol=0.0001,
          verbose=0, warm_start=False)

In [11]:
prediction_saga = log_regression_saga.predict(x_test_flat[:500])

In [13]:
f1_score(y_test[:500], prediction_saga, average="micro")

0.842

In [15]:
# The above run was only to show the comparison for the same datasets. Because the saga solver runs better,
# we tried it on the whole set
log_regression_saga.fit(x_train_flat, y_train)
prediction_saga = log_regression_saga.predict(x_test_flat)
f1_score(y_test, prediction_saga, average="micro")



0.9184

## Answer

Logistic regression can be used to identify dicrete labels (like categories) in datasets. The given dataset consists of 10 different categories.
Because the default solver for the logistic regression is optimized for binary problems (two classes), it did not perform too good on the dataset. In addition, it is quite slow because this solver does not allow to be split to all CPU-cores. By subsampling 2000 training data and testing with 500 test data examples one achieves an f1_score of 82.4%. (Again, using the f1_score to combine precision and recall and taking into account true negatives and false positives).<br>
We tested out another solver, one that is recommended by the documentation for lage datasets and capable of multiclass problems. With the same subsample, the "saga" solver is able to achieve a f1_score of 84.2%.
If you invest more time and use the full dataset on the saga solver, one can achieve 91.84% in f1-score which is a very good result.

Logistic regression is an acceptable model for this dataset, the values are quite accurate but the KNN results with >96% were better. On the other hand, it performed far better than kmeans which had errors all over the place.
Nevertheless, for an accurate prediction with as little mistakes as possible, this model is not recommendable.