In [1]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784")

In [2]:
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [3]:
mnist["DESCR"]

"**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  \n**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  \n**Please cite**:  \n\nThe MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  \n\nIt is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 

In [4]:
X = mnist["data"]
y = mnist["target"]

In [5]:
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

prep_pipeline = Pipeline([
    ("scaler", StandardScaler())
])
X_train = prep_pipeline.fit_transform(X_train)
X_test = prep_pipeline.transform(X_test)

In [7]:
X_val, y_val = X[:10000], y[:10000]

In [8]:
# lets start with LinearSVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

lin_svm_clf = LinearSVC(random_state=42)
param = {
    "penalty": ["l1", "l2"],
    "C": [1, 2, 3, 5, 10, 20, 50]
}
grid = GridSearchCV(lin_svm_clf, param, n_jobs=-1, cv=5)
grid.fit(X_val, y_val)
grid.best_params_

35 fits failed out of a total of 70.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/ML/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/ML/lib/python3.10/site-packages/sklearn/svm/_classes.py", line 257, in fit
    self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/ML/lib/python3.10/site-packages/sklearn/svm/_base.py", line 1185, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, p

{'C': 5, 'penalty': 'l2'}

In [9]:
lin_svm_clf = LinearSVC(**grid.best_params_)
lin_svm_clf.fit(X_train, y_train)



LinearSVC(C=5)

In [10]:
from sklearn.metrics import accuracy_score

def get_accuracy_score(model, X_train, y_train, X_test, y_test):
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    acs_train = accuracy_score(y_train, y_pred_train)
    acs_test = accuracy_score(y_test, y_pred_test)

    print(f"Train accuracy score: {acs_train}.")
    print(f"Test accuracy score: {acs_test}.")

get_accuracy_score(lin_svm_clf, X_train, y_train, X_test, y_test)

Train accuracy score: 0.89465.
Test accuracy score: 0.8882.


In [11]:
X_train2, X_test2 = X[:60000], X[60000:]
y_train2, y_test2 = y[:60000], y[60000:]

In [12]:
import numpy as np

# apparently as float32 it gets even higher scores
scaler = StandardScaler()
X_train2 = scaler.fit_transform(X_train2.astype(np.float32))
X_test2 = scaler.fit_transform(X_test2.astype(np.float32))

clf = LinearSVC(**grid.best_params_)
clf.fit(X_train2, y_train2)
get_accuracy_score(clf, X_train2, y_train2, X_test2, y_test2)

Train accuracy score: 0.898.
Test accuracy score: 0.8912.




In [13]:
from sklearn.svm import SVC

svm_clf = SVC(random_state=42)
param = {
    "C": [1, 5, 10, 25, 50, 100],
    "kernel": ["linear", "poly", "rbf"],
    "degree": [2, 3, 4],
    "gamma": ["scale", "auto", 0.001, 0.01, 0.1, 0.5],
    "coef_1": [0.0, 0.2, 0.5, 1]
}
grid = GridSearchCV(lin_svm_clf, param, n_jobs=-1, cv=5)
grid.fit(X_val, y_val)
grid.best_params_

from scipy.stats import racipostal, uniform
from sklearn.model_selection import RandomizedSearchCV

param = {
    "gamma": racipostal
}