## Training Notebook

This notebook illustrates training of a simple model to classify digits using the MNIST dataset. This code is used to train the model included with the templates. This is meant to be a started model to show you how to set up Serverless applications to do inferences. For deeper understanding of how to train a good model for MNIST, we recommend literature from the [MNIST website](http://yann.lecun.com/exdb/mnist/). The dataset is made available under a [Creative Commons Attribution-Share Alike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../../../chapter1/stream-classifier/data/bank-additional/bank-additional/bank-additional-full.csv', sep=";")#delimiter=';', decimal=',')

# Load the mnist dataset

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, y = fetch_openml('mnist_784', return_X_y=True)

# We limit training to 10000 images for faster training. Remove train_size to use all examples.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, train_size=10000)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

numeric_features = ['age', 'balance']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['job', 'marital', 'education', 'contact', 'housing', 'loan', 'default','day', 'poutcome']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


# Add classifier to the preprocessing pipeline
clf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

clf_pipeline.fit(X_train, y_train)


## Scikit-learn Model Training

For this example, we will train a simple SVM classifier using scikit-learn to classify the MNIST digits. We will then freeze the model in the `.joblib` format. This is same as the starter model file included with the SAM templates.

In [None]:
%%time

import sklearn
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn import svm

print (f'Using scikit-learn version: {sklearn.__version__}')

# Fit our training data
clf = svm.SVC(degree=5)
clf.fit(X_train, y_train)

# Test the fitted model for accuracy for the accuracy score
accuracy = accuracy_score(y_test, clf.predict(X_test))

print('Test accuracy without deskewing:', accuracy)

In [None]:
%%time

# Let's try this again with deskewing on

# Fit our training data
clf = svm.SVC(degree=5)
clf.fit(deskew_images(X_train), y_train)

# Test the fitted model for accuracy for the accuracy score
accuracy = accuracy_score(y_test, clf.predict(deskew_images(X_test)))

print('Test accuracy with deskewing:', accuracy)

In [None]:
import joblib

# Save the model to disk with compression to keep size low
joblib.dump(clf, 'digit_classifier.joblib', compress=3)