```
Copyright 2021 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Support Vector Machine on MNIST8M Dataset

## Background 

The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

## Source

We use an inflated version of the dataset (`mnist8m`) from the paper:

Gaëlle Loosli, Stéphane Canu and Léon Bottou: *Training Invariant Support Vector Machines using Selective Sampling*, in [Large Scale Kernel Machines](https://leon.bottou.org/papers/lskm-2007), Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston editors, 301–320, MIT Press, Cambridge, MA., 2007.

We download the pre-processed dataset from the [LIBSVM dataset repository](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).
## Goal
The goal of this notebook is to illustrate how Snap ML can accelerate training of a support vector machine model on this dataset.

## Code

In [1]:
cd ../../

/Users/tpa/Code/snapml-examples/examples


In [2]:
CACHE_DIR='cache-dir'

In [3]:
import numpy as np
import time
from datasets import Mnist8m
from sklearn.svm import LinearSVC
from snapml import SupportVectorMachine as SnapSupportVectorMachine
from sklearn.metrics import accuracy_score as score

In [4]:
dataset = Mnist8m(cache_dir=CACHE_DIR)
X_train, X_test, y_train, y_test = dataset.get_train_test_split()

Reading binary Mnist8m dataset (cache) from disk.


In [5]:
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

Number of examples: 6075000
Number of features: 784
Number of classes:  10


In [6]:
model = LinearSVC(loss='hinge', multi_class='ovr')
t0 = time.time()
model.fit(X_train, y_train)
t_fit_sklearn = time.time()-t0
score_sklearn = score(y_test, model.predict(X_test))
print("Training time  (sklearn): %6.2f seconds" % (t_fit_sklearn))
print("Accuracy score (sklearn): %.4f" % (score_sklearn))

Training time  (sklearn): 1249.79 seconds
Accuracy score (sklearn): 0.8452


In [7]:
model = SnapLogisticRegression(n_jobs=8)
t0 = time.time()
model.fit(X_train, y_train)
t_fit_snapml = time.time()-t0
score_snapml = score(y_test, model.predict(X_test))
print("Training time  (snapml): %6.2f seconds" % (t_fit_snapml))
print("Accuracy score (snapml): %.4f" % (score_snapml))

Training time  (snapml): 184.53 seconds
Accuracy score (snapml): 0.8452


In [8]:
speed_up = t_fit_sklearn/t_fit_snapml
score_diff = (score_snapml-score_sklearn)/score_sklearn
print("Speed-up:                %.1f x" % (speed_up))
print("Relative diff. in score: %.4f" % (score_diff))

Speed-up:                6.8 x
Relative diff. in score: 0.0000


## Disclaimer

Performance results always depend on the hardware and software environment. 

Information regarding the environment that was used to run this notebook are provided below:

In [9]:
import utils
environment = utils.get_environment()
for k,v in environment.items():
    print("%15s: %s" % (k, v))

       platform: macOS-10.16-x86_64-i386-64bit
      cpu_count: 8
   cpu_freq_min: 2300
   cpu_freq_max: 2300
   total_memory: 32.0
 snapml_version: 1.7.0
sklearn_version: 0.23.2


## Record Statistics

Finally, we record the enviroment and performance statistics for analysis outside of this standalone notebook.

In [10]:
import scrapbook as sb
sb.glue("result", {
    'dataset': dataset.name,
    'n_examples_train': X_train.shape[0],
    'n_examples_test': X_test.shape[0],
    'n_features': X_train.shape[1],
    'n_classes': len(np.unique(y_train)),
    'model': type(model).__name__,
    'score': score.__name__,
    't_fit_sklearn': t_fit_sklearn,
    'score_sklearn': score_sklearn,
    't_fit_snapml': t_fit_snapml,
    'score_snapml': score_snapml,
    'score_diff': score_diff,
    'speed_up': speed_up,
    **environment,
})