```
Copyright 2021 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Logistic Regression on Epsilon Dataset

## Background 

This is a synthetic dataset from the [PASCAL Large Scale Learning Challenge](https://www.k4all.org/project/large-scale-learning-challenge/). This challenge is concerned with the scalability and efficiency of existing ML approaches with respect to computational, memory or communication resources, e.g. resulting from a high algorithmic complexity, from the size or dimensionality of the data set, and from the trade-off between distributed resolution and communication costs.

## Source

In this example, we download the dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php).

## Goal
The goal of this notebook is to illustrate how Snap ML can accelerate training of a logistic regression model on this dataset.

## Code

In [1]:
cd ../../

/Users/tpa/Code/snapml-examples/examples


In [2]:
CACHE_DIR='cache-dir'

In [3]:
import numpy as np
import time
from datasets import Epsilon
from sklearn.linear_model import LogisticRegression
from snapml import LogisticRegression as SnapLogisticRegression
from sklearn.metrics import roc_auc_score

In [4]:
X_train, X_test, y_train, y_test = Epsilon(cache_dir=CACHE_DIR).get_train_test_split()

Creating working directory: cache-dir/Epsilon
Downloading Epsilon dataset.
Downloading file: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2


  0%|          | 0.00/3.87G [00:00<?, ?iB/s]

Preprocessing Epsilon dataset.


KeyboardInterrupt: 

In [None]:
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

In [None]:
lr = LogisticRegression(fit_intercept=False, n_jobs=4)
t0 = time.time()
lr.fit(X_train, y_train)
t_fit_sklearn = time.time()-t0
score_sklearn = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
print("Training time (sklearn): %6.2f seconds" % (t_fit_sklearn))
print("ROC AUC score (sklearn): %.4f" % (score_sklearn))

In [None]:
lr = SnapLogisticRegression(fit_intercept=False, n_jobs=4)
t0 = time.time()
lr.fit(X_train, y_train)
t_fit_snapml = time.time()-t0
score_snapml = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
print("Training time (snapml): %6.2f seconds" % (t_fit_snapml))
print("ROC AUC score (snapml): %.4f" % (score_snapml))

In [None]:
speed_up = t_fit_sklearn/t_fit_snapml
score_diff = (score_snapml-score_sklearn)/score_sklearn
print("Speed-up:                %.1f x" % (speed_up))
print("Relative diff. in score: %.4f" % (score_diff))

## Disclaimer

Performance results always depend on the hardware and software environment. 

This notebook was run on the following machine:
* OS: MacOS 11.1 (Big Sur)
* CPU: 2.3 GHz Quad-Core Intel Core i7
* Memory: 32GB

The versions of the relevant software packages are given below:

In [None]:
import snapml
import sklearn
print("scikit-learn version: %s" % (sklearn.__version__))
print("      snapml version: %s" % (snapml.__version__))