```
Copyright 2021 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Logistic Regression on Allstate Dataset

## Background 

The goal of this competition is to predict Bodily Injury Liability Insurance claim payments based on the characteristics of the insured’s vehicle. 

## Source

The raw dataset can be obtained directly from the [Allstate Claim Prediction Challenge](https://www.kaggle.com/c/ClaimPredictionChallenge).

In this example, we download the dataset directly from Kaggle using their API. In order for to work work, you must:
1. Login into Kaggle and accept the [competition rules](https://www.kaggle.com/c/mercari-price-suggestion-challenge/rules).
2. Folow [these instructions](https://www.kaggle.com/docs/api) to install your API token on your machine.

## Goal
The goal of this notebook is to illustrate how Snap ML can accelerate training of a logistic regression model on this dataset.

## Code

In [1]:
cd ../../

/Users/tpa/Code/snapml-examples/examples


In [2]:
CACHE_DIR='cache-dir'

In [3]:
import numpy as np
import time
from datasets import Allstate
from sklearn.linear_model import LogisticRegression
from snapml import LogisticRegression as SnapLogisticRegression
from sklearn.metrics import roc_auc_score

In [4]:
# note -- when the notebook is executed for the first time, the dataset will be downloaded and cached.
# subsequent runs of the notebook will simply read the cached datatset, and should be much faster.
X_train, X_test, y_train, y_test = Allstate(cache_dir=CACHE_DIR).get_train_test_split()

Creating working directory: cache-dir/Allstate
Downloading Allstate dataset.
Preprocessing Allstate dataset.
File Name                                             Modified             Size
dictionary.html                                2019-12-11 04:17:20        24768
example_compressed_entry.csv.gz                2019-12-11 04:17:20      9534348
example_compressed_entry.zip                   2019-12-11 04:17:22      5120252
example_uncompressed_entry.csv                 2019-12-11 04:17:22     51778401
kaggle_srm6_stormod.sas7bitm                   2019-12-11 04:17:26       325632
test_set.7z                                    2019-12-11 04:17:26     34841099
test_set.zip                                   2019-12-11 04:17:30    116285954
train_set.7z                                   2019-12-11 04:17:44    110548523
train_set.zip                                  2019-12-11 04:17:56    380542114


  exec(code_obj, self.user_global_ns, self.user_ns)


            Row_ID  Household_ID  Vehicle  Calendar_Year  Model_Year  \
0                1             1        3           2005        2005   
1                2             2        2           2005        2003   
2                3             3        1           2005        1998   
3                4             3        1           2006        1998   
4                5             3        2           2005        2001   
...            ...           ...      ...            ...         ...   
13184285  13184286       7542110        1           2005        1987   
13184286  13184287       7542111        1           2005        1989   
13184287  13184288       7542112        1           2005        1990   
13184288  13184289       7542112        2           2005        1998   
13184289  13184290       7542113        1           2005        1996   

         Blind_Make Blind_Model Blind_Submodel Cat1 Cat2  ...      Var5  \
0                 K        K.78         K.78.2    D    C  ..

TypeError: cannot unpack non-iterable NoneType object

In [None]:
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

In [None]:
lr = LogisticRegression(fit_intercept=False, n_jobs=4)
t0 = time.time()
lr.fit(X_train, y_train)
t_fit_sklearn = time.time()-t0
score_sklearn = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
print("Training time (sklearn): %6.2f seconds" % (t_fit_sklearn))
print("ROC AUC score (sklearn): %.4f" % (score_sklearn))

In [None]:
lr = SnapLogisticRegression(fit_intercept=False, n_jobs=4)
t0 = time.time()
lr.fit(X_train, y_train)
t_fit_snapml = time.time()-t0
score_snapml = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
print("Training time (snapml): %6.2f seconds" % (t_fit_snapml))
print("ROC AUC score (snapml): %.4f" % (score_snapml))

In [None]:
speed_up = t_fit_sklearn/t_fit_snapml
score_diff = (score_snapml-score_sklearn)/score_sklearn
print("Speed-up:                %.1f x" % (speed_up))
print("Relative diff. in score: %.4f" % (score_diff))

## Disclaimer

Performance results always depend on the hardware and software environment. 

This notebook was run on the following machine:
* OS: MacOS 11.1 (Big Sur)
* CPU: 2.3 GHz Quad-Core Intel Core i7
* Memory: 32GB

The versions of the relevant software packages are given below:

In [None]:
import snapml
import sklearn
print("scikit-learn version: %s" % (sklearn.__version__))
print("      snapml version: %s" % (snapml.__version__))