# SLU09 - Classification with Logistic Regression: Example notebook
How to use the very useful sklearn implementation of logistic regression to solve the last exercise of the Exercise Notebook of SLU09.

In [1]:
import pandas as pd 
import numpy as np 

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

### The Banknote Authentication Dataset

There are 1372 items (images of banknotes — think Euro or dollar bill). There are 4 predictor variables (variance of image, skewness, kurtosis, entropy). The variable to predict is encoded as 0 (authentic) or 1 (forgery).

Your quest is to first explore this dataset based on the material that you've learned in the previous SLUs and then create a logistic regression model that can correctly classify forged banknotes from authentic ones.

The data is loaded for you below.

In [2]:
columns = ['variance','skewness','kurtosis','entropy', 'forgery']
data = pd.read_csv('data/data_banknote_authentication.txt',names=columns).sample(frac=1, random_state=1)
X_train = data.drop(columns='forgery').values
Y_train = data.forgery.values

How does the dataset (features) and target look like?

In [3]:
X_train

array([[-3.551   ,  1.8955  ,  0.1865  , -2.4409  ],
       [ 1.3114  ,  4.5462  ,  2.2935  ,  0.22541 ],
       [-4.0173  , -8.3123  , 12.4547  , -1.4375  ],
       ...,
       [-4.3667  ,  6.0692  ,  0.57208 , -5.4668  ],
       [ 2.0466  ,  2.03    ,  2.1761  , -0.083634],
       [-2.3147  ,  3.6668  , -0.6969  , -1.2474  ]])

In [4]:
Y_train

array([1, 0, 1, ..., 1, 0, 1])

### [StandardScaler]()
Transforms features by scaling so that the mean of the feature distribution is 0 and variance is 1.

In [5]:
# Init class
scaler = StandardScaler()

# Fit your class
scaler.fit(X_train)

StandardScaler()

In [6]:
# Transform your data
X_train = scaler.transform(X_train)
X_train

array([[-1.4022234 , -0.00457705, -0.28110447, -0.5948078 ],
       [ 0.30884914,  0.44722827,  0.20793348,  0.67471412],
       [-1.56631379, -1.74447154,  2.56636367, -0.11705454],
       ...,
       [-1.68926722,  0.70681989, -0.19161076, -2.03554288],
       [ 0.5675651 ,  0.01834815,  0.18068476,  0.52756764],
       [-0.96717096,  0.29733669, -0.48614297, -0.02654139]])

Check the mean and variance:

In [7]:
X_train.mean(axis=0)

array([-2.33049731e-17,  2.00681713e-17, -4.91993877e-17, -4.91993877e-17])

In [8]:
X_train.var(axis=0)

array([1., 1., 1., 1.])

### [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
Logistic regression (aka logit, MaxEnt) classifier. In this case let us use the L2 penalty for regularization (argument: `penalty='l2'`)

In [9]:
# init with your arguments
logit_clf = LogisticRegression(penalty='l2', random_state=1)

# Fit it!
logit_clf.fit(X_train, Y_train)

LogisticRegression(random_state=1)

What are the predicted probabilities on the training data (probability of being `1`) with our logistic regression classifier for the first 10 samples?

In [10]:
logit_clf.predict_proba(X_train)[:, 1][:10]

array([9.98531263e-01, 2.14047304e-03, 9.55819104e-01, 9.92970566e-01,
       2.06357021e-03, 3.24642334e-02, 3.70083125e-05, 4.80702178e-04,
       2.24702191e-04, 1.04187107e-02])

What about the predicted classes?

In [11]:
logit_clf.predict(X_train)[:10]

array([1, 0, 1, 1, 0, 0, 0, 0, 0, 0])

And the accuracy?

In [12]:
logit_clf.score(X_train, Y_train)

0.9810495626822158

How can we change the threshold from the default (0.5) to 0.9?

In [13]:
predictions = logit_clf.predict_proba(X_train)[:, 1]
predictions[predictions>=0.9] = 1
predictions[predictions<0.9] = 0
predictions[:10]

array([1., 0., 1., 1., 0., 0., 0., 0., 0., 0.])

See the model coeficients:

In [14]:
logit_clf.intercept_,logit_clf.coef_

(array([-1.56830481]),
 array([[-4.92846897, -5.0408545 , -4.61464127,  0.23687598]]))

See the model parameters:

In [15]:
logit_clf.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 1,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}