# Logistic Regression

Logistic regression is a supervised learning algorithm that classify an object into one of two classes. That type of classifier is called a binary classifier.

Good introducion reading at [deeplearning.stanford.edu](http://deeplearning.stanford.edu/tutorial/supervised/LogisticRegression/)

### Predicting cancer type using [LinearRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

### Load and analyze data

In [1]:
import numpy as np
from sklearn import datasets
patients = datasets.load_breast_cancer()

In [2]:
print(patients.DESCR);

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [3]:
print(patients.data.shape)
print(patients.target.shape)

(569, 30)
(569,)


In [4]:
print("First patient in database")
print(patients['data'][1,:])

First patient in database
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
 7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
 5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
 2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
 2.750e-01 8.902e-02]


In [5]:
print(patients['target'][0])

0


In [6]:
print("---Mean---")
print(patients['data'].mean(axis=0))
print("---std---")
print(patients['data'].std(axis=0))

---Mean---
[1.41272917e+01 1.92896485e+01 9.19690334e+01 6.54889104e+02
 9.63602812e-02 1.04340984e-01 8.87993158e-02 4.89191459e-02
 1.81161863e-01 6.27976098e-02 4.05172056e-01 1.21685343e+00
 2.86605923e+00 4.03370791e+01 7.04097891e-03 2.54781388e-02
 3.18937163e-02 1.17961371e-02 2.05422988e-02 3.79490387e-03
 1.62691898e+01 2.56772232e+01 1.07261213e+02 8.80583128e+02
 1.32368594e-01 2.54265044e-01 2.72188483e-01 1.14606223e-01
 2.90075571e-01 8.39458172e-02]
---std---
[3.52095076e+00 4.29725464e+00 2.42776193e+01 3.51604754e+02
 1.40517641e-02 5.27663291e-02 7.96497253e-02 3.87687325e-02
 2.73901809e-02 7.05415588e-03 2.77068942e-01 5.51163427e-01
 2.02007710e+00 4.54510134e+01 2.99987837e-03 1.78924359e-02
 3.01595231e-02 6.16486075e-03 8.25910439e-03 2.64374475e-03
 4.82899258e+00 6.14085432e+00 3.35730016e+01 5.68856459e+02
 2.28123569e-02 1.57198171e-01 2.08440875e-01 6.56745545e-02
 6.18130785e-02 1.80453893e-02]


### Let's scale data

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(patients.data)

In [8]:
print("First patient in database")
print(scaled_data[1,:])
print('---Mean---')
print(scaled_data.mean(axis=0))
print('--std---')
print(scaled_data.std(axis=0))

First patient in database
[ 1.82982061e+00 -3.53632408e-01  1.68595471e+00  1.90870825e+00
 -8.26962447e-01 -4.87071673e-01 -2.38458552e-02  5.48144156e-01
  1.39236330e-03 -8.68652457e-01  4.99254601e-01 -8.76243603e-01
  2.63326966e-01  7.42401948e-01 -6.05350847e-01 -6.92926270e-01
 -4.40780058e-01  2.60162067e-01 -8.05450380e-01 -9.94437403e-02
  1.80592744e+00 -3.69203222e-01  1.53512599e+00  1.89048899e+00
 -3.75611957e-01 -4.30444219e-01 -1.46748968e-01  1.08708430e+00
 -2.43889668e-01  2.81189987e-01]
---Mean---
[-3.16286735e-15 -6.53060890e-15 -7.07889127e-16 -8.79983452e-16
  6.13217737e-15 -1.12036918e-15 -4.42138027e-16  9.73249991e-16
 -1.97167024e-15 -1.45363120e-15 -9.07641468e-16 -8.85349205e-16
  1.77367396e-15 -8.29155139e-16 -7.54180940e-16 -3.92187747e-16
  7.91789988e-16 -2.73946068e-16 -3.10823423e-16 -3.36676596e-16
 -2.33322442e-15  1.76367415e-15 -1.19802625e-15  5.04966114e-16
 -5.21317026e-15 -2.17478837e-15  6.85645643e-16 -1.41265636e-16
 -2.28956670e-15  2

### Split data in train and test batches

In [9]:
from sklearn.model_selection import train_test_split
#----
#scaled data
patients_train_data, patients_test_data, \
patients_train_target, patients_test_target = \
train_test_split(scaled_data, patients.target, test_size=0.1)

In [10]:
print("Training dataset:")
print("patients_train_data:", patients_train_data.shape)
print("patients_train_target:", patients_train_target.shape)

Training dataset:
patients_train_data: (512, 30)
patients_train_target: (512,)


In [11]:
print("Testing dataset:")
print("patients_test_data:", patients_test_data.shape)
print("patients_test_target:", patients_test_target.shape)

Testing dataset:
patients_test_data: (57, 30)
patients_test_target: (57,)


### Initiate and train model on training data

In [12]:
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
logistic_regression.fit(patients_train_data, patients_train_target)

### Using classifier

In [13]:
id=7
prediction = logistic_regression.predict(patients_test_data[id,:].reshape(1,-1))
print("Model predicted for patient {0} value {1}".format(id, prediction))

print("Real value for patient \"{0}\" is {1}".format(id, patients_test_target[id]))

Model predicted for patient 7 value [1]
Real value for patient "7" is 1


In [14]:
prediction_probability = logistic_regression.predict_proba(patients_test_data[id,:].reshape(1,-1))
print(prediction_probability)

[[5.90850652e-04 9.99409149e-01]]


### Evaluate a classifier using the test data

In [15]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(patients_test_target, logistic_regression.predict(patients_test_data))
print("Model accuracy is {0:0.2f}".format(acc))

Model accuracy is 0.98


### Evaluating classifier using [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

In [16]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(patients_test_target, logistic_regression.predict(patients_test_data))
print(conf_matrix)

[[21  1]
 [ 0 35]]


## [Multiclass classification](https://scikit-learn.org/stable/modules/multiclass.html) using [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier)

In [17]:
from sklearn.multiclass import OneVsRestClassifier

In [18]:
#New in version 0.18
multiple_classes_data = datasets.load_wine()

In [19]:
#print(multiple_classes_data['DESCR'])

In [20]:
print("Classes in data:", np.unique(multiple_classes_data.target))

Classes in data: [0 1 2]


In [21]:
wine_train_data, wine_test_data, \
wine_train_target, wine_test_target = \
train_test_split(multiple_classes_data.data, multiple_classes_data.target, test_size=0.1)

In [22]:
#initiate classifier
multiclass_classifier = OneVsRestClassifier(LogisticRegression())

#fit classifier
multiclass_classifier.fit(wine_train_data, wine_train_target);

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [23]:
# check classifier for some object
id=17
prediction = multiclass_classifier.predict(wine_test_data[id,:].reshape(1,-1))
print("Multiclass model predicted for wine {0} class {1}".format(id, prediction))

print("Real class for wine \"{0}\" is {1}".format(id, wine_test_target[id]))

Multiclass model predicted for wine 17 class [1]
Real class for wine "17" is 1


In [24]:
conf_matrix = confusion_matrix(wine_test_target, multiclass_classifier.predict(wine_test_data))
print(conf_matrix)

[[7 1 0]
 [0 8 0]
 [0 0 2]]


# Student task 

Using data in a file `credit_clients.xls` train the logistic regression model to predict whether a client will be given a credit. Evaluate model with accuracy score and confusion matrix.
Check model with __cross-validation__.

More info about data can be found [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).

Hint: Consider standardisation of the data!

In [25]:
import pandas as pd
import numpy as np

from pandas import ExcelWriter
from pandas import ExcelFile
 
df = pd.read_excel('credit_clients.xls')

In [26]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [27]:
data = df.iloc[1:,0:-1]
target = df.iloc[1:,-1]

In [28]:
data.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
1,1,20000,2,2,1,24,2,2,-1,-1,...,689,0,0,0,0,689,0,0,0,0
2,2,120000,2,2,2,26,-1,2,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
3,3,90000,2,2,2,34,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
4,4,50000,2,2,1,37,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
5,5,50000,1,2,1,57,-1,0,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [29]:
data_np = np.array(data, dtype=np.int16)
target_np = np.array(target, dtype=np.int16)

print(type(data_np))
print(type(target_np))

print(data_np.shape)
print(target_np.shape)

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(30000, 24)
(30000,)


In [30]:
...

Ellipsis