# Logistic Regression

Logistic regression is a supervised learning algorithm that classify an object into one of two classes. That type of classifier is called a binary classifier.

Good introducion reading at [deeplearning.stanford.edu](http://deeplearning.stanford.edu/tutorial/supervised/LogisticRegression/)

### Predicting cancer type using [LinearRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

#### Load and analyze data

In [9]:
import numpy as np
from sklearn import datasets
patients = datasets.load_breast_cancer()

In [10]:
print(patients.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [11]:
print(patients.data.shape)
print(patients.target.shape)

(569, 30)
(569,)


In [12]:
print("First patient in database")
print(patients['data'][0,:])

First patient in database
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


In [13]:
print(patients['target'][0])

0


#### Split data in train and test batches

In [14]:
from sklearn.model_selection import train_test_split

patients_train_data, patients_test_data, \
patients_train_target, patients_test_target = \
train_test_split(patients.data,patients.target, test_size=0.1, random_state=101)

In [15]:
print("Training dataset:")
print("patients_train_data:", patients_train_data.shape)
print("patients_train_target:", patients_train_target.shape)

Training dataset:
patients_train_data: (512, 30)
patients_train_target: (512,)


In [16]:
print("Testing dataset:")
print("patients_test_data:", patients_test_data.shape)
print("patients_test_target:", patients_test_target.shape)

Testing dataset:
patients_test_data: (57, 30)
patients_test_target: (57,)


#### Initiate and train model on training data

In [19]:
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
logistic_regression.fit(patients_train_data, patients_train_target)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

#### Using classifier

In [23]:
id=6
prediction = logistic_regression.predict(patients_test_data[id,:].reshape(1,-1))
print("Model predicted for patient {0} value {1}".format(id, prediction))

print("Real value for patient \"{0}\" is {1}".format(id, patients_test_target[id]))

Model predicted for patient 6 value [1]
Real value for patient "6" is 1


In [25]:
prediction_probability = logistic_regression.predict_proba(patients_test_data[id,:].reshape(1,-1))
print(prediction_probability)

[[0.00290373 0.99709627]]


### Evaluate a classifier using the test data

In [26]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(patients_test_target, logistic_regression.predict(patients_test_data))
print("Model accuracy is {0:0.2f}".format(acc))

Model accuracy is 0.93


#### Evaluating classifier using [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

In [28]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(patients_test_target, logistic_regression.predict(patients_test_data))
print(conf_matrix)

[[19  3]
 [ 1 34]]


### [Multiclass classification](https://scikit-learn.org/stable/modules/multiclass.html) using [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier)

In [29]:
from sklearn.multiclass import OneVsRestClassifier

In [30]:
#New in version 0.18
multiple_classes_data = datasets.load_wine()

In [31]:
#print(multiple_classes_data['DESCR'])

In [32]:
print("Classes in data:", np.unique(multiple_classes_data.target))

Classes in data: [0 1 2]


In [33]:
wine_train_data, wine_test_data, \
wine_train_target, wine_test_target = \
train_test_split(multiple_classes_data.data, multiple_classes_data.target, test_size=0.1)

In [34]:
#initiate classifier
multiclass_classifier = OneVsRestClassifier(LogisticRegression())

#fit classifier
multiclass_classifier.fit(wine_train_data, wine_train_target);



In [35]:
# check classifier for some object
id=17
prediction = multiclass_classifier.predict(wine_test_data[id,:].reshape(1,-1))
print("Multiclass model predicted for wine {0} class {1}".format(id, prediction))

print("Real class for wine \"{0}\" is {1}".format(id, wine_test_target[id]))

Multiclass model predicted for wine 17 class [0]
Real class for wine "17" is 0


In [36]:
conf_matrix = confusion_matrix(wine_test_target, multiclass_classifier.predict(wine_test_data))
print(conf_matrix)

[[6 0 0]
 [0 7 1]
 [0 0 4]]


# Student task 

Using data in a file `credit_clients.xls` train the logistic regression model to predict whether a client will be given a credit. Evaluate model with accuracy score and confusion matrix.

More info about data can be found [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).

Hint: Consider standardisation the data!

In [21]:
import pandas as pd
import numpy as np

from pandas import ExcelWriter
from pandas import ExcelFile
 
df = pd.read_excel('credit_clients.xls')

In [22]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [23]:
data = df.iloc[1:,1:-1]
target = df.iloc[1:,-1]

In [24]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
1,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
2,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
3,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
4,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
5,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [25]:
data_ = np.array(data, dtype=np.int16)
target_ = np.array(target, dtype=np.int16)

In [37]:
no_standar_data_ = data_
no_standar_target_ = target_

#Krok 1 - standaryzacja danych
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
credit_data_scaled = scaler.fit_transform(data_)

#Krok 2 - podzial danych na testujace i uczace
credit_train_data, credit_test_data, credit_train_target, credit_test_target = \
train_test_split(credit_data_scaled, target_, test_size=0.1, random_state=101)

print("Zbiór danych uczących")
print("credit_train_data:", credit_train_data.shape)
print("credit_train_target:", credit_train_target.shape)

print("Zbór danych testujących")
print("credit_test_data:", credit_test_data.shape)
print("credit_test_target:", credit_test_target.shape)

#Krok 3 - uczymy nasz model
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
logistic_regression.fit(credit_train_data, credit_train_target)

#Krok 4 - sprawdzamy skuteczność

from sklearn.metrics import accuracy_score
acc = accuracy_score(credit_test_target, logistic_regression.predict(credit_test_data))
print("Model accuracy is {0:0.2f}".format(acc))

from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(credit_test_target, logistic_regression.predict(credit_test_data))
print(conf_matrix)

#Krok 5 - confusion matrix

from sklearn.multiclass import OneVsRestClassifier
wine_train_data, wine_test_data, \
wine_train_target, wine_test_target = \
train_test_split(credit_data_scaled, target_, test_size=0.1)
multiclass_classifier = OneVsRestClassifier(LogisticRegression())
multiclass_classifier.fit(wine_train_data, wine_train_target);
conf_matrix = confusion_matrix(wine_test_target, multiclass_classifier.predict(wine_test_data))
print(conf_matrix)

# BEZ STANDARYZACJI

#Krok 2 - podzial danych na testujace i uczace
credit_train_data, credit_test_data, credit_train_target, credit_test_target = \
train_test_split(no_standar_data_, no_standar_target_, test_size=0.1, random_state=101)

print("Zbiór danych uczących")
print("credit_train_data:", credit_train_data.shape)
print("credit_train_target:", credit_train_target.shape)

print("Zbór danych testujących")
print("credit_test_data:", credit_test_data.shape)
print("credit_test_target:", credit_test_target.shape)

#Krok 3 - uczymy nasz model
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
logistic_regression.fit(credit_train_data, credit_train_target)

#Krok 4 - sprawdzamy skuteczność

from sklearn.metrics import accuracy_score
acc = accuracy_score(credit_test_target, logistic_regression.predict(credit_test_data))
print("Model accuracy is {0:0.2f}".format(acc))

from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(credit_test_target, logistic_regression.predict(credit_test_data))
print(conf_matrix)

#Krok 5 - confusion matrix

from sklearn.multiclass import OneVsRestClassifier
wine_train_data, wine_test_data, \
wine_train_target, wine_test_target = \
train_test_split(no_standar_data_, no_standar_target_, test_size=0.1)
multiclass_classifier = OneVsRestClassifier(LogisticRegression())
multiclass_classifier.fit(wine_train_data, wine_train_target);
conf_matrix = confusion_matrix(wine_test_target, multiclass_classifier.predict(wine_test_data))
print(conf_matrix)

Zbiór danych uczących
credit_train_data: (27000, 23)
credit_train_target: (27000,)
Zbór danych testujących
credit_test_data: (3000, 23)
credit_test_target: (3000,)
Model accuracy is 0.82
[[2312   45]
 [ 501  142]]




[[2280   88]
 [ 489  143]]




Zbiór danych uczących
credit_train_data: (27000, 23)
credit_train_target: (27000,)
Zbór danych testujących
credit_test_data: (3000, 23)
credit_test_target: (3000,)
Model accuracy is 0.82
[[2308   49]
 [ 495  148]]




[[2280   74]
 [ 497  149]]
