<b><font size="6">Logistic Regression</font><a class="anchor"><a id='toc'></a></b><br>

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes. The logistic function has the property of being able to map any real value between 0 and 1. The model is represented by:

$$
\hat{y} = \sigma(w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n)
$$

where $\sigma$ is the logistic sigmoid function given by:

$$
\sigma(t) = \frac{1}{1 + e^{-t}}
$$

In binary classification problems, the logistic regression model calculates the probability of an event occuring. If the calculated probability is greater than 0.5, the observation is assigned to a discrete class 1. If the calculated probability is less than 0.5, the observation is assigned to a discrete class 0. In this way, logistic regression can be understood as the probability of a certain event based on the values of the independent variables.

__`Step 1`__ - Import the data and pandas

In [1]:
import pandas as pd
tugas = pd.read_csv('datasets/final_tugas.csv')
tugas

Unnamed: 0,Custid,Year_Birth,Dependents,Income,Rcn,Frq,Mnt,Clothes,Kitchen,SmallAppliances,...,DepVar,Gender_M,Education_Basic,Education_Graduation,Education_Master,Education_PhD,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Together,Marital_Status_Widow
0,1003,1991,1,29761.20,69,11,45.76,32,19,24,...,0,1,0,1,0,0,0,1,0,0
1,1004,1956,1,98249.55,10,26,923.52,60,10,19,...,0,1,0,0,1,0,0,1,0,0
2,1006,1983,1,23505.30,65,14,58.24,47,2,48,...,0,0,0,0,0,1,0,0,1,0
3,1007,1970,1,72959.25,73,18,358.80,71,7,13,...,0,0,0,1,0,0,0,0,0,0
4,1009,1941,0,114973.95,75,30,1457.04,38,9,35,...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2495,10989,1996,1,29551.20,41,10,47.84,11,40,24,...,0,0,1,0,0,0,0,0,0,0
2496,10991,1940,0,132566.70,36,46,2320.24,32,4,47,...,0,0,0,1,0,0,0,1,0,0
2497,10993,1955,0,91768.95,1,25,870.48,56,8,27,...,0,0,0,1,0,0,0,0,1,0
2498,10994,1961,1,99085.35,1,28,931.84,68,5,21,...,0,0,1,0,0,0,0,1,0,0


In [2]:
tugas.DepVar.value_counts()

DepVar
0    2325
1     175
Name: count, dtype: int64

__`Step 2`__ - Data partition
- Assign all the variables excluding the DepVar to the object `data`
- Assign the dependent variable to the object `target`
- Import the needed library to make the partition of the dataset
- Split the data and the target to X_train, X_test, y_train, y_test, where `test_size` should be equal to 0.2, `random_state` equal to 5 the `stratify` equal to `target`

In [3]:
data = tugas.drop(['DepVar'], axis=1)
target = tugas['DepVar']

In [4]:
#make the split here
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(data,target, test_size=0.2, random_state=5, stratify=target)

__`Step 3`__ - Import the model and create an instance

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression(fit_intercept=True,...)</a>

__Definition:__ <br>
Applies Logistic Regression classifier.

__Parameters:__ <br>
*fit_intercept*: whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations; <br>
...
</div>

In [6]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

__`Step 4`__ - Fit the model to the train data

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression().fit(X,y,...)</a>

__Definition:__ <br>
Fit logistic model in the training data.

__Parameters:__ <br>
X : The regressors in my training dataset; <br>
y : The target in my training dataset; <br>
...
</div>

In [7]:
log_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


__`Step 5`__ - Use the model to predict the labels of the test data. Assign them to **y_pred**.

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression().predict(X)</a>

__Definition:__ <br>
Predict class labels for samples in X.

__Parameters:__ <br>
X : Samples to predict; <br>
...

</div>

In [8]:
y_pred = log_model.predict(X_test)
y_pred

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

***Note:*** You can get the actual probabilities of each sample instead of the assigned class using the method predict_proba()

In [9]:
pred_prob = log_model.predict_proba(X_test)
pred_prob

array([[9.95812975e-01, 4.18702479e-03],
       [9.96600365e-01, 3.39963464e-03],
       [9.84339615e-01, 1.56603847e-02],
       [3.55099519e-01, 6.44900481e-01],
       [9.94975402e-01, 5.02459751e-03],
       [9.88851007e-01, 1.11489933e-02],
       [9.99239016e-01, 7.60984179e-04],
       [9.92480285e-01, 7.51971501e-03],
       [6.78270998e-01, 3.21729002e-01],
       [9.85948364e-01, 1.40516358e-02],
       [9.29017609e-01, 7.09823913e-02],
       [9.99009040e-01, 9.90959762e-04],
       [9.88435782e-01, 1.15642181e-02],
       [9.97977746e-01, 2.02225382e-03],
       [9.99015701e-01, 9.84298914e-04],
       [9.97624324e-01, 2.37567611e-03],
       [9.90221692e-01, 9.77830765e-03],
       [9.94959609e-01, 5.04039114e-03],
       [8.79297757e-01, 1.20702243e-01],
       [9.61229402e-01, 3.87705982e-02],
       [9.98068072e-01, 1.93192790e-03],
       [9.92849318e-01, 7.15068237e-03],
       [2.11353066e-01, 7.88646934e-01],
       [9.83470702e-01, 1.65292975e-02],
       [4.661811

***Note:*** In the same way as for the linear regression, you can get the coefficients and intercept

In [10]:
log_model.coef_

array([[ 2.77693965e-05, -3.83120988e-03, -1.06686560e-04,
         1.50884240e-05, -4.11164482e-04,  1.26769135e-03,
         2.43708857e-03,  1.77732538e-02, -2.45855522e-03,
        -9.39248361e-03, -3.54536211e-03, -2.69711604e-03,
         5.35314725e-03, -5.76795555e-03,  1.44287924e-04,
        -5.28020194e-05,  7.41479986e-05, -6.27865721e-05,
         2.15574183e-05, -1.52744011e-04,  1.09980171e-04,
        -1.25473674e-04,  6.45288797e-06,  5.65712375e-06]])

__`Step 6`__ - Evaluate the model

***Note:*** Since we are predicting a categorical target (classification) we use other metrics to evaluate our model than if we were solving a regression problem. Also, for the logistic regression the R-squared cannot be obtained in the same way as we obtain it in the linear case.

### The confusion matrix

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix'>sklearn.metrics.confusion_matrix(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute confusion matrix to evaluate the accuracy of a classification

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [11]:
from sklearn.metrics import confusion_matrix

In [12]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[455,  10],
       [ 29,   6]], dtype=int64)

The confusion matrix in sklearn is presented in the following format: <br>
[ [ TN  FP  ] <br>
    [ FN  TP ] ]

### The accuracy score
<img src="img/accuracy.png" alt="Drawing" style="width: 300px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score'>sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,...)</a>

__Definition:__ <br>
Accuracy classification score.

__Interpretation:__ <br>
If normalize is True, then the best performance is 1. When normalize = False, then the best performance is the number of samples.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
_normalize_: If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples. <br>
...
</div>

In [13]:
from sklearn.metrics import accuracy_score

In [14]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.922

### The precision
<img src="img/precision.png" alt="Drawing" style="width: 200px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score'>sklearn.metrics.precision_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the precision.

__Interpretation:__ <br>
The best value is 1, and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [15]:
from sklearn.metrics import precision_score

In [16]:
precision = precision_score(y_test, y_pred)
precision

0.375

### The recall
<img src="img/recall.png" alt="Drawing" style="width: 180px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.recall_score'>sklearn.metrics.recall_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the recall.

__Interpretation:__ <br>
The best value is 1 and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [17]:
from sklearn.metrics import recall_score

In [18]:
recall_score(y_test, y_pred)

0.17142857142857143

### The F1 Score
<img src="img/f1.png" alt="Drawing" style="width: 270px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score'>sklearn.metrics.f1_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the F1 score, also known as balanced F-score or F-measure.

__Interpretation:__ <br>
F1 score reaches its best value at 1 and worst score at 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [19]:
from sklearn.metrics import f1_score

In [20]:
f1 = f1_score(y_test, y_pred)
f1

0.23529411764705882