# Logistic Regression Step by Step

The following script outlines the application of the Logistic Regression technique step by step and verifies if the obtained outputs correspond to the results from the [statsmodel](https://www.statsmodels.org/) library. The outputs will also be compared to those obtained through the BinomialLogisticRegression function created by me.



## Dataset Description

The "customer_fidelity" database contains real information related to 3000 customers of a retail group. The dependent variable is 'fidelity', which defines whether the customer in question returned to make purchases at the supermarket for a given period of time. Among the predictor variables are the individual's gender and age, and other four qualitative variables: service, assortment, accessibility, and price, represented by labels on a Likert scale. The goal of the script is to create a Logistic Regression Model that can make predictions about whether the user has returned to make purchases at the establishment or not.

## Implementation

- Check the dataset.


- Get the qualitative variables *dummies*.


- Define the model equation using the maximum likelihood method.


- Check the statistical significance of each independent variable individually (and drop columns that are not significant).


- Calculate the main metrics (Log-likelihood, AIC, BIC, Confusion Matrix).


- Verify if the results match the results from the [statsmodel](https://www.statsmodels.org/) library.


- Verify if the results macth to the results from the .... function.

***

## Checking the Dataset

In [98]:
# Libraries needed
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import statsmodels.formula.api as smf
from stepwise_process.statsmodels import stepwise

sns.set()
warnings.filterwarnings("ignore")

In [99]:
# Check the date
df_fid = pd.read_csv('Data/customer_fidelity.csv', index_col=0)
df_fid

Unnamed: 0,fidelity,sex,age,service,assortment,accessibility,price
0,n,f,34,2,2,1,1
1,n,f,34,2,2,1,1
2,n,m,34,3,2,4,2
3,n,f,34,4,3,3,3
4,n,f,34,4,3,1,4
...,...,...,...,...,...,...,...
2995,y,m,34,4,4,1,3
2996,y,f,34,4,5,4,2
2997,y,m,36,4,4,3,3
2998,y,f,35,4,4,5,4


In [100]:
# Checking NaN values
df_fid.isna().sum()

fidelity         0
sex              0
age              0
service          0
assortment       0
accessibility    0
price            0
dtype: int64

There's no missing values on the Dataset.

In [101]:
# Checking parameters types
df_fid.dtypes

fidelity         object
sex              object
age               int64
service           int64
assortment        int64
accessibility     int64
price             int64
dtype: object

## Getting Dummies

The qualitative columns (even some of them having numerical values, they are just labels for a qualitative value) need to be transformed into numerical variables, so that each qualitative variable will generate n-1 binary variables, where n is the number of categories belonging to that variable.

For example, the 'service' column had values of 1, 2, 3, 4, and 5. Therefore, the 'service_2', 'service_3', 'service_4', and 'service_5' columns were created, so that for each observation, the column corresponding to the value of the original variable will be filled with 1, while the others will be filled with 0.

In [102]:
# Dummies
quali_variables = ['sex', 'service', 'assortment', 'accessibility', 'price', 'fidelity']
for var in quali_variables:
    df_fid = df_fid.merge(pd.get_dummies(df_fid[var], drop_first=True, prefix=var), how='outer', left_index=True, right_index=True)
    df_fid = df_fid.drop(var, 1)
df_fid

Unnamed: 0,age,sex_m,service_2,service_3,service_4,service_5,assortment_2,assortment_3,assortment_4,assortment_5,accessibility_2,accessibility_3,accessibility_4,accessibility_5,price_2,price_3,price_4,price_5,fidelity_y
0,34,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
1,34,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,34,1,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0
3,34,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0
4,34,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,34,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1
2996,34,0,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1
2997,36,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1
2998,35,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,1


## Define the model equation using the maximum likelihood method

The probability of the occurrence of the event (in this case, the probability of the customer returns to make purchases at the supermarket) is given by:

$p1 = \LARGE \frac{1}{e^{-({\alpha} + {\beta}1 x_{1} + {\beta}2 x_{2} + ... + {\beta}k x_{k})}}$ 

Where:

$x_{1}, x_{2},..., x_{k}$ = explanatory variables;

${\alpha}$ = intercept (constant term);

${\beta}_1$, ${\beta}_2$,..., ${\beta}_k$ = explanatory variable coefficients;

The unknown parameters ${\alpha}$ and ${\beta}_1$, ${\beta}_2$,..., ${\beta}_k$ are usually estimated through maximum likelihood. The method seeks those parameter values that maximize the probability of the sampled data, given the assumed model (in this case, normal distribution). The first step is initializing the parameter vector theta to be all zeros and setting a tolerance level for convergence.

The vector will have k (where k is the number of explanatory variables) + 1 (for the intercecpt).

In [107]:
# Initializing theta and defining tolerance
theta = np.zeros(df_fid.shape[1])
tolerance = 0.0001

In [108]:
# Logistic Function 
def logistic(x):
    return 1 / (1 + np.exp(-x))

This statement is saying that it is necessary to create a loop with several iterations that will find the optimal value of the parameter vector 'theta' using gradient descent until the convergence threshold defined by the tolerance is reached.

Each iteration will calculate the predicted values based on the matrix multiplication of the 'current theta' by the predictor variables, thus calculating the error between the predicted values and the actual values. The algorithm also calculates two matrices, W and H, which are used to weight the gradient and calculate the Hessian matrix, respectively.

In order to start, the variables need to be separed into dependent (y) and predictors (X).

In [111]:
# Separating the variables
y = df_fid['fidelity_y'].values
X = df_fid.drop('fidelity_y', 1).values

In [112]:
# Adding a column of 1's for the intercept
X = np.hstack((np.ones((X.shape[0], 1)), X))

In [113]:
# Iterating for getting the best theta
for i in range(1000):
    
    # Calculating the error
    y_pred = logistic(np.dot(X, theta))    
    error = y_pred - y

    # Getting the gradient and Hessian
    gradient = X.T @ error
    W = np.diag(y_pred * (1 - y_pred))
    H = X.T @ W @ X
    
    # Getting the new theta
    new_theta = theta - np.linalg.inv(H) @ gradient
    if np.allclose(theta, new_theta, rtol=tolerance):
        break
    theta = new_theta

In [116]:
# Checking Theta
theta

array([-68.98648366,   1.68703626,   1.76952011,   1.6807917 ,
         1.81721998,   3.31677318,   4.31191967,   1.85025263,
         2.05112249,   3.32897139,   5.93652285,   2.34754573,
         2.92291518,   4.29066536,   5.36614975,   0.5705581 ,
         2.92160597,   3.03928282,   3.9141728 ])

The probability of the occurrence of the event (in this case, the probability of the customer returns to make purchases at the supermarket) could given by:

$p1 = \LARGE \frac{1}{e^{-({-68.98} + 1.68 * sex_m + 1.76 * service_2 + ... + 3.91 * price_5)}}$ 

However, before that, it is necessary to check the statistical significance of the coefficients.

## Checking the variables statistical significance

Once Theta has been calculated, it is necessary to verify the statistical significance of each of the coefficients. For this, the T-statistics of the parameters are needed. 

In [118]:
# Getting Logit and p1
logit = np.dot(X, theta)
p1 = 1 / (1 + np.exp(-logit))

In [119]:
# T-student Statistics
W = np.diag(p1*(1-p1))
H = np.dot(X.T, np.dot(W, X))
I = np.linalg.inv(H)
se = np.sqrt(np.diagonal(I))
z = theta / se
p_val = (1 - stats.norm.cdf(abs(z)))*2

The coefficients for which the p-value corresponding to the T-statistic is greater than 0.05 are not statistically significant for the model and should be excluded.

In [134]:
pd.DataFrame(index = ['intercept'] + list(df_fid.columns[0:-1]), data = {'p_val' : p_val, 'Significant at 0.5':p_val > .05})

Unnamed: 0,p_val,Significant at 0.5
intercept,0.0,False
age,0.0,False
sex_m,0.0,False
service_2,5.603975e-07,False
service_3,1.031541e-07,False
service_4,0.0,False
service_5,0.0,False
assortment_2,2.995996e-06,False
assortment_3,1.664693e-10,False
assortment_4,0.0,False


In this case, 'price_2' will be desconsidered, and all the processes will be repeated without this variable.

## Repeating the Processes

In [149]:
# Initializing theta and defining tolerance
theta = np.zeros(df_fid.shape[1] - 1)
tolerance = 0.0001

# Separating the variables
y = df_fid['fidelity_y'].values
X = df_fid.drop(['fidelity_y', 'price_2'], 1).values

# Adding a column of 1's for the intercept
X = np.hstack((np.ones((X.shape[0], 1)), X))

# Iterating for getting the best theta
for i in range(1000):
    y_pred = logistic(np.dot(X, theta))    
    error = y_pred - y

    gradient = X.T @ error
    W = np.diag(y_pred * (1 - y_pred))
    H = X.T @ W @ X
    
    new_theta = theta - np.linalg.inv(H) @ gradient
    if np.allclose(theta, new_theta, rtol=tolerance):
        break
    theta = new_theta

In [150]:
# Getting Logit and p1
logit = np.dot(X, theta)
p1 = 1 / (1 + np.exp(-logit))

# T-student Statistics
W = np.diag(p1*(1-p1))
H = np.dot(X.T, np.dot(W, X))
I = np.linalg.inv(H)
se = np.sqrt(np.diagonal(I))
z = theta / se
p_val = (1 - stats.norm.cdf(abs(z)))*2

pd.DataFrame(index = ['intercept'] + [x for x in list(df_fid.columns[0:-1]) if x != 'price_2'], data = {'p_val' : p_val, 'Significant at 0.5':p_val > .05})

Unnamed: 0,p_val,Significant at 0.5
intercept,0.0,False
age,0.0,False
sex_m,0.0,False
service_2,5.164217e-07,False
service_3,9.48327e-08,False
service_4,0.0,False
service_5,0.0,False
assortment_2,2.272191e-06,False
assortment_3,1.219038e-10,False
assortment_4,0.0,False


Once all the variables are significant statistically the model parameters are defined.

Now is correct to say that the probability of the occurrence of the event (in this case, the probability of the customer returns to make purchases at the supermarket) could given by:

$p1 = \LARGE \frac{1}{e^{-({-69} + 1.68 * sex_m + 1.76 * service_2 + ... + 3.90 * price_5)}}$ 

In [178]:
# Creating a DataFrame for the probabilitys
df_prob = df_fid['fidelity_y'].reset_index()
df_prob['p1'] = p1
df_prob['p0'] = 1 - p1 
df_prob = df_prob.drop('index', 1)

In [179]:
df_prob

Unnamed: 0,fidelity_y,p1,p0
0,0,0.000307,0.999693
1,0,0.000307,0.999693
2,0,0.208086,0.791914
3,0,0.398124,0.601876
4,0,0.038641,0.961359
...,...,...,...
2995,1,0.428042,0.571958
2996,1,0.923038,0.076962
2997,1,0.997545,0.002455
2998,1,0.994022,0.005978


As an example, the first observation has a probability of 99.99% of not returning to the market, while the 2995th observation has a 42% chance of returning.

Once the probabilities for each observation based on the logistic regression model are calculated, their classification depends on the set "cutoff" value. For this case, we will consider a cutoff value of 0.5.

In [181]:
# Adding the predicted_values based on the cuttof
df_fid['predicted_values'] = np.where(p1 > 0.5, 1, 0)
df_fid.head(10)

Unnamed: 0,age,sex_m,service_2,service_3,service_4,service_5,assortment_2,assortment_3,assortment_4,assortment_5,accessibility_2,accessibility_3,accessibility_4,accessibility_5,price_2,price_3,price_4,price_5,fidelity_y,predicted_values
0,34,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,34,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,34,1,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0
3,34,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
4,34,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
5,34,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
6,34,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0
7,34,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
8,34,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
9,34,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1


## Metrics

Calculating metrics which considers the cutoff.

### Confusion Matrix:

A confusion matrix is a table used to evaluate the performance of a classification model. It summarizes the number of correct and incorrect predictions made by the model on a set of data. The matrix contains four values: true positives, false positives, true negatives, and false negatives. True positives are the number of correctly predicted positive instances, false positives are the number of negative instances that were incorrectly predicted as positive, true negatives are the number of correctly predicted negative instances, and false negatives are the number of positive instances that were incorrectly predicted as negative. The actual values are represented by the columns while the predicted values are represented by the index.

In [195]:
# Creating the Confusion Matrix
conf_mat = pd.DataFrame(index=[1,0], columns=[1,0])
for i in conf_mat.index:
    for j in conf_mat.columns:
        conf_mat[j].loc[i] = len(df_fid.loc[(df_fid['fidelity_y']==j)&(df_fid['predicted_values']==i)])
conf_mat

Unnamed: 0,1,0
1,1470,210
0,210,1110


### Accuracy:

Accuracy measures the percentage of correctly classified instances out of all instances in the test dataset.

In [197]:
# Calculating accuracy
accuracy = (conf_mat[0].loc[0] + conf_mat[1].loc[1])/conf_mat.sum().sum()
accuracy

0.86

### Precision

Precision measures the percentage of correctly classified positive instances out of all instances that were predicted to be positive.

In [198]:
# Calculating precision
precision = conf_mat[1].loc[1]/conf_mat.sum(1)[1]
precision

0.875

### Sensitivity

Sensitivity measures the percentage of correctly classified positive instances out of all instances that are actually positive.

In [200]:
# Calculating sensitivity
recall = conf_mat[1].loc[1]/conf_mat.sum(0)[1]
recall

0.875

### F1-Score

It is a harmonic mean of precision and recall, and provides a balanced measure between the two metrics. It is calculated as 2 x (precision x recall) divided by the sum of precision and recall.

In [201]:
f1_score = (precision * recall * 2) / (precision + recall)
f1_score

0.875

Calculating metrics that do not depend on the cutoff.

### Log-likelihood:



$loglike_i = class_i * log(p1_i) + (1 - class_i) * log(p0_i)$

In [205]:
# Calculating Loglike
ll = sum(y * np.log(p1) + (1 - y) * np.log(1 - p1))
ll

-773.6044089085042

### AIC:

$AIC = -2 * loglike(model) + 2 * (k+1)$, where k = the number of parameters.

In [206]:
# Calculating AIC (-2 cause the 'price_2' and 'fidelity_y' columns are not considered)
k = len(df_fid.columns) - 2
aic = -2 * ll + 2 * (k + 1)
aic

1585.2088178170084

### BIC:

$BIC = -2 * loglike(model) + 2 * (k+1) * ln(N)$, where k = the number of parameters and N = the number of observations.

In [207]:
bic = -2 * ll + np.log(len(df_fid)) * (k + 1)
bic

1699.329801602363

## Comparing the results

Now the results gotten on the step-by-step implementation will be compared to the results from sklearn library.

Some processes will need to be repeated.

In [248]:
# Importing statsmodels.api
import statsmodels.api as sm

In [249]:
# Check the date
df_fid_comp1 = pd.read_csv('Data/customer_fidelity.csv', index_col=0)

# Dummies
quali_variables = ['sex', 'service', 'assortment', 'accessibility', 'price', 'fidelity']
for var in quali_variables:
    df_fid_comp1 = df_fid_comp1.merge(pd.get_dummies(df_fid_comp1[var], drop_first=True, prefix=var), how='outer', left_index=True, right_index=True)
    df_fid_comp1 = df_fid_comp1.drop(var, 1)
df_fid_comp1

Unnamed: 0,age,sex_m,service_2,service_3,service_4,service_5,assortment_2,assortment_3,assortment_4,assortment_5,accessibility_2,accessibility_3,accessibility_4,accessibility_5,price_2,price_3,price_4,price_5,fidelity_y
0,34,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
1,34,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,34,1,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0
3,34,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0
4,34,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,34,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1
2996,34,0,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1
2997,36,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1
2998,35,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,1


In [250]:
# Creating the formula
formula = 'fidelity_y ~ age'
for col in df_fid_comp1.columns[1:-1]:
    formula += f'+ {col}'

In [251]:
# Creating the model
sm_model = smf.glm(formula=formula, data=df_fid_comp1,
                         family=sm.families.Binomial()).fit()

In [252]:
# Model parameters
sm_model.summary()

0,1,2,3
Dep. Variable:,fidelity_y,No. Observations:,3000.0
Model:,GLM,Df Residuals:,2981.0
Model Family:,Binomial,Df Model:,18.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-773.57
Date:,"Thu, 20 Apr 2023",Deviance:,1547.1
Time:,16:30:25,Pearson chi2:,1730.0
No. Iterations:,8,Pseudo R-squ. (CS):,0.5752
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-68.9866,6.055,-11.394,0.000,-80.854,-57.120
age,1.6870,0.176,9.561,0.000,1.341,2.033
sex_m,1.7695,0.197,8.962,0.000,1.383,2.157
service_2,1.6808,0.336,5.004,0.000,1.023,2.339
service_3,1.8172,0.342,5.321,0.000,1.148,2.487
service_4,3.3168,0.311,10.651,0.000,2.706,3.927
service_5,4.3119,0.432,9.977,0.000,3.465,5.159
assortment_2,1.8503,0.396,4.671,0.000,1.074,2.627
assortment_3,2.0511,0.321,6.389,0.000,1.422,2.680


In [253]:
# Applying the stepwise process
sm_model_step = stepwise(sm_model)

Regression type: GLM 

Estimating model...: 
 fidelity_y ~ age + sex_m + service_2 + service_3 + service_4 + service_5 + assortment_2 + assortment_3 + assortment_4 + assortment_5 + accessibility_2 + accessibility_3 + accessibility_4 + accessibility_5 + price_2 + price_3 + price_4 + price_5

 Family type...: 
 Binomial

 Discarding atribute "price_2" with p-value equal to 0.7880570537038649 

Estimating model...: 
 fidelity_y ~ age + sex_m + service_2 + service_3 + service_4 + service_5 + assortment_2 + assortment_3 + assortment_4 + assortment_5 + accessibility_2 + accessibility_3 + accessibility_4 + accessibility_5 + price_3 + price_4 + price_5

 Family type...: 
 Binomial

 No more atributes with p-value higher than 0.05

 Atributes discarded on the process...: 

{'atribute': 'price_2', 'p-value': 0.7880570537038649}

 Model after stepwise process...: 
 fidelity_y ~ age + sex_m + service_2 + service_3 + service_4 + service_5 + assortment_2 + assortment_3 + assortment_4 + assortment_5 

In [261]:
# Getting the predictions from statsmodel
df_fid['predicted_values_sm'] = np.where(sm_model_step.predict() > .5, 1, 0)
df_fid[['predicted_values','predicted_values_sm']].value_counts()

predicted_values  predicted_values_sm
1                 1                      1680
0                 0                      1320
dtype: int64

Translation: As it can be observed, the outputs obtained step-by-step were exactly identical to those obtained using the statsmodel library, with the help of the stepwise function. The classifications were exactly the same, as well as the log-likelihood.

Now the values will be compared to those of the function.