# Logistic regression in python

Different ways to perform logistic regression
- `scikit-learn` function `sklearn.linear_model.LogisticRegression`.
- `statsmodel` also maximize loglikelihood. 
- There are other packages such as `tensorflow` (mainly for deep learning) and `glm`  (generalized linear model) packages.

We will illustrate `scikit-learn` and `statsmodel`

In [100]:
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

from sklearn.preprocessing import StandardScaler

## Data set

Predict `mpg` (good: 1 or bad: 0) of car based on different features. `Good` mpg if it is above the median value, `Bad` if it is lower than the median value

This data contains some missing values indicated by '?'. 

The columns in the data set are as follows. Need to remove the last column `name` from the analysis.

- mpg: miles per gallon
- cylinders: Number of cylinders between 4 and 8
- displacement: Engine displacement (cu. inches)
- horsepower: Engine horsepower
- weight: Vehicle weight (lbs.)
- acceleration: Time to accelerate from 0 to 60 mph (sec.)
- year: Model year (modulo 100)
- origin: Origin of car (1. American, 2. European, 3. Japanese)
- name: Vehicle name

## Check for missing values

In [54]:
auto_df = pd.read_csv('Auto.csv')
display(auto_df)
(auto_df=='?').sum()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
392,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
393,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
394,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
395,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


mpg             0
cylinders       0
displacement    0
horsepower      5
weight          0
acceleration    0
year            0
origin          0
name            0
dtype: int64

### Remove rows with missing values

In [82]:
import statistics
auto_df_ = auto_df[auto_df['horsepower'] != '?']
(auto_df_=='?').sum()
auto_df_ = auto_df_.drop(columns=['name'])
auto_df_
mpg_median = statistics.median(auto_df_['mpg'].to_list())
auto_df_.iloc[auto_df_['mpg']<mpg_median,0] = 0
auto_df_.iloc[auto_df_['mpg']>=mpg_median,0] = 1
auto_df_['mpg']=auto_df_['mpg'].astype(int)
auto_df_

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
0,0,8,307.0,130,3504,12.0,70,1
1,0,8,350.0,165,3693,11.5,70,1
2,0,8,318.0,150,3436,11.0,70,1
3,0,8,304.0,150,3433,12.0,70,1
4,0,8,302.0,140,3449,10.5,70,1
...,...,...,...,...,...,...,...,...
392,1,4,140.0,86,2790,15.6,82,1
393,1,4,97.0,52,2130,24.6,82,2
394,1,4,135.0,84,2295,11.6,82,1
395,1,4,120.0,79,2625,18.6,82,1


### Train test split data

- Split data into training and testing
- Scale the data. Allows better convergence in the optimization algorithm

In [94]:
from sklearn.model_selection import train_test_split
auto_x_train, auto_x_test, y_train, y_test = train_test_split(auto_df_.drop(columns=['mpg']),auto_df_['mpg'], test_size=0.20, random_state=42)
scaler = StandardScaler()
auto_x_train_scaled = scaler.fit_transform(auto_x_train)
auto_x_test_scaled =scaler.transform(auto_x_test)

### Function to print accuracy and log loss

In [182]:
def model_summary(model,x,y): #x = features, y = class label
    y_pred_prob = model.predict_proba(x) 
    # y_pred_prob = (n_samples,2)
    # y_pred_prob[:,0] = prob that belongs to class 0
    # y_pred_prob[:,0] = prob that belongs to class 1
    y_pred_class = np.copy(y_pred_prob[:,1])
    y_pred_class[y_pred_class<0.5] = 0
    y_pred_class[y_pred_class>=0.5] = 1
    
    print('coef',model.coef_) #coefficients
    print('intercept',model.intercept_) #intercept
    print('accuracy ',accuracy_score(y,y_pred_class))
    print('Log loss/ -loglikelihood / cross-entropy loss', log_loss(y,y_pred_prob[:,1],normalize=True))

- Fit Model
- print Results for training and testing data

In [183]:
logit       = LogisticRegression(penalty=None) # no regularization as indicated by parameter penalty
model_logit = logit.fit(auto_x_train_scaled, y_train)
print('training')
model_summary(model_logit,auto_x_train_scaled,y_train)
print('testing')
model_summary(model_logit,auto_x_test_scaled,y_test)

training
coef [[-0.82680498  2.18976282 -1.91010141 -4.65619846  0.05607149  1.92871216
   0.96097533]]
intercept [-0.71923424]
accuracy  0.9265175718849841
Log loss/ -loglikelihood / cross-entropy loss 0.18004807244772605
testing
coef [[-0.82680498  2.18976282 -1.91010141 -4.65619846  0.05607149  1.92871216
   0.96097533]]
intercept [-0.71923424]
accuracy  0.8734177215189873
Log loss/ -loglikelihood / cross-entropy loss 0.31965127704075136


## Logistic regressiong using `statsmodel`

In [159]:
log_reg = sm.Logit( y_train,auto_x_train_scaled).fit() 
print(log_reg.summary())

Optimization terminated successfully.
         Current function value: 0.188808
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:                    mpg   No. Observations:                  313
Model:                          Logit   Df Residuals:                      306
Method:                           MLE   Df Model:                            6
Date:                Sun, 01 Sep 2024   Pseudo R-squ.:                  0.7276
Time:                        20:42:56   Log-Likelihood:                -59.097
converged:                       True   LL-Null:                       -216.92
Covariance Type:            nonrobust   LLR p-value:                 3.641e-65
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -1.0910      0.832     -1.311      0.190      -2.721       0.540
x2             3.0113      1.

- statsmodel `Log-Likelihood` is not normalized by the number of samples. To get binary cross-entropy, it should be multiplied by -1 and divided by number of samples.

### Model summary for stats model 

In [194]:
def model_summary_stats_model(model,x,y): # x = features, y = class label
    y_pred_prob  = model.predict(x)
    # y_pred_prob  = prob that belongs to class 1
    y_pred_class = np.copy(y_pred_prob)
    y_pred_class[y_pred_class<0.5] = 0
    y_pred_class[y_pred_class>=0.5] = 1
    
    #print('coef',model.coef_)
    #print('intercept',model.intercept_)
    print('accuracy ',accuracy_score(y,y_pred_class))
    print('Log loss/ -loglikelihood / cross-entropy loss', log_loss(y,y_pred_prob,normalize=True))

In [196]:
print('training')
model_summary_stats_model(log_reg,auto_x_train_scaled,y_train)
print('testing')
model_summary_stats_model(log_reg,auto_x_test_scaled,y_test)

training
accuracy  0.9201277955271565
Log loss/ -loglikelihood / cross-entropy loss 0.18880811630204328
testing
accuracy  0.8354430379746836
Log loss/ -loglikelihood / cross-entropy loss 0.3766704927453016


- Note that the the difference between training and testing accuracy by statsmodel is higher compared to when the model is trained using scikit-learn.
- scikit-learn cross entropy is smaller compared to statsmodel