# AccelerateAI: Logistic Regression

## Generalized Linear Model

* Simple Generalized Linear Model using statsmodels
* For Univariate and Multivariate analysis
* Dataset reference: diabetes data from UCI ML (https://archive.ics.uci.edu/ml/index.php)

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats

# A jupyter notebook specific command that let’s you see the plots in the notbook itself.
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("C:/Users/mishr/Desktop/Notebooks/data/diabetes.csv")

df.shape

(768, 9)

In [3]:
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Univariate: with one feature

In [4]:
model = sm.GLM.from_formula("Outcome ~ Age", family=sm.families.Binomial(), data=df)

result = model.fit()

In [5]:
print(result.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                Outcome   No. Observations:                  768
Model:                            GLM   Df Residuals:                      766
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -475.36
Date:                Sat, 04 Jun 2022   Deviance:                       950.72
Time:                        00:14:54   Pearson chi2:                     761.
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.0475      0.239     -8.572      0.0

Obviously, ```Age``` is an important feature and predictor of ```diabetes outcome``` as the p-value is less than 0.05


Note:
* IRLS - stands for Iteratively reweighted least squares. It is used to find maximum likelihood estimation of a GLM. It uses iterative method in which each step involves solving a WLS (Weighted Least Squares) problem. 
* For GLMs, there are 3 important fitting procedures which are closely connected: Newton-Raphson, Fisher Scoring, IRLS

## Multivariate: with multiple features

In [6]:
model = sm.GLM.from_formula("Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction + Age", family=sm.families.Binomial(), data=df)

result = model.fit()

In [7]:
print(result.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                Outcome   No. Observations:                  768
Model:                            GLM   Df Residuals:                      759
Model Family:                Binomial   Df Model:                            8
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -361.72
Date:                Sat, 04 Jun 2022   Deviance:                       723.45
Time:                        00:14:54   Pearson chi2:                     836.
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept               

* Looking at the p-values, the variables 'SkinThickness', 'Insulin', and 'age', seem to be insignificant predictors. All the other variables have their p-values smaller than 0.05, and are, therefore, significant.
* The interpretation of the coefficients are understood from the logit perspective. In simple terms, it means that, for the output above, 
    * The log odds for 'diabetes Outcome' increases by 0.089 for each unit of 'BMI', 
    * The log odds for 'diabetes Outcome' increases by 0.035 for each unit of 'Glucose'
    * The log odds for 'diabetes Outcome' increases by 0.015 for each unit of 'Age', and so on.

## Test for normality (Omnibus test)

In [8]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [9]:
alpha = 0.05
for i in df.columns:
    print([i])
    a, b = stats.normaltest(df[[i]])
    print(a,b)
    if b < alpha:
        print("The null hypothesis can be rejected")
    else:
        print("The null hypothesis can not be rejected")

['Pregnancies']
[80.16379459] [3.91429164e-18]
The null hypothesis can be rejected
['Glucose']
[12.38505662] [0.00204465]
The null hypothesis can be rejected
['BloodPressure']
[305.88688721] [3.78012708e-67]
The null hypothesis can be rejected
['SkinThickness']
[17.34487054] [0.00017124]
The null hypothesis can be rejected
['Insulin']
[387.57777337] [6.89534274e-85]
The null hypothesis can be rejected
['BMI']
[86.14248429] [1.96968695e-19]
The null hypothesis can be rejected
['DiabetesPedigreeFunction']
[321.83907808] [1.29876975e-70]
The null hypothesis can be rejected
['Age']
[119.87763596] [9.30898004e-27]
The null hypothesis can be rejected
['Outcome']
[4556.98719518] [0.]
The null hypothesis can be rejected


In [10]:
result.wald_test_terms()

<class 'statsmodels.stats.contrast.WaldTestResults'>
                                chi2        P>chi2  df constraint
Intercept                 137.545680  9.161142e-32              1
Pregnancies                14.746678  1.229640e-04              1
Glucose                    89.896822  2.509097e-21              1
BloodPressure               6.453713  1.107207e-02              1
SkinThickness               0.008048  9.285152e-01              1
Insulin                     1.748502  1.860652e-01              1
BMI                        35.346996  2.758937e-09              1
DiabetesPedigreeFunction    9.982933  1.579978e-03              1
Age                         2.537198  1.111920e-01              1