# Statistical Analysis

## Logit Regression

### Imports

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
import statsmodels.api as sm

### Retrieve data

In [2]:
df = pd.read_csv('ready_df.csv')

In [3]:
df.head()

Unnamed: 0,ammonia,arsenic,barium,cadmium,copper,fluoride,bacteria,viruses,lead,nitrates,mercury,perchlorate,radium,selenium,uranium,potability
0,9.08,0.2,2.85,0.083666,0.412311,0.05,0.0016,0.0,0.054,16.08,0.007,2030803.0,6.78,0.08,0.02,1
1,21.16,0.1,3.31,0.044721,0.812404,0.9,0.178506,0.4225,0.1,2.01,0.003,1083072.0,3.21,0.08,0.05,1
2,14.02,0.2,0.58,0.089443,0.141421,0.99,6e-06,9e-06,0.078,14.16,0.006,6391180.0,7.07,0.07,0.01,0
3,11.33,0.2,2.96,0.031623,1.28841,1.08,0.254117,0.5041,0.016,1.41,0.004,6917.981,1.72,0.02,0.05,1
4,24.33,0.173205,0.2,0.07746,0.754983,0.61,0.000286,1e-06,0.117,6.74,0.003,81573.07,2.41,0.02,0.02,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7459 entries, 0 to 7458
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ammonia      7459 non-null   float64
 1   arsenic      7459 non-null   float64
 2   barium       7459 non-null   float64
 3   cadmium      7459 non-null   float64
 4   copper       7459 non-null   float64
 5   fluoride     7459 non-null   float64
 6   bacteria     7459 non-null   float64
 7   viruses      7459 non-null   float64
 8   lead         7459 non-null   float64
 9   nitrates     7459 non-null   float64
 10  mercury      7459 non-null   float64
 11  perchlorate  7459 non-null   float64
 12  radium       7459 non-null   float64
 13  selenium     7459 non-null   float64
 14  uranium      7459 non-null   float64
 15  potability   7459 non-null   int64  
dtypes: float64(15), int64(1)
memory usage: 932.5 KB


#### Distinguish the explanatory and outcome variables

In [5]:
# Let explanatory variables be X and the outcome variable be y

X = df.iloc[:, 0:15]
y = df['potability']

#### Split data into a training set and a test set (3:1)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

### Model development
#### Fit model to appropriate data

In [7]:
logit_fit = sm.GLM(y_train, X_train, family=sm.families.Binomial()).fit()
logit_fit.summary2()

0,1,2,3
Model:,GLM,AIC:,3292.9881
Link Function:,logit,BIC:,-44880.7128
Dependent Variable:,potability,Log-Likelihood:,-1631.5
Date:,2022-02-01 11:49,LL-Null:,-2019.9
No. Observations:,5594,Deviance:,3263.0
Df Model:,14,Pearson chi2:,6390.0
Df Residuals:,5579,Scale:,1.0
Method:,IRLS,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
ammonia,-0.0075,0.0050,-1.4779,0.1394,-0.0173,0.0024
arsenic,-1.2973,0.2691,-4.8199,0.0000,-1.8248,-0.7697
barium,0.3801,0.0396,9.6071,0.0000,0.3026,0.4577
cadmium,-8.6422,0.5673,-15.2333,0.0000,-9.7541,-7.5302
copper,0.1231,0.1077,1.1429,0.2531,-0.0880,0.3343
fluoride,0.1493,0.1002,1.4893,0.1364,-0.0472,0.3457
bacteria,2.1130,0.5361,3.9413,0.0001,1.0622,3.1638
viruses,-2.3217,0.3568,-6.5068,0.0000,-3.0210,-1.6223
lead,-0.0872,0.7802,-0.1118,0.9110,-1.6163,1.4420


#### Drop the columns that are not statistically significant on potability at the 0.05 significance level. Note that each column being dropped was determined to have no/weak correlation to potability in the data exploration

In [8]:
X_train.drop(['ammonia', 'copper', 'fluoride', 'lead', 'radium', 'selenium'], axis=1, inplace=True)
X_test.drop(['ammonia', 'copper', 'fluoride', 'lead', 'radium', 'selenium'], axis=1, inplace=True)

#### Fit the model again

In [9]:
logit_model = sm.GLM(y_train, X_train, family=sm.families.Binomial()).fit()
logit_model.summary2()

0,1,2,3
Model:,GLM,AIC:,3291.1991
Link Function:,logit,BIC:,-44922.2785
Dependent Variable:,potability,Log-Likelihood:,-1636.6
Date:,2022-02-01 11:49,LL-Null:,-2019.9
No. Observations:,5594,Deviance:,3273.2
Df Model:,8,Pearson chi2:,6370.0
Df Residuals:,5585,Scale:,1.0
Method:,IRLS,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
arsenic,-1.2766,0.2666,-4.7877,0.0000,-1.7992,-0.7540
barium,0.3884,0.0370,10.4925,0.0000,0.3159,0.4610
cadmium,-8.7706,0.5453,-16.0827,0.0000,-9.8395,-7.7018
bacteria,2.1857,0.5278,4.1409,0.0000,1.1512,3.2202
viruses,-2.3759,0.3497,-6.7945,0.0000,-3.0612,-1.6905
nitrates,-0.0235,0.0072,-3.2592,0.0011,-0.0377,-0.0094
mercury,-28.3978,13.7573,-2.0642,0.0390,-55.3616,-1.4340
perchlorate,-0.0000,0.0000,-4.5324,0.0000,-0.0000,-0.0000
uranium,-8.7432,1.5599,-5.6051,0.0000,-11.8005,-5.6859


#### The low p-values of each remaining attribute indicates that the effect of each attribute on potability is statistically significant at conventional significance levels. 

### Evaluating the logit regression with metric interpretation:

### Evaluating the logit regression with prediction:

In [10]:
y_pred = logit_model.predict(X_test)

In [11]:
metrics.confusion_matrix(y_test, y_pred.round(0))

array([[1621,   16],
       [ 221,    7]], dtype=int64)