## Diabities analysis

The purpose of this analysis is to create a predictive model, examinig influence of biological features on probability of getting ill on diabities for women. Specific variables represent levels of biological characteristics of the subjects. Subjects in this particular dataset include women who are 21 years of age or older.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import chi2

In [2]:
data_backup = pd.read_csv('dataset_zadanie.csv')

In [3]:
data = data_backup
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,122,90,51,220,49.7,0.325,31,1
1,1,163,72,0,0,39.0,1.222,33,1
2,1,151,60,0,0,26.1,0.179,22,0
3,0,125,96,0,0,22.5,0.262,21,0
4,1,81,72,18,40,26.6,0.283,24,0
...,...,...,...,...,...,...,...,...,...
744,1,0,48,20,0,24.7,0.140,22,0
745,7,62,78,0,0,32.6,0.391,41,0
746,5,95,72,33,0,37.7,0.370,27,0
747,0,131,0,0,0,43.2,0.270,26,1


## Exploring data

In [4]:
data.isna().sum()
#no NA values

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [5]:
(data == 0).sum()
#plenty of 0 values

Pregnancies                 110
Glucose                       5
BloodPressure                34
SkinThickness               223
Insulin                     365
BMI                          10
DiabetesPedigreeFunction      0
Age                           0
Outcome                     484
dtype: int64

### Problem with Insulin variable

During our analysis, we spot that in 48% of cases, 'Insulin' equals zero. It is biologically impossible for a man to have insulin level equal to zero. 

In [6]:
((data == 0).sum() / data.count()) * 100

Pregnancies                 14.686248
Glucose                      0.667557
BloodPressure                4.539386
SkinThickness               29.773031
Insulin                     48.731642
BMI                          1.335113
DiabetesPedigreeFunction     0.000000
Age                          0.000000
Outcome                     64.619493
dtype: float64

Insulin variable equals zero in almost 49 percent of all observations, so changing zero values to some other (e.g. sample average) could disturb our analysis, thus later on we may decide to drop Insulin variable from our data.

### Applying mean to empty variables

In [7]:
data['Glucose'] = data['Glucose'].replace(0, data['Glucose'].mean())
data['BloodPressure'] = data['BloodPressure'].replace(0, data['BloodPressure'].mean())
data['SkinThickness'] = data['SkinThickness'].replace(0, data['SkinThickness'].mean())
data['BMI'] = data['BMI'].replace(0, data['BMI'].mean())
data['Insulin'] = data['Insulin'].replace(0, data['Insulin'].mean())
(data == 0).sum()

Pregnancies                 110
Glucose                       0
BloodPressure                 0
SkinThickness                 0
Insulin                       0
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                     484
dtype: int64

### Calculating significance of the variables

In [8]:
A = data.iloc[:,0:-1]
b = data.iloc[:,-1]

In [9]:
A

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,1,122.000000,90.000000,51.000000,220.000000,49.7,0.325,31
1,1,163.000000,72.000000,20.520694,80.444593,39.0,1.222,33
2,1,151.000000,60.000000,20.520694,80.444593,26.1,0.179,22
3,0,125.000000,96.000000,20.520694,80.444593,22.5,0.262,21
4,1,81.000000,72.000000,18.000000,40.000000,26.6,0.283,24
...,...,...,...,...,...,...,...,...
744,1,121.255007,48.000000,20.000000,80.444593,24.7,0.140,22
745,7,62.000000,78.000000,20.520694,80.444593,32.6,0.391,41
746,5,95.000000,72.000000,33.000000,80.444593,37.7,0.370,27
747,0,131.000000,69.194927,20.520694,80.444593,43.2,0.270,26


In [10]:
chi_scores = chi2(A,b)
chi_values = pd.Series(chi_scores[0], index=A.columns).sort_values(ascending=False)
chi_scores
# First array below stands for chi2 values, second one for p-values of the variables

(array([ 106.68483416, 1366.99421761,   36.14162864,   83.14060406,
        1765.36811619,  104.69751208,    5.31038776,  165.30212532]),
 array([5.21905372e-025, 3.12404191e-299, 1.83485008e-009, 7.64197391e-020,
        0.00000000e+000, 1.42279583e-024, 2.11986377e-002, 7.85799461e-038]))

In [11]:
chi_values

Insulin                     1765.368116
Glucose                     1366.994218
Age                          165.302125
Pregnancies                  106.684834
BMI                          104.697512
SkinThickness                 83.140604
BloodPressure                 36.141629
DiabetesPedigreeFunction       5.310388
dtype: float64

#### According to chi^2 statistic, we conclude that 'Pregnancies', 'Glucose' and 'Age' have the highest significance in explaining 'Outcome' Variable

In [12]:
chi_scores[1] < 0.05
#Checking if p-values are higher than 5% level of significance

array([ True,  True,  True,  True,  True,  True,  True,  True])

#### Additionally we spot, that every vairable p-value is smaller than 5%, so we keep all the variables in the model.

## Logistic Regression
Now we will build a Logistic Regression model and show coefficients of variables.

In [13]:
input_train, input_test, outcome_train, outcome_test = train_test_split(A, b, test_size=0.33, random_state=42)

In [14]:
print(input_train.shape)
print(input_test.shape)

(501, 8)
(248, 8)


In [15]:
logreg = LogisticRegression(max_iter=500)
logreg.fit(input_train, outcome_train)

In [16]:
coefficients = logreg.coef_[0]

In [17]:
pd.DataFrame({'Variables': data.columns[:-1], 'coefficients' : coefficients})
#coefficients of variables of the model

Unnamed: 0,Variables,coefficients
0,Pregnancies,0.133555
1,Glucose,0.041437
2,BloodPressure,-0.018357
3,SkinThickness,0.012587
4,Insulin,-0.002021
5,BMI,0.090161
6,DiabetesPedigreeFunction,0.769135
7,Age,0.005883


### Predicting variables and model evaluation

In [18]:
outcome_predict = logreg.predict(input_test)

In [19]:
evaluation_table = pd.DataFrame({'Real_Outcome':outcome_test, 'Predicted_Outcome': outcome_predict})
evaluation_table['Correct_preciction'] = np.where((evaluation_table['Real_Outcome'] == evaluation_table['Predicted_Outcome']), 1, 0)
evaluation_table

Unnamed: 0,Real_Outcome,Predicted_Outcome,Correct_preciction
581,0,0,1
356,1,1,1
133,0,0,1
250,1,0,0
299,0,0,1
...,...,...,...
338,0,0,1
658,1,0,0
74,0,0,1
331,0,0,1


In [20]:
evaluation_table['Correct_preciction'].mean()

0.7540322580645161

We can conclude, that our model predicted 75,4% diabities cases correctly. The author's subjective opinion says that this is a relatively good accuracy.

## Logistic regression without Insulin Variable
In one of the previuos sections, we spotted that in 48% of cases, we have missing data for 'Insulin' variable. We will fill those values with mean of 'Insulin'. Now, we will drop 'Insulin' variable, run Logistic Regression, and see if we will get beter preformance of the model.

In [21]:
data2 = data.drop(columns='Insulin', axis=1)

In [22]:
A2 = data2.iloc[:,0:-1]
b2 = data2.iloc[:,-1]
input_train, input_test, outcome_train, outcome_test = train_test_split(A2, b2, test_size=0.33, random_state=42)
logreg = LogisticRegression(max_iter=500)
logreg.fit(input_train, outcome_train)

In [23]:
outcome_predict = logreg.predict(input_test)
evaluation_table = pd.DataFrame({'Real_Outcome':outcome_test, 'Predicted_Outcome': outcome_predict})
evaluation_table['Correct_preciction'] = np.where((evaluation_table['Real_Outcome'] == evaluation_table['Predicted_Outcome']), 1, 0)
evaluation_table['Correct_preciction'].mean()

0.7459677419354839

Model without 'Insulin' variable was corret in 74,6% of cases, so it has an overall worse accuracy. The hypothesis that dropping 'Insulin' from the model would improve model's accuracy, turned out to be not true.