# 1.2 Initial Data Analysis
The National Institute of Diabetes and Digestive and Kidney
Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The
following variables were recorded:
* Pregnancies: number of times pregnant, 
* Glucose: plasma glucose concentration at 2 hours in an oral glucose tolerance test, 
* BloodPressure: diastolic blood pressure (mmHg), 
* SkinThickness: triceps skin fold thickness (mm), 
* Insulin: 2-hour serum insulin (mu U/ml), 
* BMI: body mass index (weight in kg/(height in m2)), 
* DiabetesPedigreeFunction: diabetes pedigree function, 
* Age: age (years) 
* Outcome: a test whether the patient showed signs of diabetes (coded zero if negative, one if positive).

In [1]:
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('seaborn-white')

In [2]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

##### Load the dataset `Pima Indians` : www.ics.uci.edu/˜mlearn/MLRepository.html.

In [4]:
pima=pd.read_csv('./Data/pima-indians-diabetes.csv')
pima.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


##### Check the NAs in the dataset

In [5]:
pima.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

##### Construct numerical summary
The describe( ) function is a quick way to get the usual univariate summary
information. At this stage, we are looking for anything unusual or unexpected, perhaps
indicating a data-entry error. For this purpose, a close look at the minimum and
maximum values of each variable is worthwhile. Starting with pregnant, we see a maximum
value of 17. This is large, but not impossible. However, we then see that the next
five variables have minimum values of zero. No blood pressure is not good for the
health—something must be wrong.

In [6]:
pima.describe().round(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.85,120.89,69.11,20.54,79.8,31.99,0.47,33.24,0.35
std,3.37,31.97,19.36,15.95,115.24,7.88,0.33,11.76,0.48
min,0.0,0.0,0.0,0.0,0.0,0.0,0.08,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.37,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.63,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Sort diastolic and show the first few values.

In [7]:
pima.BloodPressure.sort_values().head(5)

347    0
494    0
222    0
81     0
78     0
Name: BloodPressure, dtype: int64

In [8]:
(pima.BloodPressure==0).sum()

35

We see that the first 35 values are zero. The description that comes with the data says
nothing about it but it seems likely that the zero has been used as a missing value code. For
one reason or another, the researchers did not obtain the blood pressures of 35 patients.
In a real investigation, one would likely be able to question the researchers about what
really happened. Nevertheless, this does illustrate the kind of misunderstanding that can
easily occur. A careless statistician might overlook these presumed missing values and
complete an analysis assuming that these were real observed zeros. If the error was later discovered,
they might then blame the researchers for using zero as a missing value code (not a good choice
since it is a valid value for some of the variables) and not mentioning it in their data description.
Unfortunately such oversights are not uncommon, particularly with datasets of any size
or complexity. The statistician bears some share of responsibility for spotting these mistakes.
##### We set all zero values of the five variables to NA.

In [9]:
missing_rep={'Glucose':0,'BloodPressure':0,'SkinThickness':0,'Insulin':0,'BMI':0}
pima.replace(missing_rep, np.nan, inplace=True)
pima.describe().round(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,768.0
mean,3.85,121.69,72.41,29.15,155.55,32.46,0.47,33.24,0.35
std,3.37,30.54,12.38,10.48,118.78,6.92,0.33,11.76,0.48
min,0.0,44.0,24.0,7.0,14.0,18.2,0.08,21.0,0.0
25%,1.0,99.0,64.0,22.0,76.25,27.5,0.24,24.0,0.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.37,29.0,0.0
75%,6.0,141.0,80.0,36.0,190.0,36.6,0.63,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0
