# EDA: Diagnosing Diabetes

Exploring data that looks at how certain diagnostic factors affect the diabetes outcome of women patients.

Using EDA skills to help inspect, clean, and validate the data.

**Note**: This [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration per 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure
- `SkinThickness`: Triceps skinfold thickness
- `Insulin`: 2-Hour serum insulin
- `BMI`: Body mass index
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1)

In [12]:
import pandas as pd
import numpy as np

In [11]:
#2 load in data
diabetes_data = pd.read_csv('diabetes.csv', sep=',', encoding='utf-8')

In [43]:
# print first 10 rows 
diabetes_data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,0
2,8.0,183.0,64.0,,,23.3,0.672,32,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5.0,116.0,74.0,,,25.6,0.201,30,0
6,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10.0,115.0,,,,35.3,0.134,29,0
8,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8.0,125.0,96.0,,,,0.232,54,1


In [46]:
# the numbers of rows and columns by method .shape
diabetes_data.shape

(768, 9)

In [47]:
# which columns are there
diabetes_data.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [48]:
# check columns data types by method .dtypes
diabetes_data.dtypes

Pregnancies                 float64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object

In [17]:
# check columns contain missing data
print(diabetes_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB
None


In [23]:
# another check columns contain missing data
print(diabetes_data.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [44]:
# calculate summary statistics on diabetes_data using the .describe() method
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,657.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0
mean,4.494673,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885
std,3.217291,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232
min,1.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0
25%,2.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0
50%,4.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0
75%,7.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


All column values at the row 'min' are 0.000000

The maximum value of the Insulin column is 846, which is abnormally high.
The maximum value of the Pregnancies column is 17.

Need to look further.

In [30]:
# replace the instances of 0 with NaN in the five columns
diabetes_data[['Pregnancies', 
               'Glucose', 
               'BloodPressure', 
               'SkinThickness', 
               'Insulin',
               'BMI']] = diabetes_data[['Pregnancies', 
                                        'Glucose', 
                                        'BloodPressure', 
                                        'SkinThickness', 
                                        'Insulin', 
                                        'BMI']].replace(0, np.nan)

In [33]:
# one more time check columns contain missing data by 2 methods:
print(diabetes_data.isnull().sum())
print()
print(diabetes_data.info())

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               657 non-null    float64
 1   Glucose                   763 non-null    float64
 2   BloodPressure             733 non-null    float64
 3   SkinThickness             541 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       757 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null 

In [50]:
# print out all of the rows that contain missing (null) values
diabetes_data[diabetes_data.isnull().any(axis=1)].head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,0
2,8.0,183.0,64.0,,,23.3,0.672,32,1
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5.0,116.0,74.0,,,25.6,0.201,30,0
7,10.0,115.0,,,,35.3,0.134,29,0
9,8.0,125.0,96.0,,,,0.232,54,1
10,4.0,110.0,92.0,,,37.6,0.191,30,0
11,10.0,168.0,74.0,,,38.0,0.537,34,1
12,10.0,139.0,80.0,,,27.1,1.441,57,0


Most rows with missing data have missing values in more than one column.  Every single row with at least one missing value also has a missing value in the insulin column.