# EDA: Diagnosing Diabetes

Inspecting, cleaning and Validating Data

## Initial Inspection

In [2]:
import pandas as pd
import numpy as np

# load in data
diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data.head(5))


   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age Outcome  
0                     0.627   50       1  
1                     0.351   31       0  
2                     0.672   32       1  
3                     0.167   21       0  
4                     2.288   33       1  


In [13]:
# print number of columns
print(diabetes_data.columns.nunique())

9


In [16]:
# print number of rows
print(diabetes_data.info())

#the data has 768 rows 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB
None


## Further Inspection

In [11]:
# find whether columns contain null values- two ways to do this:

# print(diabetes_data.info())
#OR
print(diabetes_data.isnull().sum())
#there are no null values

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


While it's technically true that none of the columns contain null values, that doesn't necessarily mean that the data isn't missing any values.
To investigate further, calculate the summary statistics on `diabetes_data` using the `.describe()` method.

In [10]:
# perform summary statistics
print(diabetes_data.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age  
count  768.000000                768.000000  768.000000  
mean    31.992578                  0.471876   33.240885  
std      7.884160                  0.331329   11.760232  
min      0.000000                  0.078000   21.000000  
25%     27.300000        

Looking at the summary statistics, I noticed that the following columns had minimum values of 0 which isn't possible for a live human being

   - `Glucose`
   - `BloodPressure`
   - `SkinThickness`
   - `Insulin`
   - `BMI`
   Additionally, BMI, number of pregnancies and insulin seem to have outlier values since their maximum values are off the chart

To get a more accurate view of the missing values in the data.

Replace the instances of `0` with `NaN` in the five columns mentioned:
   
 

In [18]:
# replace instances of 0 with NaN
diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN) 

Re-check for missing (null) values in all of the columns just like earlier


In [19]:
# find whether columns contain null values after replacements are made
print(diabetes_data.isnull().sum())

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64


Taking a closer look at these rows to get a better idea of _why_ some data might be missing.

    

In [20]:
# print rows with missing values
print(diabetes_data[diabetes_data.isnull().any(axis=1)])

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6    148.0           72.0           35.0      NaN  33.6   
1              1     85.0           66.0           29.0      NaN  26.6   
2              8    183.0           64.0            NaN      NaN  23.3   
5              5    116.0           74.0            NaN      NaN  25.6   
7             10    115.0            NaN            NaN      NaN  35.3   
..           ...      ...            ...            ...      ...   ...   
761            9    170.0           74.0           31.0      NaN  44.0   
762            9     89.0           62.0            NaN      NaN  22.5   
764            2    122.0           70.0           27.0      NaN  36.8   
766            1    126.0           60.0            NaN      NaN  30.1   
767            1     93.0           70.0           31.0      NaN  30.4   

     DiabetesPedigreeFunction  Age Outcome  
0                       0.627   50       1  
1                    

There seems to be a pattern patterns or overlaps between the missing data;
Missing skin thickness and insulin data seem to mostly occur concurrently

In [22]:
#A closer look at the data types of each column in `diabetes_data`.
# print data types using .info() method
print(diabetes_data.dtypes)

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object


The Outcome column doesn't match the expected dtype. It's of type `object` (string) instead of type `int64`. I therefore printed out the unique values in the `Outcome` column.

In [23]:
# print unique values of Outcome column
print(diabetes_data.Outcome.unique())

['1' '0' 'O']


In [27]:
#the column has a mix of integers and letter 'O', resolved this by replacing the 'O' with zeros 

diabetes_data['Outcome'] = diabetes_data['Outcome'].replace(['O', 0]).astype('int')
print(diabetes_data.Outcome.unique())

[1 0]


## Next Steps:

<!-- building a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not -->