# Explorative Data Analysis (EDA): Diagnosing Diabetes

The goal of this project is to explore data that looks at how certain diagnostic factors affect the diabetes outcome of women patients. We are going to use EDA skills to help inspect, clean, and validate the data. We will use your EDA skills to help inspect, clean, and validate the data.

This dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. 


First, let's import libraries and the dataset itself to prepare it for further work.

In [95]:
import pandas as pd
import numpy as np

# Import dataset
diabetes = pd.read_csv('diabetes.csv')

Now let's explore first few rows of it.

In [96]:
diabetes.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


We can see that it contains the following columns:

- `Pregnancies:` Number of times pregnant;
- `Glucose:` Plasma glucose concentration; 
- `BloodPressure:` Diastolic blood pressure;
- `SkinThickness:` Triceps skinfold thickness;
- `Insulin:` 2-Hour serum insulin;
- `BMI:` Body mass index;
- `DiabetesPedigreeFunction:` Diabetes pedigree function;
- `Age:` Age (years);
- `Outcome:` Class variable (0 or 1);

We can quickly get the number of rows (768) and columns (9).

In [97]:
diabetes.shape

(768, 9)

Or even the whole technical description of the set.

In [98]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


It seems that all data is present what is unusual, so we'll need to check it going forward. 

We can also note that almost all data types are numerical except the outcome column, which doesn't make much sense because it also contains numbers. So let's try to find the cause of this flaw.

In [99]:
diabetes.Outcome.unique()

array(['1', '0', 'O'], dtype=object)

We see letter "O" here as a third type of value. Considering the column consists out of ones and zeros, it's probably just a typo, so let's replace it with 0.

In [100]:
diabetes.Outcome = pd.to_numeric(diabetes.Outcome.replace('O', '0', regex=True))

That should also fix the issue with the datatype.

In [101]:
diabetes.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

And it did. 

What about missing values? Let's try to replace possible 0 values in the dataset with `NaN` to find out how many are missing.

In [102]:
diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

Let's check the result.

In [103]:
print(diabetes.isnull().sum())

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64


In [104]:
diabetes[diabetes.isnull().any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
5,5,116.0,74.0,,,25.6,0.201,30,0
7,10,115.0,,,,35.3,0.134,29,0
...,...,...,...,...,...,...,...,...,...
761,9,170.0,74.0,31.0,,44.0,0.403,43,1
762,9,89.0,62.0,,,22.5,0.142,33,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
766,1,126.0,60.0,,,30.1,0.349,47,1


Seems like around 30% of the "Insulin" and "Skin Thickness" values have been omitted despite initial results and that's a lot. Depending on how much data is missing, we might choose to remove specific rows, impute the missing values somehow or get the absent data from the initial source if possible.

Within our EDA investigation we can check short statistical overview of the dataset. 

In [105]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,0.348958
std,3.369578,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,0.0
75%,6.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


If we try to check this data on outliers we find that "Insulin" `max` looks a bit strange and 17 pregnancies are not very likely to occur. So that definitely deserves more thorow investigation outside of this EDA.

Finally we can additionally check values in each column with `value_counts` method.

In [106]:
diabetes.value_counts()

Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  BMI   DiabetesPedigreeFunction  Age  Outcome
0            74.0     52.0           10.0           36.0     27.8  0.269                     22   0          1
4            117.0    64.0           27.0           120.0    33.2  0.230                     24   0          1
             111.0    72.0           47.0           207.0    37.1  1.390                     56   1          1
             110.0    76.0           20.0           100.0    28.4  0.118                     27   0          1
             109.0    64.0           44.0           99.0     34.8  0.905                     26   1          1
                                                                                                            ..
1            131.0    64.0           14.0           415.0    23.7  0.389                     21   0          1
             130.0    70.0           13.0           105.0    25.9  0.472                     22   0          1
      

By default, the resulting Series will be in descending order so that the first element is the most frequently-occurring result.

## Conclusion

In this project, we saw how EDA can help with the initial data inspection and cleaning process. This is an important step as it helps to keep our datasets clean and reliable.
Going forward with the analysis we will of course need to deal with all mentioned flaws, but the important point here is that EDA helps to inform us about the parts requiring special attention.