# EDA: Diagnosing Diabetes

In this project, I'll be exploring data that looks at how certain diagnostic factors affect the diabetes outcome of women patients.

I will help inspect, clean, and validate the data.

**Note**: This [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration per 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure
- `SkinThickness`: Triceps skinfold thickness
- `Insulin`: 2-Hour serum insulin
- `BMI`: Body mass index
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1)

## Initial Inspection

1. First, let's familiarize ourselves with the dataset [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

Expected data type for each column:

- `Pregnancies`: 
- `Glucose`: 
- `BloodPressure`: 
- `SkinThickness`: 
- `Insulin`: 
- `BMI`: 
- `DiabetesPedigreeFunction`: 
- `Age`: 
- `Outcome`: 

2. Next, let's load in the diabetes data to start exploring.

   Load the data in a variable called `diabetes_data` and print the first few rows.
   
   **Note**: The data is stored in a file called `diabetes.csv`.

In [24]:
import pandas as pd
import numpy as np

# load in data
diabetes_data = pd.read_csv('diabetes.csv')
diabetes_data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


3. Counting how many columns (features) does the data contain

In [3]:
print(len(diabetes_data.columns))

9


_The DataSet has **9 columns**_

4. Counting how many rows (observations) does the data contain

In [4]:
print(len(diabetes_data))

768


_The DataSet has **768 rows**_

## Further Inspection

5. Let's inspect `diabetes_data` with different approaches to check whether any of the columns in the data contain null (missing) values

In [5]:
print(diabetes_data.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


6. The answer to the previous question could be 'no', however, While it's technically true that none of the columns contain null values, that doesn't necessarily mean that the data isn't missing any values. When exploring data, we should always question our assumptions and try to dig deeper.
   
   To investigate further, let's calculate summary statistics on `diabetes_data` using the `.describe()` method.

In [7]:
print(diabetes_data.describe(include='all').round(2))

        Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin     BMI  \
count        768.00   768.00         768.00         768.00   768.00  768.00   
unique          NaN      NaN            NaN            NaN      NaN     NaN   
top             NaN      NaN            NaN            NaN      NaN     NaN   
freq            NaN      NaN            NaN            NaN      NaN     NaN   
mean           3.85   120.89          69.11          20.54    79.80   31.99   
std            3.37    31.97          19.36          15.95   115.24    7.88   
min            0.00     0.00           0.00           0.00     0.00    0.00   
25%            1.00    99.00          62.00           0.00     0.00   27.30   
50%            3.00   117.00          72.00          23.00    30.50   32.00   
75%            6.00   140.25          80.00          32.00   127.25   36.60   
max           17.00   199.00         122.00          99.00   846.00   67.10   

        DiabetesPedigreeFunction     Age Outcome  


7. Looking at the summary statistics, what can we say about the following columns?

   - `Glucose`
   - `BloodPressure`
   - `SkinThickness`
   - `Insulin`
   - `BMI`

_Taking a look at the minimum values for these five columns, we can notice that they are all `0`, which is almost impossible. These values also seem to be way off from their respective medians and means, another indicator that something is off. One way to interpret this is that these are missing values in the data_

8. Let's try to find any other outliers in the data

In addition to the `0` values that show up for the five columns above, there appear to be additional outliers, such as:
- The maximum value of the Insulin column is 846, which is abnormally high.
- The maximum value of the Pregnancies column is 17. While having 17 pregnancies is not impossible, this case might be something to look further into to determine its accuracy.

9. Let's see if we can get a more accurate view of the missing values in the data. We first replace the instances of `0` with `NaN` in the five columns mentioned:
   

In [17]:
diabetes_data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes_data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.nan)
diabetes_data.describe(include='all')

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,768.0
unique,,,,,,,,,3.0
top,,,,,,,,,0.0
freq,,,,,,,,,494.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,
std,3.369578,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232,
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,
25%,1.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0,
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,
75%,6.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0,


_In the context of the dataset, a 0 in the pregnancies column is a valid value, indicating that the individual has never been pregnant so it should not be replaced. We now have values in the minimum value row._

10. Now, let's check for missing (null) values in all of the columns just like we did in Step 5.

    For the same 5 columns, how many missing values are there, now?

In [14]:
print(diabetes_data.isnull().sum())

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64


_There is **no missing values** anymore._

11. Let's take a closer look at these rows to get a better idea of _why_ some data might be missing.

In [18]:
print(diabetes_data[diabetes_data.isnull().any(axis=1)])

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6    148.0           72.0           35.0      NaN  33.6   
1              1     85.0           66.0           29.0      NaN  26.6   
2              8    183.0           64.0            NaN      NaN  23.3   
5              5    116.0           74.0            NaN      NaN  25.6   
7             10    115.0            NaN            NaN      NaN  35.3   
..           ...      ...            ...            ...      ...   ...   
761            9    170.0           74.0           31.0      NaN  44.0   
762            9     89.0           62.0            NaN      NaN  22.5   
764            2    122.0           70.0           27.0      NaN  36.8   
766            1    126.0           60.0            NaN      NaN  30.1   
767            1     93.0           70.0           31.0      NaN  30.4   

     DiabetesPedigreeFunction  Age Outcome  
0                       0.627   50       1  
1                    

12. Going through the rows with missing data. We can notice some patterns or overlaps between the missing data.

_One thing you might notice is that most rows with missing data have missing values in more than one column. In fact, almost all rows with at least one missing value also have a missing value in the insulin column. This is a clue as to why this data is missing! Patients did not have their insulin measured, because they might also not have had these other measurements taken._

13. Now, let's take a closer look at the data types of each column in `diabetes_data` to check whether the results match what we expect.

In [20]:
#print(diabetes_data.head())
print(diabetes_data.dtypes)

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object


14. Let's try to figure out why the `Outcome` column is of type `object` (string) instead of type `int64`.

In [21]:
print(diabetes_data.Outcome.unique())

['1' '0' 'O']


_Because there is typo. **An 'O' was introduced as a value rather than a '0'**, so python represented the columns as a string. We need to fix this before change the Outcome column datatype from object to bool (if we want to do it)._

15. How might we resolve this issue?

In [22]:
diabetes_data['Outcome'].replace('O', 0, inplace=True)
#print(diabetes_data.Outcome.unique())
diabetes_data['Outcome'] = diabetes_data['Outcome'].astype('int64')
print(diabetes_data['Outcome'].unique())

[1 0]


16. How many people are more likely to develop diabetes? How many do not?

In [23]:
diabetes_data = diabetes_data.Outcome.value_counts()print(diabetes_data)

Outcome
0    500
1    268
Name: count, dtype: int64


_268 people are more likely to develop diabetes. 500 people do not._