# EDA: Diagnosing Diabetes

In this project, you'll imagine you are a data scientist interested in exploring data that looks at how certain diagnostic factors affect the diabetes outcome of women patients.

You will use your EDA skills to help inspect, clean, and validate the data.

**Note**: This [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration per 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure
- `SkinThickness`: Triceps skinfold thickness
- `Insulin`: 2-Hour serum insulin
- `BMI`: Body mass index
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1)

Let's get started!

## Initial Inspection

1. First, familiarize yourself with the dataset [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

   Look at each of the nine columns in the documentation.
   
   What do you expect each data type to be?

Expected data type for each column:

- `Pregnancies`: INT
- `Glucose`: FLOAT
- `BloodPressure`: FLOAT
- `SkinThickness`: INT
- `Insulin`: FLOAT
- `BMI`: FLOAT
- `DiabetesPedigreeFunction`: FLOAT
- `Age`: INT
- `Outcome`: BOOL/BINARY

2. Next, let's load in the diabetes data to start exploring.

   Load the data in a variable called `diabetes_data` and print the first few rows.
   
   **Note**: The data is stored in a file called `diabetes.csv`.

In [50]:
import pandas as pd
import numpy as np

# load in data
diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
0            6      148             72             35        0  33.6  \
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age Outcome  
0                     0.627   50       1  
1                     0.351   31       0  
2                     0.672   32       1  
3                     0.167   21       0  
4                     2.288   33       1  


3. How many columns (features) does the data contain?

In [51]:
# print number of columns
print(len(diabetes_data.columns))


9


4. How many rows (observations) does the data contain?

In [52]:
# print number of rows
print(len(diabetes_data))

768


## Further Inspection

5. Let's inspect `diabetes_data` further.

   Do any of the columns in the data contain null (missing) values?

In [53]:
# find whether columns contain null values
print(diabetes_data.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


6. If you answered no to the question above, not so fast!

   While it's technically true that none of the columns contain null values, that doesn't necessarily mean that the data isn't missing any values.
   
   When exploring data, you should always question your assumptions and try to dig deeper.
   
   To investigate further, calculate summary statistics on `diabetes_data` using the `.describe()` method.

In [54]:
# perform summary statistics
print(diabetes_data.info())

print('\n')

print(diabetes_data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB
None


       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin   
count   768.000000  768.000000     768.000000     768.000000  768.000000  \
mean      3.845052  120.894531      69.105469      20.536458   79.799

7. Looking at the summary statistics, do you notice anything odd about the following columns?

   - `Glucose`
   - `BloodPressure`
   - `SkinThickness`
   - `Insulin`
   - `BMI`

In [55]:
# comparing some variables side by side
for column in diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']]:
    print(column, {'MEAN': diabetes_data[column].mean(), 'MAX': diabetes_data[column].max(), 'STD': diabetes_data[column].std()})


Glucose {'MEAN': 120.89453125, 'MAX': 199, 'STD': 31.97261819513622}
BloodPressure {'MEAN': 69.10546875, 'MAX': 122, 'STD': 19.355807170644777}
SkinThickness {'MEAN': 20.536458333333332, 'MAX': 99, 'STD': 15.952217567727637}
Insulin {'MEAN': 79.79947916666667, 'MAX': 846, 'STD': 115.24400235133817}
BMI {'MEAN': 31.992578124999998, 'MAX': 67.1, 'STD': 7.884160320375446}


**Your response to question 7**:
# <span style='color:red'>The mean seems a little off on several columns when you compare the mean, max, and standard deviation values. I 've read somewhere that if the max is more than three std way from the mean then theres most likely some outliers.</span>

8. Do you spot any other outliers in the data?

**Your response to question 8**:
# <span style='color:red'>Theres quite a bit of missing values and insulin max value seems high. This might skew the spread of data when calculating the mean.</span>

9. Let's see if we can get a more accurate view of the missing values in the data.

   Use the following code to replace the instances of `0` with `NaN` in the five columns mentioned:
   
   ```py
   diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)
   ```

In [56]:
# replace instances of 0 with np.nan
# diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.nan)

# Replace 0 with np.nan using for loop
for column in diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']]:
    diabetes_data[column] = diabetes_data[column].replace(0, np.nan)

print(diabetes_data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
0            6    148.0           72.0           35.0      NaN  33.6  \
1            1     85.0           66.0           29.0      NaN  26.6   
2            8    183.0           64.0            NaN      NaN  23.3   
3            1     89.0           66.0           23.0     94.0  28.1   
4            0    137.0           40.0           35.0    168.0  43.1   

   DiabetesPedigreeFunction  Age Outcome  
0                     0.627   50       1  
1                     0.351   31       0  
2                     0.672   32       1  
3                     0.167   21       0  
4                     2.288   33       1  


10. Next, check for missing (null) values in all of the columns just like you did in Step 5.

    Now how many missing values are there?

In [57]:
# find whether columns contain null values after replacements are made
print(diabetes_data.isnull().sum())

print(diabetes_data.head(10))

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
0            6    148.0           72.0           35.0      NaN  33.6  \
1            1     85.0           66.0           29.0      NaN  26.6   
2            8    183.0           64.0            NaN      NaN  23.3   
3            1     89.0           66.0           23.0     94.0  28.1   
4            0    137.0           40.0           35.0    168.0  43.1   
5            5    116.0           74.0            NaN      NaN  25.6   
6            3     78.0           50.0           32.0     88.0  31.0   
7           10    115.0            NaN            NaN      NaN  35.3   
8            2    197.0           70.0           45

11. Let's take a closer look at these rows to get a better idea of _why_ some data might be missing.

    Print out all the rows that contain missing (null) values.

In [58]:
print(diabetes_data[diabetes_data.isnull().any(axis=1)])
print(diabetes_data.info())
 

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
0              6    148.0           72.0           35.0      NaN  33.6  \
1              1     85.0           66.0           29.0      NaN  26.6   
2              8    183.0           64.0            NaN      NaN  23.3   
5              5    116.0           74.0            NaN      NaN  25.6   
7             10    115.0            NaN            NaN      NaN  35.3   
..           ...      ...            ...            ...      ...   ...   
761            9    170.0           74.0           31.0      NaN  44.0   
762            9     89.0           62.0            NaN      NaN  22.5   
764            2    122.0           70.0           27.0      NaN  36.8   
766            1    126.0           60.0            NaN      NaN  30.1   
767            1     93.0           70.0           31.0      NaN  30.4   

     DiabetesPedigreeFunction  Age Outcome  
0                       0.627   50       1  
1                    

12. Go through the rows with missing data. Do you notice any patterns or overlaps between the missing data?

**Your response to question 12**: 
# <span style="color:red">Several rows have both SkinThickness and Insulin missing. Suggestion they might not be completely independent variables. Possibly some connection between the thickness of someones skin and the ability to administer/check insulin and its levels.</span>

13. Next, take a closer look at the data types of each column in `diabetes_data`.

    Does the result match what you would expect?

In [59]:
# print data types using .info() method
print(diabetes_data.info())
print(diabetes_data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   763 non-null    float64
 2   BloodPressure             733 non-null    float64
 3   SkinThickness             541 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       757 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(6), int64(2), object(1)
memory usage: 54.1+ KB
None
       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin   
count   768.000000  763.000000     733.000000     541.000000  394.000000  \
mean      3.845052  121.686763      72.405184      29.153420  155.54822

# <span style='color:red'>It seems all four columns we called the replace method on have changed data type from int64 to float64. I wasn't expecting that. Though those columns do make more since as continuous type values rather than discrete values.</span>

### <span style='color:red'>I did a little research and experimenting and I guess changing integer to nan value converts the whole column data type to float. NaN values are considered type float (non-finite) compared to the (finite) integer values. Since your computer represents integers by binary 1010's there is no way to represent infinity in integers. NaN values and floating point values are infinite. Could someone smarter than me let me know if that sounds accurate?</span>

14. To figure out why the `Outcome` column is of type `object` (string) instead of type `int64`, print out the unique values in the `Outcome` column.

In [60]:
# print unique values of Outcome column
print(diabetes_data.Outcome.unique())
# print count of 'O' values
print(list(diabetes_data.Outcome).count('O'))
# print rows with 'O' values
print(diabetes_data[diabetes_data.Outcome == 'O'])
print('\n')
# comparing 'O' rows with '0' rows to find some connection.. which I didn't find
print(diabetes_data[diabetes_data.Outcome == '0'].head(6))


['1' '0' 'O']
6
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
67             2    109.0           92.0            NaN      NaN  42.7  \
112            1     89.0           76.0           34.0     37.0  31.2   
134            2     96.0           68.0           13.0     49.0  21.1   
147            2    106.0           64.0           35.0    119.0  30.5   
166            3    148.0           66.0           25.0      NaN  32.5   
234            3     74.0           68.0           28.0     45.0  29.7   

     DiabetesPedigreeFunction  Age Outcome  
67                      0.845   54       O  
112                     0.192   23       O  
134                     0.647   26       O  
147                     1.400   34       O  
166                     0.256   22       O  
234                     0.293   23       O  


    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
1             1     85.0           66.0           29.0      NaN  26.6  \
3 

15. How might you resolve this issue?

**Your response to question 15**:
# <span style='color:red'>There are a total of 6 people who's outcome wasn't recorded for some reason that doesn't seem clear to me. With it being such a small number of rows I would say to remove those rows. Because of the same reason I believe we could replace all the 'O's with '0' so we can then change type to INT/BOOL.</span>

## Next Steps:

16. Congratulations! In this project, you saw how EDA can help with the initial data inspection and cleaning process. This is an important step as it helps to keep your datasets clean and reliable.

    Here are some ways you might extend this project if you'd like:
    - Use `.value_counts()` to more fully explore the values in each column.
    - Investigate other outliers in the data that may be easily overlooked.
    - Instead of changing the `0` values in the five columns to `NaN`, try replacing the values with the median or mean of each column.

In [61]:
for column in diabetes_data:
    print(diabetes_data[column].value_counts())

figured it out
figured it out
figured it outf

SyntaxError: invalid syntax (3038119770.py, line 4)