# EXPLORATORY DATA ANALYSIS IN PYTHON
# EDA: Diagnosing Diabetes
In this project, you’ll imagine you are a data scientist interested in exploring data that looks at how certain diagnostic factors affect the diabetes outcome of women patients.

You will use your EDA skills to help inspect, clean, and validate the data.

Note: This dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure
- SkinThickness: Triceps skinfold thickness
- Insulin: 2-Hour serum insulin
- BMI: Body mass index
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)

Let’s get started!

# Tasks


Mark the tasks as complete by checking them off


Initial Inspection
1.
First, familiarize yourself with the dataset here.

Look at each of the nine columns in the documentation.

What do you expect each data type to be?


<b>Hint


Take a look at the structure of the data and each of the columns.</b>

Here are the expected data types for each column:

- Pregnancies: int64
- Glucose: int64
- BloodPressure: int64
- SkinThickness: int64
- Insulin: int64
- BMI: float64
- DiabetesPedigreeFunction: float64
- Age: int64
- Outcome: int64
2. Next, let’s load in the diabetes data to start exploring.

Load the data in a variable called diabetes_data and print the first few rows.

Note: The data is stored in a file called diabetes.csv.


<b>Hint


Use Pandas to load in the data and then print out the first few rows:</b>

In [None]:
diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data.head())

3. How many columns (features) does the data contain?


<B>Hint


There are 9 columns in diabetes_data.

One method to find the number of columns is to use .columns:

In [None]:
print(len(diabetes_data.columns))

<b>which outputs:

In [None]:
9

<B>Another method to find the number of columns is to use .shape, which prints out the full dimensions of the data:

In [None]:
(768, 9)

4. How many rows (observations) does the data contain?


<B>Hint


There are 768 rows in diabetes_data.

One method to find the number of rows is to print the following:

In [None]:
print(len(diabetes_data))

which outputs:

In [None]:
768

Another method to find the number of rows is to use .shape, which prints out the full dimensions of the data:

In [None]:
(768, 9)

### Further Inspection
5. Let’s inspect diabetes_data further.

Do any of the columns in the data contain null (missing) values?


<b>Hint


To find whether any columns contain missing data, you can use the .isnull().sum() method:



In [None]:
print(diabetes_data.isnull().sum())

The output of this code is:

In [None]:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Alternatively, you can use the .info() method, which will print out a concise summary of the DataFrame:

In [None]:
print(diabetes_data.info())

The output should look like:

In [None]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None

According to the output, there are no columns with null values.

But is this really true?

6. If you answered no to the question above, not so fast!

While it’s technically true that none of the columns contain null values, that doesn’t necessarily mean that the data isn’t missing any values.

When exploring data, you should always question your assumptions and try to dig deeper.

To investigate further, calculate summary statistics on diabates_data using the .describe() method.


<b>Hint
To calculate summary statistics on diabates_data, use the .describe() method:

In [None]:
print(diabetes_data.describe())

7. Looking at the summary statistics, do you notice anything odd about the following columns?

- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI

<b>Hint
If you take a look at the minimum values for these five columns, you’ll notice that they are all 0.

How can Blood Pressure or BMI be 0? That makes no sense! These values also seem to be way off from their respective medians and means, another indicator that something is off.

One way to interpret this is that these are missing values in the data.
</b>

8. Do you spot any other outliers in the data?


<B>Hint
In addition to the 0 values that show up for the five columns above, there appear to be additional outliers, such as:

The maximum value of the Insulin column is 846, which is abnormally high.
The maximum value of the Pregnancies column is 17. While having 17 pregnancies is not impossible, this case might be something to look further into to determine its accuracy.
As you can see, EDA helps inform the data cleaning process by helping catch things that aren’t immediately obvious.
</b>

9. Let’s see if we can get a more accurate view of the missing values in the data.

Use the following code to replace the instances of 0 with NaN in the five columns mentioned:

In [None]:
diabetes_data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes_data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

<b>Hint


Since 0 is a non-null value, replacing these instances with NaN makes it easier to find possible missing values in these columns.

10. Next, check for missing (null) values in all of the columns just like you did in Step 5.

Now how many missing values are there?


<b>Hint


To get the number of null values for each of the columns you can use the .isnull().sum() method:

In [None]:
print(diabetes_data.isnull().sum())

`The output looks like:`

In [None]:
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

`Alternatively, you can use the .info() method, which will print out a concise summary of the DataFrame:`

In [None]:
print(diabetes_data.info())

`The output should look like:`

In [None]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     763 non-null float64
BloodPressure               733 non-null float64
SkinThickness               541 non-null float64
Insulin                     394 non-null float64
BMI                         757 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null object
dtypes: float64(6), int64(2), object(1)
memory usage: 54.1+ KB
None

`As you can see from the output, some values in the data do seem to be missing, even though an initial look at the data told us something different.`

`For example, the SkinThickness column has 227 instances of 0 out of 768 rows, meaning that nearly 30% of the values in this column might be missing.`

`Note: you might choose to deal with these values differently in your analysis, but the important point here is that EDA allows you to dig deeper into the data and help inform you which parts need cleaning.`

11. Let’s take a closer look at these rows to get a better idea of why some data might be missing.

Print out all of the rows that contain missing (null) values.

<b>Hint


To print out the rows with missing values, you can use the following code:

In [None]:
print(diabetes_data[diabetes_data.isnull().any(axis=1)])

12. Go through the rows with missing data. Do you notice any patterns or overlaps between the missing data?


`Hint`<br>


`One thing you might notice is that most rows with missing data have missing values in more than one column. In fact, every single row with at least one missing value also has a missing value in the insulin column. This is a clue as to why this data is missing! If patients did not have their insulin measured, why might they also not have had these other measurements taken?`

`Depending on how much data is missing, you might choose to remove specific rows or impute the missing values somehow.`

13. Next, take a closer look at the data types of each column in diabetes_data.


Does the result match what you would expect?

`Hint`<br>
`To print the data types of each column, you can use .dtypes:`

In [None]:
print(diabetes_data.dtypes)

`The outlook should look like:`

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object

`Alternatively, you can use the .info() method, which will print out a concise summary of the DataFrame with the data types of each column included:`

In [None]:
print(diabetes_data.info())

`The output should look like this:`

In [None]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None

`It looks like the Outcome column is of type object (string) even though in our initial inspection we expected it to be of type int64.

Let’s figure out why this might be.`

14. To figure out why the Outcome column is of type object (string) instead of type int64, print out the unique values in the Outcome column.


`Hint`<br>
`To print out the unique values of Outcome, you can use the .unique() method:`

In [None]:
print(diabetes_data.Outcome.unique())

`You should see the following output:`

In [None]:
['1' '0' 'O']

`Notice that we have instances of the character 'O' in addition to the number 0.

The documentation tells us that the value of the Outcome column should either be a 0 or a 1, so it seems likely that instances of the character 'O' are misentries.`

15. How might you resolve this issue?


`Hint`<br>
`A possible next step would be to replace instances of 'O' with 0 and convert the Outcome column to type int64.`


#### Next Steps
16. Congratulations! In this project, you saw how EDA can help with the initial data inspection and cleaning process. This is an important step as it helps to keep your datasets clean and reliable.


Here are some ways you might extend this project if you’d like:


- Use .value_counts() to more fully explore the values in each column.
- Investigate other outliers in the data that may be easily overlooked.
- Instead of changing the 0 values in the five columns to NaN, try replacing the values with the median or mean of each column.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None

It looks like the Outcome column is of type object (string) even though in our initial inspection we expected it to be of type int64.

Let’s figure out why this might be.

#### 14. To figure out why the Outcome column is of type object (string) instead of type int64, print out the unique values in the Outcome column.

In [None]:
import codecademylib3
import pandas as pd
import numpy as np

# code goes here
#task 2
diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data.head())

#task 3
print(len(diabetes_data.columns))
#print(len(diabetes_data.shape))

#task 4
print(len(diabetes_data))
#print(len(diabetes_data.shape))

#task 5 
print(diabetes_data.isnull().sum())
print(diabetes_data.info())

#task 6
print(diabetes_data.describe())

#task 9
diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0,np.NaN)

#task 10
print(diabetes_data.isnull().sum())
print(diabetes_data.info())

#task 11
print(diabetes_data[diabetes_data.isnull().any(axis=1)])

#task 13
print(diabetes_data.dtypes)
print(diabetes_data.info())

#task 14
print(diabetes_data.Outcome.unique())

#task 15
diabetes_data[['Outcome']] = diabetes_data[['Outcome']].replace('O', '0')
print(diabetes_data.Outcome.unique())

#task 16
print(diabetes_data.Outcome.value_counts())
