# EDA: Inspect, Clean, and Validate a Dataset 1

## Introduction
One of the most challenging parts of data cleaning is diagnosing data issues and figuring out HOW to most effectively address them. In order to accomplish this, exploratory data analysis (EDA) can be an extremely useful tool. In this article, we’ll walk through an example dataset to demonstrate how EDA can inform the initial data inspection, cleaning, and validation process.

While this article serves as an introduction to EDA for data cleaning, it is important to note that every dataset is different, and therefore will require different exploration. EDA is all about following the data, verifying your assumptions, and investigating anything that is unexpected.

## Initial Data Inspection
Before analysis or cleaning, it is useful to print a few rows of data. This helps ensure that the data is properly loaded. It also allows us to compare the observed data to the data dictionary and determine whether the coding appears to match our expectations. For example, let’s load and inspect the first few rows of a dataset of heart disease patients (downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/45/heart+disease)).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Variables Table**
| Variable Name | Role | Type | Demographic | Description | Units | Missing Values |
| --- | --- | --- | --- | --- | --- | --- |
| age | Feature | Integer | Age | age in years | years | no |
| sex | Feature | Categorical | Sex | sex (1 = male; 0 = female) | | no |
| cp | Feature | Categorical |  | **chest pain type** <br> --Value `1`: typical angina <br> --Value `2`: atypical angina <br> --Value `3`: non-anginal pain <br> --Value `4`: asymptomatic | | no |
| trestbps | Feature | Integer |  | resting blood pressure (on admission to the hospital) | mm<br>Hg | no |
| chol | Feature | Integer |  | serum cholestoral | mg/dl | no |
| fbs | Feature | Categorical |  | fasting blood sugar > 120 mg/dl (`1` = true; `0` = false) | | no |
| restecg | Feature | Categorical |  | **resting electrocardiographic results** <br>-- Value `0`: normal <br>-- Value `1`: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) <br>-- Value `2`: showing probable or definite left ventricular hypertrophy by Estes' criteria | | no |
| thalach | Feature | Integer |  | 	maximum heart rate achieved | | no |
| exang | Feature | Categorical |  | exercise induced angina | | no |
| oldpeak | Feature | Integer |  | ST depression induced by exercise relative to rest | | no |
| slope | Feature | Categorical |  | **the slope of the peak exercise ST segment** <br>-- Value `1`: upsloping <br>-- Value `2`: flat <br>-- Value `3`: downsloping | | no |
| ca | Feature | Integer |  | number of major vessels (0-3) colored by flourosopy | | yes |
| thal | Feature | Categorical |  | `3` = normal; `6` = fixed defect; `7` = reversable defect | | yes |
| heart_disease (*`num` in original dataset*) | Target | Integer |  | 	**diagnosis of heart disease**  <br>-- Value `0`: < 50% diameter narrowing <br>-- Value `1`: > 50% diameter narrowing | | no |

In [2]:
df = pd.read_csv('processed.cleveland.data.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


There are a few things we might want to inspect. For example, the data dictionary gives the following information about the `cp` column:

`cp`: chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic

Based on this information, it’s not necessarily clear whether the data is going to be coded as numerical values (eg., `1`, `2`, `3`, or `4`) or with strings (eg., '`typical angina`'). Data inspection allows us to clarify that this column contains numerical values.

Similarly, there is some conflicting information in the data dictionary about the target column (note: we renamed this column as `heart_disease` before loading it, but it was originally coded as `num`). The list of features contains the following information about this column:

`num`: diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing

However, the initial data description suggests that the target field is integer valued from 0-4, where 0 indicates no heart disease, and values 1-4 indicate the presence of heart disease.

By inspecting the first few rows of data, we see at least one instance of the value `2` in the `heart_disease` column. This suggests that the values probably range from 0-4 instead of just 0-1. We could verify this with further exploration (e.g., by using `heart.heart_disease.value_counts()` to get a table of values in this column).

## Data Information
Once we’ve taken a first look at some data, a common next step is to address questions such as:
- How many (non-null) observations do we have?
- How many unique columns/features do we have?
- Which columns (if any) contain missing data?
- What is the data type of each column?

Using pandas, we can easily address these questions using the `.info()` method. For example:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             303 non-null    object 
 12  thal           303 non-null    object 
 13  heart_disease  303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


There are a few interesting pieces of information that we can glean from this output:
- There are 303 rows and 14 columns of data
- At first glance, there are no null (i.e., missing) values in any column (we’ll come back to this)
- The `ca` and `thal` columns have a data type of `object` (which suggests that they are strings), even though we saw in our initial inspection that these columns appear to contain numerical values

To investigate the unexpected output here, we might want to take a look at the unique values in the `ca` column:

In [4]:
df.ca.unique()

array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)

We note that at least one row contains a `'?'` in this column. We can probably assume that this indicates mis-coded missing data. The `'?'` also probably forced the column to be coded as a string because there is no obvious way to cast a `'?'` to a numerical value.

Given this information, we now have more to do! We can replace any instance of `'?'` with `np.NaN`, change the data type of this column back to a float or integer, and then re-print the `heart.info()` to determine how many missing values we’ve got. Then, we probably want to do a similar inspection of the `thal` column.

In [5]:
df['ca'] = df['ca'].replace('?', np.nan)

In [6]:
df.thal.unique() 

array(['6.0', '3.0', '7.0', '?'], dtype=object)

In [7]:
df['thal'] = df['thal'].replace('?', np.nan)

## Inspecting Missing Data
After identifying that there is some missing data and converting it to a format that Python can recognize, it’s often a good idea to take a closer look at those rows. Sometimes, we can find clues as to WHY the data is missing, which can help us make decisions about whether to get rid of the rows altogether or impute the missing values somehow.

In [8]:
df[df.isnull().any(axis=1)]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,,0
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,,7.0,1
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,,2
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,,3.0,0


Looking at this output, we note that there is no overlap between the rows with missing `ca` data and missing `thal` data. This suggests that these patients are missing `ca` and `thal` information for different reasons. We don’t see any immediate clues as to why the data is missing in the first place, but we can inspect this further once we start digging into individual features.

## Data Exploration in Real-Time
If you’d like to watch us inspect this dataset in real-time, feel free to checkout the livestream recording below:

[Video](https://www.youtube.com/watch?v=YwadRm2sfpQ)

## Steps from the Video

Full Notebook: [GitHub](https://github.com/Codecademy/Master-Statistics-Live-Series/blob/main/Codecademy%20Live%20Stats%20%231/Final%20Code.ipynb)

In [9]:
heart = pd.read_csv('processed.cleveland.data.csv')

In [10]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [None]:
heart.describe(include='all')   # summary statistics

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
unique,,,,,,,,,,,,5.0,4.0,
top,,,,,,,,,,,,0.0,3.0,
freq,,,,,,,,,,,,176.0,166.0,
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,,,0.937294
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,,,1.228536
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,,,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,,,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,,,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,,,2.0


In [12]:
heart.info()   # data types and missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             303 non-null    object 
 12  thal           303 non-null    object 
 13  heart_disease  303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


In [14]:
heart.ca.unique()  # unique values of 'ca'

array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)

In [15]:
heart.thal.unique()  # unique values of 'thal'

array(['6.0', '3.0', '7.0', '?'], dtype=object)

In [18]:
heart = heart.replace('?', np.nan)  # or heart.replace('?', np.nan, inplace=True)

In [22]:
heart.ca = heart.ca.astype("float")

In [23]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             299 non-null    float64
 12  thal           301 non-null    object 
 13  heart_disease  303 non-null    int64  
dtypes: float64(12), int64(1), object(1)
memory usage: 33.3+ KB


In [25]:
heart.cp = heart.cp.replace({1.0: 'typical angina', 2.0: 'atypical angina', 3.0: 'non-anginal pain', 4.0: 'asymptomatic'})

In [27]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    object 
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             299 non-null    float64
 12  thal           301 non-null    object 
 13  heart_disease  303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


In [28]:
heart.describe(include='all')

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
count,303.0,303.0,303,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,299.0,301.0,303.0
unique,,,4,,,,,,,,,,3.0,
top,,,asymptomatic,,,,,,,,,,3.0,
freq,,,144,,,,,,,,,,166.0,
mean,54.438944,0.679868,,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.672241,,0.937294
std,9.038662,0.467299,,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.937438,,1.228536
min,29.0,0.0,,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,,0.0
25%,48.0,0.0,,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,,0.0
50%,56.0,1.0,,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,,0.0
75%,61.0,1.0,,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,,2.0


In [29]:
heart.slope = heart.slope.replace({1.0: 'upsloping', 2.0: 'flat', 3.0: 'downsloping'})

In [30]:
heart.slope = pd.Categorical(heart.slope, categories=['upsloping', 'flat', 'downsloping'], ordered=True)

In [32]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
0,63.0,1.0,typical angina,145.0,233.0,1.0,2.0,150.0,0.0,2.3,downsloping,0.0,6.0,0
1,67.0,1.0,asymptomatic,160.0,286.0,0.0,2.0,108.0,1.0,1.5,flat,3.0,3.0,2
2,67.0,1.0,asymptomatic,120.0,229.0,0.0,2.0,129.0,1.0,2.6,flat,2.0,7.0,1
3,37.0,1.0,non-anginal pain,130.0,250.0,0.0,0.0,187.0,0.0,3.5,downsloping,0.0,3.0,0
4,41.0,0.0,atypical angina,130.0,204.0,0.0,2.0,172.0,0.0,1.4,upsloping,0.0,3.0,0


In [33]:
heart.slope.unique()

['downsloping', 'flat', 'upsloping']
Categories (3, object): ['upsloping' < 'flat' < 'downsloping']

In [36]:
heart.slope.cat.codes

0      2
1      1
2      1
3      2
4      0
      ..
298    1
299    1
300    1
301    1
302    0
Length: 303, dtype: int8

In [39]:
heart.describe(include='all')

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
count,303.0,303.0,303,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303,299.0,301.0,303.0
unique,,,4,,,,,,,,3,,3.0,
top,,,asymptomatic,,,,,,,,upsloping,,3.0,
freq,,,144,,,,,,,,142,,166.0,
mean,54.438944,0.679868,,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,,0.672241,,0.937294
std,9.038662,0.467299,,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,,0.937438,,1.228536
min,29.0,0.0,,94.0,126.0,0.0,0.0,71.0,0.0,0.0,,0.0,,0.0
25%,48.0,0.0,,120.0,211.0,0.0,0.0,133.5,0.0,0.0,,0.0,,0.0
50%,56.0,1.0,,130.0,241.0,0.0,1.0,153.0,0.0,0.8,,0.0,,0.0
75%,61.0,1.0,,140.0,275.0,0.0,2.0,166.0,1.0,1.6,,1.0,,2.0
