# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset/code

Import the necessary libraries and create your dataframe(s).

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

shl_df = pd.read_csv("Sleep_health_and_lifestyle_dataset.csv")


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [3]:
shl_df.isnull().sum()

Person ID                  0
Gender                     0
Age                        0
Occupation                 0
Sleep Duration             0
Quality of Sleep           0
Physical Activity Level    0
Stress Level               0
BMI Category               0
Blood Pressure             0
Heart Rate                 0
Daily Steps                0
Sleep Disorder             0
dtype: int64

In [None]:
#there are no missing data 

In [6]:
shl_df.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


In [4]:
shl_df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
369    False
370    False
371    False
372    False
373    False
Length: 374, dtype: bool

In [None]:
#There are no duplicate data as well

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [5]:
shl_df.describe([x*0.1 for x in range(10)])


Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,187.5,42.184492,7.132086,7.312834,59.171123,5.385027,70.165775,6816.84492
std,108.108742,8.673133,0.795657,1.196956,20.830804,1.774526,4.135676,1617.915679
min,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0
0%,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0
10%,38.3,31.0,6.1,6.0,30.0,3.0,65.0,5000.0
20%,75.6,33.0,6.3,6.0,40.0,4.0,68.0,5000.0
30%,112.9,37.0,6.5,6.0,45.0,4.0,68.0,6000.0
40%,150.2,39.0,6.8,7.0,45.4,5.0,68.0,6000.0
50%,187.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [12]:
#Id column is not necessary,so I dropped it.

shl_df.drop('Person ID',inplace=True,axis=1)

In [13]:
shl_df.head()

Unnamed: 0,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [10]:
#Looking at the data below, we see that there are two different labels for people in normal BMI category, we correct the same in the dataset in the following cell.
shl_df.iloc[:,[8]].value_counts()

BMI Category 
Normal           195
Overweight       148
Normal Weight     21
Obese             10
dtype: int64

In [11]:
shl_df.loc[shl_df['BMI Category']=='Normal Weight',['BMI Category']] = 'Normal'
shl_df.iloc[:,[8]].value_counts()

BMI Category
Normal          216
Overweight      148
Obese            10
dtype: int64

In [None]:
#After correction, we see that as many as 216 out of 374 people have Normal BMI, followed by 148 who are Overweight and the remaining 10 who are Obese.

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset? No, I did not have any duplicates or missing data
2. Did the process of cleaning your data give you new insights into your dataset? Since the data was almost clean, just noticed couple of incosistencies with data.
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations? 

In [15]:
# Exporting the DataFrame.
shl_df.to_csv("shl_df.csv")