# ETL Pipeline that reads from a CSV

### Objectives 

1. Importing Modules
2. Importing the data
### Data Transformation.
#### Data Cleaning.

3. Checking the inconsistencies of the data
4. Dropping down the rows whose User IDs are unkonwn
5. Converting the ``User ID`` from float to int
6. Removing the duplicates if any
7. Converting the Sleep Duration column into float


#### Data Standardization.
8. Fixing the hyphens in ``Activity level`` column
9. Fixing the ``Stress Level`` Column's elements having values 'Very High' to 9
10. Fixing the Actve error in ``Activity Level``
11. Filling up the empty entries
12. Reducing the decimal value to 2 for columns with data type float.
13. Removng the outliers from the dataframe
14. Removing the rows with illogical values(To ensure logical consistency)
15. Converting the Stress Level to ``int`` as we have fixed the data
16. Checking for inconsistent categorical values for the column ``Activity level``
17. Resetting the index

18. Checking if all the changes have been implemented or not.


19. Saving the data into a csv file


### Requirements

In [1]:
pip install great_expectations

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np

### Extracts

In [3]:
def extract(file_to_prcess):
    data = pd.read_csv(file_to_prcess)
    
    return data


#### Data Exploration

In [4]:
data = extract('unclean_smartwatch_health_data.csv')

In [5]:
data.sample(5)

Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours),Activity Level,Stress Level
5009,2313.0,89.398918,97.932791,5060.952124,5.036170722569186,Sedentary,10
2965,1788.0,78.017781,98.736544,518.977421,5.559555798224284,Highly Active,5
2113,2325.0,79.641981,97.559194,6559.260033,4.503353301797565,Seddentary,6
3796,1867.0,85.83612,100.0,21398.378954,8.899756954238846,Actve,4
1683,1193.0,71.754302,96.077817,3179.476407,6.962624786346553,Seddentary,7


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   User ID                 9799 non-null   float64
 1   Heart Rate (BPM)        9600 non-null   float64
 2   Blood Oxygen Level (%)  9700 non-null   float64
 3   Step Count              9900 non-null   float64
 4   Sleep Duration (hours)  9850 non-null   object 
 5   Activity Level          9800 non-null   object 
 6   Stress Level            9800 non-null   object 
dtypes: float64(4), object(3)
memory usage: 547.0+ KB


In [7]:
data.describe()

Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count
count,9799.0,9600.0,9700.0,9900.0
mean,3007.480253,76.035462,97.841581,6985.685885
std,1150.581542,19.412483,1.732863,6885.80968
min,1001.0,40.0,90.791208,0.910138
25%,1997.5,64.890152,96.662683,2021.039657
50%,2998.0,75.220601,98.010642,4962.534599
75%,4004.0,85.198249,99.376179,9724.90288
max,4999.0,296.59397,100.0,62486.690753


### Transform

##### Data Cleaning.

3. Checking the inconsistencies of the data
4. Dropping down the rows whose User IDs are unkonwn
5. Converting the ``User ID`` from float to int
6. Removing the duplicates if any
7. Converting the Sleep Duration column into float

##### Data Standardization.
8. Fixing the hyphens in ``Activity level`` column
9. Fixing the ``Stress Level`` Column's elements having values 'Very High' to 9
10. Fixing the Actve error in ``Activity Level``
11. Filling up the empty entries
12. Reducing the decimal value to 2 for columns with data type float.
13. Removng the outliers from the dataframe
14. Removing the rows with illogical values(To ensure logical consistency)
15. Converting the Stress Level to ``int`` as we have fixed the data
16. Checking for inconsistent categorical values for the column ``Activity level``
17. Resetting the index

In [8]:
data.head(2)

Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours),Activity Level,Stress Level
0,4174.0,58.939776,98.80965,5450.390578,7.167235622316564,Highly Active,1
1,,,98.532195,727.60161,6.538239375570314,Highly_Active,5


In [17]:
data['Activity Level'].unique()

array(['Highly Active', 'Highly_Active', 'Actve', 'Seddentary',
       'Sedentary', 'Active', nan], dtype=object)

In [9]:
from scipy import stats

def transform(data):

    data = data.drop_duplicates()
    data['User ID'] = data['User ID'].drop_duplicates()
    data.loc[:,  'User ID'] = data.loc[:, 'User ID'].dropna()
    data['User ID'] = data['User ID'].astype(str)

    data['Heart Rate (BPM)'] = round(data['Heart Rate (BPM)'].astype(float), 2)
    data['Heart Rate (BPM)'] = data['Heart Rate (BPM)'].fillna(data['Heart Rate (BPM)'].mean())




    data['Blood Oxygen Level (%)'] = round(data['Blood Oxygen Level (%)'], 2)
    data['Blood Oxygen Level (%)'] = data['Blood Oxygen Level (%)'].fillna(data['Blood Oxygen Level (%)'].mean())


    # Can't convert null values to integer
    # Fill null values first
    data.loc[:, 'Step Count'] = data.loc[:, 'Step Count'].fillna(data.loc[:, 'Step Count'].mean())
    data['Step Count'] = data['Step Count'].astype(int)
    

    data.loc[:, 'Sleep Duration (hours)'] = data.loc[:, 'Sleep Duration (hours)'].map(lambda x: np.nan if x == 'ERROR' else x) 
    data.loc[:, 'Sleep Duration (hours)'] = data.loc[:, 'Sleep Duration (hours)'].astype(float)
    
    
    
    data['Sleep Duration (hours)'] = round(data['Sleep Duration (hours)'].astype(float), 2)
    sleep_duration_mean = data['Sleep Duration (hours)'].mean()
    data['Sleep Duration (hours)'] = data['Sleep Duration (hours)'].fillna(sleep_duration_mean)


    data.loc[:,'Activity Level'] = data.loc[:,'Activity Level'].str.replace('Actve', 'Active')
    data.loc[:,'Activity Level'] = data.loc[:,'Activity Level'].str.replace('Highly_Active', 'Highly Active')
    data.loc[:,'Activity Level'] = data.loc[:,'Activity Level'].str.replace('Seddentary', 'Sedentary')


    mode = data['Stress Level'].mode() 
    data['Stress Level'] = data['Stress Level'].str.replace('Very High', '2')


    # Finding and Replacing Missing Values in Numerical columns.
    numerical_data = data[['Heart Rate (BPM)', 'Blood Oxygen Level (%)', 'Step Count', 'Sleep Duration (hours)']]

    z_scores = np.abs(stats.zscore(numerical_data))

    threshold = 3

    outliers = np.where(z_scores < threshold)

    data[['Heart Rate (BPM)', 'Blood Oxygen Level (%)', 'Step Count', 'Sleep Duration (hours)']] = numerical_data[(z_scores < threshold).all(axis = 1)]    


    #Imputing missing values

    



    
    

    return data

In [10]:
transform(data).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   User ID                 10000 non-null  object 
 1   Heart Rate (BPM)        9720 non-null   float64
 2   Blood Oxygen Level (%)  9720 non-null   float64
 3   Step Count              9720 non-null   float64
 4   Sleep Duration (hours)  9720 non-null   float64
 5   Activity Level          9800 non-null   object 
 6   Stress Level            9800 non-null   object 
dtypes: float64(4), object(3)
memory usage: 547.0+ KB


In [11]:
transform(data)

Unnamed: 0,User ID,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours),Activity Level,Stress Level
0,4174.0,58.940000,98.81,5450.0,7.170000,Highly Active,1
1,,76.035481,98.53,727.0,6.540000,Highly Active,5
2,1860.0,,,,,Highly Active,5
3,2294.0,40.000000,96.89,13797.0,7.370000,Active,3
4,2130.0,61.950000,98.58,15679.0,6.505476,Highly Active,6
...,...,...,...,...,...,...,...
9995,,78.820000,98.93,2948.0,7.400000,Active,7
9996,,48.630000,95.77,4725.0,6.380000,Sedentary,2
9997,,73.830000,97.95,2571.0,6.920000,Sedentary,4
9998,,76.035481,98.40,3364.0,5.690000,Active,8


In [12]:
round(transform(data).isnull().sum()/len(data)*100, 2)

User ID                   0.0
Heart Rate (BPM)          2.8
Blood Oxygen Level (%)    2.8
Step Count                2.8
Sleep Duration (hours)    2.8
Activity Level            2.0
Stress Level              2.0
dtype: float64

### Load

In [13]:
target_file = "transformed_data.csv"

def load_data(target_file, transformed_data):
    transformed_data.to_csv(target_file)

## Implementing the ETL Data Pipeline.

Call the functions in order(sequencially) to implement the data pipeline.

In [14]:
extracted_data = extract('unclean_smartwatch_health_data.csv')

In [24]:
transformed_data = pd.DataFrame(transform(extracted_data))
transformed_data = transformed_data.dropna()

In [26]:
load_data(target_file, transformed_data)

In [27]:
round(transformed_data.isnull().sum()/len(transformed_data)*100, 2)

User ID                   0.0
Heart Rate (BPM)          0.0
Blood Oxygen Level (%)    0.0
Step Count                0.0
Sleep Duration (hours)    0.0
Activity Level            0.0
Stress Level              0.0
dtype: float64