# Data Cleaning

## Objectives

- Evaluate missing data
- Clean data

## Inputs

- /workspace/Heart_attack_risk/outputs/datasets/collection/heart.csv

## Outputs

- Generate clean Train and Test sets, at path outputs/datasets/cleaned

## Conclusions
- Data cleaning pipeline

---

## Setting working Directory
The steps below allow to set heart_attack risk as the new working directory

- get current directory and print it


In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Heart_attack_risk/jupyer_notebooks'

- set new working directory as parent of the previous current directory
- As a result heart_attack_risk is the new working directory  

In [2]:
os.chdir(os.path.dirname(current_dir))


## Load dataset

In [3]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/heart.csv"))
df.head(3)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0


## Data 

Checking shape and distribution of missing data

In [5]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

[]

In [6]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

There are no variables with missing data


There are no variables with missing data in the dataset

## Split Dataset into Train and Test 

In [6]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['HeartDisease'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

TrainSet shape: (734, 12) 
TestSet shape: (184, 12)


### Move dataset to new oupput folder

In [7]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned') 
except Exception as e:
  print(e)

[Errno 17] File exists: 'outputs/datasets/cleaned'


#### Train Set

In [8]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

#### Test set

In [9]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)