
<center>

**Data cleaning**

</center>
    
This step involves handling any missing values, outliers, errors, duplicates, inconsistencies, etc. in the data. This step ensures that the data is accurate, consistent, and reliable for further processing and analysis. Some common data cleaning techniques are:

- Imputation: Replacing missing values with a reasonable estimate, such as the mean, median, mode, or a constant value.
- Removal: Deleting rows or columns that contain missing values, outliers, or errors.
- Correction: Fixing typos, spelling mistakes, formatting issues, or incorrect values in the data.
- Deduplication: Identifying and removing duplicate records in the data.
- Standardization: Converting the data into a common format or unit, such as date, time, currency, etc.

In [72]:
TRAIN_DATA_PATH = "../../data/interim/train_data.pkl"
TEST_DATA_PATH = "../../data/interim/test_data.pkl"
CLEANED_TRAIN_DATA_PATH = "../../data/interim/cleaned_train_data.pkl"
CLEANED_TEST_DATA_PATH = "../../data/interim/cleaned_test_data.pkl"


In [73]:
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer


# Functions


In [74]:
def remove_duplicates(data: pd.DataFrame):
    """Remove duplicates values if exist"""
    print(f"Duplicates count before droping:{data.duplicated().sum()}")
    data.drop_duplicates(inplace=True)
    print(f"Duplicates count after droping:{data.duplicated().sum()}")
    print(f"Data dimension{data.shape}")

# Read data


In [75]:
df_train = pd.read_pickle(TRAIN_DATA_PATH)
df_test = pd.read_pickle(TEST_DATA_PATH)

# Substitute strange values

In [76]:
# Substitute '?' with np.NAN
df_train.replace("?", np.nan, inplace=True)
df_test.replace("?", np.nan, inplace=True)

In [79]:
df_train[["number_of_major_vessels", "thallium_stress_result"]] = SimpleImputer(
    strategy="most_frequent"
).fit_transform(df_train[["number_of_major_vessels", "thallium_stress_result"]])

df_test['thallium_stress_result'] = SimpleImputer(
    strategy="most_frequent"
).fit_transform(df_test[["number_of_major_vessels"]])

# Duplicates


In [68]:
remove_duplicates(df_train)

Duplicates count before droping:0
Duplicates count after droping:0
Data dimension(271, 14)


In [69]:
df_train.head()


Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,serum_cholestoral,fasting_blood_sugar,resting_electrocardiographic,maximum_heart_rate,exercise_induced_angina,ST_depression,slope_peak_exercise_ST_segment,number_of_major_vessels,thallium_stress_result,target
155,51.0,1.0,4.0,140.0,299.0,0.0,0.0,173.0,1.0,1.6,1.0,0.0,7.0,1
10,56.0,0.0,2.0,140.0,294.0,0.0,2.0,153.0,0.0,1.3,2.0,0.0,3.0,0
53,60.0,1.0,4.0,130.0,253.0,0.0,0.0,144.0,1.0,1.4,1.0,1.0,7.0,1
122,55.0,1.0,4.0,140.0,217.0,0.0,0.0,111.0,1.0,5.6,3.0,0.0,7.0,3
208,62.0,0.0,4.0,150.0,244.0,0.0,0.0,154.0,1.0,1.4,2.0,0.0,3.0,1


# Change type to numerical


In [70]:
# Convert all columns to numeric
df_train = df_train.apply(pd.to_numeric)


# Save processed data


In [71]:
df_train.to_pickle(CLEANED_TRAIN_DATA_PATH)
df_test.to_pickle(CLEANED_TEST_DATA_PATH)