## Data Wrangling 

Data Wrangling using Pandas for data analysis and manipulation. 

#### Required Dependencies 

In [14]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

#### Store data as Local Variable 

Using `data_frame` (Pandas Object) to structure tabular data into an appropriate format. It laods the complete data in memory to be ready for preprocessing. 

In [15]:
data_frame = pd.read_csv("2.1.2.cardiovascular_disease_dataset.csv")

#### Null Values 

Deals with null values using `isnull().sum()` method to return all null values in any column. 

In [16]:
data_frame.isnull().sum()

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

#### Remove Duplicates 
The presence of duplicates can effect the ML model by reducing data diversity and representativeness, potentially leading to overfitting or biased models. 

Using `duplicated().sum()` method returns the count of duplicate rows in data frame.

In [17]:
data_frame.duplicated().sum()

np.int64(0)

The `drop_duplicates()` method stores the data back in data_frame with duplicates removed. 

In [18]:
data_frame = data_frame.drop_duplicates()
data_frame.duplicated().sum()

np.int64(0)

#### Remove Outliers
The removal of outliers is neccessary as it can skew analysis on numerical columns. The 25th and 75th quartile on numerical data is used to get the inter-quartile range below, allowing the estimation to an acceptable range and values outside the range can be filtered out. 

In [25]:
# The column title (### - placeholder) should be changed according to the variable

print(data_frame['ap_lo'].describe())
Q1 = data_frame['ap_lo'].quantile(0.25)
Q3 = data_frame['ap_lo'].quantile(0.75)
IQR = Q3 - Q1
print(f'Outliers of ap_lo are above {Q3 + IQR * 1.5} or below {Q1 - IQR * 1.5}')


count    66414.000000
mean        94.062065
std        181.418988
min          0.000000
25%         80.000000
50%         80.000000
75%         90.000000
max      10000.000000
Name: ap_lo, dtype: float64
Outliers of ap_lo are above 105.0 or below 65.0


In [26]:
# Filter to an acceptable range 
data_frame = data_frame[(data_frame['ap_lo'] >= Q1 - 1.5 * IQR) & (data_frame['ap_lo'] <= Q3 + 1.5 * IQR)]
print(data_frame['ap_lo'].describe())

count    62505.000000
mean        81.698904
std          7.673364
min         65.000000
25%         80.000000
50%         80.000000
75%         90.000000
max        105.000000
Name: ap_lo, dtype: float64


#### Scaling Features to Common Range
Scaling of features allows machine learning algorithms to easily find the optimal solution as the different scales of features would no longer influence them. 

In [None]:
# Change the scale feature accordingly 
scale_feature = 'height'

#the minimum value with space for outliers
MIN_value = 140

#the maximum value with space for outliers
MAX_value = 185

#scale features
data_frame[scale_feature] = [(X - MIN_value) / (MAX_value - MIN_value) for X in data_frame[scale_feature]]

data_frame.describe()

#### Save Wrangled Data to CSV

In [27]:
data_frame.to_csv('../2.2.Feature_Engineering/2.2.1.wrangled_data.csv', index=False)