## Data Wrangling 

Data Wrangling using Pandas for data analysis and manipulation. 

#### Required Dependencies 

In [22]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

#### Store data as Local Variable 

Using `data_frame` (Pandas Object) to structure tabular data into an appropriate format. It laods the complete data in memory to be ready for preprocessing. 

In [23]:
data_frame = pd.read_csv("2.1.2.cardiovascular_disease_dataset.csv")

#### Null Values 

Deals with null values using `isnull().sum()` method to return all null values in any column. 

In [24]:
data_frame.isnull().sum()

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

#### Remove Duplicates 
The presence of duplicates can effect the ML model by reducing data diversity and representativeness, potentially leading to overfitting or biased models. 

Using `duplicated().sum()` method returns the count of duplicate rows in data frame.

In [25]:
data_frame.duplicated().sum()

np.int64(0)

The `drop_duplicates()` method stores the data back in data_frame with duplicates removed. 

In [26]:
data_frame = data_frame.drop_duplicates()
data_frame.duplicated().sum()

np.int64(0)

#### Remove Outliers
The removal of outliers is neccessary as it can skew analysis on numerical columns. The 25th and 75th quartile on numerical data is used to get the inter-quartile range below, allowing the estimation to an acceptable range and values outside the range can be filtered out. 

In [37]:
# The column title (### - placeholder) should be changed according to the variable

print(data_frame['weight'].describe())
Q1 = data_frame['weight'].quantile(0.25)
Q3 = data_frame['weight'].quantile(0.75)
IQR = Q3 - Q1
print(f'Outliers are weights above {Q3 + IQR * 1.5} or below {Q1 - IQR * 1.5}')


count    64033.000000
mean        74.209633
std         14.045185
min         11.000000
25%         65.000000
50%         72.000000
75%         82.000000
max        200.000000
Name: weight, dtype: float64
Outliers are weights above 107.5 or below 39.5


In [38]:
# Filter to an acceptable range 
data_frame = data_frame[(data_frame['weight'] >= Q1 - 1.5 * IQR) & (data_frame['weight'] <= Q3 + 1.5 * IQR)]
print(data_frame['weight'].describe())

count    62505.000000
mean        73.180776
std         12.272097
min         40.000000
25%         65.000000
50%         72.000000
75%         81.000000
max        107.000000
Name: weight, dtype: float64


#### Scaling Features to Common Range
Scaling of features allows machine learning algorithms to easily find the optimal solution as the different scales of features would no longer influence them. 

In [39]:
# Change the scale feature accordingly 
scale_feature = 'weight'

#the minimum value with space for outliers
MIN_value = 38

#the maximum value with space for outliers
MAX_value = 109

#scale features
data_frame[scale_feature] = [(X - MIN_value) / (MAX_value - MIN_value) for X in data_frame[scale_feature]]

data_frame.describe()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0,62505.0
mean,49941.600336,19493.337909,1.348564,0.508461,0.495504,0.457379,0.437753,1.357699,1.220638,0.086345,0.052172,0.80432,0.49388
std,28864.503664,2458.750218,0.476519,0.156903,0.172846,0.170117,0.170519,0.675192,0.56759,0.280875,0.222375,0.396727,0.499967
min,0.0,10859.0,1.0,0.0625,0.028169,0.02381,0.066667,1.0,1.0,0.0,0.0,0.0,0.0
25%,24876.0,17724.0,1.0,0.395833,0.380282,0.380952,0.4,1.0,1.0,0.0,0.0,1.0,0.0
50%,49980.0,19718.0,1.0,0.520833,0.478873,0.380952,0.4,1.0,1.0,0.0,0.0,1.0,0.0
75%,74849.0,21341.0,2.0,0.625,0.605634,0.619048,0.622222,1.0,1.0,0.0,0.0,1.0,1.0
max,99999.0,23713.0,2.0,0.958333,0.971831,0.97619,0.955556,3.0,3.0,1.0,1.0,1.0,1.0


#### Save Wrangled Data to CSV

In [40]:
data_frame.to_csv('../2.2.Feature_Engineering/2.2.1.wrangled_data.csv', index=False)