## Feature Engineering 
Below is the data engineering processes applied to the wrangled data befote training the model to maximise the performance of the machine learning model. 

#### Feature Engineering Process (understanding)
Involves: 
1. Deriving new variables from existing ones 
2. Combining feature/feature interactions 
3. Identifying most relevant features for model
4. Transforming features 
- dividing data into categories 
- mathematical transformations 
5. Creating domain specific features that incorporate knowledge from specific domains to create features capturing important data characteristics  

#### Required Dependencies 

In [2]:
# Import frameworks
import pandas as pd

#### Store data as local variable 

In [3]:
data_frame = pd.read_csv("2.2.1.wrangled_data.csv")

#### Deriving New Variables from Existing Ones 



##### Combining features/feature interactions
Creating a new feature that represents the interaction between two or more features. 
The features that effect the chances of Cardiovascular disease include: high blood pressure, high cholesterol levels 

_Potential features_: 
- **blood pressure and cholesterol levels** (there is a direct correlation between blood pressure and cholesterol levels - high cholesterol levels may cause high blood pressure)
- **Smoking and number of years smoking**
- **Physical inactivity and age**
- **BMI**

#### Encoding values for gender
Currently, the data for gender shows that the value 1 is for female and 2 is for male. Instead, the code below will change it so that -1 is male and 1 is female.  

In [13]:
# Change the value for males from 2 to -1
data_frame['gender'] = data_frame['gender'].replace(2, -1)

# Save the updated data back to the CSV file
data_frame.to_csv('2.2.1.wrangled_data.csv', index=False)

# Display the first few rows to verify the changes
data_frame.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,-1,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,-1,169,82.0,150,100,1,1,0,0,1,1
4,8,21914,1,151,67.0,120,80,2,2,0,0,0,0


#### Calculating Age
Currently the format of age in the data set is the individual's age in days rather than years. The code below will convert the age into number of years. 

In [14]:
# Convert age from days to years
data_frame['age'] = data_frame['age'] / 365.25

# Save the updated data back to the CSV file
data_frame.to_csv('2.2.1.wrangled_data.csv', index=False)

# Display the first few rows to verify the changes
print(data_frame.head())

   id        age  gender  height  weight  ap_hi  ap_lo  cholesterol  gluc  \
0   0  50.357290      -1     168    62.0    110     80            1     1   
1   1  55.381246       1     156    85.0    140     90            3     1   
2   2  51.627652       1     165    64.0    130     70            3     1   
3   3  48.249144      -1     169    82.0    150    100            1     1   
4   8  59.997262       1     151    67.0    120     80            2     2   

   smoke  alco  active  cardio  
0      0     0       1       0  
1      0     0       1       1  
2      0     0       0       1  
3      0     0       1       1  
4      0     0       0       0  


#### Combining Features/Feature Interactions
__Feature Engineering__: creating new features that represent the interaction between two or more features


##### BMI 
The code below is used to calculate the BMI of each patient. BMI is the combination of two features which include height and weight. This feature is being engineered as it is a useful indicator alongside cholesterol levels, blood pressure and smoking status. 

In [15]:
# Creating a BMI column and calulating the BMI of each person
data_frame['BMI'] = data_frame['weight'] / (data_frame['height'] / 100) ** 2

#### Risk - BMI and Age
Age and BMI are both risk multipliers. When both the age and BMI are high, there is an increased chance of cardiovascular disease. 

In [16]:
# Create a 'Risk' column 
data_frame['Risk'] = data_frame['BMI']*data_frame['age']

# Calculate the risk in a percentage
data_frame['Risk%'] = (data_frame['Risk'] / data_frame['Risk'].max()).round(2)

# Print results
print(data_frame[['BMI', 'Risk', 'Risk%']].head())

         BMI         Risk  Risk%
0  21.967120  1106.204631   0.35
1  34.927679  1934.338382   0.61
2  23.507805  1213.652800   0.38
3  28.710479  1385.256063   0.44
4  29.384676  1763.000116   0.56


Scale risk% between 0.15 and 0.85

In [19]:
min_val = 0.15
max_val = 0.85

data_frame['Risk%'] = (data_frame['Risk%'] - min_val) / (max_val - min_val)
print(data_frame['Risk%'].head())

0    0.285714
1    0.657143
2    0.328571
3    0.414286
4    0.585714
Name: Risk%, dtype: float64


Save the BMI, Risk and Risk% to the wrangled data csv

In [17]:
# Save the updated data back to the CSV file
data_frame.to_csv('2.2.1.wrangled_data.csv', index=False)

#### Mean Arterial Pressure (Average Blood Pressure)
Rather than using systolic and diastolic blood pressure as individual features, combining both features will allow for an overall blood flow pressure to be calculated. This is usually calculated using the formula: 
                                                

__DP + 1/3(SP - DP)__

where DP is Diastolic Blood Pressure and SP is Systolic Blood Pressure

In [4]:
# Calculate the mean arterial pressure
data_frame['meanBP'] = data_frame['ap_lo'] + 1/3 * (data_frame['ap_hi'] - data_frame['ap_lo'])

#print results
print(data_frame['meanBP'].head())

0     90.000000
1    106.666667
2     90.000000
3    116.666667
4     93.333333
Name: meanBP, dtype: float64


Scale the new Blood Pressure Data by catergorising as below: 

|Low (0)| Normal (1) | High (2) | 
|-------|------------|----------|
| <70   |   70-100   |  >100    |

In [5]:
# Categorize meanBP into 0, 1, or 2 based on conditions in table above
def categorize_meanBP(value):
    if value < 70:
        return 0
    elif 70 <= value <= 100:
        return 1
    else:
        return 2

# Apply the function to create a new column 'meanBP_category'
data_frame['meanBP_category'] = data_frame['meanBP'].apply(categorize_meanBP)

# Print results to verify the changes
print(data_frame[['meanBP', 'meanBP_category']].head())

       meanBP  meanBP_category
0   90.000000                1
1  106.666667                2
2   90.000000                1
3  116.666667                2
4   93.333333                1


Save the new mean BP calculation

In [6]:
# Save the updated data back to the CSV file
data_frame.to_csv('2.2.1.wrangled_data.csv', index=False)

##### Existing Features 

- Age and BMI risk multiplier 
- Physical Activity 
- Smoking status (0 (no)/1 (yes))
- Mean Arterial Pressure 
- Gender

##### Engineered Features

- BMI
- Age/BMI risk mulitplier (Risk%)
- Mean Arterial Pressure 

##### Save the wrangled and engineered data to CSV

In [7]:
data_frame.to_csv('../2.3.Model_Training/2.3.1.model_ready_data.csv', index=False)