[Home](../../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes you can apply to your data before model training to maximise the performance of your machine learning model. For this demonstration we will engineer new or improved features for the diabetes data you previously wrangled.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [None]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [32]:
data_frame = pd.read_csv("2.2.1.wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.

To encode the 'SEASONS' column you will assigning a number value to the season. Because the data set only provides 4 values we will use 1, 2, 3 and 4.

In [33]:
data_frame['Seasons'] = data_frame['Seasons'].apply(lambda season: 1 if season.lower() == 'spring' else 2 if season.lower() == 'summer' else 3 if season.lower() == 'fall' else 4 if season.lower() == 'winter' else None)
print(data_frame['Seasons'].head())

0    4.0
1    4.0
2    4.0
3    4.0
4    4.0
Name: Seasons, dtype: float64


##### Calculating Age

In the context of medical diagnosis of a lifestyle disease a persons date of birth has limited influence on the target. However, their age is highly relevant. So we will convert two dates into a age. You could consider further encoding this into age brackets.

In [34]:
# Convert the 'DoB' and 'DoTest' columns to datetime
data_frame['DoB'] = pd.to_datetime(data_frame['DoB'], format='%d/%m/%Y')
data_frame['DoT'] = pd.to_datetime(data_frame['DoT'], format='%d/%m/%Y')

# Calculate the year difference
data_frame['Age'] = ((data_frame['DoT'] - data_frame['DoB']).dt.days  / 365.25).round()

# Print the result
print(data_frame[['DoB', 'DoT', 'Age']])

KeyError: 'DoB'

#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

In this, case some domain knowledge (urban mobility and transportation) and data analysis have informed you that the BMI and AGE are risk multipliers (the greater the age and the greater the BMI the greater the feature). From this we can risk value based on the feature interactions.

* More exposure combined with high humidity levels (indicated by high dew point temperatures) can lead to heat-related illnesses such as heat exhaustion or heat stroke. Understanding this interaction can help in predicting lower bike-sharing usage during such conditions to ensure user safety.

* Differences between 1% to 100%:
    * Humidity and Temp = ~0.06%
    * DewPointTemp and Temp = ~0.10%
    * Hour and Humidity =  ~0.30%

* Hour and Humidity shows clear correlation.

In [35]:
# Create the 'Risk' column
data_frame['WHR'] = data_frame['Hour'] * data_frame['Humidity']

# Calculate the percentage of the maximum risk
data_frame['WHR%'] = (data_frame['WHR'] / data_frame['WHR'].max()).round(2)

# Print the result
print(data_frame[['DewPointTemp', 'Temp', 'WHR%']])

      DewPointTemp  Temp  WHR%
0            -17.6  -5.2  0.00
1            -17.6  -5.5  0.02
2            -17.7  -6.0  0.03
3            -17.6  -6.2  0.05
4            -18.6  -6.0  0.06
...            ...   ...   ...
8597         -10.3   4.2  0.29
8598          -9.9   3.4  0.33
8599          -9.9   2.6  0.36
8600          -9.8   2.1  0.40
8601          -9.3   1.9  0.44

[8602 rows x 3 columns]


#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

In [41]:
# Filter the data to -1 only
data_frame = data_frame[data_frame['Seasons'] == 1]

# Print the result
print(data_frame[['Hour', 'Seasons', 'Count']])

Empty DataFrame
Columns: [Hour, Seasons, Count]
Index: []


#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. In This case the domain is 'health' and more specifically 'Epidemiology' which is the study of how often diseases occur in different groups of people and why.

The column called '1st Degree Relatives' is a domain specific feature as is records the number of family members in the individuals direct bloodline who have developed type 2 adult onset diabetes. Domain specific knowledge, is that Family history of disease in first degree relatives is a major risk factor, especially for premature events.

First we will convert we will convert the FDR value to a risk percentage, because the risk can never be 0 (will never happen) or 100% (will definitely happen) we will scale the result between 0.15 and 0.95.

In [None]:
# Calculate the family history risk
data_frame['FHRisk'] = (data_frame['FDR'] / data_frame['FDR'].max())

# Scale the result between 0.15 and 0.95
min_val = 0.15
max_val = 0.85
data_frame['FHRisk'] = (((data_frame['FHRisk'] - data_frame['FHRisk'].min()) / (data_frame['FHRisk'].max() - data_frame['FHRisk'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Age', 'FDR', 'FHRisk']])

Then to make it even more meaningful, we will combine it with the `Risk` feature we engineered using the `AGE` and `BMI` features to create a combined risk 'interaction feature' that captures real-world relationships between the features.

Again we will scale the result between 0.15 and 0.95.

In [None]:
data_frame['CombRisk'] = (data_frame['FHRisk'] * data_frame['Risk%']).round(2)

min_val = 0.15
max_val = 0.85
data_frame['CombRisk'] = (((data_frame['CombRisk'] - data_frame['CombRisk'].min()) / (data_frame['CombRisk'].max() - data_frame['CombRisk'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Age', 'Risk%', 'FHRisk', 'CombRisk']])

#### Save the wrangled and engineered data to CSV

In [None]:
data_frame.to_csv('../2.3.Model_Training/2.3.1.model_ready_data.csv', index=False)