[Home](../../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes you can apply to your data before model training to maximise the performance of your machine learning model. For this demonstration we will engineer new or improved features for the diabetes data you previously wrangled.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [4]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [5]:
data_frame = pd.read_csv("2.2.1.wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.

To encode the 'SEASONS' column you will assigning a number value to the season. Because the data set only provides 4 values we will use 1, 2, 3 and 4.

In [6]:
data_frame['Seasons'] = data_frame['Seasons'].apply(lambda season: 1 if season.lower() == 'spring' else 2 if season.lower() == 'summer' else 3 if season.lower() == 'fall' else 4 if season.lower() == 'winter' else None)
print(data_frame['Seasons'].head())

0    4.0
1    4.0
2    4.0
3    4.0
4    4.0
Name: Seasons, dtype: float64


##### Calculating Rush Hour

In the context of medical diagnosis of a lifestyle disease a persons date of birth has limited influence on the target. However, their age is highly relevant. So we will convert two dates into a age. You could consider further encoding this into age brackets.

In [7]:

# Convert the 'DoB' and 'DoTest' columns to datetime
#data_frame['DoB'] = pd.to_datetime(data_frame['DoB'], format='%d/%m/%Y')
#data_frame['DoT'] = pd.to_datetime(data_frame['DoT'], format='%d/%m/%Y')

# Calculate the year difference
#data_frame['Age'] = ((data_frame['DoT'] - data_frame['DoB']).dt.days  / 365.25).round()

# Print the result
#print(data_frame[['DoB', 'DoT', 'Age']])

# Convert 'Date' column to datetime
data_frame['Date'] = pd.to_datetime(data_frame['Date'], format='%d/%m/%Y')

# Create Day of the Week feature
data_frame['DayOfWeek'] = data_frame['Date'].dt.dayofweek

# Create Rush Hour feature
data_frame['RushHour'] = data_frame['Hour'].apply(lambda x: 1 if (7 <= x <= 9) or (17 <= x <= 19) else 0)

# Print the result to verify the new features
print(data_frame[['Date', 'Hour', 'DayOfWeek', 'RushHour']].head())

        Date  Hour  DayOfWeek  RushHour
0 2017-12-01     0          4         0
1 2017-12-01     1          4         0
2 2017-12-01     2          4         0
3 2017-12-01     3          4         0
4 2017-12-01     4          4         0


#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

In this, case some domain knowledge (urban mobility and transportation) and data analysis have informed us that the more exposure combined with high humidity levels (indicated by high dew point temperatures) can lead to heat-related illnesses such as heat exhaustion or heat stroke. Understanding this interaction can help in predicting lower bike-sharing usage during such conditions to ensure user safety.

* Differences between 1% to 100%:
    * Humidity and Temp = ~0.06%
    * DewPointTemp and Temp = ~0.10%

* Hour and DewPointTemp shows clear correlation and effectiveness.

In [None]:
# Create the 'Risk' column
data_frame['VisibilityTemp'] = data_frame['Visibility'] * data_frame['DewPointTemp']

# Calculate the percentage of the maximum risk
data_frame['VisibilityTemp%'] = (data_frame['ComfortIndex'] / data_frame['ComfortIndex'].max()).round(2)

# Print the result
print(data_frame[['Visibility', 'DewPointTemp', 'VisibilityTemp%']])

      Visibility  Temp  ComfortIndex%
2160        1894   2.0           0.07
2161         859   2.1           0.03
2162         580   2.0           0.02
2163         469   1.6           0.01
2164         636   1.6           0.02
...          ...   ...            ...
4330        1828  27.2           0.61
4331        1950  23.1           0.71
4332        1551  21.9           0.54
4333        1874  21.1           0.60
4334        1903  20.5           0.62

[2175 rows x 3 columns]


#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

In [9]:
# Filter the data to -1 only
data_frame = data_frame[data_frame['Seasons'] == 1]

# Print the result
print(data_frame[['Hour', 'Seasons', 'Count']])

      Hour  Seasons     Count
2160     0      1.0 -0.083158
2161     1      1.0 -0.003158
2162     2      1.0  0.031579
2163     3      1.0 -0.061053
2164     4      1.0 -0.128421
...    ...      ...       ...
4330    17      1.0  1.996842
4331    20      1.0  2.120000
4332    21      1.0  2.021053
4333    22      1.0  1.842105
4334    23      1.0  1.255789

[2175 rows x 3 columns]


#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. In this case the domain is 'health' and more specifically 'Epidemiology' which is the study of how often diseases occur in different groups of people and why.

The column called '1st Degree Relatives' is a domain specific feature as is records the number of family members in the individuals direct bloodline who have developed type 2 adult onset diabetes. Domain specific knowledge, is that Family history of disease in first degree relatives is a major risk factor, especially for premature events.

First we will convert we will convert the FDR value to a risk percentage, because the risk can never be 0 (will never happen) or 100% (will definitely happen) we will scale the result between 0.15 and 0.95.

In [10]:
# Create Weather Comfort Index (simplified)
data_frame['ComfortIndex'] = data_frame['Temp'] - (0.55 * (1 - (data_frame['Humidity'] / 100)) * (data_frame['Temp'] - 14.5))

# Scale the ComfortIndex between 0.15 and 0.95
min_val = 0.15
max_val = 0.95
data_frame['ComfortIndexScaled'] = (((data_frame['ComfortIndex'] - data_frame['ComfortIndex'].min()) / (data_frame['ComfortIndex'].max() - data_frame['ComfortIndex'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['ComfortIndex', 'ComfortIndexScaled', 'Temp', 'Humidity']])

      ComfortIndex  ComfortIndexScaled  Temp  Humidity
2160       2.27500                0.26   2.0        96
2161       2.30460                0.26   2.1        97
2162       2.20625                0.25   2.0        97
2163       1.81285                0.24   1.6        97
2164       1.81285                0.24   1.6        97
...            ...                 ...   ...       ...
4330      23.14870                0.92  27.2        42
4331      21.11340                0.85  23.1        58
4332      20.27200                0.83  21.9        60
4333      19.61170                0.81  21.1        59
4334      19.24600                0.79  20.5        62

[2175 rows x 4 columns]


Then to make it even more meaningful, we will combine it with the `Risk` feature we engineered using the `AGE` and `BMI` features to create a combined risk 'interaction feature' that captures real-world relationships between the features.

Again we will scale the result between 0.15 and 0.95.

In [14]:
data_frame['ComfortIndex'] = (data_frame['ComfortIndex'] * data_frame['VisibilityTemp%']).round(2)

min_val = 0.15
max_val = 0.85
data_frame['CombRisk'] = (((data_frame['CombRisk'] - data_frame['CombRisk'].min()) / (data_frame['CombRisk'].max() - data_frame['CombRisk'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Age', 'Risk%', 'FHRisk', 'CombRisk']])

KeyError: 'VisibilityTemp%'

#### Save the wrangled and engineered data to CSV

In [3]:
data_frame.to_csv('../2.3.Model_Training/2.3.1.model_ready_data.csv', index=False)

NameError: name 'data_frame' is not defined