[Home](../../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes you can apply to your data before model training to maximise the performance of your machine learning model. For this demonstration we will engineer new or improved features for the diabetes data you previously wrangled.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [None]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [None]:
data_frame = pd.read_csv("2.2.1.wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.

* To encode the 'SEASONS' column we will assign a number value to the season. Because the data set only provides 4 values we will use 1, 2, 3 and 4.

In [None]:
data_frame['Seasons'] = data_frame['Seasons'].apply(lambda season: 1 if season.lower() == 'spring' else 2 if season.lower() == 'summer' else 3 if season.lower() == 'fall' else 4 if season.lower() == 'winter' else None)
print(data_frame['Seasons'].head())

* We will do the same for 'FunctioningDay' and 'Holiday' except for 2 values instead

In [None]:
data_frame['FunctioningDay'] = data_frame['Functioning Day'].apply(lambda day: 1 if day.lower() == 'yes' else 0)
print(data_frame['FunctioningDay'].head())

In [None]:
data_frame['Holiday'] = data_frame['Holiday'].apply(lambda holiday: 1 if holiday.lower() == 'yes' else 0)
print(data_frame['Holiday'].head())

##### Encoding categorical variables

* In the context of urban transportation, rush hour is inbetween the hours of 7-9 (AM) and 5-7 (PM). We will convert two dates, and further encode this into brackets.

In [None]:
# Convert 'Date' column to datetime
data_frame['Date'] = pd.to_datetime(data_frame['Date'], format='%d/%m/%Y')

# Create Day of the Week feature
data_frame['DayOfWeek'] = data_frame['Date'].dt.dayofweek

# Create Rush Hour feature
data_frame['RushHour'] = data_frame['Hour'].apply(lambda x: 1 if (7 <= x <= 9) or (17 <= x <= 19) else 0)

# Print the result to verify the new features
print(data_frame[['Date', 'Hour', 'DayOfWeek', 'RushHour']].head())

#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

* In this, case some domain knowledge (urban mobility and transportation) and data analysis have informed us that the more exposure combined with high humidity levels (indicated by high dew point temperatures) can lead to heat-related illnesses such as heat exhaustion or heat stroke. Understanding this interaction can help in predicting lower bike-sharing usage during such conditions to ensure user safety.

In [None]:
# Create the 'Risk' column
data_frame['VSW'] = data_frame['DewPointTemp'] * data_frame['Hour']

# Calculate the percentage of the maximum risk
data_frame['VSW%'] = (data_frame['VSW'] / data_frame['VSW'].max()).round(2)

# Print the result
print(data_frame[['DewPointTemp', 'Hour', 'VSW%']])

#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEASON' is 'Summer'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

* In this, case some domain knowledge and data analysis have informed us that there is 'bimodality' in the data and each season has a different trend. 

In [None]:
# Filter the data to -1 only
data_frame = data_frame[data_frame['Seasons'] == 1]

# Print the result
print(data_frame[['Hour', 'Seasons', 'Count']])

#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. In this case, the domain is urban mobility and transportation, which involves understanding how environmental and temporal factors influence bike-sharing demand.

The column called Comfort Index is a domain-specific feature as it combines temperature and humidity to reflect how comfortable the weather feels for biking. Domain-specific knowledge indicates that weather comfort significantly impacts bike-sharing usage, as extreme discomfort (e.g., high humidity and temperature) can deter users.

* First, we will convert the Comfort Index value to a scaled percentage, because comfort can never be 0 (completely uncomfortable) or 100% (perfectly comfortable). We will scale the result between 0.15 and 0.95 to normalize the values for better interpretability and use in predictive modeling.

In [None]:
# Create Weather Comfort Index (simplified)
data_frame['ComfortIndex'] = data_frame['Temp'] - (0.55 * (1 - (data_frame['Humidity'] / 100)) * (data_frame['Temp'] - 14.5))

# Scale the ComfortIndex between 0.15 and 0.95
min_val = 0.15
max_val = 0.95
data_frame['ComfortIndexScaled'] = (((data_frame['ComfortIndex'] - data_frame['ComfortIndex'].min()) / (data_frame['ComfortIndex'].max() - data_frame['ComfortIndex'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['ComfortIndex', 'ComfortIndexScaled', 'Temp', 'Humidity']])

Then to make it even more meaningful, we will combine it with the `VSW` feature we engineered using the `Hour` and `DewPointTemp` features to create a combined risk 'interaction feature' that captures real-world relationships between the features.

Again we will scale the result between 0.15 and 0.95.

In [None]:
data_frame['ComfortAdd'] = (data_frame['ComfortIndex'] * data_frame['VSW%']).round(2)

min_val = 0.15
max_val = 0.85
data_frame['ComfortAdd'] = (((data_frame['ComfortAdd'] - data_frame['ComfortAdd'].min()) / (data_frame['ComfortAdd'].max() - data_frame['ComfortAdd'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Hour', 'VSW%', 'ComfortIndex', 'ComfortAdd']])

#### Save the wrangled and engineered data to CSV

In [None]:
data_frame.to_csv('../2.3.model_training/2.3.1.model_ready_data.csv', index=False)