[Home](../../README.md)

### Feature Engineering

This Jupyter Notepad includes the data engineering processes applied to my data before model training to maximise the performance of my machine learning model. We will engineer new or improved features for the bike demand data we previously wrangled.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [2]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [3]:
data_frame = pd.read_csv("2.2.1.wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.

* To encode the `Seasons` column we will assign a number value to each season

    * Since the data set only provides 4 values we will use 1, 2, 3 and 4

In [4]:
data_frame['Seasons'] = data_frame['Seasons'].apply(lambda season: 1 if season.lower() == 'spring' else 2 if season.lower() == 'summer' else 3 if season.lower() == 'fall' else 4 if season.lower() == 'winter' else None)
print(data_frame['Seasons'].head())

0    4.0
1    4.0
2    4.0
3    4.0
4    4.0
Name: Seasons, dtype: float64


* We will do the same for `FunctioningDay` and `Holiday` for 2 values instead

In [4]:
data_frame['FunctioningDay'] = data_frame['FunctioningDay'].apply(lambda day: 1 if day.lower() == 'yes' else 0)
print(data_frame['FunctioningDay'].head())

0    1
1    1
2    1
3    1
4    1
Name: FunctioningDay, dtype: int64


In [5]:
data_frame['Holiday'] = data_frame['Holiday'].apply(lambda holiday: 1 if holiday.lower() == 'yes' else 0)
print(data_frame['Holiday'].head())

0    0
1    0
2    0
3    0
4    0
Name: Holiday, dtype: int64


##### Encoding categorical variables

* In the context of urban transportation, time is a very important aspect to determine human behaviour

    * Rush hour is inbetween the hours of 7-9 (AM) and 5-7 (PM). We will convert two dates, and further encode this into brackets

    * Day of the week is also converted

In [7]:
# Convert 'Date' column to datetime
data_frame['Date'] = pd.to_datetime(data_frame['Date'], format='%d/%m/%Y')

# Create Day of the Week feature
data_frame['DayOfWeek'] = data_frame['Date'].dt.dayofweek

# Create Rush Hour feature
data_frame['RushHour'] = data_frame['Hour'].apply(lambda x: 1 if (7 <= x <= 9) or (17 <= x <= 19) else 0)

# Print the result to verify the new features
print(data_frame[['Date', 'Hour', 'DayOfWeek', 'RushHour']].head())

           Date  Hour  DayOfWeek  RushHour
2160 2018-03-01     0          3         0
2161 2018-03-01     1          3         0
2162 2018-03-01     2          3         0
2163 2018-03-01     3          3         0
2164 2018-03-01     4          3         0


* We can also create a seperate bracket for months. giving better insight

In [8]:
# Extract the month from the 'Date' column
data_frame['Month'] = data_frame['Date'].dt.month

# Print the first few rows to verify
print(data_frame[['Date', 'Month']].head())

           Date  Month
2160 2018-03-01      3
2161 2018-03-01      3
2162 2018-03-01      3
2163 2018-03-01      3
2164 2018-03-01      3


#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

* In this case, domain knowledge (urban mobility and transportation) and data analysis suggest that high dew point temperatures (indicating humidity) combined with peak riding hours can influence bike-sharing demand. Specifically, hot and humid conditions during busy times (e.g., midday or evening commutes) may lead to reduced usage due to discomfort or heat-related health risks.

In [9]:
data_frame['HourDPT'] = (data_frame['DewPointTemp']) * data_frame['Hour']

# Calculate the percentage of the maximum risk
data_frame['HourDPT%'] = (data_frame['HourDPT'] / data_frame['HourDPT'].max()).round(2)

# Print the result
print(data_frame[['DewPointTemp', 'Hour', 'HourDPT%']])

      DewPointTemp  Hour  HourDPT%
2160           1.4     0      0.00
2161           1.6     1      0.00
2162           1.5     2      0.01
2163           1.1     3      0.01
2164           1.1     4      0.01
...            ...   ...       ...
4363           0.0    19      0.00
4364          14.3    20      0.58
4365          13.7    21      0.58
4366          12.7    22      0.57
4367          12.9    23      0.60

[2208 rows x 3 columns]


* High dew point in the afternoon where users may avoid biking

* Low dew point at night is more comfortable and has stable demand

#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the `Seasons` is Summer. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

* In this, case some domain knowledge and data analysis have informed us that each season has a different trend.

    * (Curating will also lower our sample size)

In [10]:
# Filter the data to 1 only for Summer
data_frame = data_frame[data_frame['Seasons'] == 1]

# Print the result
print(data_frame[['Hour', 'Seasons', 'Count']])

      Hour  Seasons     Count
2160     0      1.0  0.029583
2161     1      1.0  0.061250
2162     2      1.0  0.075000
2163     3      1.0  0.038333
2164     4      1.0  0.011667
...    ...      ...       ...
4363    19      1.0  1.055833
4364    20      1.0  0.901667
4365    21      1.0  0.862500
4366    22      1.0  0.791667
4367    23      1.0  0.559583

[2208 rows x 3 columns]


#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. In this case, the domain is urban mobility and transportation, which involves understanding how environmental and temporal factors influence bike-sharing demand.

* The column called **Comfort Index** is a domain-specific feature as it combines temperature and humidity to reflect how comfortable the weather feels for biking. Domain-specific knowledge indicates that weather comfort significantly impacts bike-sharing usage, as extreme discomfort (e.g., high humidity and temperature) can deter users. I also considered other features like:

    * **Rainfall to rush hour interaction:** may impact commuting patterns

    * **Rainfall to snowfall interaction:** precipitation intensity (these values were too low/low variance)

* First, we will convert the Comfort Index value to a scaled percentage, because comfort can never be 0. We will scale the result between 0.15 and 0.95 to normalize the values for better interpretability and use in predictive modeling.

In [11]:
# Create Weather Comfort Index
data_frame['ComfortIndex'] = data_frame['Temp'] - (0.55 * (1 - (data_frame['Humidity'] / 100)) * (data_frame['Temp'] - 14.5))

print("Raw ComfortIndex Min:", data_frame['ComfortIndex'].min())
print("Raw ComfortIndex Max:", data_frame['ComfortIndex'].max())

# Scale the ComfortIndex between 0.15 and 0.90
min_val = 0.15
max_val = 0.90
data_frame['ComfortIndexScaled'] = (((data_frame['ComfortIndex'] - data_frame['ComfortIndex'].min()) / (data_frame['ComfortIndex'].max() - data_frame['ComfortIndex'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['ComfortIndex', 'ComfortIndexScaled', 'Temp', 'Humidity']])
print(data_frame['ComfortIndexScaled'].describe())

Raw ComfortIndex Min: -1.0295999999999994
Raw ComfortIndex Max: 24.1214
      ComfortIndex  ComfortIndexScaled  Temp  Humidity
2160       2.27500                0.25   2.0        96
2161       2.30460                0.25   2.1        97
2162       2.20625                0.25   2.0        97
2163       1.81285                0.23   1.6        97
2164       1.81285                0.23   1.6        97
...            ...                 ...   ...       ...
4363      22.43405                0.85  25.2        53
4364      21.11340                0.81  23.1        58
4365      20.27200                0.79  21.9        60
4366      19.61170                0.77  21.1        59
4367      19.24600                0.75  20.5        62

[2208 rows x 4 columns]
count    2208.000000
mean        0.574284
std         0.152157
min         0.150000
25%         0.460000
50%         0.590000
75%         0.690000
max         0.900000
Name: ComfortIndexScaled, dtype: float64


Then to make it even more meaningful, we will combine it with the `HourDPT` feature we engineered using the `Hour` and `DewPointTemp` features to create a combined comfort 'interaction feature' that captures real-world relationships between the features.

In [11]:
data_frame['ComfortAdd'] = (data_frame['ComfortIndex'] * data_frame['HourDPT%']).round(2)

min_val = 0.15
max_val = 0.80
data_frame['ComfortAddScaled'] = (((data_frame['ComfortAdd'] - data_frame['ComfortAdd'].min()) / (data_frame['ComfortAdd'].max() - data_frame['ComfortAdd'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Hour', 'HourDPT%', 'ComfortIndex', 'ComfortAdd', 'ComfortAddScaled']])
print(data_frame['ComfortIndexScaled'].describe())

      Hour  HourDPT%  ComfortIndex  ComfortAdd  ComfortAddScaled
2160     0      0.00       2.27500        0.00              0.32
2161     1      0.00       2.30460        0.00              0.32
2162     2      0.01       2.20625        0.02              0.32
2163     3      0.01       1.81285        0.02              0.32
2164     4      0.01       1.81285        0.02              0.32
...    ...       ...           ...         ...               ...
4363    19      0.00      22.43405        0.00              0.32
4364    20      0.49      21.11340       10.35              0.59
4365    21      0.49      20.27200        9.93              0.58
4366    22      0.48      19.61170        9.41              0.56
4367    23      0.51      19.24600        9.82              0.57

[2208 rows x 5 columns]
count    2208.000000
mean        0.574284
std         0.152157
min         0.150000
25%         0.460000
50%         0.590000
75%         0.690000
max         0.900000
Name: ComfortIndexScaled, d

* (This feature proved ineffective later on)

In [12]:
# Select only numeric columns
numeric_columns = data_frame.select_dtypes(include=['number'])

# Count negative values for all numeric columns
negative_values = (numeric_columns < 0).sum()

# Print the result
print("Number of negative values in each column:")
print(negative_values)

Number of negative values in each column:
Count                   0
Hour                    0
Temp                   22
Humidity                0
WindSpeed               0
Visibility              0
DewPointTemp          662
SolarRadiation          0
Rainfall                0
Snowfall                0
Seasons                 0
Holiday                 0
FunctioningDay          0
DayOfWeek               0
RushHour                0
Month                   0
HourDPT               637
HourDPT%              602
ComfortIndex            5
ComfortIndexScaled      0
ComfortAdd            596
ComfortAddScaled        0
dtype: int64


* I noticed negative values, however this is normal for Temp and DewTemp

* Scaled values such as 'ComfortIndexScaled' were negative so to counteract this I created a smaller range or `max_val`

I also considered scaling some features such as `Temp`, however all features are already varied which is futile

#### Save the wrangled and engineered data to CSV

In [13]:
data_frame.to_csv('../2.3.model_training/2.3.1.model_ready_data.csv', index=False)