# **Feature Engineering**

As the name suggests, this part of the ML Engineering Workflow refers to *synthesis* of new feature variables, *extraction* of additional feature variables, or *concatenation* of new feature variables that will be appropriate for the context of the problem.

Common examples of creating new feature variables are
- **Polynomial Features**: $x_1, x_2 \rightarrow x_1^2, x_2^2, x_1x_2$
- **Interaction Features**: $x_1, x_2 \rightarrow x_1x_2$
- **Binning**: $x_1 \rightarrow \text{bin}(x_1)$
- **One-Hot Encoding**: $x_1 \rightarrow \text{one-hot}(x_1)$

Other more basic feature engineering examples would be 
- extracting *date-related* feature variables from existing data columns
- translating existing columns into new data columns

In [1]:
import pandas as pd
import numpy as np

# **Initial Inspection**

In [2]:
df = pd.read_csv('./nyc-taxi-trip-duration/train.csv', parse_dates=['pickup_datetime', 'dropoff_datetime'])
print(f'Dataset Shape: {df.shape}')

df.sample(3)

Dataset Shape: (1458644, 11)


Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
1265721,id2036437,2,2016-02-08 22:22:59,2016-02-08 22:45:08,2,-73.869873,40.772362,-73.974892,40.696232,N,1329
1209556,id3512847,2,2016-03-23 16:50:16,2016-03-23 17:29:10,2,-73.981956,40.76236,-73.872269,40.774509,N,2334
1042914,id2215112,2,2016-03-17 19:31:24,2016-03-17 19:38:57,6,-73.980759,40.75996,-73.969307,40.753349,N,453


In [3]:
df.dtypes

id                            object
vendor_id                      int64
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
pickup_longitude             float64
pickup_latitude              float64
dropoff_longitude            float64
dropoff_latitude             float64
store_and_fwd_flag            object
trip_duration                  int64
dtype: object

In [3]:
df['id'].nunique(), df['vendor_id'].nunique(), df.shape[0]

(1458644, 2, 1458644)

We should **drop** the `dropoff_datetime` feature since it already *tells us* the target variable which is `trip_duration`. We consider this info **unknown** during inference point.

In [4]:
df_clean = df.drop(labels=['id', 'dropoff_datetime'], axis=1, errors='raise')
df.shape, df_clean.shape

((1458644, 11), (1458644, 9))

In [5]:
df_clean.to_csv('./nyc-taxi-trip-duration/train_clean.csv', index=False)

# **Date and Time Related Feature Engineering**

We can create a `Day of Week` feature using the `pickup_datetime` column from the dataset. Other date-related features we can engineer include
- `Day of Month`
- `Holiday?`
- `Time of Day` to categorize whether pick up time was during `Morning`, `Afternoon`, `Evening`, or `Night`

In [13]:
df = pd.read_csv('./nyc-taxi-trip-duration/train_clean.csv', parse_dates=['pickup_datetime'])
df.sample(7)

Unnamed: 0,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
1187433,1,2016-06-14 20:27:10,2,-73.969215,40.764439,-74.0,40.741787,N,1073
1261132,2,2016-04-11 21:27:07,1,-74.00959,40.71146,-73.968895,40.693119,N,659
627895,2,2016-03-01 11:49:11,1,-73.962692,40.771599,-73.961594,40.77409,N,62
407279,1,2016-03-24 07:11:51,1,-73.951462,40.774094,-73.977257,40.754604,N,798
996747,2,2016-04-19 14:25:33,1,-73.978943,40.761829,-73.972382,40.794151,N,1175
1136095,2,2016-06-12 19:18:11,1,-73.959373,40.77417,-73.970032,40.765369,N,510
1079009,2,2016-03-23 23:08:42,1,-73.977547,40.749371,-73.955269,40.767387,N,536


In [6]:
from tqdm import tqdm

def timeOfDayDetector(datetime_series):
    '''
    Detects the time of day from a datetime series, and categorize it as either: Early Morning, Morning, Afternoon, Evening, Night
    
    Parameters:
        datetime_series (Series): A series of datetime values
        
    Returns:
        Series (Series): A series of time of day values
    '''
    
    timeOfDay = []
    
    for datetime in tqdm(datetime_series):
        hour = datetime.hour
        
        if hour >= 0 and hour < 5:
            timeOfDay.append('Early Morning')
        elif hour >= 5 and hour < 12:
            timeOfDay.append('Morning')
        elif hour >= 12 and hour < 17:
            timeOfDay.append('Afternoon')
        elif hour >= 17 and hour < 21:
            timeOfDay.append('Evening')
        else:
            timeOfDay.append('Night')
            
    return pd.Series(timeOfDay)

In [19]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
cal = calendar()
holidays = cal.holidays(start=df['pickup_datetime'].min(), end=df['pickup_datetime'].max())
holidays

DatetimeIndex(['2016-01-18', '2016-02-15', '2016-05-30'], dtype='datetime64[ns]', freq=None)

In [20]:
df['pickup_datetime'].min(), df['pickup_datetime'].max()

(Timestamp('2016-01-01 00:00:17'), Timestamp('2016-06-30 23:59:39'))

In [7]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

def dateFeatureEngineering(df):
    '''
    Generate date-related new features from the date column of the dataframe.
    
    Parameters:
        df: Pandas dataframe.
        
    Returns:
        df: Pandas dataframe with new features.
    '''

    df['Day of Week'] = df['pickup_datetime'].dt.dayofweek
    df['Day of Month'] = df['pickup_datetime'].dt.day
    
    # Check if date is a US holiday
    cal = calendar()
    holidays = cal.holidays(start=df['pickup_datetime'].min(), end=df['pickup_datetime'].max())
    df['Holiday'] = df['pickup_datetime'].isin(holidays)
    
    # Categorize the pickup time of day
    df['Time of Day'] = timeOfDayDetector(df['pickup_datetime'])
    
    return df

In [8]:
df_with_date_features = dateFeatureEngineering(df.copy())

print(f'Original: {df.shape}')
print(f'New: {df_with_date_features.shape}')

100%|██████████| 1458644/1458644 [00:20<00:00, 70230.65it/s] 


Original: (1458644, 11)
New: (1458644, 15)


In [9]:
df_with_date_features.sample(2)

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,Day of Week,Day of Month,Holiday,Time of Day
368154,id2529897,1,2016-05-31 18:01:01,2016-05-31 18:09:51,2,-73.991356,40.742172,-73.991638,40.754719,N,530,1,31,False,Evening
702888,id3950599,2,2016-05-18 23:28:54,2016-05-18 23:31:59,1,-73.952744,40.776779,-73.960121,40.768608,N,185,2,18,False,Night


In [10]:
min(df.pickup_datetime), max(df.pickup_datetime)

(Timestamp('2016-01-01 00:00:17'), Timestamp('2016-06-30 23:59:39'))

In [17]:
df_with_date_features.Holiday.value_counts()

False    1458644
Name: Holiday, dtype: int64

In [21]:
df_with_date_features['Time of Day'].value_counts()

Morning          372479
Afternoon        353762
Evening          341463
Night            234462
Early Morning    156478
Name: Time of Day, dtype: int64

In [24]:
# Save the new dataframe
df_with_date_features.to_csv('./nyc-taxi-trip-duration/train_clean_with_features.csv', index=False)

# **Geographical Feature Engineering**

Observe that the dataset contains `pickup_latitude`, `pickup_longitude`, `dropoff_latitude`, and `dropoff_longitude` columns. We can use these to create new features such as
- `Distance`: using the Haversine formula
- `Bearing`: using the Bearing formula. Bearing refers to the angle between the north line and the line connecting the pickup and dropoff points.
- `Pickup Zone`: using the `pickup_latitude` and `pickup_longitude` columns to determine which zone the pickup point is in (e.g. Manhattan, Brooklyn, etc.)
- `Dropoff Zone`: using the `dropoff_latitude` and `dropoff_longitude` columns to determine which zone the dropoff point is in (e.g. Manhattan, Brooklyn, etc.)

Note that we can use the **dropoff** features unlike the `dropoff_datetime` feature since these features are **known at the time of inference**. For example, when booking a ride, we already know the dropoff point. This can therefore be used to predict how long the trip will be, along with the other cleaned and generated feature variables.

For the the zone extraction, we can use the `geopandas` library to create a `GeoDataFrame` object that contains the zones of New York City. We can then use the `shapely` library to determine whether a point is within a zone or not.

In [25]:
df = pd.read_csv('./nyc-taxi-trip-duration/train_clean_with_features.csv', parse_dates=['pickup_datetime'])
df.head()

Unnamed: 0,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,Day of Week,Day of Month,Holiday,Time of Day
0,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,N,455,0,14,False,Evening
1,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6,12,False,Early Morning
2,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1,19,False,Morning
3,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,N,429,2,6,False,Evening
4,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,N,435,5,26,False,Afternoon


In [27]:
import math

def distanceCalculator(pickup, dropoff):
    '''
    Calculate the distance using the latitude and longitude of the PICKUP and DROPOFF with the Haversine formula.
    
    Parameters:
        pickup (tuple): A tuple of (latitude, longitude) for the pickup location.
        dropoff (tuple): A tuple of (latitude, longitude) for the dropoff location.
        
    Returns:
        float: The distance between the pickup and dropoff locations in kilometers.
    '''
    
    # Convert coordinates to radians
    pickup_lat, pickup_long = pickup
    dropoff_lat, dropoff_long = dropoff
    pickup_lat, pickup_long, dropoff_lat, dropoff_long = map(np.radians, [pickup_lat, pickup_long, dropoff_lat, dropoff_long])
    
    # Haversine formula
    dlat = dropoff_lat - pickup_lat
    dlong = dropoff_long - pickup_long
    a = np.sin(dlat/2.0)**2 + np.cos(pickup_lat) * np.cos(dropoff_lat) * np.sin(dlong/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    
    # Radius of earth in kilometers is 6371
    return 6371 * c

In [None]:
pickup = (40.767937, -73.982155)
dropoff = (40.765602, -73.964630)

distanceCalculator(pickup, dropoff)

1.4985518720659607

In [30]:
tqdm.pandas()

# Calculate the distance between pickup and dropoff locations
df['Distance (km)'] = df.progress_apply(lambda row: distanceCalculator((row['pickup_latitude'], row['pickup_longitude']), (row['dropoff_latitude'], row['dropoff_longitude'])), axis=1)
df.sample(5)

  0%|          | 0/1458644 [00:00<?, ?it/s]

100%|██████████| 1458644/1458644 [02:04<00:00, 11716.62it/s]


Unnamed: 0,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,Day of Week,Day of Month,Holiday,Time of Day,Distance (km)
297694,1,2016-04-12 20:14:03,2,-73.96492,40.772579,-73.953362,40.785381,N,355,1,12,False,Evening,1.724422
78652,2,2016-06-19 02:41:30,1,-73.988762,40.745396,-73.980751,40.743633,N,261,6,19,False,Early Morning,0.702747
487011,1,2016-01-14 16:12:09,3,-73.987885,40.754753,-73.97226,40.750713,N,763,3,14,False,Afternoon,1.3907
951886,2,2016-05-05 00:15:05,1,-74.002495,40.746948,-74.003586,40.732254,N,524,3,5,False,Early Morning,1.636505
804030,1,2016-06-20 08:28:55,1,-73.977959,40.752258,-73.958023,40.764744,N,543,0,20,False,Morning,2.178727


In [31]:
# Save the new dataframe
df.to_csv('./nyc-taxi-trip-duration/train_clean_with_features.csv', index=False)

# **Kinda Feature Engineering: One-Hot Encoding**

We can also use the `pd.get_dummies()` function to create one-hot encoded features for the `vendor_id` and `store_and_fwd_flag` columns. This is not really feature engineering but it is a good practice to do this since we are dealing with categorical variables. We can perform this **during the preprocessing step** as well.

Alternatively, we can use the `sklearn.preprocessing.OneHotEncoder()` class to perform one-hot encoding. This is useful if we want to use the one-hot encoder in a pipeline.

It is CRUCIAL that you understand what is the difference between *alternative methods* in ANY of the steps in the ML Engineering Workflow. This is because you will be able to choose the best method for your problem. Furthermore, the choice is sometimes dependent on certain *nuances* of your data. For example, between `pd.get_dummies()` and `sklearn.preprocessing.OneHotEncoder()`, the former is CANNOT handle missing values. On the other hand, `sklearn.preprocessing.OneHotEncoder()` can handle missing values by setting the `handle_unknown` parameter to `ignore`. Furthermore, the former returns a `pd.DataFrame` with **meaningful columns** which can be easier to debug, but heavier on memory. The later returns a `scipy.sparse.csr_matrix` which is more memory efficient but harder to debug.

> There is an important concept to remember when one-hot encoding: **data leakage**. What is the **objective** of creating machine learning models again? To give predictions on **unseen data**. How does this concept relate to OHE?

# **Feature Selection**

This is the process of selecting the most important features for the problem. This is important since it can help us reduce the number of features we need to train our model on. This can help us reduce the training time and memory usage of our model. Furthermore, it can help us reduce the risk of overfitting.

This is useful with the results of EDA, particularly the **correlation matrix**. We can use the correlation matrix to determine which features are highly correlated with the target variable. We can also use the correlation matrix to determine which features are highly correlated with each other. We can then remove the features that are highly correlated with each other. The analysis on **multi-collinearity** is also useful as it can help us determine which features to remove (i.e. redundant features).

You can perform this step **after** *or before* creating the first ML model. Sometimes, it is best to start with ALL of the features to create a **baseline model**. Later, when optimizing and tweaking the model, you can perform feature selection to improve the model (hopefully).

# **Feature Scaling**

This is the process of scaling the features to a certain range. This is important since it can help us improve the performance of our model. This is because some ML models are sensitive to the scale of the features. For example, the `LinearRegression` model is sensitive to the scale of the features. On the other hand, the `DecisionTreeRegressor` model is NOT sensitive to the scale of the features.

In [32]:
df = pd.read_csv('./nyc-taxi-trip-duration/train_clean_with_features.csv', parse_dates=['pickup_datetime'])
df.head()

Unnamed: 0,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,Day of Week,Day of Month,Holiday,Time of Day,Distance (km)
0,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,N,455,0,14,False,Evening,1.498521
1,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,N,663,6,12,False,Early Morning,1.805507
2,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,1,19,False,Morning,6.385098
3,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,N,429,2,6,False,Evening,1.485498
4,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,N,435,5,26,False,Afternoon,1.188588


In [33]:
df.describe()

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,Day of Week,Day of Month,Distance (km)
count,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0
mean,1.53495,1.66453,-73.97349,40.75092,-73.97342,40.7518,959.4923,3.050375,15.50402,3.440864
std,0.4987772,1.314242,0.07090186,0.03288119,0.07064327,0.03589056,5237.432,1.954039,8.703135,4.296538
min,1.0,0.0,-121.9333,34.3597,-121.9333,32.18114,1.0,0.0,1.0,0.0
25%,1.0,1.0,-73.99187,40.73735,-73.99133,40.73588,397.0,1.0,8.0,1.231837
50%,2.0,1.0,-73.98174,40.7541,-73.97975,40.75452,662.0,3.0,15.0,2.093717
75%,2.0,2.0,-73.96733,40.76836,-73.96301,40.76981,1075.0,5.0,23.0,3.875337
max,2.0,9.0,-61.33553,51.88108,-61.33553,43.92103,3526282.0,6.0,31.0,1240.909


For the **numerical columns** above, you can *normalize* the scale **LATER** after creating the baseline model to check if such a step results to an improvement or not. Refer to the discussion on EDA.