## 3. Feature Engineering

### Encoding categorical values into smart targets
1. Time - rush-hour or not (IsRushHour 1/0)? weekend vs weekday
2. Time - time of day (4 values) weighted differently
3. Weather - binary/numeric (IsRain == 1/0, IsSnow == 1/0)
4. Traffic - IsTraffic binary? EDA indicates Unknown/Medium traffic has real bearing on trip price

In [32]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [33]:
df = pd.read_csv("df_filled.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 875 entries, 0 to 874
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Trip_Distance_km       875 non-null    float64
 1   Time_of_Day            875 non-null    object 
 2   Day_of_Week            875 non-null    object 
 3   Passenger_Count        875 non-null    float64
 4   Traffic_Conditions     875 non-null    object 
 5   Weather                875 non-null    object 
 6   Base_Fare              875 non-null    float64
 7   Per_Km_Rate            875 non-null    float64
 8   Per_Minute_Rate        875 non-null    float64
 9   Trip_Duration_Minutes  875 non-null    float64
 10  Trip_Price             875 non-null    float64
dtypes: float64(7), object(4)
memory usage: 75.3+ KB


In [34]:
df[("Traffic_Conditions")].value_counts()

Traffic_Conditions
Low        344
Medium     329
High       156
Unknown     46
Name: count, dtype: int64

### Drill down into 'Unknown' values in categorical columns

In [35]:
# shows that 'unknown' is priced similar to 'high'
mean_traffic_price = df.groupby("Traffic_Conditions")["Trip_Price"].mean()
mean_traffic_price

Traffic_Conditions
High       54.287284
Low        52.239369
Medium     51.161354
Unknown    54.067491
Name: Trip_Price, dtype: float64

In [36]:
df[("Traffic_Conditions")].value_counts()

Traffic_Conditions
Low        344
Medium     329
High       156
Unknown     46
Name: count, dtype: int64

In [37]:
mean_tod_price = df.groupby("Time_of_Day")["Trip_Price"].mean()
mean_tod_price

Time_of_Day
Afternoon    52.809558
Evening      52.952252
Morning      52.274430
Night        51.039348
Unknown      48.296307
Name: Trip_Price, dtype: float64

In [38]:
df[("Time_of_Day")].value_counts()

Time_of_Day
Afternoon    325
Morning      242
Evening      182
Night         81
Unknown       45
Name: count, dtype: int64

In [39]:
mean_dow_price = df.groupby("Day_of_Week")["Trip_Price"].mean()
mean_dow_price

Day_of_Week
Unknown    53.538713
Weekday    52.746890
Weekend    51.157111
Name: Trip_Price, dtype: float64

In [40]:
df[("Day_of_Week")].value_counts()

Day_of_Week
Weekday    568
Weekend    268
Unknown     39
Name: count, dtype: int64

In [41]:
mean_weather_price = df.groupby("Weather")["Trip_Price"].mean()
mean_weather_price

Weather
Clear      51.799166
Rain       52.753097
Snow       53.913840
Unknown    55.027851
Name: Trip_Price, dtype: float64

In [42]:
df[("Weather")].value_counts()

Weather
Clear      581
Rain       201
Snow        52
Unknown     41
Name: count, dtype: int64

### Question is what to do with 'Unknown'. it represents ≈< 5% of each group which is v small. Then again, looking at where 'Unknown' clusters: it's on the highest fares.

### It's hard to know why there is missing data in categorical features. 'Unknown' might reflect irl gaps in logging (peak/busy/messy rides), driver failure to log or lack of training. Or it could be a proxy for something else that is not available as a category??

### As a software company profit is business critical, so exploring 'Unknown' is valuable since it is involved in the highest fares. Which features in combination create the highest fare?

In [51]:
rushhour = df.groupby(["Traffic_Conditions", "Time_of_Day", "Day_of_Week", "Weather"])["Trip_Price"].mean().reset_index().sort_values(by="Trip_Price", ascending=False)
rushhour[:15]

Unnamed: 0,Traffic_Conditions,Time_of_Day,Day_of_Week,Weather,Trip_Price
127,Unknown,Unknown,Weekend,Rain,110.2544
119,Unknown,Evening,Weekend,Clear,102.0011
103,Medium,Night,Weekday,Unknown,98.3796
117,Unknown,Evening,Weekday,Clear,97.0573
90,Medium,Evening,Weekend,Unknown,96.2845
54,Low,Morning,Unknown,Unknown,90.5124
76,Medium,Afternoon,Weekday,Snow,79.29288
8,High,Evening,Unknown,Clear,78.194833
22,High,Morning,Weekend,Unknown,77.1963
21,High,Morning,Weekend,Clear,73.948575


In [44]:
df

Unnamed: 0,Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
0,19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.800000,0.320000,53.82,36.2624
1,36.87,Evening,Weekend,1.0,High,Clear,2.70,1.210000,0.150000,37.27,52.9032
2,30.33,Evening,Weekday,4.0,Low,Unknown,3.48,0.510000,0.150000,116.81,36.4698
3,8.64,Afternoon,Weekend,2.0,Medium,Clear,2.55,1.710000,0.480000,89.33,60.2028
4,3.85,Afternoon,Weekday,4.0,High,Rain,3.51,1.660000,0.292647,5.05,11.2645
...,...,...,...,...,...,...,...,...,...,...,...
870,5.49,Afternoon,Weekend,4.0,Medium,Clear,2.39,0.620000,0.490000,58.39,34.4049
871,45.95,Night,Weekday,4.0,Medium,Clear,3.12,0.610000,0.292647,61.96,62.1295
872,7.70,Morning,Weekday,3.0,Low,Rain,2.08,1.780000,0.292647,54.18,33.1236
873,47.56,Morning,Weekday,1.0,Low,Clear,2.67,0.820000,0.170000,114.94,61.2090


In [46]:
# RushHour flag =
# Weekday + (Morning or Evening) + High Traffic
df["IsRushHour"] = (
    (df["Day_of_Week"] == "Weekday") &
    (df["Time_of_Day"].isin(["Morning", "Afternoon"])) &
    (df["Traffic_Conditions"].isin(["High", "Unknown"]))
).astype(int)
