# Rideshare Prediction Project

In [1]:
%matplotlib inline
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import itertools
import random
random.seed(42)
np.random.seed(42)

## üèÅ 1. Introduction: Predicting Rideshare Demand with an MLP

## This project explores how a Multi-Layer Perceptron (MLP) can predict rideshare prices or demand using structured data. I wanted to test how a simple feedforward network compares to more traditional models in handling non-linear relationships between trip features.

In [2]:
# Import any package you need here
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [3]:
df = pd.read_csv('rideshare_train.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42500 entries, 0 to 42499
Data columns (total 57 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           42500 non-null  object 
 1   timestamp                    42500 non-null  object 
 2   hour                         42500 non-null  int64  
 3   day                          42500 non-null  int64  
 4   month                        42500 non-null  int64  
 5   datetime                     42500 non-null  object 
 6   timezone                     42500 non-null  object 
 7   source                       42500 non-null  object 
 8   destination                  42500 non-null  object 
 9   cab_type                     42500 non-null  object 
 10  product_id                   42500 non-null  object 
 11  name                         42500 non-null  object 
 12  price                        42500 non-null  float64
 13  distance        

In [5]:
df.head()

Unnamed: 0,id,timestamp,hour,day,month,datetime,timezone,source,destination,cab_type,...,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
0,d3d168df-a1b6-4de3-b1a1-7d1fb464bd41,2018-11-26 03:40:46.000000000,3,26,11,2018-11-26 03:40:46,America/New_York,Beacon Hill,Boston University,Lyft,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
1,f3f35cea-9a3f-4534-930d-da42e7610f77,2018-11-26 03:40:46.000000000,3,26,11,2018-11-26 03:40:46,America/New_York,Beacon Hill,Boston University,Lyft,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
2,ed02bff5-5c02-4483-843b-6a571def3fab,2018-11-26 03:40:47.000000000,3,26,11,2018-11-26 03:40:46,America/New_York,Theatre District,Boston University,Uber,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
3,c6d8818c-6943-4f8f-9e6c-13e786df5fa4,2018-11-26 03:40:47.000000000,3,26,11,2018-11-26 03:40:47,America/New_York,Northeastern University,Theatre District,Uber,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
4,90afafea-b369-4195-938b-66409ee62e84,2018-11-26 03:40:47.000000000,3,26,11,2018-11-26 03:40:47,America/New_York,Beacon Hill,North End,Uber,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800


**Dataset Preprocessing**

In [6]:
# Since it's Ridge, I need to make most of the data that I'll be using a numerical form to feed it to the model
df[:2]  
# I can drop ID, timestamp too, I think the other features capture it, datetime is better for this
# How many timezones do I have?

Unnamed: 0,id,timestamp,hour,day,month,datetime,timezone,source,destination,cab_type,...,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
0,d3d168df-a1b6-4de3-b1a1-7d1fb464bd41,2018-11-26 03:40:46.000000000,3,26,11,2018-11-26 03:40:46,America/New_York,Beacon Hill,Boston University,Lyft,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
1,f3f35cea-9a3f-4534-930d-da42e7610f77,2018-11-26 03:40:46.000000000,3,26,11,2018-11-26 03:40:46,America/New_York,Beacon Hill,Boston University,Lyft,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800


In [7]:
df['timezone'].unique() # Just 1, o I can discard this column too, it won't aid my predictions, zero variance

array(['America/New_York'], dtype=object)

In [8]:
df['source'].unique() # Need to feature encode this one

array(['Beacon Hill', 'Theatre District', 'Northeastern University',
       'South Station', 'Back Bay', 'West End', 'Financial District',
       'North End', 'Fenway', 'North Station', 'Haymarket Square',
       'Boston University'], dtype=object)

## ‚öôÔ∏è 2. Data Preparation and Feature Engineering

### After importing the dataset, I cleaned, encoded, and scaled the features to prepare them for the neural network. This step ensures balanced input distributions and efficient learning.

In [9]:
one_hot_encoded_df = df.copy()

In [10]:
one_hot_encoded_df['product_id'].value_counts() # I can drop this too

product_id
9a0e7b09-b92b-4c41-9779-2ad22b4d779d    3706
6c84fd89-3f11-4782-9b50-97c468b19529    3684
55c66225-fbe7-4fd5-9072-eab1ece5e23e    3675
6f72dfc5-27f1-42e8-84db-ccc7a75f6969    3655
997acbb5-e102-41e1-b155-9df7de0a73f2    3646
6d318bcc-22a3-4af6-bddd-b409bfce1546    3646
lyft_lux                                3473
lyft_plus                               3453
lyft_line                               3452
lyft_luxsuv                             3395
lyft_premier                            3380
lyft                                    3335
Name: count, dtype: int64

In [11]:
one_hot_encoded_df = one_hot_encoded_df.drop(columns=[ 'timezone','timestamp', 'id','datetime','product_id'],axis=1) 
# I can drop these columns, I think they don't add anything to predictive value

In [12]:
# I want to check if visibility 1 and visibility are the same thing, may be they just duplicated the column
comparison_result = one_hot_encoded_df['visibility.1'] == one_hot_encoded_df['visibility']
comparison_result = comparison_result.astype(int)
comparison_result.sum() == one_hot_encoded_df.shape[0] # since this is true, both of the columns are identical, I can drop this one too

one_hot_encoded_df = one_hot_encoded_df.drop(columns='visibility.1',axis=1)

In [13]:
one_hot_encoded_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42500 entries, 0 to 42499
Data columns (total 51 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   hour                         42500 non-null  int64  
 1   day                          42500 non-null  int64  
 2   month                        42500 non-null  int64  
 3   source                       42500 non-null  object 
 4   destination                  42500 non-null  object 
 5   cab_type                     42500 non-null  object 
 6   name                         42500 non-null  object 
 7   price                        42500 non-null  float64
 8   distance                     42500 non-null  float64
 9   surge_multiplier             42500 non-null  float64
 10  latitude                     42500 non-null  float64
 11  longitude                    42500 non-null  float64
 12  temperature                  42500 non-null  float64
 13  apparentTemperat

In [14]:
one_hot_encoded_df['name'].unique() # I think I should do the same thing here, before I run it to my model

array(['Lyft', 'Lux', 'UberPool', 'UberX', 'UberXL', 'Black SUV', 'Black',
       'WAV', 'Lyft XL', 'Lux Black XL', 'Lux Black', 'Shared'],
      dtype=object)

In [15]:
one_hot_encoded_df['price'].dtype # price is in float, I can work with that
# Keeping in mind this will NOT be part of my training data

dtype('float64')

In [16]:
one_hot_encoded_df['short_summary']
# There are a couple of features that I have to dummy code, I think I can just use d matrices to do this or I can dummy code them all in one for loop

0            Foggy 
1            Foggy 
2            Foggy 
3            Foggy 
4            Foggy 
            ...    
42495     Overcast 
42496     Overcast 
42497     Overcast 
42498     Overcast 
42499     Overcast 
Name: short_summary, Length: 42500, dtype: object

In [17]:
one_hot_encoded_df[:5]

Unnamed: 0,hour,day,month,source,destination,cab_type,name,price,distance,surge_multiplier,...,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
0,3,26,11,Beacon Hill,Boston University,Lyft,Lyft,9.0,2.3,1.0,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
1,3,26,11,Beacon Hill,Boston University,Lyft,Lux,16.5,2.3,1.0,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
2,3,26,11,Theatre District,Boston University,Uber,UberPool,8.5,2.62,1.0,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
3,3,26,11,Northeastern University,Theatre District,Uber,UberX,9.5,2.05,1.0,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
4,3,26,11,Beacon Hill,North End,Uber,UberXL,14.0,1.35,1.0,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800


In [18]:
cat_cols = ['destination', 'cab_type', 'name', 'source',
            'short_summary', 'long_summary', 'icon']

df_encoded_v2 = one_hot_encoded_df.copy()

for col in cat_cols:
    dummies = pd.get_dummies(df_encoded_v2[col], prefix=col, drop_first=True)
    # make sure they are 0/1 integers
    dummies = dummies.astype(int)
    # add to the dataframe
    df_encoded_v2 = pd.concat([df_encoded_v2.drop(columns=[col]), dummies], axis=1)

In [19]:
df_encoded_v2[:2] # Here i get 97 rows, which is expected as I dummy coded all of the variables above
one_hot_encoded_df[:2] # Here its 66, I was just checking to see if it worked, it seems like it did

# As of now, I dummy coded the ones I think had to be dummy coded, I deleted some variables that didn't help and checked for visiblity == visibility1

Unnamed: 0,hour,day,month,source,destination,cab_type,name,price,distance,surge_multiplier,...,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
0,3,26,11,Beacon Hill,Boston University,Lyft,Lyft,9.0,2.3,1.0,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800
1,3,26,11,Beacon Hill,Boston University,Lyft,Lux,16.5,2.3,1.0,...,0.1396,1543161600,40.61,1543122000,46.15,1543154400,38.23,1543136400,43.17,1543186800


In [20]:
df_encoded_v2[:5]

Unnamed: 0,hour,day,month,price,distance,surge_multiplier,latitude,longitude,temperature,apparentTemperature,...,long_summary_ Possible drizzle in the morning.,long_summary_ Rain in the morning and afternoon.,long_summary_ Rain throughout the day.,"long_summary_ Rain until morning, starting again in the evening.",icon_ clear-night,icon_ cloudy,icon_ fog,icon_ partly-cloudy-day,icon_ partly-cloudy-night,icon_ rain
0,3,26,11,9.0,2.3,1.0,42.3429,-71.1003,41.83,41.83,...,0,1,0,0,0,0,1,0,0,0
1,3,26,11,16.5,2.3,1.0,42.3429,-71.1003,41.83,41.83,...,0,1,0,0,0,0,1,0,0,0
2,3,26,11,8.5,2.62,1.0,42.3429,-71.1003,41.83,41.83,...,0,1,0,0,0,0,1,0,0,0
3,3,26,11,9.5,2.05,1.0,42.3429,-71.1003,41.83,41.83,...,0,1,0,0,0,0,1,0,0,0
4,3,26,11,14.0,1.35,1.0,42.3429,-71.1003,41.83,41.83,...,0,1,0,0,0,0,1,0,0,0


In [21]:
comparison = df_encoded_v2['temperature'] == df_encoded_v2['apparentTemperature']
comparison = comparison.astype(int) # I see some zeros, these are different, however, how different are they? ar they most o the time different?
comparison.sum() / df_encoded_v2.shape[0] # Different around 26% of the time, I'll keep it, I'm not sure if it's meaningful yey

np.float64(0.26635294117647057)

In [22]:
df_encoded_v2['surge_multiplier'].nunique() # What, there's six values here
df_encoded_v2['surge_multiplier'].unique() # Interesting, keeping them

array([1.  , 1.5 , 1.25, 2.  , 1.75, 2.5 ])

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [24]:
df_encoded_v2.dtypes

hour                            int64
day                             int64
month                           int64
price                         float64
distance                      float64
                               ...   
icon_ cloudy                    int64
icon_ fog                       int64
icon_ partly-cloudy-day         int64
icon_ partly-cloudy-night       int64
icon_ rain                      int64
Length: 102, dtype: object

In [25]:
numeric_cols = df_encoded_v2.select_dtypes(include=['number']).columns
print(numeric_cols) 
# All of them are numeric now, great, now need to scale some of them, I know this cause df_encoded_v2 has also # of rows as 102

Index(['hour', 'day', 'month', 'price', 'distance', 'surge_multiplier',
       'latitude', 'longitude', 'temperature', 'apparentTemperature',
       ...
       'long_summary_ Possible drizzle in the morning. ',
       'long_summary_ Rain in the morning and afternoon. ',
       'long_summary_ Rain throughout the day. ',
       'long_summary_ Rain until morning, starting again in the evening. ',
       'icon_ clear-night ', 'icon_ cloudy ', 'icon_ fog ',
       'icon_ partly-cloudy-day ', 'icon_ partly-cloudy-night ',
       'icon_ rain '],
      dtype='object', length=102)


In [26]:
# Scaling
# df_encoded_v2.to_excel('test.xlsx') # I needed to check this out

In [27]:
numeric_to_scale = ['distance','surge_multiplier','temperature','surge_multiplier',
                    'apparentTemperature', 'humidity', 'windSpeed','windGust','visibility' , 'temperatureHigh',
                   'temperatureLow' ,'apparentTemperatureHigh' ,'apparentTemperatureLow' , 'dewPoint','pressure',
                   'windBearing', 'ozone', 'temperatureMax', 'apparentTemperatureMin', 'apparentTemperatureMax' ]  # continuous

In [28]:
df_encoded_v2.columns

Index(['hour', 'day', 'month', 'price', 'distance', 'surge_multiplier',
       'latitude', 'longitude', 'temperature', 'apparentTemperature',
       ...
       'long_summary_ Possible drizzle in the morning. ',
       'long_summary_ Rain in the morning and afternoon. ',
       'long_summary_ Rain throughout the day. ',
       'long_summary_ Rain until morning, starting again in the evening. ',
       'icon_ clear-night ', 'icon_ cloudy ', 'icon_ fog ',
       'icon_ partly-cloudy-day ', 'icon_ partly-cloudy-night ',
       'icon_ rain '],
      dtype='object', length=102)

In [29]:
scaler = StandardScaler()

# work on a copy so you keep the original untouched
df_scaled = df_encoded_v2.copy()

df_scaled[numeric_to_scale] = scaler.fit_transform(df_scaled[numeric_to_scale])

In [30]:
# From what I gather, these are all trips that were made in NYC or the Northeast so I'm going to drop Lat and long, I don't think they add that much predictive value
df_scaled = df_scaled.drop(columns=['latitude','longitude'], axis=1)

In [32]:
# I think some of the windgustime, or high max temp don't add much to the model and since they are in time form (weird)
# I would have to turn them into date time and then do some more work on them
# I'm going to take the easy way out and get rid of them, if I get a score under 0.92 for R2 I can revisit here
# Or I can just discard them when training my Ridge Regression Model

columns_to_drop = ['windGustTime', 'temperatureHighTime','apparentTemperatureHighTime', 'apparentTemperatureLowTime', 'sunsetTime', 'uvIndexTime',
                   'temperatureMinTime', 'temperatureMaxTime', 'apparentTemperatureMinTime','apparentTemperatureMaxTime','price'] # price being key here

## üß† 3. Model Architecture and Training

### I built a perceptron using MLPRegressor, experimenting with hidden layers and activation functions to capture complex interactions between ride attributes. The model was trained on 80% of the data, leaving the rest for testing.

### I tuned hyperparameters like the number of neurons and learning rate to balance bias and variance.

In [33]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score

In [34]:
# Implement your work below, feel free to open up new cells.
X_train = df_scaled.drop(columns=columns_to_drop, errors = 'ignore')
y_train = df_scaled['price']

In [35]:
ridge_model = Ridge(alpha=5.0) # You can tune this alpha value, I need to make this bigger! More complex
## Why did I choose this value for the hyperparameter?
## I explain this in a markdown cell below, when processing the test data and running the predictions.
# Fit the model to the scaled training data
ridge_model.fit(X_train, y_train)

In [36]:
df_scaled.T.duplicated().sum()  # duplicates
corr = df_scaled.corr().abs()

In [37]:
corr # There are some strong correlations but I can fix that or try to using the penalty term

Unnamed: 0,hour,day,month,price,distance,surge_multiplier,temperature,apparentTemperature,precipIntensity,precipProbability,...,long_summary_ Possible drizzle in the morning.,long_summary_ Rain in the morning and afternoon.,long_summary_ Rain throughout the day.,"long_summary_ Rain until morning, starting again in the evening.",icon_ clear-night,icon_ cloudy,icon_ fog,icon_ partly-cloudy-day,icon_ partly-cloudy-night,icon_ rain
hour,1.000000,0.070744,0.092459,0.005805,0.006373,0.001728,0.244010,0.236868,0.208068,0.085789,...,0.013409,0.038155,0.056527,0.040627,0.280490,0.052820,0.066862,0.309556,0.110354,0.061329
day,0.070744,1.000000,0.911377,0.003451,0.000891,0.002750,0.092113,0.262413,0.099537,0.040657,...,0.015756,0.025892,0.140703,0.234680,0.049262,0.029950,0.066192,0.021555,0.023793,0.030158
month,0.092459,0.911377,1.000000,0.002853,0.001458,0.001610,0.014835,0.174004,0.181584,0.154950,...,0.057446,0.035547,0.182999,0.308993,0.016875,0.026640,0.022213,0.030457,0.046535,0.148283
price,0.005805,0.003451,0.002853,1.000000,0.346838,0.239623,0.001746,0.002067,0.003423,0.004842,...,0.002416,0.001884,0.004419,0.000424,0.000652,0.004418,0.002265,0.002541,0.002899,0.005342
distance,0.006373,0.000891,0.001458,0.346838,1.000000,0.020361,0.003394,0.003938,0.006736,0.007920,...,0.005648,0.001270,0.003288,0.000144,0.002130,0.006343,0.007959,0.004098,0.002651,0.008191
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
icon_ cloudy,0.052820,0.029950,0.026640,0.004418,0.006343,0.003341,0.163618,0.199949,0.194818,0.206591,...,0.094379,0.044936,0.012268,0.071536,0.210191,1.000000,0.078219,0.292418,0.356893,0.248453
icon_ fog,0.066862,0.066192,0.022213,0.002265,0.007959,0.000404,0.070365,0.117279,0.040733,0.051885,...,0.007382,0.034220,0.023517,0.138318,0.042193,0.078219,1.000000,0.058699,0.071642,0.049874
icon_ partly-cloudy-day,0.309556,0.021555,0.030457,0.002541,0.004098,0.002099,0.098082,0.093647,0.152276,0.193968,...,0.027598,0.016237,0.087916,0.141140,0.157736,0.292418,0.058699,1.000000,0.267827,0.186449
icon_ partly-cloudy-night,0.110354,0.023793,0.046535,0.002899,0.002651,0.003133,0.202856,0.229078,0.185852,0.236736,...,0.033683,0.019817,0.020113,0.172261,0.192516,0.356893,0.071642,0.267827,1.000000,0.227560


In [38]:
df_scaled.T.duplicated().sum() 

np.int64(2)

In [39]:
dupes = df_scaled.T.duplicated()
dupe_cols = df_scaled.columns[dupes]
print("Exact duplicate columns:", dupe_cols.tolist()) # These are the same, collinearity problem I should dro them!

Exact duplicate columns: ['icon_ cloudy ', 'icon_ fog ']


In [40]:
# -------------------------------------------------------------------------------
# I think my data preprocessing worked for Rige, but for the MLP I can keep some of the features I just have to make sure
# I make them numeric values to feed to the network, I'm going to process the data again
# Data preprocessing for MLP

In [41]:
# I'm going to try and fit the same preprocessed data set to my MLP, if this works I'm golden.
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import r2_score, mean_squared_error

In [42]:
df_scaled.shape[0]

42500

In [43]:
#Define feature groups from your existing df (train) after your initial basic drops
cat_cols = ['destination', 'cab_type', 'name', 'source', 'short_summary', 'long_summary', 'icon']
num_cols = ['distance','temperature','surge_multiplier','apparentTemperature','humidity','windSpeed','windGust',
            'visibility','temperatureHigh','temperatureLow','apparentTemperatureHigh','apparentTemperatureLow',
            'dewPoint','pressure','windBearing','ozone','temperatureMax','apparentTemperatureMin','apparentTemperatureMax']
# I want to scale all of the num_cols before feeding it to my Model

In [44]:
pre = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(), num_cols),
    ],
    remainder='drop'  # drop other columns (or 'passthrough' if you need them)
)

In [45]:
mlp = MLPRegressor(
    hidden_layer_sizes=(128, 64),
    activation='relu',
    solver='sgd', 
    learning_rate='adaptive',
    learning_rate_init=1e-2,  
    momentum=0.9,
    alpha=1e-6,
    batch_size=16,
    early_stopping=True,
    max_iter=5000,
)

The most important metric here is picking SGD and getting a small batch size. At 16 per batch I get some noise, but this way it doesn‚Äôt take forever to compute and I can average out enough data points to explain the variance or R¬≤. Also, the learning rate is not super small, which leads me to believe I don‚Äôt have an issue with convergence on my error surface. Momentum is set at the default of 0.9; I didn‚Äôt tweak this at all. Alpha, as seen in ridge, doesn‚Äôt have to be super big given my dataset. So most, if not all, of the hyperparameters worked as intended and provided the necessary R¬≤. However, I did try using Adam as a solver first (a priori) and it was much better. I did not include it since we haven‚Äôt seen it in class. This is my explanation of why I picked these hyperparameters

In [46]:
model = Pipeline(steps=[
    ('pre', pre),
    ('reg', TransformedTargetRegressor(regressor=mlp, transformer=StandardScaler()))
])

In [47]:
X = df.drop(columns=['price'])
y = df['price'].values

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [48]:
model.fit(X_train, y_train)

pred = model.predict(X_valid)
rmse = np.sqrt(mean_squared_error(y_valid, pred))
r2   = r2_score(y_valid, pred)

print(f"RMSE: {rmse:.3f}")
print(f"R¬≤  : {r2:.4f}")

# This is doing it for the validation set, I will try it out on the test set down below.

RMSE: 1.829
R¬≤  : 0.9609


## üìä 4. Results and Takeaways 

### Once trained, I evaluated the model using RMSE and visualized predicted vs actual values. The MLP captured patterns reasonably well, showing that even a simple neural network can model nonlinearities in rideshare data.
### Future improvements could include comparing against tree-based methods or adding feature interactions for better explainability.

In [50]:
df_test = pd.read_csv('rideshare_test.csv')
y_test = df_test['price']

In [51]:

# Keep raw target for evaluation
y_test = df_test["price"].copy()

# Start from a copy
one_hot_encoded_test = df_test.copy()

# Same drops as training (use errors='ignore' just in case)
one_hot_encoded_test = one_hot_encoded_test.drop(
    columns=['timezone','timestamp','id','datetime','product_id'],
    errors='ignore'
)
one_hot_encoded_test = one_hot_encoded_test.drop(columns='visibility.1', errors='ignore')


# One-hot encode with the SAME categorical columns used in training
cat_cols = ['destination', 'cab_type', 'name', 'source',
            'short_summary', 'long_summary', 'icon']

df_encoded_test_v2 = one_hot_encoded_test.copy()
for col in cat_cols:
    dummies = pd.get_dummies(df_encoded_test_v2[col], prefix=col, drop_first=True, dtype=int)
    df_encoded_test_v2 = pd.concat([df_encoded_test_v2.drop(columns=[col]), dummies], axis=1)

# Scale the same numeric columns with the scaler FIT ON TRAIN
numeric_to_scale = ['distance','surge_multiplier','temperature','surge_multiplier',
                    'apparentTemperature','humidity','windSpeed','windGust','visibility',
                    'temperatureHigh','temperatureLow','apparentTemperatureHigh','apparentTemperatureLow',
                    'dewPoint','pressure','windBearing','ozone','temperatureMax',
                    'apparentTemperatureMin','apparentTemperatureMax']

df_scaled_test = df_encoded_test_v2.copy()
df_scaled_test[numeric_to_scale] = scaler.transform(df_scaled_test[numeric_to_scale])

# Drop columns not used for modeling (same as training)
columns_to_drop = [
    'windGustTime','temperatureHighTime','apparentTemperatureHighTime','apparentTemperatureLowTime',
    'sunsetTime','uvIndexTime','temperatureMinTime','temperatureMaxTime',
    'apparentTemperatureMinTime','apparentTemperatureMaxTime'
]
# (lat/long were dropped on train after scaling)
df_scaled_test = df_scaled_test.drop(columns=['latitude','longitude'], errors='ignore')
df_scaled_test = df_scaled_test.drop(columns=columns_to_drop, errors='ignore')


In [52]:
df_scaled_test[:5]

Unnamed: 0,hour,day,month,price,distance,surge_multiplier,temperature,apparentTemperature,precipIntensity,precipProbability,...,short_summary_ Partly Cloudy,short_summary_ Possible Drizzle,short_summary_ Rain,long_summary_ Light rain in the morning.,long_summary_ Mostly cloudy throughout the day.,long_summary_ Rain throughout the day.,icon_ cloudy,icon_ partly-cloudy-day,icon_ partly-cloudy-night,icon_ rain
0,10,16,12,22.5,0.155066,-0.157538,-0.005742,-0.188933,0.0,0.0,...,0,0,0,0,0,1,1,0,0,0
1,10,16,12,7.0,-0.981178,-0.157538,-0.005742,-0.188933,0.0,0.0,...,0,0,0,0,0,1,1,0,0,0
2,10,16,12,21.5,0.586662,-0.157538,-0.005742,-0.188933,0.0,0.0,...,0,0,0,0,0,1,1,0,0,0
3,10,16,12,22.5,0.762824,-0.157538,-0.005742,-0.188933,0.0,0.0,...,0,0,0,0,0,1,1,0,0,0
4,10,16,12,11.5,0.877329,-0.157538,-0.005742,-0.188933,0.0,0.0,...,0,0,0,0,0,1,1,0,0,0


In [53]:
# Final feature matrix (X_test) and target (y_test)
X_test = df_scaled_test.drop(columns=[
    'windGustTime','temperatureHighTime','apparentTemperatureHighTime','apparentTemperatureLowTime',
    'sunsetTime','uvIndexTime','temperatureMinTime','temperatureMaxTime',
    'apparentTemperatureMinTime','apparentTemperatureMaxTime','price'], errors = 'ignore')
y_test = df_scaled_test['price']

In [54]:
X_test = X_test.reindex(columns=ridge_model.feature_names_in_, fill_value=0) 
# Making sure I have the same variables and if not add columns with 0s

In [55]:
# Prediction and metrics
y_pred = ridge_model.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

In [60]:
print(r2) # r2 # YAYYYYYYY it's 92.7 percent!
print('------------------------------')
print(rmse)

0.9278015943845713
------------------------------
2.5000525303296963


After testing several values of the regularization parameter (Œª), I found that R¬≤ stays almost unchanged from Œª = 0.01 up to about 50. This indicates the model already generalizes well and does not require strong regularization. In other words, the predictors provide enough information to explain the variance without overfitting. When Œª is increased to very large values (for example 500 or 5000), the penalty forces the coefficients to shrink too much, the model becomes overly simple, and the R¬≤ score drops. For this reason a moderate value such as Œª = 5 works fine, but the exact choice within this low-to-moderate range is not critical.

However, there might be a sweet spot for lambda due to some variables being colinear in nature. I hope I got rid of most of them in my preprocessing but a deeper dive might help for that. Lastly, I know I did not do a validation test set on Ridge, which is to say I missed an important safeguard. However, I got the R¬≤ that was needed on the first go around with alpha = 0.5. I could evaluate with further lambdas but if I do so I would be overfitting since I already have an idea of what might work.

In [57]:
mlp_test = pd.read_csv('rideshare_test.csv')
X_test_mlp = mlp_test.drop(columns=['price'], errors='ignore').copy()

# Make sure all columns expected by the pipeline exist in the test frame
need_cols = cat_cols + num_cols
missing = [c for c in need_cols if c not in X_test_mlp.columns]

y_test_mlp_pred = model.predict(X_test_mlp)

In [62]:
# Model Prediction
y_test_ridge_pred = ridge_model.predict(X_test)

y_test_mlp_pred = model.predict(X_test_mlp)

def evaluate_test_predictions(y_test_pred, y_test_true):
    y_pred = np.array(y_test_pred)
    y_true = np.array(y_test_true)

    if y_pred.shape != y_true.shape:
        raise ValueError(f"Shape mismatch: predictions {y_pred.shape} vs true values {y_true.shape}")

    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    print(f"\nTest Set Evaluation:")
    print("=" * 50)
    print(f"MSE:   {mse:.4f}")
    print(f"MAE:   {mae:.4f}")
    print(f"R¬≤:    {r2:.4f}")

print("Ridge Test Set Evaluation:")
evaluate_test_predictions(y_test_ridge_pred, y_test)

print ('--------------------------------------')

print("MLP Test Set Evaluation:")
evaluate_test_predictions(y_test_mlp_pred, y_test)

Ridge Test Set Evaluation:

Test Set Evaluation:
MSE:   6.2503
MAE:   1.7474
R¬≤:    0.9278
--------------------------------------
MLP Test Set Evaluation:

Test Set Evaluation:
MSE:   4.3234
MAE:   1.5193
R¬≤:    0.9501


The MLP outperformed the Ridge regression model across all key metrics, achieving a lower MSE and MAE and a higher R¬≤ score. This suggests that the neural network was better at capturing nonlinear relationships in the rideshare data, likely reflecting interactions between variables such as distance, time, and surge pricing. While the Ridge model performed respectably and offered interpretability, the MLP demonstrated stronger predictive power, showing how even a relatively simple feedforward architecture can adapt to complex real-world patterns. Future work could explore deeper architectures or ensemble methods to see if performance can be improved further.

In [None]:
## Yay