<a href="https://colab.research.google.com/github/9mithun9/NY--Taxi-Fare-Prediction-Model/blob/main/taxi_fare_NY_prediction_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Installation & Dataset**


*   installation of necessary libraries
*   downloading dataset from kaggle

In [8]:
!pip install opendatasets --quiet

In [9]:
import opendatasets as od

In [10]:
dataset_url = 'https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data'

In [12]:
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: mehedferewrfs
Your Kaggle Key: ··········
Extracting archive ./new-york-city-taxi-fare-prediction/new-york-city-taxi-fare-prediction.zip to ./new-york-city-taxi-fare-prediction


In [13]:
data_dir = 'new-york-city-taxi-fare-prediction'

In [14]:
!ls -lh {data_dir}

total 5.4G
-rw-r--r-- 1 root root  486 May 13 08:32 GCP-Coupons-Instructions.rtf
-rw-r--r-- 1 root root 336K May 13 08:32 sample_submission.csv
-rw-r--r-- 1 root root 960K May 13 08:32 test.csv
-rw-r--r-- 1 root root 5.4G May 13 08:33 train.csv


In [15]:
!wc -l {data_dir}/train.csv

55423856 new-york-city-taxi-fare-prediction/train.csv


In [16]:
!head {data_dir}/train.csv

key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.7267

**Sampling 1% of Data**


*   As the dataset is huge with 55 millions, we are taking one 1% (~550000) of data
*   for traing and validation
The 1% of data are being taken randomly

In [17]:
import pandas as pd

In [18]:
cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')
cols

['fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [19]:
d_types = {
    'fare_amount': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude' : 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}

In [20]:
sample_fraction = 0.01

In [21]:
import random

In [22]:
def skip_rows(row_idx):
  if row_idx == 0:
    return False
  return random.random() > sample_fraction

In [23]:
random.seed(42)
df = pd.read_csv(data_dir+'/train.csv',
                 usecols=cols,
                 dtype=d_types,
                 skiprows=skip_rows,
                 parse_dates=['pickup_datetime'])


In [None]:
df.info()

In [None]:
df

In [None]:
df.isna().sum()

In [24]:
test_df = pd.read_csv(data_dir+'/test.csv',dtype=d_types, parse_dates=['pickup_datetime'])

In [None]:
test_df

In [None]:
df.describe()

In [None]:
test_df.info()

In [None]:
test_df.describe()

**Spliting Data**


*   Spliting 20% of the 1% data into training and validation



In [25]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
print(len(train_df), len(val_df))

441960 110490


In [None]:
train_df = train_df.dropna()
test_df = test_df.dropna()

In [None]:
train_df.columns


In [None]:
input_cols = ['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_cols = 'fare_amount'

In [None]:
train_input = train_df[input_cols]
train_target = train_df[target_cols]
val_input = val_df[input_cols]
val_target = val_df[target_cols]

In [None]:
len(train_target), len(val_target)

In [None]:
test_inputs = test_df[input_cols]

**RMSE function**


*   created a RMSE function to evaluate the margin of errors as per competition instructio



In [26]:
def rmse(targets, predictions):
  return np.sqrt(np.mean(np.square(targets - predictions)))


**Function for Submission**


*   the function tries out different models to predict
*   it also creates a submission file to submit in the competition



In [27]:
def predict_and_submit(model, fname, test_inputs):
  test_preds = model.predict(test_inputs)
  submission_df = pd.DataFrame({
      'key': test_df.key,
      'fare_amount': test_preds
  })
  submission_df.to_csv(fname, index=False)
  return submission_df

**Feature Engineering**

**Extracting information from Dates**


*   spliting into date, month, yera, hour, dayofweek, weekend
*   Also featuring time of day: Morning, Afternoon, Evening, Night




In [29]:
# prompt: create an add_dateparts function
def add_dateparts(df, col):
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_hour'] = df[col].dt.hour
    df[col + '_dayofweek'] = df[col].dt.dayofweek
    df[col + '_dayofyear'] = df[col].dt.dayofyear
    df[col + '_weekend'] = (df[col].dt.dayofweek // 5 == 1).astype(int) # 1 for weekend, 0 for weekday

    return df

In [30]:
add_dateparts(train_df, 'pickup_datetime')
add_dateparts(val_df, 'pickup_datetime')
add_dateparts(test_df, 'pickup_datetime')

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_datetime_year,pickup_datetime_month,pickup_datetime_day,pickup_datetime_hour,pickup_datetime_dayofweek,pickup_datetime_dayofyear,pickup_datetime_weekend
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24+00:00,-73.973320,40.763805,-73.981430,40.743835,1,2015,1,27,13,1,27,0
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24+00:00,-73.986862,40.719383,-73.998886,40.739201,1,2015,1,27,13,1,27,0
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44+00:00,-73.982521,40.751259,-73.979652,40.746140,1,2011,10,8,11,5,281,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12+00:00,-73.981163,40.767807,-73.990448,40.751637,1,2012,12,1,21,5,336,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12+00:00,-73.966049,40.789776,-73.988564,40.744427,1,2012,12,1,21,5,336,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9909,2015-05-10 12:37:51.0000002,2015-05-10 12:37:51+00:00,-73.968124,40.796997,-73.955643,40.780388,6,2015,5,10,12,6,130,1
9910,2015-01-12 17:05:51.0000001,2015-01-12 17:05:51+00:00,-73.945511,40.803600,-73.960213,40.776371,6,2015,1,12,17,0,12,0
9911,2015-04-19 20:44:15.0000001,2015-04-19 20:44:15+00:00,-73.991600,40.726608,-73.789742,40.647011,6,2015,4,19,20,6,109,1
9912,2015-01-31 01:05:19.0000005,2015-01-31 01:05:19+00:00,-73.985573,40.735432,-73.939178,40.801731,6,2015,1,31,1,5,31,1


In [31]:
train_df['time_of_day'] = pd.cut(train_df['pickup_datetime_hour'], bins=[0, 6, 12, 18, 24], labels=['Night', 'Morning', 'Afternoon', 'Evening'], right=False)
val_df['time_of_day'] = pd.cut(val_df['pickup_datetime_hour'], bins=[0, 6, 12, 18, 24], labels=['Night', 'Morning', 'Afternoon', 'Evening'], right=False)
test_df['time_of_day'] = pd.cut(test_df['pickup_datetime_hour'], bins=[0, 6, 12, 18, 24], labels=['Night', 'Morning', 'Afternoon', 'Evening'], right=False)

**Haversine Function**


*   A fucntion to find the distance between pick up and drop off points in km


In [32]:
# prompt: create a haversine fucntion to calculate the distance between two points: pick up and drop off using lat and long

import numpy as np

def haversine(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r


In [33]:
train_df['distance'] = haversine(train_df['pickup_latitude'], train_df['pickup_longitude'], train_df['dropoff_latitude'], train_df['dropoff_longitude'])
val_df['distance'] = haversine(val_df['pickup_latitude'], val_df['pickup_longitude'], val_df['dropoff_latitude'], val_df['dropoff_longitude'])
test_df['distance'] = haversine(test_df['pickup_latitude'], test_df['pickup_longitude'], test_df['dropoff_latitude'], test_df['dropoff_longitude'])

In [None]:
train_df.sample(5)

**Incorporating Landmarks**

*   A df of famous landmarks of NY to explore wider relation with dropoffs
*   Used the haversine function to calculate distance from landmarks to dropoffs



In [34]:
# prompt: 'Landmark': ['JFK Airport', 'LaGuardia Airport', 'Times Square', 'Central Park', 'Empire State Building'],
#     'Latitude': [40.6413, 40.7769, 40.7580, 40.7829, 40.7484],
#     'Longitude': [-73.7781, -73.8740, -73.9855, -73.9654, -73.9857]      add more landmarks of NY more

import pandas as pd
landmarks = {
    'Landmark': ['JFK Airport', 'LaGuardia Airport', 'Times Square', 'Central Park', 'Empire State Building'],
    'Latitude': [40.6413, 40.7769, 40.7580, 40.7829, 40.7484],
    'Longitude': [-73.7781, -73.8740, -73.9855, -73.9654, -73.9857]
}


landmarks_df = pd.DataFrame(landmarks)
landmarks_df


Unnamed: 0,Landmark,Latitude,Longitude
0,JFK Airport,40.6413,-73.7781
1,LaGuardia Airport,40.7769,-73.874
2,Times Square,40.758,-73.9855
3,Central Park,40.7829,-73.9654
4,Empire State Building,40.7484,-73.9857


In [35]:
# prompt: now create a function that creates columns for all the landmarks showing their haversine distance from the drop off lat lon. u can use the haversine function created above

def add_landmark_distances(df, landmark_df):
    for index, row in landmarks_df.iterrows():
        landmark_name = row['Landmark']
        landmark_lat = row['Latitude']
        landmark_lon = row['Longitude']

        df[f'distance_to_{landmark_name}'] = haversine(df['dropoff_latitude'], df['dropoff_longitude'], landmark_lat, landmark_lon)
    return df

train_df = add_landmark_distances(train_df, landmarks_df)
val_df = add_landmark_distances(val_df, landmarks_df)
test_df = add_landmark_distances(test_df, landmarks_df)


In [None]:
train_df.sample(5)

In [None]:
test_df.describe()

In [None]:
val_df.isna().sum()

**Handling Outliers**


*   some rows have very unreasonable values like 208 passengers
*   As the dataset is huge, it can be dropped
*   This were dropped based on the limits extracted from test_df



In [36]:
# prompt: write a function where I will drop all the rows of train and val df that is not within the max-min range of test_df for the columns of pickup and drop off lat and long and also for the column person that is not in the range (1-10) and for fare_maount (1-500)

def drop_outliers(train_df, val_df, test_df):
    # Define the columns to check for outliers
    columns_to_check = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

    # Calculate the min and max values for each column in the test set
    test_min_max = test_df[columns_to_check].agg(['min', 'max'])

    # Filter the train and validation sets based on the min and max values from the test set
    for col in columns_to_check:
        train_df = train_df[(train_df[col] >= test_min_max[col]['min']) & (train_df[col] <= test_min_max[col]['max'])]
        val_df = val_df[(val_df[col] >= test_min_max[col]['min']) & (val_df[col] <= test_min_max[col]['max'])]

    # Filter the train and validation sets for passenger_count and fare_amount
    train_df = train_df[(train_df['passenger_count'] >= 1) & (train_df['passenger_count'] <= 6)]
    train_df = train_df[(train_df['fare_amount'] >= 1) & (train_df['fare_amount'] <= 500)]

    val_df = val_df[(val_df['passenger_count'] >= 1) & (val_df['passenger_count'] <= 6)]
    val_df = val_df[(val_df['fare_amount'] >= 1) & (val_df['fare_amount'] <= 500)]

    return train_df, val_df


In [37]:
train_df, val_df = drop_outliers(train_df, val_df, test_df)

In [None]:
train_df.describe()

In [38]:
# prompt: save the parquet format for train and val

# Save the train and validation DataFrames to parquet files
train_df.to_parquet('train.parquet')
val_df.to_parquet('val.parquet')


**Evaluate Function**


*   evaluate the modeld and calculating the rmse



In [54]:
def evaluate(model):
  train_preds = model.predict(train_inputs)
  val_preds = model.predict(val_inputs)
  train_rmse = rmse(train_targets, train_preds)
  val_rmse = rmse(val_targets, val_preds)
  print(f'Train RMSE: {train_rmse:.2f}, Val RMSE: {val_rmse:.2f}')
  return train_preds, val_preds, train_rmse, val_rmse

**One Hot Encoder**


*   applied on the time_of_day column



In [41]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)


encoder.fit(train_df[['time_of_day', 'pickup_datetime_dayofweek']])
encoded_cols = encoder.get_feature_names_out(['time_of_day', 'pickup_datetime_dayofweek'])
train_df[encoded_cols] = encoder.transform(train_df[['time_of_day', 'pickup_datetime_dayofweek']])
val_df[encoded_cols] = encoder.transform(val_df[['time_of_day', 'pickup_datetime_dayofweek']])
test_df[encoded_cols] = encoder.transform(test_df[['time_of_day', 'pickup_datetime_dayofweek']])

encoded_cols

array(['time_of_day_Afternoon', 'time_of_day_Evening',
       'time_of_day_Morning', 'time_of_day_Night',
       'pickup_datetime_dayofweek_0', 'pickup_datetime_dayofweek_1',
       'pickup_datetime_dayofweek_2', 'pickup_datetime_dayofweek_3',
       'pickup_datetime_dayofweek_4', 'pickup_datetime_dayofweek_5',
       'pickup_datetime_dayofweek_6'], dtype=object)

In [40]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 430925 entries, 353352 to 121958
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype              
---  ------                             --------------   -----              
 0   fare_amount                        430925 non-null  float32            
 1   pickup_datetime                    430925 non-null  datetime64[ns, UTC]
 2   pickup_longitude                   430925 non-null  float32            
 3   pickup_latitude                    430925 non-null  float32            
 4   dropoff_longitude                  430925 non-null  float32            
 5   dropoff_latitude                   430925 non-null  float32            
 6   passenger_count                    430925 non-null  uint8              
 7   pickup_datetime_year               430925 non-null  int32              
 8   pickup_datetime_month              430925 non-null  int32              
 9   pickup_datetime_day                43

**Columns**


*   separated the input and target columns
*   created train_inputs, train_targets, val_inputs, val_target



In [42]:
train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_hour', 'pickup_datetime_dayofweek',
       'pickup_datetime_dayofyear', 'pickup_datetime_weekend', 'time_of_day',
       'distance', 'distance_to_JFK Airport', 'distance_to_LaGuardia Airport',
       'distance_to_Times Square', 'distance_to_Central Park',
       'distance_to_Empire State Building', 'time_of_day_Afternoon',
       'time_of_day_Evening', 'time_of_day_Morning', 'time_of_day_Night',
       'pickup_datetime_dayofweek_0', 'pickup_datetime_dayofweek_1',
       'pickup_datetime_dayofweek_2', 'pickup_datetime_dayofweek_3',
       'pickup_datetime_dayofweek_4', 'pickup_datetime_dayofweek_5',
       'pickup_datetime_dayofweek_6'],
      dtype='object')

In [49]:
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_hour', 'pickup_datetime_dayofweek',
       'pickup_datetime_dayofyear', 'pickup_datetime_weekend',
       'distance', 'distance_to_JFK Airport', 'distance_to_LaGuardia Airport',
       'distance_to_Times Square', 'distance_to_Central Park',
       'distance_to_Empire State Building', 'time_of_day_Afternoon',
       'time_of_day_Evening', 'time_of_day_Morning', 'time_of_day_Night',
       'pickup_datetime_dayofweek_0', 'pickup_datetime_dayofweek_1',
       'pickup_datetime_dayofweek_2', 'pickup_datetime_dayofweek_3',
       'pickup_datetime_dayofweek_4', 'pickup_datetime_dayofweek_5',
       'pickup_datetime_dayofweek_6']
target_cols = 'fare_amount'

In [50]:
train_inputs.columns

Index(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'passenger_count', 'pickup_datetime_year',
       'pickup_datetime_month', 'pickup_datetime_day', 'pickup_datetime_hour',
       'pickup_datetime_dayofweek', 'pickup_datetime_dayofyear',
       'pickup_datetime_weekend', 'time_of_day', 'distance',
       'distance_to_JFK Airport', 'distance_to_LaGuardia Airport',
       'distance_to_Times Square', 'distance_to_Central Park',
       'distance_to_Empire State Building', 'time_of_day_Afternoon',
       'time_of_day_Evening', 'time_of_day_Morning', 'time_of_day_Night',
       'pickup_datetime_dayofweek_0', 'pickup_datetime_dayofweek_1',
       'pickup_datetime_dayofweek_2', 'pickup_datetime_dayofweek_3',
       'pickup_datetime_dayofweek_4', 'pickup_datetime_dayofweek_5',
       'pickup_datetime_dayofweek_6'],
      dtype='object')

In [51]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_cols]
val_inputs = val_df[input_cols]
val_targets = val_df[target_cols]

In [56]:
test_inputs = test_df[input_cols]

In [52]:
train_inputs.isna().sum()

Unnamed: 0,0
pickup_longitude,0
pickup_latitude,0
dropoff_longitude,0
dropoff_latitude,0
passenger_count,0
pickup_datetime_year,0
pickup_datetime_month,0
pickup_datetime_day,0
pickup_datetime_hour,0
pickup_datetime_dayofweek,0


**Gradient Boosting**

In [46]:
train_inputs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 430925 entries, 353352 to 121958
Data columns (total 30 columns):
 #   Column                             Non-Null Count   Dtype   
---  ------                             --------------   -----   
 0   pickup_longitude                   430925 non-null  float32 
 1   pickup_latitude                    430925 non-null  float32 
 2   dropoff_longitude                  430925 non-null  float32 
 3   dropoff_latitude                   430925 non-null  float32 
 4   passenger_count                    430925 non-null  uint8   
 5   pickup_datetime_year               430925 non-null  int32   
 6   pickup_datetime_month              430925 non-null  int32   
 7   pickup_datetime_day                430925 non-null  int32   
 8   pickup_datetime_hour               430925 non-null  int32   
 9   pickup_datetime_dayofweek          430925 non-null  int32   
 10  pickup_datetime_dayofyear          430925 non-null  int32   
 11  pickup_datetime_weekend   

In [63]:
from xgboost import XGBRegressor

# Initialize and train the GradientBoostingRegressor
xgb_regressor = XGBRegressor(n_estimators=400, learning_rate=0.1, max_depth=5, random_state=42, n_jobs=-1, objective='reg:squarederror') # You can adjust hyperparameters
xgb_regressor.fit(train_inputs, train_targets)

# Evaluate the model
train_preds_xgb, val_preds_xgb, train_rmse_xgb, val_rmse_xgb = evaluate(xgb_regressor)

# Create a submission file using the GradientBoostingRegressor
predict_and_submit(xgb_regressor, 'submission_gb.csv', test_inputs)


Train RMSE: 3.28, Val RMSE: 3.91


Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,10.846735
1,2015-01-27 13:08:24.0000003,11.253995
2,2011-10-08 11:53:44.0000002,4.693129
3,2012-12-01 21:12:12.0000002,8.882232
4,2012-12-01 21:12:12.0000003,16.003944
...,...,...
9909,2015-05-10 12:37:51.0000002,9.019261
9910,2015-01-12 17:05:51.0000001,11.570228
9911,2015-04-19 20:44:15.0000001,53.473495
9912,2015-01-31 01:05:19.0000005,19.358086


**Ensembling XGB, RF, Lasso **

In [64]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso

# Initialize and train the RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_regressor.fit(train_inputs, train_targets)

# Evaluate the RandomForestRegressor
train_preds_rf, val_preds_rf, train_rmse_rf, val_rmse_rf = evaluate(rf_regressor)

# Initialize and train the Lasso Regression model
lasso_regressor = Lasso(alpha=0.1, random_state=42) # You can adjust the alpha value
lasso_regressor.fit(train_inputs, train_targets)

# Evaluate the Lasso Regression model
train_preds_lasso, val_preds_lasso, train_rmse_lasso, val_rmse_lasso = evaluate(lasso_regressor)


# Ensemble predictions with custom weights
# Example weights (you can adjust these)
weight_xgb = 0.6
weight_rf = 0.3
weight_lasso = 0.1

#Check if weights sum to 1
if (weight_xgb + weight_rf + weight_lasso) != 1:
    print("Warning: Weights do not sum to 1. Normalizing")
    total_weight = weight_xgb + weight_rf + weight_lasso
    weight_xgb = weight_xgb / total_weight
    weight_rf = weight_rf / total_weight
    weight_lasso = weight_lasso / total_weight


ensemble_val_preds = (weight_xgb * val_preds_xgb) + (weight_rf * val_preds_rf) + (weight_lasso * val_preds_lasso)
ensemble_rmse = rmse(val_targets, ensemble_val_preds)

print(f'Ensemble Validation RMSE: {ensemble_rmse:.2f}')

# Predict on test data using the ensemble
ensemble_test_preds = (weight_xgb * xgb_regressor.predict(test_inputs)) + (weight_rf * rf_regressor.predict(test_inputs)) + (weight_lasso * lasso_regressor.predict(test_inputs))

# Create a submission file for the ensemble
submission_df = pd.DataFrame({
    'key': test_df.key,
    'fare_amount': ensemble_test_preds
})
submission_df.to_csv('submission_ensemble.csv', index=False)


Train RMSE: 1.43, Val RMSE: 3.96
Train RMSE: 5.19, Val RMSE: 5.28
Ensemble Validation RMSE: 3.88


In [59]:
# prompt: apply gridsearch for hypertuning params of the xgb_regressor

from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300, ],  # Example values, adjust as needed
    'learning_rate': [0.01, 0.1, 0.2], # Example values, adjust as needed
    'max_depth': [3, 5, 7, ], # Example values, adjust as needed
}

# Initialize the XGBRegressor
xgb_regressor = XGBRegressor(random_state=42, n_jobs=-1, objective='reg:squarederror')

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb_regressor,
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',  # Use RMSE for scoring
    cv=3,  # Number of cross-validation folds
    n_jobs=-1, # Use all available cores
    verbose=2  # Print progress updates
)


# Fit the GridSearchCV object to the data
grid_search.fit(train_inputs, train_targets)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model
train_preds_best, val_preds_best, train_rmse_best, val_rmse_best = evaluate(best_model)
print(f'Train RMSE: {train_rmse_best:.2f}, Val RMSE: {val_rmse_best:.2f}')
# Create a submission file using the best model
predict_and_submit(best_model, 'submission_gridsearch.csv', test_inputs)


Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300}
Train RMSE: 2.82, Val RMSE: 3.91
Train RMSE: 2.82, Val RMSE: 3.91


Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,10.532259
1,2015-01-27 13:08:24.0000003,11.049778
2,2011-10-08 11:53:44.0000002,4.494516
3,2012-12-01 21:12:12.0000002,8.987752
4,2012-12-01 21:12:12.0000003,16.923683
...,...,...
9909,2015-05-10 12:37:51.0000002,8.950566
9910,2015-01-12 17:05:51.0000001,11.566434
9911,2015-04-19 20:44:15.0000001,53.765228
9912,2015-01-31 01:05:19.0000005,18.889082


In [None]:
# prompt: generate a test_params function

def test_params(model_class, **params):
  model = model_class(**params).fit(train_inputs, train_targets)
  train_preds, val_preds, train_rmse, val_rmse = evaluate(model)
  return train_preds, val_preds, train_rmse, val_rmse


In [None]:
# prompt: generate a test_parans_and_plot function

import pandas as pd
def test_params_and_plot(model_class, param_name, param_values, **other_params):
    train_errors, val_errors = [], []
    for value in param_values:
        params = dict(other_params)
        params[param_name] = value
        train_preds, val_preds, train_rmse, val_rmse = test_params(model_class, **params)
        train_errors.append(train_rmse)
        val_errors.append(val_rmse)

    plt.figure(figsize=(10, 6))
    plt.title('Overfiting curve:' + param_name)
    plt.plot(param_values, train_errors, 'b-o')
    plt.plot(param_values, val_errors, 'r-o')
    plt.xlabel(param_name)
    plt.ylabel('RMSE')
    plt.legend('Training', 'Validation')
    plt.show()

In [None]:
best_params={
    'random_state': 42,
    'n_jobs':-1,
    'objective': 'reg:squarederror',
    'n_estimators': 500,
    'max_depth': 5,
    'learning_rate': 0.08,
    'subsample': 0.8,
    'colsample_bytree': 0.8
}

In [None]:
test_params_and_plot(XGBRegressor, 'min_child_weight', [3], **best_params)