# New York City Taxi Fare Prediction
![](https://i.imgur.com/ecwUY8F.png)

Dataset Link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

We'll train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time, pickup location, drop location and no. of passengers. 


Here's an outline of the project:

1. Download the dataset
2. Explore & analyze the dataset
3. Prepare the dataset for ML training
4. Train hardcoded & baseline models
5. Make predictions & submit to Kaggle
6. Peform feature engineering
7. Train & evaluate different models
8. Tune hyperparameters for the best models
9. Train on a GPU with the entire dataset
10. Document & publish the project online


## 1. Download the Dataset

Steps:

- Install required libraries
- Download data from Kaggle
- View dataset files
- Load training set with Pandas
- Load test set with Pandas


### Install Required Libraries

In [74]:
!pip install opendatasets pandas numpy scikit-learn xgboost --quiet

In [75]:
import opendatasets as od

In [76]:
dataset_url='https://www.kaggle.com/c/new-york-city-taxi-fare-prediction'

- **Please Enter Your Kaggle Credential** 

In [77]:
od.download(dataset_url)

Skipping, found downloaded files in "./new-york-city-taxi-fare-prediction" (use force=True to force download)


In [78]:
data_dir='new-york-city-taxi-fare-prediction'

### View Dataset Files

Let's look at the size, no. of lines and first few lines of each file.

In [79]:
!ls -lh {data_dir} # shell command this shows how many files we have

total 5.4G
-rw-r--r-- 1 root root  486 Mar  8 09:24 GCP-Coupons-Instructions.rtf
-rw-r--r-- 1 root root 336K Mar  8 09:24 sample_submission.csv
-rw-r--r-- 1 root root 960K Mar  8 09:24 test.csv
-rw-r--r-- 1 root root 5.4G Mar  8 09:25 train.csv


In [80]:
!wc -l {data_dir}/train.csv # shell command to get the data directary and under that we saw train.csv 

55423856 new-york-city-taxi-fare-prediction/train.csv


In [81]:
!wc -l {data_dir}/test.csv # shell command to get the data directary and under that we saw test.csv 

9914 new-york-city-taxi-fare-prediction/test.csv


In [82]:
!wc -l {data_dir}/sample_submission.csv # shell command to get the data directary and under that we sample_submission.csv

9915 new-york-city-taxi-fare-prediction/sample_submission.csv


In [83]:
!head {data_dir}/train.csv # read the data and check number of columns .

key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.7267

In [84]:
!head {data_dir}/sample_submission.csv

key,fare_amount
2015-01-27 13:08:24.0000002,11.35
2015-01-27 13:08:24.0000003,11.35
2011-10-08 11:53:44.0000002,11.35
2012-12-01 21:12:12.0000002,11.35
2012-12-01 21:12:12.0000003,11.35
2012-12-01 21:12:12.0000005,11.35
2011-10-06 12:10:20.0000001,11.35
2011-10-06 12:10:20.0000003,11.35
2011-10-06 12:10:20.0000002,11.35


Observations:

- This is a supervised learning regression problem
- Training data is 5.5 GB in size
- Training data has 5.5 million rows
- Test set is much smaller (< 10,000 rows)
- The training set has 8 columns:
    - `key` (a unique identifier)
    - `fare_amount` (target column)
    - `pickup_datetime`
    - `pickup_longitude`
    - `pickup_latitude`
    - `dropoff_longitude`
    - `dropoff_latitude`
    - `passenger_count`
- The test set has all columns except the target column `fare_amount`.
- The submission file should contain the `key` and `fare_amount` for each test sample.



In [85]:
import pandas as pd

In [86]:
cols=['fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude','passenger_count']

In [87]:
dtypes={
    'fare_amount':'float32',#Covers upto 8 decimal places 
    'pickup_longitude':'float32',
    'pickup_latitude':'float32',
    'dropoff_longitude':'float32',
    'dropoff_latitude':'float32',
    'passenger_count':'uint8',
}

We will use the skip row function to generate random samples 

In [88]:
import random 
sample_fraction=0.01
def skip_row(row_idx) :
  if row_idx == 0:
    return False 
  return random.random() > sample_fraction # skip the 


In [89]:
random.seed(42)
train_df=pd.read_csv(data_dir+'/train.csv',
               parse_dates=['pickup_datetime'],
               usecols=cols,dtype=dtypes,skiprows=skip_row)

In [90]:
train_df.head(2)  

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.0,2014-12-06 20:36:22+00:00,-73.979813,40.751904,-73.979446,40.755482,1
1,8.0,2013-01-17 17:22:00+00:00,0.0,0.0,0.0,0.0,2


> _**TIP #3**: Fix the seeds for random number generators so that you get the same results every time you run your notebook._


### Load Test Set

For the test set, we'll simply provide the data types.

In [91]:
test_df=pd.read_csv(data_dir+'/test.csv',parse_dates=['pickup_datetime'],dtype=dtypes)

In [92]:
test_df.head(2)

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24+00:00,-73.97332,40.763805,-73.98143,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24+00:00,-73.986862,40.719383,-73.998886,40.739201,1


## 2. Explore the Dataset

- Basic info about training set
- Basic info about test set
- Exploratory data analysis & visualization
- Ask & answer questions

### Training Set

In [26]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552450 entries, 0 to 552449
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        552450 non-null  float32            
 1   pickup_datetime    552450 non-null  datetime64[ns, UTC]
 2   pickup_longitude   552450 non-null  float32            
 3   pickup_latitude    552450 non-null  float32            
 4   dropoff_longitude  552450 non-null  float32            
 5   dropoff_latitude   552450 non-null  float32            
 6   passenger_count    552450 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 15.3 MB




*    **Here We have to notice that 50% of the fare is less then 8$.**
*   **There is no such large value of longitude and latitude exsist.**

*   **To handle this situation we look at the distribution of test set and clean our training data according to that.**








In [27]:
train_df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,552450.0,552450.0,552450.0,552450.0,552450.0,552450.0
mean,11.354059,-72.497063,39.9105,-72.504326,39.934265,1.684983
std,9.811924,11.618246,8.061114,12.074346,9.255057,1.337664
min,-52.0,-1183.362793,-3084.490234,-3356.729736,-2073.150635,0.0
25%,6.0,-73.99202,40.734875,-73.991425,40.73399,1.0
50%,8.5,-73.981819,40.752621,-73.980179,40.753101,1.0
75%,12.5,-73.967155,40.767036,-73.963737,40.768059,2.0
max,499.0,2420.209473,404.983337,2467.752686,3351.403076,208.0


Observations about training data:

- 550k+ rows, as expected
- No missing data (in the sample)
- `fare_amount` ranges from \$-52.0 to \$499.0 
- `passenger_count` ranges from 0 to 208 
- There seem to be some errors in the latitude & longitude values
- Dates range from 1st Jan 2009 to 30th June 2015
- The dataset takes up ~19 MB of space in the RAM

We may need to deal with outliers and data entry errors before we train our model.


### Test Set

In [28]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9914 entries, 0 to 9913
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   key                9914 non-null   object             
 1   pickup_datetime    9914 non-null   datetime64[ns, UTC]
 2   pickup_longitude   9914 non-null   float32            
 3   pickup_latitude    9914 non-null   float32            
 4   dropoff_longitude  9914 non-null   float32            
 5   dropoff_latitude   9914 non-null   float32            
 6   passenger_count    9914 non-null   uint8              
dtypes: datetime64[ns, UTC](1), float32(4), object(1), uint8(1)
memory usage: 319.6+ KB


In [29]:
test_df.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9914.0,9914.0,9914.0,9914.0,9914.0
mean,-73.974716,40.751041,-73.973656,40.75174,1.671273
std,0.042774,0.033541,0.039072,0.035435,1.278747
min,-74.25219,40.573143,-74.263245,40.568974,1.0
25%,-73.9925,40.736125,-73.991249,40.735253,1.0
50%,-73.982327,40.753052,-73.980015,40.754065,1.0
75%,-73.968012,40.767113,-73.964062,40.768757,2.0
max,-72.986534,41.709557,-72.990967,41.696682,6.0


Some observations about the test set:

- 9914 rows of data
- No missing values
- No obvious data entry errors
- 1 to 6 passengers (we can limit training data to this range)
- Latitudes lie between 40 and 42
- Longitudes lie between -75 and -72
- Pickup dates range from Jan 1st 2009 to Jun  30th 2015 (same as training set)

We can use the ranges of the test set to drop outliers/invalid data from the training set.

**We wil check the datetime of training and testing dataset**

**Training data range from 2009 to 2015**

In [30]:
print(train_df['pickup_datetime'].min())
print(train_df['pickup_datetime'].max())

2009-01-01 00:11:46+00:00
2015-06-30 23:59:54+00:00


**Test Data is also ranging from 2009 to 2015**

In [31]:
print(test_df['pickup_datetime'].min())
print(test_df['pickup_datetime'].max())

2009-01-01 11:04:24+00:00
2015-06-30 20:03:50+00:00


### Exploratory Data Analysis and Visualization

**Exercise**: Create graphs (histograms, line charts, bar charts, scatter plots, box plots, geo maps etc.) to study the distrubtion of values in each column, and the relationship of each input column to the target.


### Ask & Answer Questions

**Exercise**: Ask & answer questions about the dataset: 

1. What is the busiest day of the week?
2. What is the busiest time of the day?
3. In which month are fares the highest?
4. Which pickup locations have the highest fares?
5. Which drop locations have the highest fares?
6. What is the average ride distance?
7. ???

Performing EDA on your dataset and asking questions will help you develop a deeper understand of the data and give you ideas for feature engineering.


## 3. Prepare Dataset for Training

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

### Split Training & Validation Set

We'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data. 

Since the test set and training set have the same date ranges, we can pick a random 20% fraction.

> _**TIP #5**: Your validation set should be as similar to the test set or real-world data as possible i.e. the evaluation metric score of a model on validation & test sets should be very close, otherwise you're shooting in the dark._


In [93]:
from sklearn.model_selection import train_test_split

In [94]:
train_df,val_df=train_test_split(train_df,test_size=0.2,random_state=42)
# We are using randomness spliting remain same every time .

In [95]:
len(train_df),len(val_df)

(441960, 110490)

### Fill/Remove Missing Values

There are no missing values in our sample, but if there were, we could simply drop the rows with missing values instead of trying to fill them (since we have a lot of training data)>

In [96]:
train_df=train_df.dropna()
val_df=val_df.dropna()

### Extract Inputs and Outputs

In [97]:
input_cols=['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count']
target_cols='fare_amount'

#### Training

In [98]:
train_inputs=train_df[input_cols]
train_targets=train_df[target_cols]

#### Validation

In [99]:
validation_inputs=val_df[input_cols]
validation_targets=val_df[target_cols]

#### Test

In [100]:
test_inputs=test_df[input_cols]

## 4. Train Baseline Model

> _**TIP #6**: Always create a simple baseline model to establish the minimum score any proper ML model should beat._

- Baseline model: Linear regression 

For evaluation the dataset uses RMSE error: 
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/evaluation

### Train & Evaluate  Model

Let's create a simple model that always predicts the average.

In [40]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [41]:
linear_model=LinearRegression()

In [42]:
linear_model.fit(train_inputs,train_targets)

In [43]:
#mean_squared_error(validation_targets,val_preds,squared=False)

The linear regression model is off by $9.898, which isn't much better than simply predicting the average. 

This is mainly because the training data (geocoordinates) is not in a format that's useful for the model, and we're not using one of the most important columns: pickup date & time.

However, now we have a baseline that our other models should ideally beat.

## 5. Make Predictions and Submit to Kaggle

> _**TIP #7**: When working on a Kaggle competition, submit early and submit often (ideally daily). The best way to improve your models is to try & beat your previous score._

- Make predictions for test set
- Generate submissions CSV
- Submit to Kaggle
- Record in experiment tracking sheet

In [44]:
test_preds=linear_model.predict(test_inputs)

In [137]:
def Pred_and_submit(model,fname) :
  test_Pred=model.predict(test_inputs)
  data_dir='new-york-city-taxi-fare-prediction'
  sub_df=pd.read_csv(data_dir+'/sample_submission.csv')
  sub_df['fare_amount']=test_preds
  sub_df.to_csv(fname,index=False)
  return sub_df

In [46]:
Pred_and_submit(linear_model,'submission_file.csv')

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.284280
1,2015-01-27 13:08:24.0000003,11.284634
2,2011-10-08 11:53:44.0000002,11.284384
3,2012-12-01 21:12:12.0000002,11.284222
4,2012-12-01 21:12:12.0000003,11.284050
...,...,...
9909,2015-05-10 12:37:51.0000002,11.720277
9910,2015-01-12 17:05:51.0000001,11.720225
9911,2015-04-19 20:44:15.0000001,11.721249
9912,2015-01-31 01:05:19.0000005,11.720798


## 6. Feature Engineering

> _**TIP #10**: Take an iterative approach to feature engineering. Add some features, train a model, evaluate it, keep the features if they help, otherwise drop them, then repeat._

- Extract parts of date
- Remove outliers & invalid data
- Add distance between pickup & drop
- Add distance from landmarks

Exercise: We're going to apply all of the above together, but you should observer the effect of adding each feature individually.

### Extract Parts of Date

- Year
- Month
- Day
- Weekday
- Hour



In [101]:
def add_dateparts(df, col):
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_weekday'] = df[col].dt.weekday
    df[col + '_hour'] = df[col].dt.hour


In [102]:
add_dateparts(test_df,'pickup_datetime')
add_dateparts(val_df,'pickup_datetime')
add_dateparts(train_df,'pickup_datetime')

- **We Will Calculate the distance between two points so the best distance is Haversine Distance .**

The **Haversine distance** is a **measure of distance between two points on a sphere (such as the Earth) using their latitude and longitude coordinates**. It takes into account the curvature of the sphere and is commonly used in geospatial applications such as mapping and navigation. The formula for Haversine distance is:

d =**2r * arcsin(√(sin²((lat₂-lat₁)/2) + cos(lat₁) * cos(lat₂) * sin²((lon₂-lon₁)/2)))** 

In [112]:
from math import radians, cos, sin, asin, sqrt
import numpy as np 

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
    return c * r

### Add Distance Between Pickup and Drop

We can use the haversine distance: 
- https://en.wikipedia.org/wiki/Haversine_formula
- https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas

In [113]:
def add_trip_distance (df) :
  df['trip_distance_km']=haversine(df['pickup_longitude'],
                                df['pickup_latitude'],
                                df['dropoff_longitude'],
                                df['dropoff_latitude'])

In [114]:
add_trip_distance(train_df)
add_trip_distance(val_df)
add_trip_distance(test_df)

### Add Distance From Popular Landmarks

> _**TIP #11**: Creative feature engineering (generally involving human insight or external data) is a lot more effective than excessive hyperparameter tuning. Just one or two good feature improve the model's performance drastically._

- JFK Airport
- LGA Airport
- EWR Airport
- Times Square
- Met Meuseum
- World Trade Center

We'll add the distance from drop location. 

In [117]:
### Most popular land marks 
jfk_longlat=-73.7781,40.6413
lga_lonlat=-73.8740,40.7769
ewr_lonlat=-74.1745,40.6895
met_lonlat=-73.9632,40.7794
wtc_lonlat=-74.0099,40.7126

In [118]:
def add_land_mark_dropoff_distance(df,landmark_name,landmark_lonlat) :
  lon,lat=landmark_lonlat
  df[landmark_name+'_drop_distance_km']=haversine(lon,lat,df['dropoff_longitude'],df['dropoff_latitude'])

In [56]:
def add_land_marks(df) :
  landmarks= [('jfk',jfk_longlat),('lga',lga_lonlat),('ewr',ewr_lonlat),('wtc',wtc_lonlat)]
  for name, lonlat in landmarks :
    add_land_mark_dropoff_distance(df,name,lonlat)

In [119]:
add_land_marks(train_df)
add_land_marks(val_df)
add_land_marks(test_df)

In [121]:
def remove_outliers(df) :
  return df[(df['fare_amount']>=1.) &
            (df['fare_amount']<=500.) &
            (df['pickup_longitude']>=-75) &
            (df['pickup_longitude']<=-72) &
            (df['dropoff_longitude']>=-75) &
            (df['dropoff_longitude']<=-72) &
            (df['pickup_latitude']>=40) &
            (df['pickup_latitude']<=42) &
            (df['dropoff_latitude']>=40) &
            (df['dropoff_latitude']<=42) &
            (df['passenger_count']>=1)&
            (df['passenger_count']<=6)
            ]

In [122]:
train_df=remove_outliers(train_df)
val_df=remove_outliers(val_df)

### Save Intermediate DataFrames

> _**TIP #12**: Save preprocessed & prepared data files to save time & experiment faster. You may also want to create differnt notebooks for EDA, feature engineering and model training._

Let's save the processed datasets in the Apache Parquet format, so that we can load them back easily to resume our work from this point.




- **Save Data in Apacha Parquet format beacuse its easy to load and take less space for storage** 

In [123]:
train_df.to_parquet('trian.parquet')
val_df.to_parquet('val.parquet')

## 7. Train & Evaluate Different Models

We'll train each of the following & submit predictions to Kaggle:

- Linear Regression
- Random Forests
- Gradient Boosting

Exercise: Train Ridge, SVM, KNN, Decision Tree models

In [124]:
input_cols = ['pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count',
       'pickup_datetime_year', 'pickup_datetime_month', 'pickup_datetime_day',
       'pickup_datetime_weekday', 'pickup_datetime_hour', 'trip_distance_km',
       'jfk_drop_distance_km', 'lga_drop_distance_km', 'ewr_drop_distance_km'
       , 'wtc_drop_distance_km']

In [125]:
target_col = 'fare_amount'


In [126]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_col]

In [127]:
val_inputs = val_df[input_cols]
val_targets = val_df[target_col]

In [128]:
test_inputs = test_df[input_cols]


In [129]:
def evaluate(model):
    train_preds = model.predict(train_inputs)
    train_rmse = mean_squared_error(train_targets, train_preds, squared=False)
    val_preds = model.predict(val_inputs)
    val_rmse = mean_squared_error(val_targets, val_preds, squared=False)
    return train_rmse, val_rmse, train_preds, val_preds

- **Using Ridge Regression Model**

In [130]:
from sklearn.linear_model import Ridge

In [131]:
model1 = Ridge(random_state=42)

In [132]:
%%time
model1.fit(train_inputs, train_targets)

CPU times: user 43.6 ms, sys: 13 ms, total: 56.6 ms
Wall time: 53.3 ms


In [133]:
evaluate(model1)

(5.117309083016065,
 5.297373373605723,
 array([ 8.21093129,  3.43957364,  9.2404262 , ..., 10.10100683,
         8.14726819,  9.98992574]),
 array([11.01006698,  6.36000993, 48.01074195, ...,  7.85744665,
        22.38779338,  9.09231632]))

In [138]:
Pred_and_submit(model1, 'ridge_submission.csv')

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.284280
1,2015-01-27 13:08:24.0000003,11.284634
2,2011-10-08 11:53:44.0000002,11.284384
3,2012-12-01 21:12:12.0000002,11.284222
4,2012-12-01 21:12:12.0000003,11.284050
...,...,...
9909,2015-05-10 12:37:51.0000002,11.720277
9910,2015-01-12 17:05:51.0000001,11.720225
9911,2015-04-19 20:44:15.0000001,11.721249
9912,2015-01-31 01:05:19.0000005,11.720798


- **Using Random Forest Model**

In [139]:
from sklearn.ensemble import RandomForestRegressor

In [140]:
model2 = RandomForestRegressor(max_depth=10, n_jobs=-1, random_state=42, n_estimators=50)

In [141]:
model2.fit(train_inputs, train_targets)

In [142]:
evaluate(model2)

(3.604004050636313,
 4.169615793817497,
 array([ 7.12146703,  9.05792898,  9.09745305, ..., 10.38977691,
         7.73276412, 10.36404777]),
 array([12.60790157,  6.17960147, 47.31852767, ...,  8.32851844,
        29.20883778,  8.25065798]))

In [144]:
Pred_and_submit(model2, 'rf_submission.csv')

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.284280
1,2015-01-27 13:08:24.0000003,11.284634
2,2011-10-08 11:53:44.0000002,11.284384
3,2012-12-01 21:12:12.0000002,11.284222
4,2012-12-01 21:12:12.0000003,11.284050
...,...,...
9909,2015-05-10 12:37:51.0000002,11.720277
9910,2015-01-12 17:05:51.0000001,11.720225
9911,2015-04-19 20:44:15.0000001,11.721249
9912,2015-01-31 01:05:19.0000005,11.720798


- **Using Gradient Boosting Model**

In [145]:
from xgboost import XGBRegressor

In [146]:
model3 = XGBRegressor(random_state=42, n_jobs=-1, objective='reg:squarederror')


In [147]:
model3.fit(train_inputs, train_targets)


In [148]:
evaluate(model3)


(3.065759,
 3.9424372,
 array([ 6.497749,  9.234839,  9.840836, ..., 11.329951,  9.33291 ,
        10.200343], dtype=float32),
 array([15.936949,  5.799245, 48.289013, ...,  8.023396, 30.849865,
         8.801597], dtype=float32))

In [149]:
Pred_and_submit(model3, 'xgb_submission.csv')


Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,11.284280
1,2015-01-27 13:08:24.0000003,11.284634
2,2011-10-08 11:53:44.0000002,11.284384
3,2012-12-01 21:12:12.0000002,11.284222
4,2012-12-01 21:12:12.0000003,11.284050
...,...,...
9909,2015-05-10 12:37:51.0000002,11.720277
9910,2015-01-12 17:05:51.0000001,11.720225
9911,2015-04-19 20:44:15.0000001,11.721249
9912,2015-01-31 01:05:19.0000005,11.720798
