### Abstract:
In the last few years, the number of for-hire vehicles operating in NY has grown from 
63,000 to more than 100,000. However, while the number of trips in app-based vehicles 
has increased from 6 million to 17 million a year, taxi trips have fallen from 11 million 
to 8.5 million. Hence, the NY Yellow Cab organization decided to become more data-centric. Then we have apps like Uber, OLA, Lyft, Gett, etc. how do these apps work? After 
all, that set price is not a random guess.

### Problem Statement:
Given pickup and dropoff locations, the pickup timestamp, and the passenger count, the 
objective is to predict the fare of the taxi ride using Random Forest.

### Dataset Information:
Column Description:
* unique_id => A unique identifier or key for each record in the dataset.
* date_time_of_pickup => The time when the ride started.
* longitude_of_pickup => Longitude of the taxi ride pickup point.
* latitude_of_pickup => Latitude of the taxi ride pickup point.
* longitude__of_dropoff => Longitude of the taxi ride dropoff point.
* latitude_of_dropoff => Latitude of the taxi ride dropoff point.
* no_of_passenger => count of the passengers during the ride.
* amount => (target variable)dollar amount of the cost of the taxi ride.

### Scope:
* Prepare and analyse data 
* Perform feature engineering wherever applicable
* Check the distribution of key numerical variables
* Training a Random Forest model with data and check it’s performance
* Perform hyperparameter tuning

### Learning Outcome:
The students will get a better understanding of how the variables are linked to each 
other and how the EDA approach will help them gain more insights and knowledge 
about the data that we have and train Random Forest model with the data. Also, use 
GridSerachCV to get best hyperparameters to build optimized Random Forest model for 
prediction.

In [1]:
# Imports

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from haversine import haversine, Unit
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

### Loading Dataset into a DataFrame

In [2]:
taxi_df = pd.read_csv(r'TaxiFare.csv', header=0, index_col=0)
taxi_df.head()

Unnamed: 0_level_0,amount,date_time_of_pickup,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
26:21.0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
52:16.0,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
35:00.0,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
30:42.0,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
51:00.0,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


In [3]:
taxi_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 26:21.0 to 13:14.0
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   amount                50000 non-null  float64
 1   date_time_of_pickup   50000 non-null  object 
 2   longitude_of_pickup   50000 non-null  float64
 3   latitude_of_pickup    50000 non-null  float64
 4   longitude_of_dropoff  50000 non-null  float64
 5   latitude_of_dropoff   50000 non-null  float64
 6   no_of_passenger       50000 non-null  int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 3.1+ MB


In [4]:
# Using info, we see that the date_time_of_pickup fields has a datatype of object and not datetime.
# Hence, we change the datatype of the field.

taxi_df.date_time_of_pickup = pd.to_datetime(taxi_df.date_time_of_pickup)

# We confirm using info() function.
taxi_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 26:21.0 to 13:14.0
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   amount                50000 non-null  float64            
 1   date_time_of_pickup   50000 non-null  datetime64[ns, UTC]
 2   longitude_of_pickup   50000 non-null  float64            
 3   latitude_of_pickup    50000 non-null  float64            
 4   longitude_of_dropoff  50000 non-null  float64            
 5   latitude_of_dropoff   50000 non-null  float64            
 6   no_of_passenger       50000 non-null  int64              
dtypes: datetime64[ns, UTC](1), float64(5), int64(1)
memory usage: 3.1+ MB


In [5]:
taxi_df.describe()

Unnamed: 0,amount,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,11.364171,-72.509756,39.933759,-72.504616,39.926251,1.66784
std,9.685557,10.39386,6.224857,10.40757,6.014737,1.289195
min,-5.0,-75.423848,-74.006893,-84.654241,-74.006377,0.0
25%,6.0,-73.992062,40.73488,-73.991152,40.734372,1.0
50%,8.5,-73.98184,40.752678,-73.980082,40.753372,1.0
75%,12.5,-73.967148,40.76736,-73.963584,40.768167,2.0
max,200.0,40.783472,401.083332,40.851027,43.41519,6.0


* We see that the min of amount is in minus (-5), amount can not be negative.
* The numbers are in decimal degrees format and range from -90 to 90 for latitude and -180 to 180 for longitude.
* latitude_of_pickup has max value is "401.08". This is mostly due to data entry error, so we replace 401.08 => 40.10).
* minimum no_of_passenger is '0'.

In [6]:
# unique_id is taken as index by pandas. 
# We reset index and drop the "unique_id" column

taxi_df = taxi_df.reset_index()
taxi_df.drop("unique_id", axis=1, inplace=True)
taxi_df.head()

Unnamed: 0,amount,date_time_of_pickup,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger
0,4.5,2009-06-15 17:26:21+00:00,-73.844311,40.721319,-73.84161,40.712278,1
1,16.9,2010-01-05 16:52:16+00:00,-74.016048,40.711303,-73.979268,40.782004,1
2,5.7,2011-08-18 00:35:00+00:00,-73.982738,40.76127,-73.991242,40.750562,2
3,7.7,2012-04-21 04:30:42+00:00,-73.98713,40.733143,-73.991567,40.758092,1
4,5.3,2010-03-09 07:51:00+00:00,-73.968095,40.768008,-73.956655,40.783762,1


In [7]:
# Here we find the index numbers for all records where amount is zero or negative (<=0).

taxi_df[taxi_df.amount<=0].index

Int64Index([2039, 2486, 10002, 13032, 27891, 28839, 36722, 42337, 47302], dtype='int64')

In [8]:
# We find 9 indexes where the amount is zero or negative (<=0).
# We use the indexes from the above code and drop those records.

print("Shape before dropping", taxi_df.shape)
taxi_df.drop(taxi_df[taxi_df.amount<=0].index, axis=0, inplace=True)
print("Shape After dropping", taxi_df.shape)

Shape before dropping (50000, 7)
Shape After dropping (49991, 7)


In [9]:
# Extracting various date time components as seperate variables

taxi_df = taxi_df.assign(hour = taxi_df.date_time_of_pickup.dt.hour,
                        day = taxi_df.date_time_of_pickup.dt.day,
                        month = taxi_df.date_time_of_pickup.dt.month,
                        year = taxi_df.date_time_of_pickup.dt.year,
                        dayofweek = taxi_df.date_time_of_pickup.dt.dayofweek,)

taxi_df.head()

Unnamed: 0,amount,date_time_of_pickup,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,hour,day,month,year,dayofweek
0,4.5,2009-06-15 17:26:21+00:00,-73.844311,40.721319,-73.84161,40.712278,1,17,15,6,2009,0
1,16.9,2010-01-05 16:52:16+00:00,-74.016048,40.711303,-73.979268,40.782004,1,16,5,1,2010,1
2,5.7,2011-08-18 00:35:00+00:00,-73.982738,40.76127,-73.991242,40.750562,2,0,18,8,2011,3
3,7.7,2012-04-21 04:30:42+00:00,-73.98713,40.733143,-73.991567,40.758092,1,4,21,4,2012,5
4,5.3,2010-03-09 07:51:00+00:00,-73.968095,40.768008,-73.956655,40.783762,1,7,9,3,2010,1


In [10]:
# Now that we have variables (components), seperated from our datetime field and stored in our DF.
# We will go ahead and drop the original datetime field.

taxi_df.drop("date_time_of_pickup", axis=1, inplace=True)
taxi_df.head()

Unnamed: 0,amount,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,hour,day,month,year,dayofweek
0,4.5,-73.844311,40.721319,-73.84161,40.712278,1,17,15,6,2009,0
1,16.9,-74.016048,40.711303,-73.979268,40.782004,1,16,5,1,2010,1
2,5.7,-73.982738,40.76127,-73.991242,40.750562,2,0,18,8,2011,3
3,7.7,-73.98713,40.733143,-73.991567,40.758092,1,4,21,4,2012,5
4,5.3,-73.968095,40.768008,-73.956655,40.783762,1,7,9,3,2010,1


In [11]:
taxi_df.shape

(49991, 11)

In [12]:
taxi_df.latitude_of_pickup.value_counts()

0.000000     955
41.366138     18
40.756007     10
40.763975      8
40.757500      7
            ... 
40.725295      1
40.768578      1
40.740649      1
40.772588      1
40.768434      1
Name: latitude_of_pickup, Length: 36587, dtype: int64

* We have 955 entries with latitude=0, We will drop these records.

In [13]:
taxi_df.drop(taxi_df[taxi_df.latitude_of_pickup ==0].index, inplace=True)

# We confirm using .shape
taxi_df.shape

(49036, 11)

In [14]:
# We will fix the max value of "latitude_of_pickup", which was a result of data entry error.
taxi_df.latitude_of_pickup.replace(401.083332, 40.1083332, inplace=True)

taxi_df.latitude_of_pickup.value_counts().index.sort_values(ascending=True)
# We confirm that the last value is no longer 401.08 and is now 43.09.

Float64Index([-74.006893,  -74.00621, -74.004027,    -73.995, -73.992947,
               -73.99184, -73.988467, -73.987585, -73.987307, -73.986968,
              ...
                 41.0091,   41.03249,  41.035688,  41.150487,  41.366138,
               41.391042,  41.523217,      41.65,  42.160275,  43.098708],
             dtype='float64', length=36586)

In [15]:
# We use haversine library to calculate the distance travelled by passing the lat & long of pickup and dropoff.
#pip install haversine

# Syntax:-
#df.apply(lambda x: func(x['col1'], x['col2']), axis=1)
#haversine((lat1,long1), (lat2,long2))

#from haversine import haversine, Unit
taxi_df['travel_dist_km'] = taxi_df.apply(lambda row: haversine((row["latitude_of_pickup"],row["longitude_of_pickup"]),
                                                              (row["latitude_of_dropoff"],row["longitude_of_dropoff"]),
                                                              unit=Unit.KILOMETERS),axis=1)
taxi_df.head()

Unnamed: 0,amount,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,hour,day,month,year,dayofweek,travel_dist_km
0,4.5,-73.844311,40.721319,-73.84161,40.712278,1,17,15,6,2009,0,1.030765
1,16.9,-74.016048,40.711303,-73.979268,40.782004,1,16,5,1,2010,1,8.450145
2,5.7,-73.982738,40.76127,-73.991242,40.750562,2,0,18,8,2011,3,1.389527
3,7.7,-73.98713,40.733143,-73.991567,40.758092,1,4,21,4,2012,5,2.799274
4,5.3,-73.968095,40.768008,-73.956655,40.783762,1,7,9,3,2010,1,1.99916


In [16]:
# We drop the longitudes and latitudes since we now have the distance travelled.

taxi_df.drop(['longitude_of_pickup', 'latitude_of_pickup', 'longitude_of_dropoff', 'latitude_of_dropoff'], axis=1, inplace=True)
taxi_df.columns

Index(['amount', 'no_of_passenger', 'hour', 'day', 'month', 'year',
       'dayofweek', 'travel_dist_km'],
      dtype='object')

In [17]:
taxi_df.head()

Unnamed: 0,amount,no_of_passenger,hour,day,month,year,dayofweek,travel_dist_km
0,4.5,1,17,15,6,2009,0,1.030765
1,16.9,1,16,5,1,2010,1,8.450145
2,5.7,2,0,18,8,2011,3,1.389527
3,7.7,1,4,21,4,2012,5,2.799274
4,5.3,1,7,9,3,2010,1,1.99916


In [18]:
# In our DataFrame, the target variable is the 1st col, hence [:,0]
# Col 1 onwards are our independent variables, hence [:,1:]

X = taxi_df.values[:,1:]
Y = taxi_df.values[:,0]

In [19]:
# Scaling the Independent variables

#from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X)
X = scaler.transform(X)

In [20]:
# Splitting our data into train and test data sets (70/30% as we have large amount of data.)

#from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=10)

In [21]:
# predicting using the DecisionTreeRegressor

#from sklearn.tree import DecisionTreeRegressor

model_DecisionTree = DecisionTreeRegressor(criterion="squared_error", random_state=10)

# Fitting the model on the data and predict the values
model_DecisionTree.fit(X_train, Y_train)
Y_pred = model_DecisionTree.predict(X_test)
#print(Y_pred)
print("Model Score:", model_DecisionTree.score(X_train, Y_train))

# Evaluating metrics:

#from sklearn.metrics import r2_score, mean_squared_error
#import numpy as np

r2=r2_score(Y_test,Y_pred)  # R-Square value
print("R-squared:",r2)

rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE:", rmse)

adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
print("Adj R-square:",adjusted_r_squared)

Model Score: 0.9999999011754527
R-squared: 0.5054342181418084
RMSE: 6.707341049541443
Adj R-square: 0.5053636062369172


* We see that our model gave a very good accuracy on the training data but, we get a bad score on testing data.
* This is a clear sign that our model is overfitted.

## Pruning

In [22]:
# predicting using the DecisionTreeRegressor

#from sklearn.tree import DecisionTreeRegressor

model_DecisionTree = DecisionTreeRegressor(max_depth=5,
                                           min_samples_leaf=5,
                                           random_state=10) # Using default criterion

# Fitting the model on the data and predict the values
model_DecisionTree.fit(X_train, Y_train)
Y_pred = model_DecisionTree.predict(X_test)
#print(Y_pred)
print("Model Score:", model_DecisionTree.score(X_train, Y_train))
# Our model score decreased from 99% to 74.76%

# Evaluating metrics:

r2=r2_score(Y_test,Y_pred)  # R-Square value
print("R-squared:",r2)

rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE:", rmse)

adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
print("Adj R-square:",adjusted_r_squared)

Model Score: 0.7476080446835321
R-squared: 0.7845022137366695
RMSE: 4.427512170288781
Adj R-square: 0.7844714459202413


* We get a higher r-squared ,adj r-squared and a lower RMSE value

In [23]:
# We build a DataFrame to view the feature importance and sort for a better visual.

feature_imp = pd.DataFrame()
feature_imp["Feature"] = taxi_df.columns[1:]
feature_imp["Importance"] = model_DecisionTree.feature_importances_ 
# feature_importances_ only for Decision Tree & Random Forest (tree based models)

feature_imp.sort_values("Importance", ascending = False)

Unnamed: 0,Feature,Importance
6,travel_dist_km,0.976843
4,year,0.021625
1,hour,0.000885
5,dayofweek,0.000461
2,day,0.000187
0,no_of_passenger,0.0
3,month,0.0


* We see that no_of_passenger has 0 importance

In [24]:
# Generating graph for our Tree

from sklearn import tree
with open(r"model_DecisionTree.txt", "w") as f:

    f = tree.export_graphviz(model_DecisionTree, feature_names=taxi_df.columns[1:], out_file=f)

## Random Forest

In [25]:
# predicting using the RandomForestRegressor

#from sklearn.ensemble import RandomForestRegressor

model_DecisionTree = RandomForestRegressor(n_estimators=100, random_state=10) # Default: n_estimators=100 (RF and EXT)

# Fitting the model on the data and predict the values
model_DecisionTree.fit(X_train, Y_train)
Y_pred = model_DecisionTree.predict(X_test)
#print(Y_pred)
print("Model Score:", model_DecisionTree.score(X_train, Y_train))

# Evaluating metrics:

r2=r2_score(Y_test,Y_pred)  # R-Square value
print("R-squared:",r2)

rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE:", rmse)

adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
print("Adj R-square:",adjusted_r_squared)

Model Score: 0.9607301673132398
R-squared: 0.770669705796814
RMSE: 4.567400248298248
Adj R-square: 0.7706369630363624


### Trying to prune RandomForest

In [26]:
#from sklearn.ensemble import RandomForestRegressor

model_DecisionTree = RandomForestRegressor(n_estimators=100,
                                           max_depth=5,
                                           min_samples_leaf=5,
                                           random_state=10)

# Fitting the model on the data and predict the values
model_DecisionTree.fit(X_train, Y_train)
Y_pred = model_DecisionTree.predict(X_test)
#print(Y_pred)
print("Model Score:", model_DecisionTree.score(X_train, Y_train))

# Evaluating metrics:

r2=r2_score(Y_test,Y_pred)  # R-Square value
print("R-squared:",r2)

rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE:", rmse)

adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
print("Adj R-square:",adjusted_r_squared)

Model Score: 0.7547707985082028
R-squared: 0.7919078207039847
RMSE: 4.3507711124121204
Adj R-square: 0.7918781102272148


We see that our metrics improved slightly after pruning

In [27]:
# We build a DataFrame to view the feature importance and sort for a better visual.

feature_imp = pd.DataFrame()
feature_imp["Feature"] = taxi_df.columns[1:]
feature_imp["Importance"] = model_DecisionTree.feature_importances_ 
# feature_importances_ only for Decision Tree & Random Forest (tree based models)

feature_imp.sort_values("Importance", ascending = False)

Unnamed: 0,Feature,Importance
6,travel_dist_km,0.976056
4,year,0.021545
1,hour,0.001618
3,month,0.000278
5,dayofweek,0.000265
2,day,0.000236
0,no_of_passenger,3e-06


## Trying with ExtraTreeRegressor

In [28]:
# Predictibng using the Extra_Trees_Regressor

#from sklearn.ensemble import ExtraTreesRegressor

model_EXT = ExtraTreesRegressor(n_estimators=100,
                                 random_state=10,
                                 bootstrap=True)   # Here, bootstrap=True is not mandatory

# Fitting the model on the data and predict the values
model_EXT.fit(X_train, Y_train)

Y_pred = model_EXT.predict(X_test)
#print(Y_pred)
print("Model Score:", model_EXT.score(X_train, Y_train))

# Evaluating metrics:

r2=r2_score(Y_test,Y_pred)  # R-Square value
print("R-squared:",r2)

rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE:", rmse)

adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
print("Adj R-square:",adjusted_r_squared)

Model Score: 0.9597581526056705
R-squared: 0.773893154951798
RMSE: 4.535187136373466
Adj R-square: 0.7738608724210944


### After pruning ExtraTreeRegressor

In [29]:
#from sklearn.ensemble import ExtraTreesRegressor

model_EXT = ExtraTreesRegressor(n_estimators=100,
                                 random_state=10,
                                max_depth=10,
                                min_samples_leaf=5,
                                 bootstrap=True)   # Here, bootstrap=True is not mandatory

# Fitting the model on the data and predict the values
model_EXT.fit(X_train, Y_train)

Y_pred = model_EXT.predict(X_test)
#print(Y_pred)
print("Model Score:", model_EXT.score(X_train, Y_train))

# Evaluating metrics:

r2=r2_score(Y_test,Y_pred)  # R-Square value
print("R-squared:",r2)

rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE:", rmse)

adjusted_r_squared = 1 - (1-r2)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
print("Adj R-square:",adjusted_r_squared)

Model Score: 0.6072723027454293
R-squared: 0.6049334087643743
RMSE: 5.994783587824736
Adj R-square: 0.6048770029118278


* Linear Regressor is suitable when we have a linear relationship between X & Y
* We plotted and confirmed that none of the variables have a linear relationship with Y.

<head>
	<title>Metrics Summary</title> 
	<style>
		table td {
			text-align:center;
		}
	</style>
</head>
<body>
	<table>
		<thead>
			<tr>
                <th><u>Metrics</u></th>
                <th><u>Base DecisionTreeRegressor</u></th>
				<th>Pruned DecisionTreeRegressor</th>
				<th>Base RandomForestRegressor</th>
				<th>Pruned RandomForestRegressor</th>
				<th>Base ExtraTreesRegressor</th>
				<th>Pruned ExtraTreesRegressor</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>Model Score</td>
				<td>99.99%</td>
				<td>74.76%</td>
				<td>96.07%</td>
				<td>75.47%</td>
				<td>95.97%</td>
                <td>60.72%</td>
			</tr>
			<tr>
				<td>R-squared</td>
				<td>0.50</td>
				<td>0.78</td>
				<td>0.77</td>
				<td>0.79</td>
				<td>0.77</td>
                <td>0.60</td>
			</tr>
			<tr>
				<td>Adj R-square</td>
				<td>0.50</td>
				<td>0.78</td>
				<td>0.77</td>
				<td>0.79</td>
				<td>0.77</td>
                <td>0.60</td>
			</tr>
			<tr>
				<td>RMSE</td>
				<td>6.70</td>
				<td>4.42</td>
				<td>4.56</td>
				<td>4.35</td>
				<td>4.53</td>
                <td>5.99</td>
			</tr>
		</tbody>
	</table>
</body>

* We see that the Pruned RandomForest Regressor gives us the highest R-Squared and Adjusted R-Squared values and the lowest RMSE.
* We will now Re-build our model and predict values using the same.

In [30]:
model_DecisionTree = RandomForestRegressor(n_estimators=100,
                                           max_depth=5,
                                           min_samples_leaf=5,
                                           random_state=10)

# Fitting the model on the data and predict the values
model_DecisionTree.fit(X_train, Y_train)
Y_pred = model_DecisionTree.predict(X)
print(Y_pred)
print("Length of predicted values", len(Y_pred))

[ 5.44367337 21.85361238  7.13117599 ...  8.22979661  5.44367337
 12.38343785]
Length of predicted values 49036


In [31]:
df_final = pd.read_csv(r'TaxiFare.csv', header=0)
df_final.drop(df_final[df_final.amount<=0].index, axis=0, inplace=True)
df_final.latitude_of_pickup.replace(401.083332, 40.1083332, inplace=True)
df_final.drop(df_final[df_final.latitude_of_pickup ==0].index, inplace=True)
df_final.shape
df_final['Predicted Amount'] = Y_pred
df_final['Deviation'] = abs(df_final['amount']-df_final['Predicted Amount'])
df_final.head()

Unnamed: 0,unique_id,amount,date_time_of_pickup,longitude_of_pickup,latitude_of_pickup,longitude_of_dropoff,latitude_of_dropoff,no_of_passenger,Predicted Amount,Deviation
0,26:21.0,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1,5.443673,0.943673
1,52:16.0,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1,21.853612,4.953612
2,35:00.0,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2,7.131176,1.431176
3,30:42.0,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1,9.474145,1.774145
4,51:00.0,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1,7.734998,2.434998


In [32]:
df_final.to_csv("Predicted_amount.csv")

### End of project