<a href="https://colab.research.google.com/github/Jayjake1/MachineLearningProjects/blob/main/Copy_of_NYC_Taxi_Trip_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Taxi trip time Prediction : Predicting total ride duration of taxi trips in New York City</u></b>

## <b> Problem Description </b>

### Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy as np 
import pandas as pd 
import math

In [None]:
dir_path="/content/drive/MyDrive/NYC Taxi Trip Time Prediction/NYC Taxi Data.csv"

In [None]:
df= pd.read_csv(dir_path)

In [None]:
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
 10  trip_duration       1458644 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB


In [None]:
df.columns

Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration'],
      dtype='object')

In [None]:
df['id']

0          id2875421
1          id2377394
2          id3858529
3          id3504673
4          id2181028
             ...    
1458639    id2376096
1458640    id1049543
1458641    id2304944
1458642    id2714485
1458643    id1209952
Name: id, Length: 1458644, dtype: object

In [None]:
import math

In [None]:
def cal_distance(x1,y1,x2,y2):
  dist=math.sqrt((x2-x1)**2+(y2-y1)**2)
  return dist 

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
 10  trip_duration       1458644 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB


In [None]:
df.iloc[[2896]].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 2896 to 2896
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  1 non-null      object 
 1   vendor_id           1 non-null      int64  
 2   pickup_datetime     1 non-null      object 
 3   dropoff_datetime    1 non-null      object 
 4   passenger_count     1 non-null      int64  
 5   pickup_longitude    1 non-null      float64
 6   pickup_latitude     1 non-null      float64
 7   dropoff_longitude   1 non-null      float64
 8   dropoff_latitude    1 non-null      float64
 9   store_and_fwd_flag  1 non-null      object 
 10  trip_duration       1 non-null      int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 96.0+ bytes


In [None]:
df[2898:]

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
2898,id1119347,2,2016-04-12 23:06:31,2016-04-12 23:18:20,1,-73.973007,40.756409,-73.991196,40.756306,N,709
2899,id1526407,2,2016-05-19 20:35:54,2016-05-19 20:49:39,2,-73.985298,40.761044,-74.011192,40.728642,N,825
2900,id1923168,2,2016-01-21 08:27:45,2016-01-21 08:35:31,1,-73.988411,40.748913,-73.996681,40.756039,N,466
2901,id0313424,1,2016-05-12 22:21:20,2016-05-12 22:46:18,1,-73.863609,40.769768,-73.970306,40.786518,N,1498
2902,id0971925,2,2016-06-08 23:03:55,2016-06-08 23:21:42,4,-73.984261,40.743267,-73.987396,40.779396,N,1067
...,...,...,...,...,...,...,...,...,...,...,...
1458639,id2376096,2,2016-04-08 13:31:04,2016-04-08 13:44:02,4,-73.982201,40.745522,-73.994911,40.740170,N,778
1458640,id1049543,1,2016-01-10 07:35:15,2016-01-10 07:46:10,1,-74.000946,40.747379,-73.970184,40.796547,N,655
1458641,id2304944,2,2016-04-22 06:57:41,2016-04-22 07:10:25,1,-73.959129,40.768799,-74.004433,40.707371,N,764
1458642,id2714485,1,2016-01-05 15:56:26,2016-01-05 16:02:39,1,-73.982079,40.749062,-73.974632,40.757107,N,373


In [None]:
df['dist']=np.sqrt((df.pickup_longitude-df.dropoff_longitude)**2+(df.pickup_latitude-df.dropoff_latitude)**2)
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,dist,dist_1
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,0.01768,1.01768
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,0.020456,1.020456
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,0.059934,1.059934
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,0.013438,1.013438
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,0.01069,1.01069


'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude'

In [None]:
df[dist_a] = 3963.0 * arccos[(sin(lat1) * sin(lat2)) + cos(lat1) * cos(lat2) * cos(long2 – long1)]

In [None]:
# x_es = 
def haversine_dist(row):
  dlat=np.radians(row.dropoff_lattitude-row.pickup_lattitude)
  dlon=np.radians(row.dropoff_longitude - row.pickup_longitude)
  r=6371 #in km
  h=np.sin(dlat/2)**2+np.cos(np.radians(row.dropoff_lattitude))*np.cos(np.radians(row.pickup_lattitude))*(np.sin(dlon/2))**2
  d=2*r*np.arcsin(np.sqrt(h)).astype('float')
  return d


In [None]:
haversine_dist(51.5007,0.1246,40.6892,74.0445)

5574.840456848555

In [None]:
df[dist_haversine]=df[df['pickup_longitude'],df['pickup_latitude'],df['dropoff_longitude'],df['dropoff_latitude']].apply(haversine_dist).astype('float')

TypeError: ignored

In [None]:
df[dist_haversine]=df.apply(haversine_dist(),'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude')

ValueError: ignored

In [None]:
df[dist_haversine]=df[['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']].apply(haversine_dist('pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude'))

TypeError: ignored

In [None]:
df[dist_haversine]=df[['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']].apply(haversine_dist)

TypeError: ignored

In [None]:
sub_data=df.loc[:,['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']]

In [None]:
sub_data

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
0,-73.982155,40.767937,-73.964630,40.765602
1,-73.980415,40.738564,-73.999481,40.731152
2,-73.979027,40.763939,-74.005333,40.710087
3,-74.010040,40.719971,-74.012268,40.706718
4,-73.973053,40.793209,-73.972923,40.782520
...,...,...,...,...
1458639,-73.982201,40.745522,-73.994911,40.740170
1458640,-74.000946,40.747379,-73.970184,40.796547
1458641,-73.959129,40.768799,-74.004433,40.707371
1458642,-73.982079,40.749062,-73.974632,40.757107


In [None]:
df['distance_3']=df.apply(haversine_dist(sub_data['pickup_longitude'],sub_data['pickup_latitude'],sub_data['pickup_latitude'],sub_data['dropoff_latitude']))

In [None]:
def haversine_dist(x1,y1,x2,y2):
  dlat=np.radians(x2-x1)
  dlon=np.radians(y2 - y1)
  r=6371 #in km
  h=np.sin(dlat/2)**2+np.cos(np.radians(x1))*np.cos(np.radians(x2))*(np.sin(dlon/2))**2
  d=2*r*np.arcsin(np.sqrt(h))
  return d

# sub_data['haver_distance'] = sub_data.apply(haversine_dist, axis = 'columns') 

In [None]:
bsas = [-34.83333, -58.5166646]
paris = [49.0083899664, 2.53844117956]