### NY City Taxi Fare Prediction Exploration using Dask

This is data on New York City Taxi Cab trips,the Dataset published by NYC,there is over 15 million trips.Our dataset includes every cab ride in the city of New York in the year of 2015, including when and where it started and stopped, a breakdown of the fare, etc.
The data is from [here](https://www.kaggle.com/kentonnlp/2014-new-york-city-taxi-trips?select=nyc_taxi_data_2014.csv).

 Features I am using for my prediction are
- vendor_id - str
- fare_amount - float dollar amount of the cost of the taxi ride
- pickup_datetime - timestamp value indicating when the taxi ride started.
- dropoff_datetime - timestamp value indicating when the taxi ride ended.
- pickup_longitude - float for longitude coordinate of where the taxi ride started.
- pickup_latitude - float for latitude coordinate of where the taxi ride started.
- dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
- dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
- trip_distance -  float for distance of where the taxi ride traveled.
- passenger_count - integer indicating the number of passengers in the taxi ride(driver entered value).
- payment_type - str of what kind of payment the passenger done.
- tip amount - float how much  tip the passenger given.



This data is too large to fit into Pandas on a single computer. However, it can fit in memory if we break it up into many small pieces and load these pieces onto different computers across a cluster.

We connect a client to our Dask cluster, composed of one centralized dask-scheduler process and several dask-worker processes running on each of the machines in our cluster.

#### importing necessary libraries

In [1]:
%%time
import numpy as np
import pandas as pd
import urllib.request
import math
import seaborn as sns     
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline
from geopy.distance import great_circle
import warnings
warnings.filterwarnings('ignore')
import folium                                        # map visualizations
from folium.plugins import HeatMap
import dask
import dask.dataframe as dd
import dask.multiprocessing
from dask.distributed import Client,progress 
client=Client('172.20.115.128:8786')
#dask.config.set({"distributed.comm.timeouts.connect": "50s"})
client

CPU times: user 3.87 s, sys: 3.09 s, total: 6.95 s
Wall time: 6 s


0,1
Connection method: Direct,
Dashboard: http://172.20.115.128:42571/status,

0,1
Comm: tcp://172.20.115.128:8786,Workers: 2
Dashboard: http://172.20.115.128:42571/status,Total threads: 16
Started: 17 minutes ago,Total memory: 24.61 GiB

0,1
Comm: tcp://172.20.115.128:43651,Total threads: 8
Dashboard: http://172.20.115.128:46573/status,Memory: 12.31 GiB
Nanny: tcp://172.20.115.128:45541,
Local directory: /home/sulochana/dask-worker-space/worker-v1a4mqee,Local directory: /home/sulochana/dask-worker-space/worker-v1a4mqee
Tasks executing: 8,Tasks in memory: 8
Tasks ready: 6,Tasks in flight: 0
CPU usage: 325.0%,Last seen: Just now
Memory usage: 5.13 GiB,Spilled bytes: 0 B
Read bytes: 11.89 kiB,Write bytes: 16.33 kiB

0,1
Comm: tcp://172.20.115.128:44367,Total threads: 8
Dashboard: http://172.20.115.128:45499/status,Memory: 12.31 GiB
Nanny: tcp://172.20.115.128:43035,
Local directory: /home/sulochana/dask-worker-space/worker-dmuzvgb9,Local directory: /home/sulochana/dask-worker-space/worker-dmuzvgb9
Tasks executing: 8,Tasks in memory: 5
Tasks ready: 2,Tasks in flight: 1
CPU usage: 104.8%,Last seen: Just now
Memory usage: 3.95 GiB,Spilled bytes: 0 B
Read bytes: 4.08 kiB,Write bytes: 10.90 kiB


#### Extracting zipfile

In [2]:
import zipfile
with zipfile.ZipFile("archive (1).zip") as zip_ref:
    zip_ref.extractall("taxidata")

 Set columns to most suitable type to optimize for memory usage and speed-up the loading and select the columns (names) that you truly need for analysis

In [3]:
%%time
# Set columns to most suitable type to optimize for memory usage and speed-up the loading
data_types = {'vendor_id' : 'str',
               'fare_amount'      : 'float32',
               'pickup_datetime'  : 'str', 
               'dropoff_datetime' : 'str',
               'passenger_count' :'float',
               'trip_distance':   'float',
               'pickup_longitude' : 'float32',
               'pickup_latitude'  : 'float32',
               #'store_and_fwd_flag' : 'str',
               'dropoff_longitude': 'float32',
               'dropoff_latitude' : 'float32',
               'payment_type' :   'str',
               'tip_amount' :'float',
        
             }

#select the columns (names) that you truly need for analysis
data_cols = list(data_types.keys())

CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 12.4 µs


#### Reading csv file,creating dask data frame and understanding data

And we load our CSV data using dask.dataframe which looks and feels just like Pandas, even though it’s actually coordinating hundreds of small Pandas dataframes. This takes about a minute to load and parse.

In [4]:
%%time
taxi_data= dd.read_csv("taxidata/nyc_taxi_data_2014.csv",sep=',',usecols=data_cols, dtype=data_types,engine='python')
taxi_data.head()

CPU times: user 135 ms, sys: 17.9 ms, total: 153 ms
Wall time: 3.47 s


Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,tip_amount
0,CMT,2014-01-09 20:45:25,2014-01-09 20:52:31,1.0,0.7,-73.994766,40.736828,-73.982224,40.731789,CRD,6.5,1.4
1,CMT,2014-01-09 20:46:12,2014-01-09 20:55:12,1.0,1.4,-73.982391,40.77338,-73.960449,40.763996,CRD,8.5,1.9
2,CMT,2014-01-09 20:44:47,2014-01-09 20:59:46,2.0,2.3,-73.988571,40.739407,-73.986626,40.765217,CRD,11.5,1.5
3,CMT,2014-01-09 20:44:57,2014-01-09 20:51:40,1.0,1.7,-73.960213,40.770466,-73.979866,40.77705,CRD,7.5,1.7
4,CMT,2014-01-09 20:47:09,2014-01-09 20:53:32,1.0,0.9,-73.995369,40.717247,-73.984367,40.720524,CRD,6.0,1.75


In [5]:
taxi_data.compute().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14999999 entries, 0 to 209999
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   vendor_id          object 
 1   pickup_datetime    object 
 2   dropoff_datetime   object 
 3   passenger_count    float64
 4   trip_distance      float64
 5   pickup_longitude   float32
 6   pickup_latitude    float32
 7   dropoff_longitude  float32
 8   dropoff_latitude   float32
 9   payment_type       object 
 10  fare_amount        float32
 11  tip_amount         float64
dtypes: float32(5), float64(3), object(4)
memory usage: 1.2+ GB


In [6]:
#%%time
#taxi_data.compute().info()

#### percentage of rows to load from 15 million rows

In [7]:
%%time
sample=taxi_data.sample(0.001)
sample.head()


CPU times: user 120 ms, sys: 88.1 ms, total: 209 ms
Wall time: 5.17 s


Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,tip_amount
281338,CMT,2014-01-11 22:47:28,2014-01-11 23:16:48,1.0,6.2,-73.968864,40.755829,-73.947762,40.815372,CRD,24.0,5.0
158910,CMT,2014-01-10 23:33:57,2014-01-10 23:42:20,1.0,2.7,-73.979576,40.74395,-73.952164,40.771236,CRD,10.0,2.0
54326,CMT,2014-01-10 08:20:27,2014-01-10 08:57:21,1.0,8.2,-73.933434,40.758213,-74.006035,40.705967,CRD,31.5,5.0
324542,CMT,2014-01-11 20:33:49,2014-01-11 20:38:19,1.0,0.9,-73.971642,40.795433,-73.966255,40.804573,CRD,5.5,1.3
208927,CMT,2014-01-11 03:38:35,2014-01-11 03:49:27,2.0,2.8,-73.988998,40.727135,-73.997444,40.75629,CRD,11.0,2.4


Lets explore data

In [8]:
%%time
sample.compute()

CPU times: user 10.7 s, sys: 3.59 s, total: 14.3 s
Wall time: 4min 48s


Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,tip_amount
281338,CMT,2014-01-11 22:47:28,2014-01-11 23:16:48,1.0,6.20,-73.968864,40.755829,-73.947762,40.815372,CRD,24.0,5.0
158910,CMT,2014-01-10 23:33:57,2014-01-10 23:42:20,1.0,2.70,-73.979576,40.743950,-73.952164,40.771236,CRD,10.0,2.0
54326,CMT,2014-01-10 08:20:27,2014-01-10 08:57:21,1.0,8.20,-73.933434,40.758213,-74.006035,40.705967,CRD,31.5,5.0
324542,CMT,2014-01-11 20:33:49,2014-01-11 20:38:19,1.0,0.90,-73.971642,40.795433,-73.966255,40.804573,CRD,5.5,1.3
208927,CMT,2014-01-11 03:38:35,2014-01-11 03:49:27,2.0,2.80,-73.988998,40.727135,-73.997444,40.756290,CRD,11.0,2.4
...,...,...,...,...,...,...,...,...,...,...,...,...
147693,VTS,2014-02-06 23:37:00,2014-02-07 00:07:00,5.0,11.35,-73.994125,40.761444,-73.839493,40.722160,CRD,34.0,0.3
113593,VTS,2014-02-03 17:16:00,2014-02-03 17:21:00,5.0,1.34,-73.978188,40.745556,-73.961464,40.768742,CRD,6.5,1.5
59981,VTS,2014-02-03 15:37:00,2014-02-03 15:48:00,1.0,1.34,-73.951370,40.782463,-73.963219,40.769039,CRD,8.5,1.5
133608,VTS,2014-02-07 00:16:00,2014-02-07 00:22:00,2.0,1.18,-74.004333,40.742294,-73.994301,40.752724,CSH,6.5,0.0


#### checking number of unique values

In [9]:
sample.compute().nunique()

CancelledError: ('sample-3b0e482293143f1ce8851855de5e32e6', 18)

#### describing data

In [None]:
sample.compute().describe()

- Passenger count has a minimum of 0 which means either it is an error entered or the drivers deliberately entered 0 to complete a target number of rides.

### Data Cleaning and persisting to memory

Lets see if there is any null values in our dataset

In [None]:
%%time
# remove rows with null values
sample.isnull().sum().compute()

In [None]:
#sample = sample.drop(['store_and_fwd_flag','mta_tax'],axis=1)

#### Lets convert pickup and dropoff times to datetime format

In [None]:
%%time
# converting pickup and dropoff times to datetime format
sample["pickup_datetime"] =dd.to_datetime(sample["pickup_datetime"],format='%Y/%m/%d %H:%M')

sample["dropoff_datetime"] =dd.to_datetime(sample["dropoff_datetime"],format='%Y/%m/%d %H:%M')

#### Creating 3 new columns called Duration, Average_Speed and Price_Per_Mile

In [None]:
sample['duration'] = sample['dropoff_datetime']-sample['pickup_datetime']

In [None]:
def get_seconds(a):
    return a.seconds
    

In [None]:
sample['duration'] = sample['duration'].apply(get_seconds)

In [None]:
#Creating a speed feature (MPH)
sample['average_speed'] = (sample['trip_distance']/(sample['duration']/3600)).round(3)

In [None]:
sample['price_per_mile']= (sample['trip_distance']/sample['fare_amount']).round(2)

#### checking null values after creating new columns

In [None]:
sample.isnull().sum().compute()

There is 31 null values in average_speed columns,lets check index of null values

In [None]:
sample.loc[sample['average_speed'].isnull()].index.compute()

Assign null values to variable 

In [None]:
cancelled_trips = sample[sample['average_speed'].isnull()].compute()

In [None]:
sample = sample.dropna(subset=['average_speed'])

In [None]:
sample = sample[sample['average_speed'] != sample['average_speed'].max()]

In [None]:
sample['average_speed'].compute().describe()

In [None]:
sample['average_speed'].compute().value_counts()

In [None]:
sample[sample['average_speed'] == sample['average_speed'].max()].compute()

In [None]:
sample.pickup_datetime.compute()

In [None]:
sample.dropoff_datetime.compute()

#### persisting to memory all cleaned dataframe  further analysis

In [None]:
%%time
sample= sample.persist()

In [None]:
sample.compute()

### Univariate Analysis

The univariate analysis involves studying patterns of all variables individually.

#### Trips for day

 pickup_day and dropoff_day which will contain the name of the day on which the ride was taken.

In [None]:
sample['pickup_day']=sample['pickup_datetime'].dt.day_name()
sample['dropoff_day']=sample['dropoff_datetime'].dt.day_name()

In [None]:
sample['pickup_day'].describe(include=object).compute()

In [None]:
figure,(ax1,ax2)=plt.subplots(ncols=2,figsize=(20,5))
ax1.set_title('Pickup Days')
ax=sns.countplot(x="pickup_day",data=sample.compute(),ax=ax1)

ax2.set_title('Dropoff Days')
ax=sns.countplot(x="dropoff_day",data=sample.compute(),ax=ax2)

 We see thursday is the busiest day

In [None]:
sample.groupby('pickup_day').mean().compute()[['passenger_count','fare_amount','tip_amount','duration']].style.highlight_max(color='lightblue').highlight_min(color='pink')

#### Pickup Day Fare Amount

In [None]:
##fare_amount per day
p=sns.lineplot(x='pickup_day',y='fare_amount',data=sample)
p.set(title="Fare amount frequency for piyckup day");

As we see Fridays's are highest paid rides and sunday's are low fair rates.

##### pickup_hour and dropoff_hour with an hour of the day in the 24-hour format.

In [None]:
sample['pickup_hour']=sample['pickup_datetime'].dt.hour
sample['dropoff_hour']=sample['dropoff_datetime'].dt.hour

In [None]:
figure,(ax1,ax2)=plt.subplots(ncols=2,figsize=(20,5))
ax1.set_title('Pickup hour')
ax=sns.countplot(x="pickup_hour",data=sample.compute(),ax=ax1)

ax2.set_title('dropoff hour')
ax=sns.countplot(x="dropoff_hour",data=sample.compute(),ax=ax2)

We see the busiest hours are 6:00 pm to 7:00 pm and that makes sense as this is the time when people return from their offices.

In [None]:
sample.groupby('pickup_hour').mean().compute()[['passenger_count','fare_amount','tip_amount','duration']].style.highlight_max(color='lightblue').highlight_min(color='pink')

#### Fare amount for hour

In [None]:
##fare_amount per hour
p=sns.lineplot(x='pickup_hour',y='fare_amount',data=sample)
p.set(title="Fare amount frequency for piyckup hour")

As we see 5pm   the highest fair amount paid by passengers.

#### pickup_month and dropoff_month with month number with January=1 and December=12.

In [None]:
sample['pickup_month']=sample['pickup_datetime'].dt.month
sample['dropoff_month']=sample['dropoff_datetime'].dt.month

In [None]:
figure,(ax1,ax2)=plt.subplots(ncols=2,figsize=(20,5))
ax1.set_title('Pickup Month')
ax=sns.countplot(x="pickup_month",data=sample.compute(),ax=ax1)

ax2.set_title('Dropoff Month')
ax=sns.countplot(x="dropoff_month",data=sample.compute(),ax=ax2)



In January month  most rides taken 

#### pickup_month fare amount

In [None]:
##fare_amount by month
p=sns.lineplot(x='pickup_month',y='fare_amount',data=sample)
p.set(title="Fare amount frequency for piyckup month")

#### Time of day

I have defined a function that lets us determine what time of the day the ride was taken. I have created 4 time zones ‘Morning’ (from 6:00 am to 11:59 pm), ‘Afternoon’ (from 12 noon to 3:59 pm), ‘Evening’ (from 4:00 pm to 9:59 pm), and ‘Late Night’ (from 10:00 pm to 5:59 am)

In [None]:
def time_of_day(x):
    if x in range(6,12):
        return 'Morning'
    elif x in range(12,16):
        return 'Afternoon'
    elif x in range(16,22):
        return 'Evening'
    else:
        return 'Late night'

In [None]:
sample['pickup_timeofday']=sample['pickup_hour'].apply(time_of_day)
sample['dropoff_timeofday']=sample['dropoff_hour'].apply(time_of_day)

In [None]:
figure,(ax3,ax4)=plt.subplots(ncols=2,figsize=(20,5))
ax3.set_title('Pickup Time of Day')
ax=sns.countplot(x="pickup_timeofday",data=sample.compute(),ax=ax3)
ax4.set_title('Dropoff Time of Day')
ax=sns.countplot(x="dropoff_timeofday",data=sample.compute(),ax=ax4)

As we saw above, evenings are the busiest

In [None]:
sample.groupby('pickup_timeofday').mean().compute()[['passenger_count','fare_amount','tip_amount','duration']].style.highlight_max(color='lightblue').highlight_min(color='pink')

#### time of day with fare amount

In [None]:
##fare_amount by time of day
p=sns.lineplot(x='pickup_timeofday',y='fare_amount',data=sample)
p.set(title="Fare amount frequency for piyckup time of day")

As we see late night the fair amount  is more.

In [None]:
 
plt.figure(figsize=(10, 8))
sns.heatmap(sample.corr(),xticklabels=sample.columns[:-8],yticklabels=sample.columns[:-8])
plt.suptitle('Pearson Correlation Heatmap')
plt.show();

#### type of payment by passenger

In [None]:
##fare_amount per hour
p=sns.lineplot(x='payment_type',y='passenger_count',data=sample)
p.set(title="passenger count by type of payment")


Maximum payments done by UNK. Equal no of passengers used card and cash payments.

`Credit card
Cash
No charge
Dispute
Unknown
Voided trip'

#### passenger count

In [None]:
sample.passenger_count.value_counts().compute()

In [None]:
p=sns.countplot(x='passenger_count',data=sample.compute())
p.set(title="passenger frequency");

Most rides taken by the single passenger

In [None]:
sample.groupby('passenger_count').mean().compute()[['fare_amount','tip_amount','duration']].style.highlight_max(color='lightblue').highlight_min(color='pink')

Tip amount and duration of travel show upward trend with increasing number of passengers. 

In [None]:
#Fare amount mean and standard deviation:
fare_amount_mean = sample["fare_amount"].mean()
#fare_amount_standard_deviation = math.sqrt(((sample["fare_amount"] - fare_amount_mean) ** 2).mean())
fare_amount_standard_deviation=sample["fare_amount"].std()

print("average fair amount (mean) : ${0:.2f}".format(fare_amount_mean.compute()))
print("fare amount standard deviation : ${0:.2f}\n".format(fare_amount_standard_deviation.compute()))

#### check the taxi’s fair amount requency by  using Histogram.

In [None]:
# plot histogram of fare
plt.figure(figsize=(10,6))
sns.set(color_codes=True)
ax = sns.distplot(sample.fare_amount, bins=15, kde=False)
plt.xlabel('Fare $USD')
plt.ylabel('Frequency')

    
plt.title('Fare Amount Histogram', fontsize=15)
plt.show()

It shows up, greater part of taxi ride charges are between five to 10 dollars.

In [None]:
sample.head()

#### Distance and Vendor

In [None]:
plt.figure(figsize=(10,6))
p=sns.barplot(y='trip_distance',x='vendor_id',data=sample.compute(),estimator=np.mean)
plt.title("distance frequency by vender id", fontsize=15)

The distribution for both vendors is very similar.

In [None]:
sample.groupby('vendor_id').mean().compute()[['passenger_count','fare_amount','tip_amount','duration','trip_distance']].style.highlight_max(color='lightblue').highlight_min(color='pink')

#### Distance per passenger count

In [None]:
plt.figure(figsize=(10,6))
p=sns.catplot(y='trip_distance',x='passenger_count',data=sample.compute(),kind="strip")
p.set(title="passenger count by distance")

We see some of the longer distances are covered by either 1 or 2 passenger rides.

#### Distance per day of week

In [None]:
plt.figure(figsize=(10,6))
p=sns.lineplot(x='pickup_day',y='trip_distance',data=sample)
p.set(title="distance frequency for pickup day")

- Distances are longer on Sunday and friday probably because it’s weekend.
- Tuesday trip distances are also quite high.


#### Distance per hour of day

In [None]:
plt.figure(figsize=(10,6))
p=sns.lineplot(x='pickup_hour',y='trip_distance',data=sample)
p.set(title="distance frequency for pickup hour")

Distances are the longest around 5 am.

#### Distance per time of day

In [None]:
plt.figure(figsize=(10,6))
p=sns.lineplot(x='pickup_timeofday',y='trip_distance',data=sample)
p.set(title="distance frequency for pickup time of day")

As seen above also, distances being the longest during late night or it maybe called as early morning too.

This can probably point to outstation trips where people start early for the day.

#### Distance per month

In [None]:
plt.figure(figsize=(10,6))
p=sns.lineplot(x='pickup_month',y='trip_distance',data=sample)
p.set(title="distance frequency for pickup month")

As we also saw during trip duration per month, similarly trip distance is same beacuse here my datails using only two months

#### price_per_mile by pickup day

In [None]:
plt.figure(figsize=(10,6))
p= sns.lineplot(x='pickup_day',y='price_per_mile',data=sample)
p.set(title="fare per mile for pickup day")


As we saw sundays are higher rate for mile.

In [None]:
plt.figure(figsize=(10,6));
p=sns.lineplot(x='pickup_hour',y='price_per_mile',data=sample)
p.set(title="fare per mile for pickup hour")


As we saw 5 am is peak hour paying by highest per mile

In [None]:
plt.figure(figsize=(10,6))
p=sns.lineplot(x='pickup_timeofday',y='price_per_mile',data=sample)
p.set(title="fare per mile for pickup time of day")

Latenight times are highest rate per mile

#### Passenger Count and Vendor id

In [None]:
plt.figure(figsize=(10,6))
p=sns.barplot(y='passenger_count',x='vendor_id',data=sample.compute())
p.set(title="vendor_id shared by passenger")

This shows that vendor VTS generally carries more passengers than vendor CMT  passenger rides.

#### Relationship between fareamount and tip amount 

In [None]:
#p=sns.lineplot(y='tip_amount',x='fare_amount',data=sample)
plt.figure(figsize=(10,6))
p=sns.scatterplot(x="fare_amount", y="tip_amount", data=sample);
p.set(title="tip amount accodring to fare amount")

We can see fare_amount is between 5 to 10 USD the tip mount share is greater.

#### Relationship between tip amount and passenger

In [None]:
plt.figure(figsize=(10,6))
p=sns.relplot( x="passenger_count",y="tip_amount", kind="line", data=sample);
p.set(title="tip amount accodring to passenger")

the tip amount is varrying by number of passengers.When 5 passengers are taking taxi paying more tip.

In [None]:
sample.columns

In [None]:
plt.figure(figsize=(10,6))
p=sns.lineplot(x='dropoff_timeofday',y='tip_amount',data=sample)
p.set(title="fare per mile for pickup time of day")

- High tip amount is often provided during late night.
- Morming ad evening hours show almost same trend in tip amount. 
- Afternoon tip amount is the least.

#### Relationshipb between Distance and tipamount

In [None]:
plt.figure(figsize=(10,6))
p=sns.scatterplot( x="trip_distance",y="tip_amount", data=sample);
p.set(title="tip amount according to distance")

- Higher trip distance does not yield to higher tip amount

In [None]:
#sns.catplot(y='distance',x='fare_amount',data=sample.compute(),kind="strip");

In [None]:
sample.columns[:-8]

In [None]:
# get rid fro now !! 
plt.figure(figsize=(14, 10))
sns.heatmap(sample.corr(),xticklabels=sample.columns[:-8],yticklabels=sample.columns[:-8],annot=True,fmt='.2f')
plt.suptitle('Pearson Correlation Heatmap')
plt.show();

In [None]:
pickup_locations = sample[["pickup_latitude", "pickup_longitude", "pickup_day","pickup_timeofday","dropoff_latitude", "dropoff_longitude", "dropoff_day","dropoff_timeofday"]].compute()
pickup_location_sample = pickup_locations.sample(frac=0.001)

In [None]:
map1 = folium.Map(location=[pickup_location_sample.pickup_latitude.mean(), pickup_location_sample.pickup_longitude.mean()], zoom_start=10, control_scale=True)

In [None]:
for index, location_info in pickup_location_sample.iterrows():
    folium.Marker([location_info["pickup_latitude"], location_info["pickup_longitude"]], popup=location_info[["pickup_day","pickup_timeofday"]],icon=folium.Icon(color='red')).add_to(map1)
for index, location_info in pickup_location_sample.iterrows():
    folium.Marker([location_info["dropoff_latitude"], location_info["dropoff_longitude"]], popup=location_info[["dropoff_day","dropoff_timeofday"]],icon=folium.Icon(color='blue')).add_to(map1)

In [None]:
map1

In [None]:
pickup_location_sample1 = pickup_locations.sample(frac=0.01)

In [None]:
#create a map
this_map = folium.Map(location=[pickup_location_sample.pickup_latitude.mean(), pickup_location_sample.pickup_longitude.mean()], zoom_start=10, control_scale=True)

# List comprehension to make out list of lists
heat_data = [[row['pickup_latitude'],row['pickup_longitude']] for index, row in pickup_location_sample1.iterrows()]

# Plot it on the map
HeatMap(heat_data,
             color='blue',
             fill_color='#FD8A6C').add_to(this_map)


#Set the zoom to the maximum possible
#this_map.fit_bounds(this_map.get_bounds())
    
this_map  

In [None]:
#create a map
this_map1 = folium.Map(location=[pickup_location_sample.pickup_latitude.mean(), pickup_location_sample.pickup_longitude.mean()], zoom_start=10, control_scale=True)

# List comprehension to make out list of lists
heat_data = [[row['dropoff_latitude'],row['dropoff_longitude']] for index, row in pickup_location_sample1.iterrows()]

# Plot it on the map
HeatMap(heat_data,
             color='red',
             fill_color='#FD8A6C').add_to(this_map1)

#Set the zoom to the maximum possible
#this_map.fit_bounds(this_map.get_bounds())
    
this_map1  

location analysis based on fare and tip amount 

In [None]:
sample.columns


In [None]:
df_=sample[["tip_amount","pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude","trip_distance","fare_amount"]].compute().nlargest(20,["tip_amount","fare_amount"])
type(df_)
df_


In [None]:
# -*- coding: utf-8 -*-
'''Test of Folium polylines'''

import folium

# Coordinates are 15 points on the great circle from Boston to 
# San Francisco.
# Reference: http://williams.best.vwh.net/avform.htm#Intermediate
#
coordinates = []
for i in range(len(df_)):
    coordinates.append(
        [[df_.iloc[i].pickup_latitude, df_.iloc[i].pickup_longitude],
        [df_.iloc[i].dropoff_latitude,df_.iloc[i].dropoff_longitude]]
        )


m = folium.Map(location=[40.715271,-74.009209], zoom_start=12)
folium.PolyLine(coordinates, line_color='#FF0000', line_weight=5).add_to(m)

for index, location_info in df_.iterrows():
    folium.Marker([location_info["pickup_latitude"], location_info["pickup_longitude"]], popup=location_info[["tip_amount","fare_amount","trip_distance"]],icon=folium.Icon(color='green')).add_to(m)
for index, location_info in df_.iterrows():
    folium.Marker([location_info["dropoff_latitude"], location_info["dropoff_longitude"]], popup=location_info[["tip_amount","fare_amount","trip_distance"]],icon=folium.Icon(color='red')).add_to(m)

m

In [None]:
coordinates = [
    [df_.iloc[0].pickup_latitude, df_.iloc[0].pickup_longitude],
    [df_.iloc[0].dropoff_latitude,df_.iloc[0].dropoff_longitude]
    ]
coordinates

In [None]:
df_.iloc[0]['pickup_longitude']

In [None]:
df_.pickup_latitude

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError


### Conclusion

- The day with most pickups is Thursday, Friday and Wednesday
- The hour with the most rides seems to be between 6pm and 10pm
- Sundays people tend to travel longer distances
- Journeys with one or two passengers ten to tip more
- journeys to and from the airport seem to give higher tips
- Also trip late at night tend to give higher tips
- Park Avenue, Theatre district and New york Penn Station and good locations for a taxi to be to get a customer