# <Center> Introduction to Data Visualisation & Exploration (COMS4060A)
# <Center> Assignment 2
### Group Members: 
Joshua Wacks - 2143116 <br>
Matthew Dacre - 2091295 <br>
Alex Vogt - 2152320 <br>
Sonia Bullah - 2107762

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium.plugins import HeatMap
import sympy as sym
from sympy import sin, cos, pi
from math import radians
from sklearn.cluster import DBSCAN
import sympy as sym
import geopandas as gpd
%matplotlib inline
from shapely.geometry import Point

In [None]:
df = pd.read_csv('train.csv')
df.head()

In [None]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    All args must be of equal length.
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

## Q2: Feature generation

Trip distance, the day of the week for the trip, the hour of the day of the trip, and the trips average speed are computed below

In [None]:
# Q2
 
df['trip_distance'] = haversine(df.pickup_longitude, df.pickup_latitude, df.dropoff_longitude, df.dropoff_latitude)
df['day_of_week'] = pd.to_datetime(df.pickup_datetime).dt.day_name()
df['hour_of_day'] = pd.to_datetime(df.pickup_datetime).dt.hour
df['average_speed'] = (df.trip_distance / df.trip_duration) * 3600

df[['trip_distance', 'day_of_week', 'hour_of_day', 'average_speed']].head()

## Q1: Data Cleaning

A trip is removed according to the following criteria:

- The trip is less than 30s long
- The trip is more than 12 hours long
    - Uber requires drivers to stop driving after a continuous 12 hour period
- The trip is less than 50m in distance
- The trip is more than 500km in distance
- The average speed of the trip is more than 150km/h
- The average speed of the trip is less than 1km/h 

In [None]:
# Q1
print("The number of rows before removal", len(df))

# Trip less than 30 seconds
df.drop(df[df.trip_duration <= 30].index, inplace=True)

# Trip more than 12 hours
# https://www.uber.com/en-ZA/blog/driving-hours-limit/
df.drop(df[df.trip_duration >= 43200].index, inplace=True)

# trip less than 50m
df.drop(df[(df['trip_distance'] <= 0.05)].index, inplace=True)

# trip more than 500kms
df.drop(df[(df['trip_distance'] >= 500)].index, inplace=True)

# Average speed more than 150km/h
df.drop(df[(df['average_speed'] >= 150)].index, inplace=True)

# Average speed <= 1km/h
df.drop(df[(df['average_speed'] <= 1)].index, inplace=True)

print("The number of rows after removal", len(df))

## Q3.1: Most popular weekdays

As shown in the plot below, the most popular weekday for trips is Friday

In [None]:
# Q3.1
# Most popular weekdays

df['day_of_week'].value_counts().plot(kind='bar', figsize=(15, 10), title="Most popular day of the week for trips")

## Q3.2: Most popular hours of the day

As shown in the plot below, the most popular hour of the day for a trip during the week is either 6 or 7pm, with the exception of Thursday. During the week, the most likely use for a taxi ride would be to go home after work, which ends around 5-6pm. Thursday may be an outlier due to it beinga  popular day for drinks specials. On the weekend, the most popular times shift much later to 11pm or 12pm. This is due to the most likely use for a taxi on these days would be to leave a resteraount/bar, where you would not drive due to having alcoholic drinks. 

In [None]:
# Q3.2
# Most popular time of day (in 24hr format)

fig, axs = plt.subplots(7, 1, figsize=(15, 20))
ax = axs.ravel()

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df.hour_of_day[df['day_of_week'] == days[0]].value_counts().plot(kind='bar', ax=ax[0], title=days[0])
df.hour_of_day[df['day_of_week'] == days[1]].value_counts().plot(kind='bar', ax=ax[1], title=days[1])
df.hour_of_day[df['day_of_week'] == days[2]].value_counts().plot(kind='bar', ax=ax[2], title=days[2])
df.hour_of_day[df['day_of_week'] == days[3]].value_counts().plot(kind='bar', ax=ax[3], title=days[3])
df.hour_of_day[df['day_of_week'] == days[4]].value_counts().plot(kind='bar', ax=ax[4], title=days[4])
df.hour_of_day[df['day_of_week'] == days[5]].value_counts().plot(kind='bar', ax=ax[5], title=days[5])
df.hour_of_day[df['day_of_week'] == days[6]].value_counts().plot(kind='bar', ax=ax[6], title=days[6])

fig.suptitle("Most popular trip times for each day")
fig.tight_layout()
fig.subplots_adjust(top=0.93)


plt.show()

## Q3.3: Differences between weekdays and weekends

As can be seen in the plot below, on the weekend it is much more likely that a trip will be ordered late at night/early in the morning when compared to a weekday. This could be due to that people are more likely to be awake later on the weekend, when they would not have to wake up early for work the next day. Another shift is that there are many more trips at 7am-9am during the week when compared to the weekend. This is because commuters will be going to work during the week but not on the weekend. On the weekend, there are moany more trips in the early afternoon, at 12pm-2pm when compared to the week.

In [None]:
# Most popular times on the weekend

fig, axs = plt.subplots(1, 2, figsize=(25, 10))

ax = axs.ravel()

df['hour_of_day'].loc[~df.day_of_week.isin(['Saturday', 'Sunday'])].value_counts().plot(kind='bar', ax=ax[0], title="Most popular trip times on the weekday")
df['hour_of_day'].loc[df.day_of_week.isin(['Saturday', 'Sunday'])].value_counts().plot(kind='bar', ax=ax[1], title="Most popular trip times on the weekend")

fig.tight_layout()
plt.show()

## Q3.4: Holiday Days

All holidays were only considered for the year of 2016, as this is the only year present in the data set. While the distributions are of very similar shapes, the most popular trip times are different for some of them. Holidays where there is likely to be alcohol consumed have more popular trips late at night/early in the morning. Holidays where people are more likely to visit family, such as Memorial day, have the majority of the trips in the afternoon. Valentines day is very similar to the distributions of a normal day, likely due to it not being a public holdiay in the USA.  

In [None]:
# Extrating data
# All trips are form the year 2016, so only considering dates for that year

patrick_day, patrick_month = 17, 3
easter_day, easter_month = 27, 3
memorial_day, memorial_month = 30, 5
valentine_day, valentine_month = 14, 2
mlk_day, mlk_month = 18, 1

patrick_df = df[(pd.to_datetime(df['pickup_datetime']).dt.day == patrick_day) & (pd.to_datetime(df['pickup_datetime']).dt.month == patrick_month)]
easter_df = df[(pd.to_datetime(df['pickup_datetime']).dt.day == easter_day) & (pd.to_datetime(df['pickup_datetime']).dt.month == easter_month)]
memorial_df = df[(pd.to_datetime(df['pickup_datetime']).dt.day == memorial_day) & (pd.to_datetime(df['pickup_datetime']).dt.month == memorial_month)]
valentine_df = df[(pd.to_datetime(df['pickup_datetime']).dt.day == valentine_day) & (pd.to_datetime(df['pickup_datetime']).dt.month == valentine_month)]
mlk_df = df[(pd.to_datetime(df['pickup_datetime']).dt.day == mlk_day) & (pd.to_datetime(df['pickup_datetime']).dt.month == mlk_month)]

In [None]:
# Plotting for each of the holidays

fig, axs = plt.subplots(5, 1, figsize=(20, 15))

ax = axs.ravel()

# St Patricks Day

patrick_df['hour_of_day'].value_counts().plot(kind='bar', ax=ax[0], title="St Patrick's Day")
easter_df['hour_of_day'].value_counts().plot(kind='bar', ax=ax[1], title="Easter")
memorial_df['hour_of_day'].value_counts().plot(kind='bar', ax=ax[2], title="Memorial Day")
valentine_df['hour_of_day'].value_counts().plot(kind='bar', ax=ax[3], title="Valentines Day")
mlk_df['hour_of_day'].value_counts().plot(kind='bar', ax=ax[4], title="Martin Luther King Day")

fig.suptitle("Most popular trip times")
fig.tight_layout()
fig.subplots_adjust(top=0.93)


plt.show()

## Q3.5: Average Speed

As shown in the plot belowm the fastest trips were completed in the early morning, when there are likely top be many less cars on the road. The slowest average speeds were all acheived around the time when there would be many cars on the road, when commuters are traveling to/form work. 

In [None]:
# Average speed for each hour of the day

df.groupby(['hour_of_day'])['average_speed'].mean().sort_values(ascending=False).plot(kind='bar', figsize=(20,10), title="Average Speed per hour of day")

## Q4

The cell below produces a heatmap of pickup locations for trips on the weekday and the weekend. On weekdays, there appears to be a larger spread on the pickup locations, that is that there are more pickups further out from the most popular pickup locations which appear to be arround Brooklyn and The Bronx. This is likely people coming into the city to work, while living further out for cheaper living expenses.

In [None]:
#1(a, b)

df_heat_weekend = df[['pickup_latitude', 'pickup_longitude']].loc[df['day_of_week'].isin(['Saturday', 'Sunday'])]
df_heat_weekday = df[['pickup_latitude', 'pickup_longitude']].loc[~df['day_of_week'].isin(['Saturday', 'Sunday'])]

NYCmap = folium.Map(location=[40.716662,-74.009899], tiles= "Stamen Terrain")
HeatMap(df_heat_weekend, name= 'Weekend').add_to(NYCmap)
HeatMap(df_heat_weekday,name= 'Weekday').add_to(NYCmap)
folium.LayerControl(collapsed=False).add_to(NYCmap)
NYCmap

The cell below produces a heatmap of pickup locations for trips in the morning and evening. In the morning, there appears to be a central hotspot that is less spread out when compared to evening pickups, as well as a hotspot in the north are Pughkeepsie. The evening central hotspot appears more spread out, with more pickupos further south in Philadelphia and less in the north.

In [None]:
#2(a, b)

df_heat_morning = df[['pickup_latitude', 'pickup_longitude']].loc[(df['hour_of_day'] >= 4) & (df['hour_of_day'] <= 11)]
df_heat_evening = df[['pickup_latitude', 'pickup_longitude']].loc[(df['hour_of_day'] >= 16) & (df['hour_of_day'] <= 23)]
NYCmap2 = folium.Map(location=[40.716662,-74.009899] , tiles= "Stamen Terrain")
HeatMap(df_heat_morning, name = 'Morning').add_to(NYCmap2)
HeatMap(df_heat_evening,name = 'Evening').add_to(NYCmap2)
folium.LayerControl(collapsed=False).add_to(NYCmap2)
NYCmap2

## Q4.2

A disatnce of 100 meters is used. This is to ensure that any inacuracies on the gps data does not significantly affect the data.

In [None]:
#between 23 on friday and 02 of satuday, and 17-20 on a Thursday
dfHotspot = df[((df['day_of_week']=='Friday') & (df['hour_of_day']>= 23)) | ((df['day_of_week']=='Saturday') & (df['hour_of_day']<= 2)) | ((df['day_of_week']=='Thursday') & ((df['hour_of_day']>= 17)&(df['hour_of_day']<=20 )))]

kms_per_radian = 6371.0088

print("Average Trip distance on hospot nights",dfHotspot['trip_distance'].mean())
print("Max Trip distance on hospot nights",dfHotspot['trip_distance'].max())

epsilon = 0.1 / kms_per_radian

dbscan_pick = DBSCAN(eps=epsilon, min_samples=500, algorithm='ball_tree', metric='haversine').fit(np.radians(dfHotspot.loc[:,'pickup_longitude':'pickup_latitude']))
labels_pick = dbscan_pick.labels_

dfHotspot['Cluster']= labels_pick

dfHotspot['Cluster'].head(20)

As shown in th map below, 14 clusters were identified

In [None]:
noise_count = dfHotspot[(dfHotspot['Cluster']==-1)].count()
print("Number of noise entries:", noise_count['Cluster'])
dfHotspot.drop(dfHotspot[dfHotspot['Cluster'] == -1].index, inplace=True)
hotspotMap = create_map(dfHotspot, 'Cluster')

hotspotMap

## 5. Airports

We need to find out how long it takes, on average, to travel from the Empire State Building to JFK airport.
We can assume that the coordinates for the centre point of the locations are:
- Empire State Building: (40.756724, -73.983806)
- JFK Airport: (40.647929, -73.777813)

Since the points above are centre point locations, we can use a reasonable radius of 5 kilometers around these locations to create a bounding box when determining if a GPS coordinate is at that location. This is a reasonable radius as people will either be dropped off or picked up within the region of the two specified locations but not necessrily from the exact locations themselves. 
This upper and lower bouds of these points can be calculated as follows:

$ minimum \; latitude = latitude - \frac{radius}{111} $ <br>
$ maximum \; latitude = latitude + \frac{radius}{111}  $ <br>

$ minimum \; longitude = longitude - \frac{radius}{cos(latitude)\times 111.32} $ <br>
$ maximum \; longitude = longitude + \frac{radius}{cos(latitude)\times 111.32}   $

The aim is to now find points within the given search radius of the two locations and use these points only as the pickup and dropoff locations.

In [None]:
# Create the upper and lower bounds of the relevant pickup and dropoff points:

# Empire State Building:
long1 = -73.983806
lat1 = 40.756724

max_long_pickup = long1 + (5/(cos(radians(lat1)) * 111.32))
min_long_pickup = long1 - (5/(cos(radians(lat1)) * 111.32))

max_lat_pickup = lat1 + (5/111)
min_lat_pickup = lat1 - (5/111)

print("pickup: minimum longitude is", min_long_pickup)
print("pickup: maximum longitude is", max_long_pickup, "\n")

print("pickup: minimum latitude is", min_lat_pickup)
print("pickup: maximum latitude is", max_lat_pickup, "\n")


# JFK Airport:
long2 = -73.777813
lat2 = 40.647929

max_long_dropoff = long2 + (5/(cos(radians(lat2)) * 111.32))
min_long_dropoff = long2 - (5/(cos(radians(lat2)) * 111.32))

max_lat_dropoff = lat2 + (5/111)
min_lat_dropoff = lat2 - (5/111)

print("dropoff: minimum longitude is", min_long_dropoff)
print("dropoff: maximum longitude is", max_long_dropoff, "\n")

print("dropoff: minimum latitude is", min_lat_dropoff)
print("dropoff: maximum latitude is", max_lat_dropoff)


In [None]:
# Create a new dataframe that only contains the relevant information:
locations = df[['pickup_datetime', 'dropoff_datetime', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'trip_duration']]

locations.head()

In [None]:
# Adjust the dataframe to only contains coordinates within the given radius between the two points:
locations = locations[locations['pickup_longitude'].between(-74.0431013770581, -73.9245106229419)]
locations = locations[locations['pickup_latitude'].between(40.711678954954955, 40.80176904504504)]


locations = locations[locations['dropoff_longitude'].between(-73.8370116037985, -73.7186143962015)]
locations = locations[locations['dropoff_latitude'].between(40.602883954954955, 40.69297404504504)]

locations.head()

Based on the two dataframes above, it can be seen that the number of pickup and dropoff locations have decreased by a considerable amount.

We can now use the information from the dataframe above to find out the average time taken to travel from the Empire State Building to JFK airport. This can be done as follows:

In [None]:
average_time = locations["trip_duration"].mean()
print("Average time in seconds of trips:", average_time)

Therefore, on average, it takes 2 788.21 seconds (or 46.47 minutes) to travel from the Empire State Building to JFK airport.

Lastly, we can analyse the travel time by time of day.

In [None]:
# Create new column containing only the time of day:
locations['time_of_day_pickup'] = pd.to_datetime(locations['pickup_datetime']).dt.time
locations['time_of_day_dropoff'] = pd.to_datetime(locations['dropoff_datetime']).dt.time
locations.head()

In [None]:
# Plotting:
_ = locations.plot.line(x='time_of_day_pickup', y='trip_duration', figsize=(20,10), title="Travel Time By Time of Day: Empire State Building to JFK Airport")


The graph above indicates that the trip duration from the Empire State Building to JFK airport spikes at random hours of the day. These random spikes occur during the morning, which could be a result of early morning traffic as people are on their way to work, as well as during various times in the afternoon, which is generally when rush hour would be. However, it is evident that the lowest trip duration times occur late at night and during the extremely early hours of the morning, which would be around 03:00. This is due to the fact that the city and roads are quieter during these times as people are already at home, therefore, there is no traffic and taxis are able to get around the city faster.

We can now compare this with Newark Airport.

- Empire State Building: (40.756724, -73.983806)
- Newark Airport: (40.689442, -74.173242)

The same steps followed above will be repeated.

In [None]:
# Create the upper and lower bounds of the relevant pickup and dropoff points:

# Empire State Building:
print("pickup: minimum longitude is", min_long_pickup)
print("pickup: maximum longitude is", max_long_pickup, "\n")

print("pickup: minimum latitude is", min_lat_pickup)
print("pickup: maximum latitude is", max_lat_pickup, "\n")


# Newark Airport:
long3 = -74.173242
lat3 = 40.689442

max_long_dropoff = long3 + (5/(cos(radians(lat3)) * 111.32))
min_long_dropoff = long3 - (5/(cos(radians(lat3)) * 111.32))

max_lat_dropoff = lat3 + (5/111)
min_lat_dropoff = lat3 - (5/111)

print("dropoff: minimum longitude is", min_long_dropoff)
print("dropoff: maximum longitude is", max_long_dropoff, "\n")

print("dropoff: minimum latitude is", min_lat_dropoff)
print("dropoff: maximum latitude is", max_lat_dropoff)

In [None]:
locations2 = df[['pickup_datetime', 'dropoff_datetime', 'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 'trip_duration']]

locations2.head()

In [None]:
# Adjust the dataframe to only contains coordinates within the given radius between the two points:
locations2 = locations2[locations2['pickup_longitude'].between(-74.0431013770581, -73.9245106229419)]
locations2 = locations2[locations2['pickup_latitude'].between(40.711678954954955, 40.80176904504504)]


locations2 = locations2[locations2['dropoff_longitude'].between(-74.2324774671639, -74.1140065328361)]
locations2 = locations2[locations2['dropoff_latitude'].between(40.644396954954956, 40.73448704504504)]

locations2.head()

In [None]:
average_time = locations2["trip_duration"].mean()
print("Average time in seconds:", average_time)

Therefore, on average, it takes 2 264.74 seconds (or 37.75 minutes) to travel from the Empire State Building to Newark airport.

In [None]:
# Create new column containing only the time of day:
locations2['time_of_day_pickup'] = pd.to_datetime(locations2['pickup_datetime']).dt.time
locations2['time_of_day_dropoff'] = pd.to_datetime(locations2['dropoff_datetime']).dt.time
locations2.head()

In [None]:
# Plotting:

line_plot = locations2.plot.line(x='time_of_day_pickup', y='trip_duration', figsize=(20,10), title="Travel Time By Time of Day: Empire State Building to Newark Airport")


Based on the line plot above, it can be seen that the trip duration from the Empire State Building to Newark airport tends to be at its lowest during the early hours of the morning. More specifically, this can be seen between 05:00 and 06:00, which could be due to the fact that there is hardly any traffic during this time of the day, hence, trips are shorter. In contrast, the trip duration reaches a maximum at around 14:30, which is in the middle of the afternoon, hence, there is bound to be a significant amount of traffic and congestion in the city at those times. 

The information above indicates that the average travel time from the Empire State Building to JFK airport was 2 788.21 seconds (or 46.47 minutes) whereas the the average travel time from the Empire State Building to Newark airport was 2 264.74 seconds (or 37.75 minutes). Newark airport is 10.9 miles away from the Empire State Building, while JFK airport is 12.9 miles away from the Empire State Building, therefore, it makes sense that the average travel time to Newark airport would be 8.72 minutes shorter than the average travel time to JFK airport.

### Question 6

In [None]:
nycShape = gpd.read_file('geo_export_d94b29ac-01af-4732-8181-b6e311d3cad7.shp')
nycShape.head()

In [None]:
pickup_points = []
dropoff_points = []
for index,row in df.iterrows():
	pickup_long,pickup_lat = row['pickup_longitude'],row['pickup_latitude']
	pickup_points.append(Point(float(pickup_long),float(pickup_lat)))
	dropoff_long,dropoff_lat = row['dropoff_longitude'],row['dropoff_latitude']
	dropoff_points.append(Point(float(dropoff_long),float(dropoff_lat)))

pickup_points_df = gpd.GeoDataFrame({'geometry': pickup_points},crs='EPSG:4326')
dropoff_points_df = gpd.GeoDataFrame({'geometry': dropoff_points},crs='EPSG:4326')

In [None]:
print("Unique borough names in the shapefile:", nycShape.boro_name.unique())

6.1 The neighbourhoods for the trip start and end locations are:\
Queens\
Brooklyn\
Bronx\
Staten Island\
Manhattan

In [None]:
boroughs = nycShape.dissolve(by = 'boro_name')
boroughs.head()

In [None]:
figsize = (20,11)
boroughs.plot(figsize = figsize)

In [None]:
innyc_pickups = pickup_points_df.sjoin(nycShape, how = 'left')
pickup_count = innyc_pickups.groupby('boro_name').count().sort_values('geometry',ascending = False)['geometry'] #Geometry was just the field where the count was placed into
print(pickup_count)

In [None]:
innyc_dropoffs = dropoff_points_df.sjoin(nycShape, how = 'left')
dropoff_count = innyc_dropoffs.groupby('boro_name').count().sort_values('geometry',ascending = False)['geometry'] #Geometry was just the field where the count was placed into
print(dropoff_count)

In [None]:
boroughs['dropoffs'] = dropoff_count
boroughs['pickups'] = pickup_count

## Q6.2

As shown in the plots below, Manhattan has a higher portion of the dropoffs as it does pickups, which means that more people are traveling into Manhattan that traveling out of it. The les spopular boroughs, Bronx and Staten Island, have more pickups than they do dropoffs.

In [None]:
# 6.2

fig, ax = plt.subplots(1, 1,figsize = figsize)
boroughs.plot('pickups', ax = ax, legend = True,
              legend_kwds = {'label': "Dropoffs by Boroughs", "orientation": 'horizontal'},
              )

plt.show()

As shown in the plot above, the most popular borough is Manhattan, with signifcantly more pickups than any other borough.

In [None]:
fig, ax = plt.subplots(1, 1,figsize = figsize)
boroughs.plot('dropoffs', ax = ax, legend = True,
              legend_kwds = {'label': "Pickups by Boroughs", "orientation": 'horizontal'}
              )

plt.show()

As shown in the plot above, the most popular borough is Manhattan, with signifcantly more dropoffs than any other borough.

Question 6.4

In [None]:
latenight_df = df[(df['hour_of_day'] > 0) & (df['hour_of_day'] < 5)]

latenight_df.head()

## Q6.4 & 6.5

The busiest boruogh late at night is Manhattan, while The Broinx and Staten Island are by far the least busy. This distribution matches the distributions when all the dropoffs/pickups are considered.

In [None]:

pickup_points = []
dropoff_points = []
for index,row in latenight_df.iterrows():
	pickup_long,pickup_lat = row['pickup_longitude'],row['pickup_latitude']
	pickup_points.append(Point(float(pickup_long),float(pickup_lat)))
# We redo the pickup points, specifically for the late night trips only.

pickup_points_df = gpd.GeoDataFrame({'geometry': pickup_points},crs='EPSG:4326')

nyc_pickups = pickup_points_df.sjoin(nycShape, how = 'left')
pickup_count['count'] = nyc_pickups.groupby('boro_name').count().sort_values('geometry',ascending = False )['geometry'] #Geometry was just the field where the count was placed into

fig, ax = plt.subplots(1, 1,figsize = figsize)
boroughs.plot('pickups', ax = ax, legend = True,
              legend_kwds = {'label': "Pickups by Boroughs", "orientation": 'horizontal'}
              )

plt.show()