# NY taxi data

The purpose of this notebook is to practice different types of map visualization tools with the given NY taxi dataset.

![TaxiUrll](https://media.giphy.com/media/3ohfFH6Evbu6ElsiIw/giphy.gif "taxi")

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
import datetime

sns.set()
%matplotlib inline

ModuleNotFoundError: No module named 'plotly'

In [None]:
dataset = pd.read_csv('taxi.csv')

In [None]:
dataset = pd.read_csv('taxi.csv', usecols = np.arange(21))  # Ordering the columns properly
dataset.head()

In [None]:
dataset.dtypes

"lpep_pickup_datetime" and "Lpep_dropoff_datetime" should be date type.

In [None]:
dataset = pd.read_csv('taxi.csv', usecols = np.arange(21), parse_dates = ['lpep_pickup_datetime','Lpep_dropoff_datetime']) 

In [None]:
dataset.dtypes

In [None]:
dataset.describe().T

## Let's see only the unique values

In [None]:
for col in dataset:
    print(dataset[col].unique())
    print("------------")

"Ehail_Fee" contains only NaN values.
The dataset contains information of 2 vendors.

In [None]:
dataset = dataset.drop(['Ehail_fee'], axis = 1) # It contains no useful data

Now that we have our dataset, let's check our [data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to have a better understanding on the dataset. I'll change the column names to more clear ones and some of the values in the dataframe.

In [None]:
dataset = dataset.rename(columns={'VendorID': 'Provider',
                                  'lpep_pickup_datetime': 'Date_Engaged',
                                  'Lpep_dropoff_datetime': 'Date_Disengaged',
                                  'Store_and_fwd_flag':'Store_and_forward',
                                  'RateCodeID':'Final_Rate',
                                  'Tip_amount':'Credit_card_tips'})

In [None]:
dataset = dataset.drop(['Trip_type '], axis = 1) #Dropping as we don´t have more information about the column 

In [None]:
dataset['Provider'] = dataset['Provider'].replace([1,2],['Creative Mobile Technologies','VeriFone Inc.'])
dataset['Store_and_forward'] = dataset['Store_and_forward'].replace(['Y','N'],['Yes','No'])
dataset['Final_Rate'] = dataset['Final_Rate'].replace([1,2,3,4,5,6],['Standard rate','JFK','Newark','Nassau or Westchester','Negotiated fare','Group ride'])
dataset['Payment_type'] = dataset['Payment_type'].replace([1,2,3,4,5,6],['Credit card','Cash', 'No charge','Dispute','Unknown','Voided trip'])

Let's see where dates begin and end.

In [None]:
print (dataset.Date_Engaged.min())
print (dataset.Date_Disengaged.max())

Our analysis will be done in a month time span between march and april of the year 2015.

# EDA

### First let's visualize the amount of Provider, Final Rates and Payment Type.

In [None]:
Final_Rate = pd.DataFrame(dataset['Final_Rate'].value_counts()).reset_index()
Final_Rate.columns = ['Final_Rate','Count']
fig = px.bar(Final_Rate, x = 'Final_Rate', y = 'Count', color = 'Final_Rate', title='Final Rate vs Count')        
fig.show()

In [None]:
Provider = pd.DataFrame(dataset['Provider'].value_counts()).reset_index()
Provider.columns = ['Provider','Count']
fig = px.bar(Provider, x = 'Provider', y = 'Count', color = 'Provider', title='Provider vs Count')        
fig.show()

In [None]:
Payment_type = pd.DataFrame(dataset['Payment_type'].value_counts()).reset_index()
Payment_type.columns = ['Payment_type','Count']
fig = px.pie(Payment_type, values='Count', names='Payment_type', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

Now we have some useful information:
- Most trips were paid in cash or credit card.
- Most trips records were provided by Verifone Inc.
- Most of them where Standard Rate.

### Working with date column

Let's see how many trips where made at each hour

In [None]:
dataset['Date_Engaged'] = dataset['Date_Engaged'].dt.round('H')
dataset['Date_Disengaged'] = dataset['Date_Disengaged'].dt.round('H')
dataset.groupby(dataset['Date_Engaged'].dt.round('H')).count().head()

In [None]:
Daily_count = dataset.groupby(dataset['Date_Engaged'].dt.round('H')).count()['Provider'].to_frame()
fig = go.Figure([go.Scatter(x = Daily_count.index.to_series(), y=Daily_count['Provider'])])
fig.show()

- The least amount of trips ordered where from 3 to 6 am everyday.
- We can see some peaks... Those are weekends!

### Let's see where the most pickups take place

In [None]:
dataset2 = dataset.sample(frac = 0.002)
dataset2.head()

In [None]:
import folium 

map_osm = folium.Map(
    location = [40.75, -73.9],
    zoom_start = 10
)
map_osm

for indice, row in dataset2.iterrows():
   folium.Marker(
        location=[row["Pickup_latitude"], row["Pickup_longitude"]],
        icon=folium.map.Icon(color='red')
    ).add_to(map_osm)
map_osm


Most pickups take place in Brooklyn, Central Park and on Roosevelt Avenue in Queens.

### Let's see where the most dropoffs take place

In [None]:
map_osm = folium.Map(
    location = [40.75, -73.9],
    zoom_start = 10
)
map_osm

for indice, row in dataset2.iterrows():
   folium.Marker(
        location=[row["Dropoff_latitude"], row["Dropoff_longitude"]],
        icon=folium.map.Icon(color='red')
    ).add_to(map_osm)
map_osm

There seems to be no clear destination for dropoffs

### Let's see if there is any relationship between the fare amount and the pickup place

In [None]:
fig = px.density_mapbox(dataset, 
                        lat ='Pickup_latitude', 
                        lon ='Pickup_longitude', 
                        z = 'Fare_amount', 
                        color_continuous_scale  = 'solar',
                        radius = 1,
                        center = dict(lat=40.75, lon=-73.9), 
                        zoom = 10,
                        mapbox_style = "carto-darkmatter",
                        )
fig.update_layout(
    title='NYC Taxi Pickups vs Fare_amount',
    height=800,
    template="plotly_dark",
)

fig.show()

# Getting address from latitude / longitude

In [None]:
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

import tqdm
from tqdm._tqdm_notebook import tqdm_notebook

In [None]:
dataset['geom'] = dataset['Pickup_latitude'].map(str) + ',' + dataset['Pickup_longitude'].map(str)
dataset['geom'][0]

In [None]:
dataset2 = dataset.copy()
dataset2 = dataset2.sample(frac = 0.005)

In [None]:
locator = Nominatim(user_agent='myGeocoder', timeout=10)
rgeocode = RateLimiter(locator.reverse, min_delay_seconds=0.001)

In [None]:
tqdm_notebook.pandas()
dataset2['address'] = dataset2['geom'].progress_apply(rgeocode)
dataset2.head()

In [None]:
import geopandas as gpd
gdf = gpd.read_file(gpd.datasets.get_path("nybb"))
gdf.head(20)

## Working with hour

In [None]:
Hour_count1 = dataset.groupby(dataset['Date_Engaged'].dt.round('H')).count()['Provider'].loc['2015-03-03'].to_frame()
Hour_count2 = dataset.groupby(dataset['Date_Engaged'].dt.round('H')).count()['Provider'].loc['2015-03-06'].to_frame()

Hour_count1 = Hour_count1.reset_index()
Hour_count1['time'] = [d.time() for d in Hour_count1['Date_Engaged']]
Hour_count1.columns = ['Date_Engaged1','2015-03-03','time']

Hour_count2 = Hour_count2.reset_index()
Hour_count2['time'] = [d.time() for d in Hour_count2['Date_Engaged']]
Hour_count2.columns = ['Date_Engaged2','2015-03-06','time']

In [None]:
Hour = pd.merge(Hour_count1,Hour_count2,on='time',how='outer')
Hour = Hour.drop(['Date_Engaged1','Date_Engaged2'], 1)
Hour = Hour.set_index('time')
Hour.head()

In [None]:
fig = px.line(Hour, x=Hour.index, y=['2015-03-03','2015-03-06'], title='Hourly rides comparison')

fig.update_xaxes(rangeslider_visible=True)
fig.show()

We can see that on a Saturday there where many more rides at nighttime than on a Tuesday!