# Project 1: Analysis and Forecasting of NYC Taxi Rides

In [None]:
# The code makes sure that once there is change in the 'src/' folder, the 
# change will be automatically reloaded in the notebook.
%reload_ext autoreload
%autoreload 2
%aimport src

### Task 1: Understanding the Data

Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

In [66]:
import pandas as pd
from src.utils import load_data_from_google_drive

# Define the base URLs for the yellow and green taxi data
base_url_yellow = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year}-{month}.parquet"
base_url_green = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_{year}-{month}.parquet"
zones_url = 'https://drive.google.com/file/d/12VgjWXkyEBsxzuKFxIkAevEbao85ei0T/view?usp=sharing'

# Define the months and year you're interested in
months = ['01', '02']
year = '2022'

# Create empty lists to store the dataframes
df_yellow_list = []
df_green_list = []

# Loop over the months
for month in months:
    # Create the full URL for the yellow and green taxi data
    dfy = pd.read_parquet(base_url_yellow.format(year=year, month=month))
    dfg = pd.read_parquet(base_url_green.format(year=year, month=month))
    
    # Download the data and append it to the list
    df_yellow_list.append(dfy)
    df_green_list.append(dfg)

# Concatenate the list of dataframes into a single dataframe
df_yellow = pd.concat(df_yellow_list)
df_green = pd.concat(df_green_list)

df_zones = load_data_from_google_drive(url=zones_url)

In [67]:
df_zones

Unnamed: 0,Shape_Leng,Shape_Area,zone,LocationID,borough,lat,lng
0,0.116357,0.000782,Newark Airport,1,EWR,40.689516,-74.176786
1,0.433470,0.004866,Jamaica Bay,2,Queens,40.625724,-73.826126
2,0.084341,0.000314,Allerton/Pelham Gardens,3,Bronx,40.865888,-73.849479
3,0.043567,0.000112,Alphabet City,4,Manhattan,40.724152,-73.977023
4,0.092146,0.000498,Arden Heights,5,Staten Island,40.550340,-74.189930
...,...,...,...,...,...,...,...
258,0.126750,0.000395,Woodlawn/Wakefield,259,Bronx,40.899103,-73.856351
259,0.133514,0.000422,Woodside,260,Queens,40.746798,-73.903713
260,0.027120,0.000034,World Trade Center,261,Manhattan,40.708976,-74.012919
261,0.049064,0.000122,Yorkville East,262,Manhattan,40.776534,-73.945830


In [43]:
df_zones = pd.read_csv('taxi_zones.csv')

In [53]:
from pyproj import Transformer
transformer = Transformer.from_crs("EPSG:2263", "EPSG:4269")

In [54]:
df_zones

Unnamed: 0,X,Y,OBJECTID,Shape_Leng,Shape_Area,zone,LocationID,borough,lat,lng
0,9.352230e+05,190535.052575,1,0.116357,0.000782,Newark Airport,1,EWR,1.711351,8.401251
1,1.032516e+06,167292.493195,2,0.433470,0.004866,Jamaica Bay,2,Queens,1.502642,9.275246
2,1.025883e+06,254779.600631,3,0.084341,0.000314,Allerton/Pelham Gardens,3,Bronx,2.288116,9.215661
3,9.906188e+05,203105.532318,4,0.043567,0.000112,Alphabet City,4,Manhattan,1.824220,8.898880
4,9.314680e+05,139837.478389,5,0.092146,0.000498,Arden Heights,5,Staten Island,1.256081,8.367519
...,...,...,...,...,...,...,...,...,...,...
258,1.023962e+06,266878.034018,259,0.126750,0.000395,Woodlawn/Wakefield,259,Bronx,2.396707,9.198408
259,1.010930e+06,211369.883417,260,0.133514,0.000422,Woodside,260,Queens,1.898421,9.081336
260,9.806682e+05,197575.689861,261,0.027120,0.000034,World Trade Center,261,Manhattan,1.774569,8.809492
261,9.992531e+05,222193.818758,262,0.049064,0.000122,Yorkville East,262,Manhattan,1.995597,8.976443


In [59]:
df_zones['lat'] = df_zones.apply(lambda row: transformer.transform(row['X'], row['Y'])[0], axis=1)
df_zones['lng'] = df_zones.apply(lambda row: transformer.transform(row['X'], row['Y'])[1], axis=1)

In [61]:
df_zones = df_zones.drop(columns = ['X', 'Y', 'OBJECTID'], axis=1)

In [64]:
df_zones.to_csv('zones.csv', index=False)

In [65]:
pd.read_csv('zones.csv')

Unnamed: 0,Shape_Leng,Shape_Area,zone,LocationID,borough,lat,lng
0,0.116357,0.000782,Newark Airport,1,EWR,40.689516,-74.176786
1,0.433470,0.004866,Jamaica Bay,2,Queens,40.625724,-73.826126
2,0.084341,0.000314,Allerton/Pelham Gardens,3,Bronx,40.865888,-73.849479
3,0.043567,0.000112,Alphabet City,4,Manhattan,40.724152,-73.977023
4,0.092146,0.000498,Arden Heights,5,Staten Island,40.550340,-74.189930
...,...,...,...,...,...,...,...
258,0.126750,0.000395,Woodlawn/Wakefield,259,Bronx,40.899103,-73.856351
259,0.133514,0.000422,Woodside,260,Queens,40.746798,-73.903713
260,0.027120,0.000034,World Trade Center,261,Manhattan,40.708976,-74.012919
261,0.049064,0.000122,Yorkville East,262,Manhattan,40.776534,-73.945830


In [None]:
from src.utils import create_scatterplot

create_scatterplot(
    df_yellow.sample(50), 
    x_col = 'trip_distance', 
    y_col = 'fare_amount', 
    title = 'trip_distance vs fare_amount', 
    xlabel = 'trip_distance',
    ylabel = 'fare_amount',
)

### Task 2: Exploratory Data Analysis
Conduct exploratory data analysis to understand the patterns and relationships in the data. This includes analyzing the distribution of trip distances, fares, and passenger counts, as well as the relationship between these variables.

In [None]:
# Write your code

### Task 3: Spatial Analysis
Use Kepler.gl or a similar tool to visualize the spatial patterns of taxi rides. This includes the pickup and dropoff locations, as well as the routes taken. Analyze the spatial patterns to identify hotspots of taxi demand.

In [None]:
# Write your code

### Task 4: Temporal Analysis
Analyze the temporal patterns of taxi rides. This includes the number of rides by time of day, day of the week, and month of the year. Also, analyze the relationship between temporal patterns and other variables, such as trip distance and fare.

In [None]:
# Write your code

### Task 5: Time-Series Forecasting
Use Prophet or a similar tool to forecast the number of taxi rides in the future, respectively for green and yellow taxis. This includes creating a time-series model, tuning its parameters, and validating its performance. Also, interpret the model's predictions and identify the factors driving the forecasted trends.

In [None]:
# Write your code

### Task 6: Report and Presentation
Compile your findings into a final report and presentation. This includes summarizing your methodology, presenting your results, and discussing your conclusions. Also, identify the limitations of your analysis and suggest areas for future research.