# Comparing Traffic Patterns of NYC Taxis Pre, During, and Post-COVID-19: A SARIMA Time Series Analysis with Weather Consideration

> Author: 2802453   
> Date: 25.03.2023


## Introduction



The COVID-19 pandemic has had a profound impact on many aspects of daily life, including transportation. New York City (NYC), as one of the largest and most densely populated cities in the world, has experienced significant changes in its transportation system during the pandemic. With the implementation of social distancing guidelines and the closure of many businesses, there has been a reduction in the number of commuters and tourists, leading to changes in the traffic patterns of taxis, one of the main modes of transportation in the city.

This study aims to analyze the traffic patterns of NYC taxis before, during, and after the COVID-19 pandemic. Specifically, we will compare the taxi flow during December of 2017-2019 (before the pandemic), 2020 (during the pandemic), and 2021 (post-pandemic), to understand the effects of the pandemic on the transportation system. We will also identify and analyze the areas with higher taxi flow in the city to gain insights into the impact of the pandemic on the taxi system. In addition, we will use SARIMA time series modeling to consider the influence of weather on taxi flow.

This project primarily investigates the following questions:

1. Which zones are popular origins and destinations for taxis? Are there any differences in popular areas between the five-year period?
2. Are there differences in taxi traffic patterns between these popular zones? What changes have occurred in the past five years?
3. To establish a time-series modeling of traffic flow between zones that takes into account weather factors and examine its applicability in three different time periods.


In [99]:
# import all necessary libraries

## basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import datetime

## geo data processing
import geopandas as gpd
import folium

## API requests
import time
import requests
import json

## ARIMA model
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace import sarimax
from pmdarima import auto_arima

## others
import matplotlib.image as mpimg
from IPython.display import IFrame
from IPython.display import HTML
import plotly.graph_objects as go
import pickle
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Data

### Data collection

To investigate these issues, the following datasets were primarily utilized in this research:

1. The yellow and green taxi data for the month of December from 2017 to 2021 in New York City - [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
2. The shapefile corresponding to the data zone - [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
3. Weather data for the corresponding time period (obtained through API) - [OpenWeather](https://openweathermap.org/api/one-call-3)

The following code was used to obtain the weather data, and save them into dataframe:

In [None]:
def transform_weather_data_from_json_to_dataframe(weather_data):
    '''
    This function is used to transform the weather data into a dataframe
    '''
    # get the number of hours
    num_hours = len(weather_data)

    # get the date and time
    hourly_date = []
    hourly_time = []
    for i in range(num_hours):
        date_time = datetime.datetime.fromtimestamp(weather_data[i]['data'][0]['dt']).strftime('%Y-%m-%d %H:%M:%S')
        hourly_date.append(int(date_time.split(' ')[0].split('-')[2]))
        hourly_time.append(int(date_time.split(' ')[1].split(':')[0]))
    
    # get the hourly temperature
    hourly_temp = []
    for i in range(num_hours):
        hourly_temp.append(weather_data[i]['data'][0]['temp'])
    
    # get the hourly feels like temperature
    hourly_feels_like = []
    for i in range(num_hours):
        hourly_feels_like.append(weather_data[i]['data'][0]['feels_like'])
    
    # get the hourly humidity
    hourly_humidity = []
    for i in range(num_hours):
        hourly_humidity.append(weather_data[i]['data'][0]['humidity'])
    
    # get the hourly visibility
    hourly_visibility = []
    for i in range(num_hours):
        hourly_visibility.append(weather_data[i]['data'][0]['visibility'])
    
    # get the hourly wind speed
    hourly_wind_speed = []
    for i in range(num_hours):
        hourly_wind_speed.append(weather_data[i]['data'][0]['wind_speed'])
    
    # get the hourly weather description
    hourly_weather_description = []
    for i in range(num_hours):
        hourly_weather_description.append(weather_data[i]['data'][0]['weather'][0]['description'])
    
    # try to get the hourly rain volume
    hourly_rain = []
    for i in range(num_hours):
        try:
            hourly_rain.append(weather_data[i]['data'][0]['rain']['1h'])
        except:
            hourly_rain.append(0)
    
    # try to get the hourly snow volume
    hourly_snow = []
    for i in range(num_hours):
        try:
            hourly_snow.append(weather_data[i]['data'][0]['snow']['1h'])
        except:
            hourly_snow.append(0)
    
    # save the all the data into a dataframe
    weather_df = pd.DataFrame({'date': hourly_date, 'time': hourly_time,
                                'temp': hourly_temp, 'feels_like': hourly_feels_like,
                                'humidity': hourly_humidity, 'visibility': hourly_visibility, 
                                'wind_speed': hourly_wind_speed, 
                                'weather_description': hourly_weather_description, 
                                'rain': hourly_rain, 'snow': hourly_snow})
    
    # return the dataframe
    return weather_df

def get_weather_data_from_WeatherData(lat, lon, start_date, end_date, api_key):
    '''
    This function is to get hourly history weather data from openweathermap.org
    '''
    # get the start and end time in unix time
    start_time = int(time.mktime(datetime.datetime.strptime(start_date, "%Y-%m-%d").timetuple()))
    end_time = int(time.mktime(datetime.datetime.strptime(end_date, "%Y-%m-%d").timetuple()))
    
    # get the number of hours between start and end time
    num_hours = int((end_time - start_time) / 3600)
    
    # get the hourly history weather data
    weather_data = []
    for i in range(num_hours):
    # get the time in unix time
        time_unix = start_time + i * 3600
        # get the weather data
        url = f'http://api.openweathermap.org/data/3.0/onecall/timemachine?lat={lat}&lon={lon}&dt={time_unix}&appid={api_key}&units=metric'
        response = requests.get(url)
        data = json.loads(response.text)
        weather_data.append(data)
        # wait for 1 second
        time.sleep(1)
    
    # transform the weather data into a dataframe
    weather_df = transform_weather_data_from_json_to_dataframe(weather_data)

    # save the dataframe to a file
    folder_path = '../data/weather_data/csv/'
    flie_name = f'weather_data_{start_date}_{end_date}.csv'
    file_path = folder_path + flie_name
    weather_df.to_csv(file_path, index=False)
    print(f'The weather data has been saved to {file_path}.')

    # return the weather data
    return f'The weather data of {start_date} to {end_date} has been saved to {file_path}.'

# get the hourly history weather data
lat,lon = 40.730610,-73.935242  # New York City
# get API from file
with open('../data/documentations/api.txt', 'r') as f:
    api_key = f.read()

# get the weather data for 2017 to 2021
for year in range(2017,2022):
    start_date = f'{year}-12-01'
    end_date = f'{year+1}-01-01'
    weather_data = get_weather_data_from_WeatherData(lat, lon, start_date, end_date, api_key)

### Data process

#### Geodata of data zone
The coordinate system of the geodata of NYC zones was converted to WGS84 for the convenience of subsequent visualization.

In [None]:
# read the geodata and convert the coordinate system
nyc_zones_geo = gpd.read_file('../data/taxi_zones/taxi_zones.shp')
nyc_zones_geo = nyc_zones_geo.to_crs(epsg=4326)

#### NYC taxi trip data

原始NYC出租车行程数据如下所示（以2020年12月绿色出租车为例）：

In [7]:
# show a example of the original trip data
trip_data_example = pd.read_parquet('../data/dataexample/green_tripdata_2020-12.parquet')
trip_data_example.head(5)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2020-12-01 00:29:37,2020-12-01 00:32:51,N,1.0,75,75,1.0,0.59,4.5,0.5,0.5,1.16,0.0,,0.3,6.96,1.0,1.0,0.0
1,2,2020-12-01 00:41:46,2020-12-01 00:46:31,N,1.0,75,74,1.0,1.24,6.0,0.5,0.5,1.46,0.0,,0.3,8.76,1.0,1.0,0.0
2,2,2020-12-01 00:05:46,2020-12-01 00:10:48,N,1.0,244,243,2.0,1.19,6.0,0.5,0.5,1.46,0.0,,0.3,8.76,1.0,1.0,0.0
3,2,2020-11-30 23:59:17,2020-12-01 00:16:06,N,1.0,75,68,1.0,5.08,17.0,0.5,0.5,2.0,0.0,,0.3,23.05,1.0,1.0,2.75
4,2,2020-12-01 00:57:03,2020-12-01 01:00:29,N,1.0,74,263,1.0,1.24,5.5,0.5,0.5,0.0,0.0,,0.3,9.55,2.0,1.0,2.75


将包含出租车行程的原始数据转换为车流数据的过程包括以下三步：
1. Cleaning - 去掉起点和终点未知的行程
2. Counting - 按日期、小时、起点和终点将行程数据集计为车流数据
3. Merging - 将同一时间的绿色出租车和黄色出租车车流数据相加，得到该时间最终车流量

具体代码如下：

In [None]:
# functions for process parquet files from folder
def process_data(df):
    '''
    INPUT: df - dataframe of parquet files
    OUTPUT: df - counted dataframe of parquet files
    '''
    # select columns that are needed
    df = df[['lpep_pickup_datetime', 'passenger_count', 
              'PULocationID', 'DOLocationID', 'tip_amount', 'total_amount']]

    # delete rows with values >263 in 'PULocationID' and 'DOLocationID' columns
    df = df[(df['PULocationID'] <= 263) & (df['DOLocationID'] <= 263)]
    # fill missing values in 'passenger_count' column with 1
    df['passenger_count'] = df['passenger_count'].fillna(1)
    # dropna in 'total_amount' and 'tip_amount' columns
    df = df.dropna(subset=['total_amount','tip_amount'])
    # calculate the trip_fee = Total_amount - Tip_amount
    df['trip_fee'] = df['total_amount'] - df['tip_amount']


    # delete columns that are not needed anymore
    df = df.drop(columns=['total_amount','tip_amount'])
    # delete rows with negative values and 0 in 'trip_fee' column
    df = df[df['trip_fee'] > 0]

    # count the number of trips in each day, hour, PULocationID and DOLocationID
    df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
    df['day'] = df['lpep_pickup_datetime'].dt.day
    df['hour'] = df['lpep_pickup_datetime'].dt.hour
    count_df = df.groupby(['day','hour','PULocationID','DOLocationID']).count().reset_index()
    
    # calculate the average trip_fee in each day, hour, PULocationID and DOLocationID
    count_df['trip_fee'] = df.groupby(['day','hour',
                                       'PULocationID','DOLocationID']).mean().reset_index()['trip_fee']

    # rename the column 'lpep_pickup_datetime' to 'trip_count'
    count_df = count_df.rename(columns={'lpep_pickup_datetime':'trip_count'})
    
    return count_df

def read_and_count_parquet(folder_path):
    '''
    INPUT: folder_path - path to folder containing parquet files
    OUTPUT: saves the processed dataframe to csv file
    '''
    
    for filename in os.listdir(folder_path):
        if filename.endswith(".parquet"):
            file_path = os.path.join(folder_path, filename)
            df = pd.read_parquet(file_path)

            # rename columns for yellow taxi data
            if filename.split('_')[0] == 'yellow':
                df = df.rename(columns={'tpep_pickup_datetime':'lpep_pickup_datetime',
                                                           'tpep_dropoff_datetime':'lpep_dropoff_datetime'})
            df = process_data(df)  # process the dataframe
            try:
                save_path = f'../data/processed_nyc_data/{filename.split(".")[0]}.csv'
            except:
                print('save_path is not defined')
            save_path = save_path.replace('trip', 'count')
            df.to_csv(save_path, index=False)  # save the dataframe to csv file
            print(f'{filename} is processed')
            
    return 'Done!'

# process parquet files from folder
folder_path = '../data/NYC_taxi_data/'
read_and_count_parquet(folder_path)

In [None]:
# merge the green and yellow taxi data
for i in range(2017,2022):
    df_green = pd.read_csv(f'../data/processed_nyc_data/green_countdata_{i}-12.csv')
    df_yellow = pd.read_csv(f'../data/processed_nyc_data/yellow_countdata_{i}-12.csv')
    
    # merge the green and yellow taxi data by day, hour, PULocationID and DOLocationID
    # which trip_count = yellow + green
    # passenger_count = yellow + green
    # trip_fee = (yellow * yellow_trip_fee + green * green_trip_fee) / (yellow + green)
    df = pd.merge(df_green, df_yellow, how='outer', on=['day','hour','PULocationID','DOLocationID'])
    df = df.fillna(0)
    df['trip_count'] = df['trip_count_x'] + df['trip_count_y']
    df['passenger_count'] = df['passenger_count_x'] + df['passenger_count_y']
    df['trip_fee'] = round((df['trip_fee_x'] * df['trip_count_x'] + df['trip_fee_y'] * df['trip_count_y']) / df['trip_count'],2)  # round 2
    df = df.drop(columns=['trip_count_x','trip_count_y','passenger_count_x','passenger_count_y',
                            'trip_fee_x','trip_fee_y'])
    
    # save the dataframe to csv file
    df.to_csv(f'../data/processed_nyc_data/countdata_{i}-12.csv', index=False)
    print(f'countdata_{i}-12.csv is saved')

After this, 我们得到了集计后的数据如下所示：

In [14]:
# show a example of the counted data
counted_data_2020 = pd.read_csv('../data/dataexample/countdata_2020-12.csv')
counted_data_2020.head(5)

Unnamed: 0,day,hour,PULocationID,DOLocationID,trip_count,passenger_count,trip_fee
0,1,0,32,159,1.0,1.0,27.57
1,1,0,37,217,1.0,1.0,9.3
2,1,0,41,41,1.0,1.0,5.3
3,1,0,41,42,1.0,1.0,7.8
4,1,0,41,47,1.0,1.0,25.11


对原始数据进行集计之后，对于要研究的三个问题，分别对数据做进一步的处理。

#### Dataset for Question 1

为了找到纽约市出租车的热点区域，对集计后的车流数据进行如下处理：

对于2017-2021年的车流集计数据：
1. 计算每个区域每小时的所有流入车流量之和及流出车流量之和，由于本地设备无法满足该计算过程的性能要求，故使用Amazon Web Services的Instance进行计算，使用的Instance的类型为t2.2xlarge
2. 计算total_valume = in_valume + out_valume
3. 对所有区域total_valume进行排名for each day and hour, 得到volume_rank，并根据volume_rank计算volume_score，其中：
   $$volume\_score = num(zones)\times \frac{1}{volume\_rank}$$
4. 对每个地区的volume_score和total_valume求和，得到该地区当年的year_score and year_volumn
5. 分别按year_score和year_volume排名，得到year_score_rank和year_volume_rank

处理数据使用的代码及处理后的数据如下所示：

In [82]:
# get the in-and-out volumn for each zone by day and hours, and the score
## ================== below code was run on the aws instance ==================

# def get_in_out_volumn_score(df, day_list, hour_list, year):
#     """
#     Calculate the in-and-out volumn for each zone by day and hours, and give each zone a score based on the volumn.
#     Input:
#         df: a dataframe containing the counted data
#         day_list: a list of days
#         hour_list: a list of hours
#     Output:
#         df_in_out_volumn: a dataframe containing the in-and-out volumn for each zone by day and hours, and the score
#     """
#     # calculate the in-and-out volumn for each zone by day and hours
#     ## set the location id
#     location_id_list = list(range(1,264))
#     ## get the day and hour
#     day_list = day_list
#     hour_list = hour_list
#     ## get the in-and-out volumn for each zone by day and hours, and store them in a dictionary
#     in_out_volumn = {}
#     for location_id in location_id_list:
#         for day in day_list:
#             for hour in hour_list:
#                 in_out_volumn[(location_id, day, hour)] = [df.loc[(df['DOLocationID']==location_id) & (df['day']==day) & (df['hour']==hour), 'trip_count'].sum(), df.loc[(df['PULocationID']==location_id) & (df['day']==day) & (df['hour']==hour), 'trip_count'].sum()]
#         print(f'Finish calculating in-and-out volumn for zone {location_id} of {year}.')
#     # convert the dictionary to a dataframe
#     df_in_out_volumn = pd.DataFrame.from_dict(in_out_volumn, orient='index', columns=['in_volumn', 'out_volumn'])
#     # add day hour and location id as columns
#     df_in_out_volumn['day'] = df_in_out_volumn.index.map(lambda x: x[1])
#     df_in_out_volumn['hour'] = df_in_out_volumn.index.map(lambda x: x[2])
#     df_in_out_volumn['location_id'] = df_in_out_volumn.index.map(lambda x: x[0])
#     # add the total volumn
#     df_in_out_volumn['total_volumn'] = df_in_out_volumn['in_volumn'] + df_in_out_volumn['out_volumn']
#     # reorder the columns
#     df_in_out_volumn = df_in_out_volumn[['day', 'hour', 'location_id', 'in_volumn', 'out_volumn', 'total_volumn']]
#     # reset the index
#     df_in_out_volumn = df_in_out_volumn.reset_index(drop=True)

#     # calculate the rank of the total volumn for each zone by day and hours
#     df_in_out_volumn['total_volumn_rank'] = df_in_out_volumn.groupby(['day', 'hour'])['total_volumn'].rank(ascending=False)
#     # give the zone a score = 1/rank * number of zones
#     df_in_out_volumn['total_volumn_score'] = (1/df_in_out_volumn['total_volumn_rank']) * 263
    
#     return df_in_out_volumn


# def get_hot_score(year_list, day_list, hour_list):
#     for year in year_list:
#         # read the data
#         df = pd.read_csv(f'/data/countdata_{year}-12.csv')
#         # calculate the in-and-out volumn for each zone by day and hours, and give each zone a score based on the volumn
#         df_in_out_volumn = get_in_out_volumn_score(df, day_list, hour_list,year)
#         df_in_out_volumn.to_csv(f'../data/in_out_volumn_score_{year}-12.csv', index=False)
#         print(f'Finish getting score for {year}-12.')


# year_list = range(2019, 2022)
# day_list = range(1, 32)
# hour_list = range(0, 24)
# get_hot_score(year_list, day_list, hour_list)

# ================== above code was run on the aws instance ==================

def get_ranks(year):
    df = pd.read_csv(f'../data/processed_nyc_data/{year}.csv')
    df['year_score'] = df.groupby('location_id')['total_volumn_score'].transform('sum')
    df['year_volumn'] = df.groupby('location_id')['total_volumn'].transform('sum')
    df_hot = df[['location_id', 'year_score','year_volumn']].drop_duplicates()

    df_hot['volumn_rank'] = df_hot['year_volumn'].rank(ascending=False)
    df_hot['score_rank'] = df_hot['year_score'].rank(ascending=False)
    
    df_hot['fluctuation_coef'] = df_hot['score_rank'] / df_hot['volumn_rank']
        
    df_hot = df_hot.sort_values(by='year_score', ascending=False)
    df_hot = df_hot[['location_id', 'year_score', 'score_rank', 'year_volumn', 'volumn_rank', 'fluctuation_coef']]

    return df_hot


# get the rank of each zone for each year
def get_zones_rank(year_list):
    df_hots = {}
    for year in year_list:
        df_hot = get_ranks(year)
        # reset the index
        df_hot = df_hot.reset_index(drop=True)
        df_hots[year] = df_hot

    return df_hots

zones_rank = get_zones_rank(range(2017, 2022))



In [88]:
# show a example of the rank dictionary
zones_rank[2019].head(5)

Unnamed: 0,location_id,year_score,score_rank,year_volumn,volumn_rank,fluctuation_coef
0,237,81220.20891,1.0,606782.0,1.0,1.0
1,236,70439.11933,2.0,589538.0,2.0,1.0
2,161,61489.543556,3.0,529001.0,3.0,1.0
3,48,52746.6502,4.0,413657.0,8.0,0.5
4,230,48159.681958,5.0,444629.0,5.0,1.0


#### Dataset for Question 2

为了研究热点地区间的出租车流量，采用如下方法构建数据：
1. 找到10个在五年间最热门的区域，为此计算每个区域5年的year_score之和，取前十名作为研究区域
2. 生成每年12月的从每个区域到其他区域的出租车流量时间序列，共$5\times10\times9=450$条序列

使用的代码如下：


In [50]:
# find the 10 hostest zone in most years by adding 5 years' score together
df_hot_5years = pd.DataFrame()
df_hot_5years['location_id'] = range(1, 264)
for year in range(2017, 2022):
    # add the score of each year for each zone
    df_hot_5years = df_hot_5years.merge(zones_rank[year][['location_id', 'year_score']], on='location_id', how='left')
    df_hot_5years = df_hot_5years.rename(columns={'year_score': f'{year}_score'})

# add the total score
df_hot_5years['total_score'] = df_hot_5years.sum(axis=1)-df_hot_5years['location_id']

df_hot_5years = df_hot_5years.sort_values(by='total_score', ascending=False)
df_hot_5years = df_hot_5years.reset_index(drop=True)
df_hot_5years = nyc_zones_geo.merge(df_hot_5years, left_on='LocationID', right_on='location_id', how='right')


# show the 10 hostest zone in most years
df_hot_5years[['location_id','zone', 'total_score']].head(10)


Unnamed: 0,location_id,zone,total_score
0,236,Upper East Side North,402109.543625
1,237,Upper East Side South,384653.539341
2,48,Clinton East,270290.50712
3,161,Midtown Center,259331.130219
4,186,Penn Station/Madison Sq West,208677.405776
5,230,Times Sq/Theatre District,198370.576496
6,79,East Village,195969.058117
7,142,Lincoln Square East,158488.126822
8,162,Midtown East,142121.385078
9,132,JFK Airport,141099.765288


In [58]:
# create time series for each pair of zones, and save them to a dictionary
def create_time_series(year, location_id_list):
    # create a template time series dataframe
    count_df = pd.read_csv(f'../data/processed_nyc_data/countdata_{year}-12.csv')

    temp_time_series = pd.DataFrame()
    temp_time_series['date'] = np.repeat(np.arange(1, 32), 24)
    temp_time_series['hour'] = np.tile(np.arange(0, 24), 31)
    
    all_time_series = {}
    
    for location_id1 in location_id_list:
        for location_id2 in location_id_list:
            if location_id1 != location_id2:
                series_name = f'{location_id1}to{location_id2}'
                all_time_series[series_name] = temp_time_series.copy()
                temp_count = count_df.loc[(count_df['PULocationID']==location_id1) & (count_df['DOLocationID']==location_id2), :]
                all_time_series[series_name] = all_time_series[series_name].merge(temp_count, left_on=['date', 'hour'], right_on=['day', 'hour'], how='left')
                # fill the nan with 0
                all_time_series[series_name] = all_time_series[series_name].fillna(0)
                # keep the columns we need
                all_time_series[series_name] = all_time_series[series_name][['date','hour','trip_count']]
                # create a datetime column
                all_time_series[series_name]['datetime']  = pd.to_datetime(all_time_series[series_name]['date'].astype(str) + '-12-' + str(year) + ' ' + all_time_series[series_name]['hour'].astype(str) + ':00:00', format='%d-%m-%Y %H:%M:%S')
                # set the datetime column as index
                all_time_series[series_name] = all_time_series[series_name].set_index('datetime')
    return all_time_series



location_id_list = df_hot_5years['location_id'].head(10).tolist()

all_time_series_2017 = create_time_series(2017, location_id_list)
all_time_series_2018 = create_time_series(2018, location_id_list)
all_time_series_2019 = create_time_series(2019, location_id_list)
all_time_series_2020 = create_time_series(2020, location_id_list)
all_time_series_2021 = create_time_series(2021, location_id_list)

In [57]:
all_time_series_2017['236to162'].head(5)

Unnamed: 0_level_0,date,hour,trip_count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-12-01 00:00:00,1,0,2.0
2017-12-01 01:00:00,1,1,1.0
2017-12-01 02:00:00,1,2,0.0
2017-12-01 03:00:00,1,3,1.0
2017-12-01 04:00:00,1,4,1.0


#### Dataset for question3

用于问题三的数据只需要在用于问题二的时间序列数据的基础上增加天气信息即可，在此不作赘述。

## Analysis of zones

对于每年year_score前30的地区，使用`folium`可视化：

In [81]:
# plot the first 30 hot zone's fluctuation rank on the map

def plot_hot_zone(dict_hot, df_geo, year):
    # merge the year score to the geodata
    df_hot_geo = df_geo.merge(dict_hot[year], left_on='LocationID', right_on='location_id', how='right')

    # filter the hot zone which volumn_rank is less than 31
    df_hot_geo = df_hot_geo.loc[df_hot_geo['volumn_rank']<=30, :]

    # add center point coordinates to the dataframe
    df_hot_geo['center_lat'] = df_hot_geo['geometry'].centroid.y
    df_hot_geo['center_lon'] = df_hot_geo['geometry'].centroid.x

    # plot the hot zone with base map in folium
    threshold_scale = [0,0.99999,1.000001,2.5]

    m = folium.Map(location=[40.75, -73.9], zoom_start=11, tiles='Stamen Toner')
    folium.Choropleth(
        geo_data=df_hot_geo,
        name='choropleth',
        data=df_hot_geo,
        columns=['location_id', 'fluctuation_coef'],
        key_on='feature.properties.location_id',
        fill_color='RdYlBu',
        fill_opacity=0.7,
        line_opacity=1.0,
        threshold_scale=threshold_scale).add_to(m)
    # always show zone id to the map as a label 
    folium.features.GeoJson(
        df_hot_geo,
        style_function=lambda feature: {
            'fillColor': 'transparent',
            'color': 'transparent',
            'weight': 0,
            'dashArray': '5, 5'
        },
        highlight_function=lambda x: {'weight':0.1, 'color':'black'},
        tooltip=folium.features.GeoJsonTooltip(
            fields=['location_id', 'zone'],
            aliases=['Zone ID:', 'Zone Name:'],
            localize=True,
            sticky=False
        )
    ).add_to(m)

    folium.LayerControl().add_to(m)
    m.save(f'../data/figures/hot_zone_{year}.html')
    print(f'Hot zones of December {year} as below:')
    display(m)
    return None


plot_hot_zone(zones_rank, nyc_zones_geo, 2017)
plot_hot_zone(zones_rank, nyc_zones_geo, 2018)
plot_hot_zone(zones_rank, nyc_zones_geo, 2019)
plot_hot_zone(zones_rank, nyc_zones_geo, 2020)
plot_hot_zone(zones_rank, nyc_zones_geo, 2021)

Hot zones of December 2017 as below:


Hot zones of December 2018 as below:


Hot zones of December 2019 as below:


Hot zones of December 2020 as below:


Hot zones of December 2021 as below:


在上述热点区域示意图中，着色的区域表示该区域当年的year_volume在所有区域中排名前30。从图中可以看出，绝大部分热点区域都集中在了曼哈顿区的南部, Which is 纽约市的商业和旅游中心。

值得注意的是，除2020年外，John F. Kennedy International Airport（JFK Airport）和 LaGuardia Airport也都为热点区域，这间接反映了在Covid-19 Pandemic的影响下受到很大冲击。

为了考察不同区域车流特征的差异，我计算了各区域车流量的波动性指数，which is
$$
fluctuation\_ coef = \frac{year\_ score\_ rank}{year\_ volume\_ rank}
$$

该指数反映了区域在当年12月所有时间的热度在所有区域中的排名（year_score_rank）和当年12月总车流量在所有区域中的排名（year_volume_rank）间的比例关系。如果该指数低于1,则说明其year_score_rank低于year_volume_rank,这表明该地区当月的客流量在所有区域中表现出更强的波动性；如果该指数高于1，则表明该地区当月的客流量在所有区域中表现得更加平稳。

在图中，不同的颜色则反映了区域车流的fluctuation_coef。具体而言，红色区域fluctuation_coef \< 1, 而蓝色区域的fluctuation_coef \> 1.

从图上可以看出，出租车流量波动性较大的区域主要集中在两座机场和曼哈顿区的西南部。这是make sence的，因为该地区是城市中商业和居住密集的区域，吸引了大量的游客，而旅行客流受疫情、天气和是否为节假日等因素的影响更大，这导致这些地区的出租车流量波动性较大。

下图反映了这些地区五年间的year_score_rank变化情况：


In [117]:
rank_change = pd.DataFrame()
rank_change['location_id'] = zones_rank[2017]['location_id'][0:30]
# merge the zone name for each zone from the geodata
rank_change['zone'] = rank_change.merge(nyc_zones_geo, left_on='location_id', right_on='LocationID', how='left')['zone']


# get the score rank for each zone in each year
for year in range(2017, 2022):
    rank_change = rank_change.merge(zones_rank[year][['location_id', 'volumn_rank']], left_on='location_id', right_on='location_id', how='left')
    rank_change = rank_change.rename(columns={'volumn_rank': year})


# plot the rank change of each zone with plotly
fig = go.Figure()
for i in range(0, 30):
    fig.add_trace(go.Scatter(x=rank_change.columns[1:], y=rank_change.iloc[i, 1:], name=rank_change.loc[i, 'zone']))
fig.update_layout(title='Rank Change of Hot Zones in NYC',
                     xaxis_title='Year',
                        yaxis_title='Rank')
# reverse the y axis
fig.update_yaxes(autorange="reversed")

# set ticks on the x axis
fig.update_xaxes(tickmode = 'array',
                    tickvals = [2017, 2018, 2019, 2020, 2021],
                    ticktext = ['2017', '2018', '2019', '2020', '2021'])

# set the width of the line
fig.update_traces(line=dict(width=4))

# seve the plot as html file
fig.write_html("../data/figures/rank_change.html")

# show the plot from the html file
from IPython.display import IFrame

IFrame(src = '../data/figures/rank_change.html', width=1000, height=600)


从上图可以看出，2017年至2019年各区域热度排名变化不大，但情况在2020年发生了改变。  

在2020年，一些区域的排名发生了显著变化，一些区域的排名大幅下降，另一些则有所上升。有趣的是，区域排名的上升或下降似乎和区域出租车流量的波动性存在相关关系，波动性较强的区域下降更为明显，这表明疫情对这些地区的出租车流量影响较大，与前述分析吻合。

到2021年12月，各地区的排名又回到接近2019年之前的水平，从这一角度看，疫情对出租车流量的影响似乎已经过去。

值得注意的是，出租车流量排名的上升和下降并不直接反映出租车流量的变化趋势。事实上，几乎所有地区间的车流量在2020年12月均显著下降，这一点将会在下一部分展开讨论。

## Analysis of taxi trip flows

## Analysis of weather impact on taxi trip flows

### SARIMA
SARIMA, or Seasonal Autoregressive Integrated Moving Average, is a statistical model used for time series analysis and forecasting. It is an extension of the popular ARIMA model that takes into account seasonal patterns in the data.

In SARIMA, the "S" stands for the seasonal component, which means that the model considers the repeating patterns that occur within a specific time period, such as monthly, quarterly, or yearly. The "AR" in SARIMA refers to the autoregressive component, which takes into account the dependence of the current value on its past values. The "MA" stands for the moving average component, which considers the relationship between the current value and the errors from the past values.

SARIMA models are typically represented as  
$$SARIMA(p,d,q)(P,D,Q)m$$
where $p$, $d$, and $q$ are the parameters for the non-seasonal component, $P$, $D$, and $Q$ are the parameters for the seasonal component, and $m$ is the number of time periods in each season. The parameters are determined using statistical techniques such as maximum likelihood estimation or the Akaike Information Criterion (AIC).

SARIMA models are widely used in various fields, including finance, economics, and environmental sciences, for forecasting future values based on historical patterns. They are particularly useful when the data exhibit seasonal patterns or have non-stationary behavior, such as trends or cycles. 

## Conclusion