期中提案僅針對最新的dataset (2023-04)進行資料清理與EDA

Import Packages and Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

In [2]:
fhvhv_202304 = pd.read_parquet('../fhvhv_tripdata_2023-04.parquet', engine='pyarrow')

In [3]:
fhvhv_202304.head()

Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,...,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
0,HV0003,B03404,B03404,2023-04-01 00:01:28,2023-04-01 00:02:41,2023-04-01 00:03:03,2023-04-01 00:15:42,174,126,5.1,...,1.82,0.0,0.0,0.0,13.83,N,N,,N,N
1,HV0003,B03404,B03404,2023-04-01 00:39:40,2023-04-01 00:41:20,2023-04-01 00:43:42,2023-04-01 01:01:08,211,163,3.11,...,3.56,2.75,0.0,9.52,24.84,N,N,,N,N
2,HV0003,B03404,B03404,2023-03-31 23:56:31,2023-04-01 00:00:20,2023-04-01 00:01:01,2023-04-01 00:09:48,222,76,2.4,...,1.08,0.0,0.0,0.0,8.1,Y,N,,N,N
3,HV0003,B03404,B03404,2023-04-01 00:12:35,2023-04-01 00:16:04,2023-04-01 00:17:17,2023-04-01 00:19:12,76,124,0.43,...,0.7,0.0,0.0,0.0,5.4,N,N,,N,N
4,HV0003,B03404,B03404,2023-04-01 00:46:23,2023-04-01 00:47:30,2023-04-01 00:47:49,2023-04-01 01:05:23,263,247,4.18,...,1.48,2.75,0.0,0.0,15.65,N,N,,N,N


In [4]:
fhvhv_202304.shape

(19144903, 24)

僅保留Uber (HV0003) 的紀錄

In [5]:
uber_202304 = fhvhv_202304[fhvhv_202304['hvfhs_license_num'] == 'HV0003']
uber_202304.reset_index(drop=True, inplace=True)
uber_202304.shape

(13998413, 24)

# 刪除共乘紀錄

刪除共乘紀錄，約10萬筆 (約0.75%)  
Uber的共乘服務舊名為Uber Pool，受疫情影響於2020年3月17暫停  
新版的UberX Share在2022年11月1日上線，官方宣稱共乘的車資有最高20%優惠  
處理所有的資料時要考慮暫停即恢復服務的時間，且無法確定新舊版服務的優惠是否有差異  
若要準確預測價格應分開建模，專案複雜度會太高

In [6]:
share_records = uber_202304[uber_202304.shared_match_flag == "Y"].shape[0]
all_records = uber_202304.shape[0]
p_share_in_all = round((share_records / all_records * 100), 2)
print(f'Shared records: {share_records}')
print(f'Percentage of shared records in all records: {p_share_in_all}%')

Shared records: 105461
Percentage of shared records in all records: 0.75%


移除columns：  
● *'hvfhs_license_num'*：已經篩選出Uber，全為'HV0003'；  
● *'dispatching_base_num'*, *'originating_base_num'*：基地為哪一個不重要，大部分司機收到需求時應不是從基地出發；  
● *'bcf'*, *'sales_tax'*：皆是按固定百分比計算，不須預測；  
● *'driver_pay'*：為Uber支付給司機的金額，乘客不需要此資訊；  
● *'shared_request_flag'*、*'shared_match_flag'*：刪除共乘紀錄後就不需要了；  
● *'access_a_ride_flag'*：不重要的資訊；  
● *'wav_request_flag'*、*'wav_match_flag'*：Uber官方宣稱WAV的行程費用與UberX相當，所以除非車資有不同，否則不需要此資訊

In [7]:
uber_202304 = uber_202304.drop(['hvfhs_license_num', 'dispatching_base_num', 'originating_base_num', 'bcf', 'sales_tax', 'driver_pay', 'shared_request_flag', 'shared_match_flag', 'access_a_ride_flag', 'wav_request_flag', 'wav_match_flag'], axis=1)

In [8]:
uber_202304.head()

Unnamed: 0,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,congestion_surcharge,airport_fee,tips,shared_request_flag,shared_match_flag
0,2023-04-01 00:01:28,2023-04-01 00:02:41,2023-04-01 00:03:03,2023-04-01 00:15:42,174,126,5.1,759,20.52,0.0,0.0,0.0,0.0,N,N
1,2023-04-01 00:39:40,2023-04-01 00:41:20,2023-04-01 00:43:42,2023-04-01 01:01:08,211,163,3.11,1046,40.12,0.0,2.75,0.0,9.52,N,N
2,2023-03-31 23:56:31,2023-04-01 00:00:20,2023-04-01 00:01:01,2023-04-01 00:09:48,222,76,2.4,527,12.16,0.0,0.0,0.0,0.0,Y,N
3,2023-04-01 00:12:35,2023-04-01 00:16:04,2023-04-01 00:17:17,2023-04-01 00:19:12,76,124,0.43,115,7.86,0.0,0.0,0.0,0.0,N,N
4,2023-04-01 00:46:23,2023-04-01 00:47:30,2023-04-01 00:47:49,2023-04-01 01:05:23,263,247,4.18,1054,16.73,0.0,2.75,0.0,0.0,N,N


費用相關的feature保留'basic fare', 'airport_fee', 'congestion_surcharge'  
'airport_fee'：起點或終點其中一個為機場，為2.5美元；兩者皆為機場則為5美元  
'congestion_surcharge'：壅堵附加費是在起點、終點或途經曼哈頓南部的行程會收取，應比對上下車區域找出可能會被收取的起點終點  
'tips'：可用來提供小費參考

congestion_surcharge(壅堵附加費)為2.75美元，若為共乘，則每位乘客0.75美元

In [9]:
uber_202304.congestion_surcharge.value_counts()

0.00    7964577
2.75    5912874
0.75     120962
Name: congestion_surcharge, dtype: int64

比對起點區域、終點區域計算出tolls(通行費)