In [None]:
from src.data_preprocessing import *

Let's prepare series from `2020-01` to `2021-07`

Since we are working with time series, it is better to have data for all months. **That's why we don't specify days**

All scripts were prepared in advance and are located in the `data_preprocessing.py` module.

**There is also an implementation on PySpark** in the `data_preprocessing_pyspark.py` module

In [2]:
# Main Parameters
DATE_START = '2020-01'
DATE_END = '2021-07'
PROJECT_SOURCE = 'D:/AI-Projects/Machine_Learning/ENG/TLC_NY_Trip_Demand_Prediction'

In [3]:
# GeoData Processing (polygons) 
polygon_data = preprocess_polygon_data('taxi_zones_polygons.geojson', file_path=PROJECT_SOURCE)
UNIQUE_DISTRICTS = polygon_data['location_id'].unique().tolist()

In [4]:
# ETL
f_path = PROJECT_SOURCE+'/taxi_data'
trips_df, mean_trips_df = get_series_data(file_path=f_path,
                                          start_date=DATE_START, 
                                          end_date=DATE_END,
                                          unique_districts=UNIQUE_DISTRICTS,
                                          get_monthly_avg=True)

 11%|█         | 2/19 [00:55<07:50, 27.70s/it]

Missing Date Found:  {Timestamp('2020-03-08 02:00:00', freq='H')}


 74%|███████▎  | 14/19 [02:49<00:52, 10.44s/it]

Missing Date Found:  {Timestamp('2021-03-14 02:00:00', freq='H')}


100%|██████████| 19/19 [04:02<00:00, 12.77s/it]


There are a couple of areas of the city in which trips were not made at all, we will exclude such areas.

**We cannot predict for regions 103 and 104 (no data)** and for further analysis these series will be just noise

In [7]:
# Find districts with only zero values
print('Districts with only 0 values:')
for district in trips_df['PULocationID'].unique():
    if trips_df[trips_df['PULocationID'] == district]['n_trips'].sum() == 0:
        print(district)

Districts with only 0 values:
103
104


In [8]:
# Exclude districts 103 и 104
trips_df = trips_df[~trips_df['PULocationID'].isin([103, 104])]
mean_trips_df = mean_trips_df[~mean_trips_df['PULocationID'].isin([103, 104])]

### Saving Results

In [None]:
# Save processed data
os.chdir(PROJECT_SOURCE+'/processed_data')

trips_df.to_csv('processed_ts_main.csv', index=False)
mean_trips_df.to_csv('processed_ts_monthly.csv', index=False)