### Automating 1,2,3: From raw data to training data.

In this notebook we combine the processes from obtaining raw data to transforming it into training data for modeling.

The functions used here are defined in the `data.py` and `paths.py` scripts.

In [2]:
# Telling jupyter to always reload modules before executing code
%reload_ext autoreload
%autoreload 2

Let us import 2022 data.

In [6]:
import sys
sys.path.append(r"C:\Users\User\capstone_project")
from src.data import load_raw_data

rides = load_raw_data(year=2022) # fetches 2022 data
rides

Downloading file 2022-01
Downloading file 2022-02
Downloading file 2022-03
Downloading file 2022-04
Downloading file 2022-05
Downloading file 2022-06
Downloading file 2022-07
Downloading file 2022-08
Downloading file 2022-09
Downloading file 2022-10
Downloading file 2022-11
Downloading file 2022-12


Unnamed: 0,pickup_datetime,pickup_location_id
0,2022-01-01 00:35:40,142
1,2022-01-01 00:33:43,236
2,2022-01-01 00:53:21,166
3,2022-01-01 00:25:21,114
4,2022-01-01 00:36:48,68
...,...,...
3399544,2022-12-31 23:46:00,16
3399545,2022-12-31 23:13:24,75
3399546,2022-12-31 23:00:49,168
3399547,2022-12-31 23:02:50,238


Let us transform the raw data into time series data.

In [7]:
from src.data import transform_raw_data_into_ts_data

ts_data = transform_raw_data_into_ts_data(rides)
ts_data

100%|██████████| 265/265 [00:02<00:00, 92.05it/s] 


Unnamed: 0,pickup_hour,rides,pickup_location_id
0,2022-01-01 00:00:00,0,1
1,2022-01-01 01:00:00,0,1
2,2022-01-01 02:00:00,0,1
3,2022-01-01 03:00:00,0,1
4,2022-01-01 04:00:00,1,1
...,...,...,...
2321395,2022-12-31 19:00:00,2,265
2321396,2022-12-31 20:00:00,2,265
2321397,2022-12-31 21:00:00,7,265
2321398,2022-12-31 22:00:00,3,265


Let us now transform the time series data into tabular data (features and targets)

In [8]:
from src.data import transform_ts_data_into_features_and_target

features, targets = transform_ts_data_into_features_and_target(
    ts_data,
    input_seq_len=24*28*1, # one month
    step_size=24,
)

print(f'{features.shape=}')
print(f'{targets.shape=}')

100%|██████████| 265/265 [00:20<00:00, 13.02it/s]

features.shape=(89305, 674)
targets.shape=(89305,)





Now let us save our data to be used for modeling.

In [9]:
tabular_data = features
tabular_data['target_rides_next_hour'] = targets

from src.paths import TRANSFORMED_DATA_DIR
tabular_data.to_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')

We can now move to the next step of model building.