# Near-time Forecasting

## Recalculating Features

**Select data, but this time cut off somewhere randomly**

Let's say we want to make a prediction now, given the data we have

In [1]:
import pandas as pd
csv_file_path = 'data/energy_data_new.csv'
df = (pd.read_csv(csv_file_path, parse_dates=['period'])
      .set_index('period')
      .sort_index(ascending=False))
df = df.tail(-81)
df

Unnamed: 0_level_0,subba,subba-name,parent,parent-name,value,value-units
period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-04-04 14:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5752,megawatthours
2024-04-04 13:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5630,megawatthours
2024-04-04 12:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5420,megawatthours
2024-04-04 11:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,4989,megawatthours
2024-04-04 10:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,4567,megawatthours
...,...,...,...,...,...,...
2024-01-01 04:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,4956,megawatthours
2024-01-01 03:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5112,megawatthours
2024-01-01 02:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5257,megawatthours
2024-01-01 01:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5417,megawatthours


**To make a prediction, we need to find out the minimum number of historic timesteps (offset) to calculate our features**

We get this from the YAML file

In [2]:
import yaml
import os
directory = "feature_store"
yaml_file_path = os.path.join(directory, 'config_v1.yaml')
with open(yaml_file_path, 'r') as file:
    config = yaml.safe_load(file)

max_offset_days = int(config['feature_store']['feature_offset'])

print(max_offset_days)

13


**Convert 13 days into hours**

In [3]:
max_offset_hours = (max_offset_days + 1) * 24 # +1 because we need the current day + the offset
max_offset_hours

336

**Select this mini-batch of data**

This is the minimum set of data required for making a prediction with our model

In [4]:
# Get one mini-batch of data
mini_batch_df = df[:max_offset_hours]
mini_batch_df

Unnamed: 0_level_0,subba,subba-name,parent,parent-name,value,value-units
period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-04-04 14:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5752,megawatthours
2024-04-04 13:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5630,megawatthours
2024-04-04 12:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5420,megawatthours
2024-04-04 11:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,4989,megawatthours
2024-04-04 10:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,4567,megawatthours
...,...,...,...,...,...,...
2024-03-21 19:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5767,megawatthours
2024-03-21 18:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5816,megawatthours
2024-03-21 17:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5852,megawatthours
2024-03-21 16:00:00,ZONJ,New York City,NYIS,New York Independent System Operator,5896,megawatthours


**Implement a new sampling strategy** 

The model is trained on daily data.  We don't have full days so we need to resample 24 hour intervals

In [5]:
chunk_size = 24
periods = mini_batch_df.index[::chunk_size]  # Select every chunk_size-th index as the period
sums = [mini_batch_df.iloc[i:i + chunk_size]['value'].sum() for i in range(0, len(mini_batch_df), chunk_size)]
resampled_df = pd.DataFrame({'period': periods, 'value': sums})
resampled_df.set_index('period', inplace=True)
resampled_df

Unnamed: 0_level_0,value
period,Unnamed: 1_level_1
2024-04-04 14:00:00,131001
2024-04-03 14:00:00,126545
2024-04-02 14:00:00,122018
2024-04-01 14:00:00,110076
2024-03-31 14:00:00,109811
2024-03-30 14:00:00,118276
2024-03-29 14:00:00,125478
2024-03-28 14:00:00,125171
2024-03-27 14:00:00,124573
2024-03-26 14:00:00,125662


**implement the resampling strategy into our feature pipeline**

In [6]:
# Recalculating features

def feature_pipeline_online(mini_batch_df):
    
    # Resample the last 24 hours relatively
    chunk_size = 24
    periods = mini_batch_df.index[::chunk_size]  # Select every chunk_size-th index as the period
    sums = [mini_batch_df.iloc[i:i + chunk_size]['value'].sum() for i in range(0, len(mini_batch_df), chunk_size)]
    resampled_df = pd.DataFrame({'period': periods, 'value': sums})
    resampled_df.set_index('period', inplace=True)
    resampled_df = resampled_df.sort_index(ascending = True)

    batch_df = pd.DataFrame()

    # Lagging features
    batch_df['lag_1'] = resampled_df['value'].shift(1) # Energy demand -1 day

    batch_df['lag_4'] = resampled_df['value'].shift(4) # Energy demand +3 days - 7 days
    batch_df['lag_5'] = resampled_df['value'].shift(5) # Energy demand +2 days - 7 days
    batch_df['lag_6'] = resampled_df['value'].shift(6) # Energy demand +1 days - 7 days

    batch_df['lag_11'] = resampled_df['value'].shift(11) # Energy demand +3 days - 14 days
    batch_df['lag_12'] = resampled_df['value'].shift(12) # Energy demand +2 days - 14 days
    batch_df['lag_13'] = resampled_df['value'].shift(13) # Energy demand +1 days - 14 days

    # Rolling statistics
    batch_df['rolling_mean_7'] = resampled_df['value'].rolling(window=7).mean().round(2)
    batch_df['rolling_std_7'] = resampled_df['value'].rolling(window=7).std().round(2) 
    
    batch_df = batch_df.dropna()
    
    return batch_df

**Test the new pipeline**

In [7]:
feature_pipeline_online(mini_batch_df)

Unnamed: 0_level_0,lag_1,lag_4,lag_5,lag_6,lag_11,lag_12,lag_13,rolling_mean_7,rolling_std_7
period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2024-04-04 14:00:00,126545.0,109811.0,118276.0,125478.0,121841.0,125884.0,130810.0,120457.86,8182.77


**Finally, let's simulate online feature calculation for a given date**

In [8]:
# try with some new data
datetime = "2024-04-03 10:00:00"
csv_file_path = 'data/energy_data_new.csv'

# This could also be a SQL statement
mini_batch_df = (pd.read_csv(csv_file_path, parse_dates=['period'])
      .set_index('period')
      .sort_index(ascending=False)
      .query("period <= @datetime"))[:max_offset_hours]

# Process batch on the fly
feature_pipeline_online(mini_batch_df)

Unnamed: 0_level_0,lag_1,lag_4,lag_5,lag_6,lag_11,lag_12,lag_13,rolling_mean_7,rolling_std_7
period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2024-04-03 10:00:00,121369.0,120419.0,126130.0,125423.0,129076.0,130762.0,126434.0,119585.43,7639.02


## Frequency considerations

**One consideration is the time it takes for the data to load**

Typically, you'll have an overhead for batch data I/O operations

In [19]:
import pandas as pd
import time
import yaml

# Load offset from yaml
yaml_file_path = 'feature_store/config_v1.yaml'
with open(yaml_file_path, 'r') as file:
    config = yaml.safe_load(file)
max_offset_days = int(config['feature_store']['feature_offset'])
max_offset_hours = (max_offset_days + 1) * 24 

# Measure time to read data
start_time = time.time()

datetime = "2024-04-03 10:00:00"
csv_file_path = 'data/energy_data_new.csv'

# This could also be a SQL statement
mini_batch_df = (pd.read_csv(csv_file_path, parse_dates=['period'])
      .set_index('period')
      .sort_index(ascending=False)
      .query("period <= @datetime"))[:max_offset_hours]

elapsed_time = time.time() - start_time

print(f"Time to read data: {elapsed_time:.3f} seconds")

Time to read data: 0.010 seconds


**Second consideration: Time for the feature processing**

For large datasets, moving data is typically much slower than processing data, because processing data can often be done in-memory or we can perform cache operations.

In [20]:
start_time = time.time()

from scripts import feature_processing

feature_processing.feature_pipeline_online(mini_batch_df)

elapsed_time = time.time() - start_time

print(f"Time to process batch: {elapsed_time:.3f} seconds")

Time to read data: 0.008 seconds


**Remember:**

The frequency at which you can run your batch pipeline is constrained by the time it takes to load your data and and to process it – and of course the given availability of your data.

So far, we're dealing just with one data source!

## Online Prediction

**Let's simulate new data coming in**

(This data was not in the training set)

In [1]:
import pandas as pd
from scripts import feature_processing
import yaml

yaml_file_path = 'feature_store/config_v1.yaml'
with open(yaml_file_path, 'r') as file:
    config = yaml.safe_load(file)
max_offset_days = int(config['feature_store']['feature_offset'])
max_offset_hours = (max_offset_days + 1) * 24 

datetime = "2024-04-05 10:00:00"
csv_file_path = 'data/energy_data_new.csv'

mini_batch_df = (pd.read_csv(csv_file_path, parse_dates=['period'])
      .set_index('period')
      .sort_index(ascending=False)
      .query("period <= @datetime"))[:max_offset_hours]

online_features_df = feature_processing.feature_pipeline_online(mini_batch_df)
online_features_df

Unnamed: 0_level_0,lag_1,lag_4,lag_5,lag_6,lag_11,lag_12,lag_13,rolling_mean_7,rolling_std_7
period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2024-04-05 10:00:00,130946.0,106949.0,111094.0,120419.0,117601.0,122134.0,129076.0,120358.0,8558.07


**Load our model for next-day prediction trained on batch data and predict the next day**

In [2]:
import joblib
import xgboost

filename = f'models/batch_demand_forecaster_model_1.pkl'
model = joblib.load(filename)
prediction = model.predict(online_features_df)
prediction

array([129164.41], dtype=float32)

**How good is this prediction?**

Load the next 24 hours

In [3]:
new_data = (pd.read_csv(csv_file_path, parse_dates=['period'])
            .set_index('period')
            .sort_index()[datetime:][:24]['value'])
new_data

period
2024-04-05 10:00:00    4434
2024-04-05 11:00:00    4867
2024-04-05 12:00:00    5249
2024-04-05 13:00:00    5462
2024-04-05 14:00:00    5588
2024-04-05 15:00:00    5613
2024-04-05 16:00:00    5717
2024-04-05 17:00:00    5719
2024-04-05 18:00:00    5650
2024-04-05 19:00:00    5614
2024-04-05 20:00:00    5620
2024-04-05 21:00:00    5685
2024-04-05 22:00:00    5691
2024-04-05 23:00:00    5574
2024-04-06 00:00:00    5566
2024-04-06 01:00:00    5498
2024-04-06 02:00:00    5348
2024-04-06 03:00:00    5110
2024-04-06 04:00:00    4844
2024-04-06 05:00:00    4598
2024-04-06 06:00:00    4418
2024-04-06 07:00:00    4279
2024-04-06 08:00:00    4205
2024-04-06 09:00:00    4186
Name: value, dtype: int64

**Find the true value by summarizing them**

In [4]:
true_value = new_data.sum()

true_value

np.int64(124535)

For a near-time workflow, we could schedule these predictions to happen every hour and they would give us the forecast for the next 24 hours.