# Adapt Training Data for Prediction Task

We have previously been training the model for reconstruction, i.e:

\begin{equation}
    \hat{\mathcal{G_y}}(t)=\mathcal{f}(~\mathcal{G_x(t)}~)
\end{equation}

Where $\hat{\mathcal{G_y}}(t)$ is the complete graph signal at timestep $t$, $\mathcal{G_x(t)}$ is the partial graph signal and $f$ is the learned function for the reconstruction.

For succesful model invalidation however, we need to predict the complete graph signal at $t+1$, and for this we may require an arbitrary amount of previous timesteps to learn the optimal function $f$ for the prediction:

\begin{equation}
    \hat{\mathcal{G_y}}(t+1)=\mathcal{f}(~\mathcal{G_x(t)~,~G_x(t-1)~,~G_x(t-2)~...~G_x(t-n)}~)
\end{equation}

Now this means our datahandler must be able to generate the training data so that a single training sample $i$ consists of:

\begin{equation}
x_i = \mathcal{G_x(t)~,~G_x(t-1)~,~G_x(t-2)~...~G_x(t-n)} 
\end{equation}

\begin{equation}
y_i = \hat{\mathcal{G_y}}(t+1)
\end{equation}

Let's make it happen!

## Load the Nominal Model Pressure Data

This is the simulation data from the BattLeDIM competition. <br>
We load it from a `.csv` to a `Pandas DataFrame`.

In [3]:
import os 
import yaml
import numpy as np
import pandas as pd

from torch_geometric.data import DataLoader

In [4]:
path_to_data = 'data/l-town-data/'
nominal_pressure = pd.read_csv(path_to_data + 'nominal_pressure.csv', index_col = 'Unnamed: 0')

In [5]:
nominal_pressure

Unnamed: 0,n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,...,n773,n774,n775,n776,n777,n778,n779,n780,n781,n782
0,28.885649,28.229593,28.925112,33.828160,36.537888,31.185562,26.183756,37.625713,32.829617,27.756403,...,52.457670,50.842834,51.985100,45.279144,48.613106,46.187430,46.464058,47.713550,49.513973,49.027523
300,28.900486,28.244280,28.939800,33.842846,36.552296,31.200220,26.198536,37.640057,32.844303,27.771090,...,52.476357,50.860176,52.003483,45.305267,48.638573,46.213013,46.489710,47.739033,49.539490,49.053160
600,28.915424,28.259079,28.954600,33.857647,36.566822,31.214993,26.213410,37.654522,32.859100,27.785880,...,52.494904,50.877390,52.021740,45.331276,48.663930,46.238487,46.515255,47.764404,49.564884,49.078682
900,28.930391,28.273897,28.969416,33.872475,36.581398,31.229792,26.228312,37.669020,32.873928,27.800697,...,52.512928,50.894110,52.039470,45.356632,48.688644,46.263317,46.540150,47.789130,49.589645,49.103565
1200,28.945330,28.288706,28.984215,33.887300,36.595974,31.244581,26.243195,37.683544,32.888744,27.815506,...,52.530228,50.910160,52.056496,45.381070,48.712467,46.287260,46.564156,47.812970,49.613506,49.127550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603600,28.254526,27.599018,28.294537,33.197620,35.908325,30.555126,25.552855,36.996414,32.199078,27.125828,...,52.390083,50.780120,51.918594,45.184193,48.520725,46.094650,46.371037,47.621143,49.421486,48.934597
603900,28.268282,27.612644,28.308163,33.211240,35.921730,30.568724,25.566566,37.009750,32.212696,27.139463,...,52.404682,50.793670,51.932964,45.204330,48.540360,46.114370,46.390812,47.640790,49.441154,48.954360
604200,28.282366,27.626598,28.322117,33.225180,35.935430,30.582640,25.580593,37.023396,32.226640,27.153416,...,52.421234,50.809032,51.949250,45.227543,48.562996,46.137093,46.413580,47.663430,49.463800,48.977104
604500,28.296803,27.640894,28.336412,33.239470,35.949455,30.596899,25.594973,37.037357,32.240917,27.167704,...,52.438650,50.825188,51.966385,45.251743,48.586594,46.160793,46.437347,47.687035,49.487434,49.000850


## Load the Sensor Locations

We need to get the node names that are equipped with pressure sensors

In [6]:
# Open the dataset configuration file
with open(path_to_data + 'dataset_configuration.yml') as file:

    # Load the configuration to a dictionary
    config = yaml.load(file, Loader=yaml.FullLoader) 

# Generate a list of integers, indicating the number of the node
# at which a  pressure sensor is present
sensors = [int(string.replace("n", "")) for string in config['pressure_sensors']]

## `dataCleaner( )` Function

This is the function we need to adapt for the new prediction learning task!<br>
We should probably add a parameter to it, `n_timesteps` which indicates the number of timesteps that should be contained in the training sample $x_i$ for a given target $y_i$.

In [271]:
# Function to clean the nominal pressure dataframe 
def dataCleaner(pressure_df, observed_nodes, 
                rescale=None, mode='sensor_mask', task='reconstruction', n_timesteps=None):
    '''
    Function for cleaning the pressure dataframes obtained by simulation of the
    nominal system model supplied with the BattLeDIM competition.
    The output format is suitable for ingestion by the GNN model.
    
    Parameters
    ----------
    pressure_df : pd.DataFrame
        Pandas dataframe where: 
            columns (x) = nodes
            index   (y) = observations
    sensor_list : list of ints
        A list of numerical values indicating the sensors nodal placement..
    scaling : str
        'standard' - standard scaling
        'minmax'   - min/max scaling
    mode : str
        'sensor_mask' - A per timestep stacked feature output np.array as per below
        'n_timesteps' - A t-n timestep stacked feature output np.array as per below
    task : str
        'reconstruction' - Returns y[t]   for x[t],x[t-1]...x[t-n] timesteps
        'prediction'     - Returns y[t+1] for x[t],x[t-1]...x[t-n] timesteps
        
    Returns
    -------
    if mode='sensor_mask'
    
    x : np.array(n_obs,n_nodes,2)
        The incomplete pressure signal matrix w/ 'n' number of observations.
        This is the feature vector (x) for the GNN model
        
        x =
        [21.57, 1    <- n1, pressure at node 1 is observed
         0.0  , 0    <- n2, pressure at node 2 is unknown
         0.0  , 0    <- n3, ... unknown
         22.43, 1    <- n4, ... observed
         0.0  , 0    <- n5, ... unknown
         ...     ]   etc.
         
    if mode='n_timesteps'
    
    x : np.array(n_obs,n_nodes,n_timesteps)
        The incomplete pressure signal matrix w/ 'n' number of observations, for n timesteps
        This is the feature vector (x) for the GNN model
        
        x =
        [21.57, 22.81, 23.13, ... , t-n    <- n1, pressure at node 1 is observed
         0.0  , 0.0  , 0.0  , ... , t-n    <- n2, pressure at node 2 is unknown
         0.0  , 0.0  , 0.0  , ... , t-n    <- n3, ... unknown
         22.43, 22.51, 23.41, ... , t-n    <- n4, ... observed
         0.0  , 0.0  , 0.0  , ... , t-n    <- n5, ... unknown
         ...     ]   etc.
        
    y : np.array(n_obs,n_nodes,2)
        The complete pressure signal matrix w/ 'n' number of observations.
        With this we may train the GNN in a supervised manner.
        
        y =
        [21.57    <- n1, all values are observed
         21.89    <- n2, 
         22.17    <- n3
         22.43    <- n4
         23.79    <- n5
         ...  ]   etc.
        
    '''     
    # The number of nodes in the passed dataframe
    n_nodes = len(pressure_df.columns)
    
    # Rename the columns (n1, n2, ...) to numerical values (1, 2, ...)
    pressure_df.columns = [number for number in range(1,n_nodes+1)]
    
    # Perform scaling on the initial Pandas Dataframe for brevity
    # This is less trivial than applying it on the later generated numpy arrays
    
    # Standard scale:
    if rescale == 'standard':
        _avg        = pressure_df.stack().mean()        # Calc. avg. over entire df.
        _std        = pressure_df.stack().std(ddof=0)   # Calc. std.. over entire df.
        bias        = _avg                              # Avg. is the scaling bias
        scale       = _std                              # Std.dev. is the scaling range
        pressure_df = (pressure_df - bias) / scale      # Scale to range
        
    # Min/max scaling (normalising):
    elif rescale == 'minmax':
        _min        = min(pressure_df.min())            # Find the absolute minimum value 
        _max        = max(pressure_df.max())            # Find the absolute maximum value
        _rng        = _max - _min                       # Calculate the difference between (range)
        bias        = _min                              # Scaling bias is the min value
        scale       = _rng                              # Scaling range is the min-max range
        pressure_df = (pressure_df - bias) / scale      # Scale to range
        
    # Perform no scaling
    else:
        bias        = None
        scale       = None
    
    # DataFrame where the index is the node number holding the sensor and the value is set to 1
    sensor_df = pd.DataFrame(data=[1 for i in observed_nodes],index=observed_nodes)
    
    # Filled single row of DataFrame with the complete number of nodes, the unmonitored nodes are set to 0 
    sensor_df = sensor_df.reindex(list(range(1,n_nodes+1)),fill_value=0)
    
    # Find the number of rows in the DataFrame to be masked...
    n_rows = len(pressure_df)
    
    # ... and complete a mask DataFrame, where all the observations to keep are set to 1 and the rest to 0
    mask_df = sensor_df.T.append([sensor_df.T for i in range(n_rows-1)],ignore_index=True)
    
    # Enforce matching indices of the two DataFrames to be broadcast together
    mask_df.index = pressure_df.index
    
    # Returns a (n_observations, n_nodes, 2) feature vector (x) where the 3rd dimension is a 0/1 mask 
    # of the observed nodes
    if mode=='sensor_mask':
        
        # Generating the incomplete feature matrix (x)
        x_mask = np.array(mask_df)
        x_arr  = np.array(pressure_df.where(cond=mask_df==1,other = 0.0))
        x      = np.stack((x_arr,x_mask),axis=2)

        # Generating the complete label matrix (y)
        y_arr  = np.array(pressure_df)
        y      = np.stack((y_arr, ),axis=2)
    
    # Returns a (n_observations, n_nodes, n_timesteps) feature vector (x) where the 3rd dimension
    # is the timesteps t, t-1, t-2 ... t-n leading to the observation to be predicted, at t+1
    if mode=='n_timesteps':
        
        x_df         = pressure_df.where(cond=mask_df==1,other = 0.0)   # The feature dataframe (missing observations)
        y_df         = pressure_df                                      # The label dataframe (complete observation)
        
        if task == 'prediction':                                        # If we're doing prediction we set the
            n_samples = len(x_df)                                       # no.of samples as length of DF
            
        elif task == 'reconstruction':                                  # If we're doing reconstruction we set the
            n_samples = len(x_df)+1                                     # no.of samples as length of DF + 1 due to
                                                                        # slicing
                
        window_start = 0                                                # Set the start/end of the rolling window
        window_end   = n_timesteps                                      # to be used to retrieve t-n timesteps for x

        x_ = []                                                         # Initialise temp x_ and y_ lists
        y_ = []                                                         # to contain our features and vectors

        for i in range(n_timesteps,n_samples):                          # For each training sample
            x_arr = (x_df.iloc[window_start:window_end].to_numpy().T)   # Add the t-n partial pressure signals
            x_.append(np.flip(x_arr,axis=1))                            # Flip the order so that t is at index 0
                                                                        # t-1 is at index 1, and so on
            
            if task == 'prediction':                                    # For prediction
                y_.append(y_df.iloc[i])                                 # Add complete observation at t+1 as label
                
            elif task == 'reconstruction':                              # For reconstruction
                y_.append(y_df.iloc[i-1])                               # Add complete observation at t as label
            
            window_start+=1                                             # Increment the
            window_end  +=1                                             # rolling window

        x = np.array(x_)                                                # Dump our lists 
        y = np.array(y_)                                                # to arrays 
        
        row,col = y.shape                                               # Reshape the label array y
        shape   = (row,col,1)                                           # so its dimensions are (n_observations, 1)
        y = y.reshape(shape)                                            # not (n_observations, )
        
    return x,y,scale,bias                                               # Return the features, labels, scale & bias

In [272]:
x,y,scale,bias = dataCleaner(pressure_df    = nominal_pressure, 
                             observed_nodes = sensors,
                             rescale        = 'minmax',
                             mode           = 'n_timesteps',
                             task           = 'prediction',
                             n_timesteps    = 2)

In [273]:
timesteps = [1,1,1,2,3,6,12,24]
modes     = ['n_timesteps' for i in range(len(timesteps)-1)]
modes.insert(0,'sensor_mask')
tasks     = ['prediction' for i in range(len(timesteps)-2)]
tasks.insert(0,'reconstruction')
tasks.insert(0,'reconstruction')

In [274]:
def print_task_name(task, mode, timestep):
    message = (' '*5 + str(timestep)+' '*5+task.upper()+' '*5+mode.upper()+' '*5).replace('_',' ')
    print('+' + '-' * len(message) + '+')
    print('|' + ' ' * len(message) + '|')
    print('|' +           message  + '|')
    print('|' + ' ' * len(message) + '|')
    print('+' + '-' * len(message) + '+')

In [275]:
for task, mode, timestep in zip(tasks,modes,timesteps):
    print_task_name(task, mode, timestep)
    
    x,y,scale,bias = dataCleaner(pressure_df    = nominal_pressure, 
                             observed_nodes = sensors,
                             rescale        = 'minmax',
                             mode           = mode,
                             task           = task,
                             n_timesteps    = timestep)
    
    print('x-shape:\t{}'.format(x.shape))
    print('y-shape:\t{}'.format(y.shape))
    

+----------------------------------------------+
|                                              |
|     1     RECONSTRUCTION     SENSOR MASK     |
|                                              |
+----------------------------------------------+
x-shape:	(2017, 782, 2)
y-shape:	(2017, 782, 1)
+----------------------------------------------+
|                                              |
|     1     RECONSTRUCTION     N TIMESTEPS     |
|                                              |
+----------------------------------------------+
x-shape:	(2017, 782, 1)
y-shape:	(2017, 782, 1)
+------------------------------------------+
|                                          |
|     1     PREDICTION     N TIMESTEPS     |
|                                          |
+------------------------------------------+
x-shape:	(2016, 782, 1)
y-shape:	(2016, 782, 1)
+------------------------------------------+
|                                          |
|     2     PREDICTION     N TIMESTEPS     |
|     

## Sanity Check  `for`-Loop to Split the Training Data

Lets see if we can make sense out of this. <br>
We can define a window size `n` that indicates the number of timesteps for each target. <br>
To slide the window over the training data, we create `first` and `last` indexes which we increment simultaneously. <br>
`last` is initialised as `n` so at the first iteration for `n=3` we have `first=0` and `last=3`. <br>
This means we can slice from `x` with `x[first:last]`, or `x[0:3]`. <br>
This slice returns `x[0], x[1] and x[2]`. <br>
Our `for`-loop will run from `n` to `len(y)` with `i` as the iterator. <br>
Thus, the first target `y`, for the training samples `x[0:3]` can be retrieved as `y[i]` or `y[3]` in the first instance. <br>

In [30]:
window = (0,3)

In [31]:
n     = 2   # The number of timesteps to contain in x
first = 0   # The first index at the first timestep, we will increment this
last  = n   # The last index at the first timestep, this will also be incremented
x_p   = []  # An empty list to contain the training samples
y_p   = []  # An empty list to contain the targets

_y = ['a', 'b', 'c', 'd', 'e', 'f', 'g']  # Dummy lists for
_x = [ 1 ,  2 ,  3 ,  4 ,  5 ,  6 ,  7 ]  # sanity checking

for i in range(n,len(_y)):
    x_p.append(np.array(_x[first:last]))  # Get a window of training samples
    y_p.append(_y[i])                     # Get a single target for the given window
    first += 1                            # Increment first window index
    last  += 1                            # Increment second window index
    
print(x_p)
print(y_p)

[array([1, 2]), array([2, 3]), array([3, 4]), array([4, 5]), array([5, 6])]
['c', 'd', 'e', 'f', 'g']


Let's create a function out of this

In [9]:
def predictionTaskDataSplitter(x, y, n_timesteps):
    window_start = 0
    window_end = n_timesteps
    n_samples = len(y)
    x_new = []
    y_new = []
    
    for i in range(n_timesteps, n_samples):
        x_new.append( np.array( x[window_start:window_end] ) )
        y_new.append( y[i] )
        window_start += 1
        window_end   += 1  
        
    return np.array(x_new) , np.array(y_new)

In [10]:
x_p, y_p = predictionTaskDataSplitter(x=x, 
                                      y=y, 
                                      n_timesteps=3)

In [11]:
x_p

array([[[[0.08288128, 1.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         ...,
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]],

        [[0.08318296, 1.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         ...,
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]],

        [[0.0834867 , 1.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         ...,
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]]],


       [[[0.08318296, 1.        ],
         [0.        , 0.        ],
         [0.        , 0.        ],
         ...,
         [0.        , 0.        ],
         [0.        , 0.        ],
         [0.        , 0.        ]],

        [[0.0834867 , 1.        ],
         [0.        , 0.        ],
         [0.        , 0.

In [34]:
x_p[0].shape

(3, 782, 2)

In [35]:
y_p[0].shape

(782, 1)

In [37]:
x_p.shape

(2014, 3, 782, 2)