# Imports and Description

This notebook takes the .hdf5 file containing a Pandas dataframe that was produced from comverting the simulator cache to that dataframe.  This notebook will convert that dataframe into the needed numpy array format for training the LSTM network.

In [1]:
# Data Processing Libraries
import numpy as np
import pandas as pd
# Libraries for file reading
import h5py
# Memory management libraries for Python
import gc
# Progress Bar Libraries
from tqdm import tqdm

# Important Links

* Main link on machinelearningmastery for LSTM networks from Gary: https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/ 
* Kanna LSTM search link: https://machinelearningmastery.com/?s=LSTM&post_type=post&submit=Search
* sdlkfjsld


# Open and Read in the Pandas Dataframe

In [2]:
data_directory = 'input_data/'
data_file = 'sim_data_df-body-ts.hdf5'
raw_df = pd.read_hdf(data_directory + data_file)
raw_df

Unnamed: 0_level_0,Unnamed: 1_level_0,mass,acc_x,acc_y,vel_x,vel_y,dis_x_1,dis_y_1,vel_x_1,vel_y_1,dis_x_2,...,vel_x_8,vel_y_8,dis_x_9,dis_y_9,vel_x_9,vel_y_9,dis_x_10,dis_y_10,vel_x_10,vel_y_10
body,time_step,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0.0,1.0,1390.0,2.594044,2.233110,-9031.438477,14691.001953,-7231071.0,13187125.0,-9038.838867,16483.906250,-7492852.5,...,-10083.470703,18032.125000,-8099020.5,14453898.0,-10123.775391,18067.373047,-8125220.5,14476922.0,-10156.525391,18096.152344
0.0,2.0,1390.0,-0.009251,2.241132,-9038.838867,16483.906250,-7492852.5,13767511.0,-9366.065430,17209.388672,-7683851.0,...,-10123.775391,18067.373047,-8125220.5,14476922.0,-10156.525391,18096.152344,-8146808.5,14496177.0,-10183.510742,18120.220703
0.0,3.0,1390.0,-0.409034,0.906853,-9366.065430,17209.388672,-7683851.0,14036962.0,-9604.813477,17546.203125,-7814419.5,...,-10156.525391,18096.152344,-8146808.5,14496177.0,-10183.510742,18120.220703,-8164801.5,14512605.0,-10206.001953,18140.755859
0.0,4.0,1390.0,-0.298435,0.421017,-9604.813477,17546.203125,-7814419.5,14186097.0,-9768.024414,17732.621094,-7906783.5,...,-10183.510742,18120.220703,-8164801.5,14512605.0,-10206.001953,18140.755859,-8179942.0,14526864.0,-10224.927734,18158.580078
0.0,5.0,1390.0,-0.204014,0.233022,-9768.024414,17732.621094,-7906783.5,14279605.0,-9883.479492,17849.505859,-7974726.0,...,-10206.001953,18140.755859,-8179942.0,14526864.0,-10224.927734,18158.580078,-8192775.0,14539422.0,-10240.968750,18174.277344
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9.0,197216.0,2689.0,0.002917,0.004731,-19296.751953,11917.413086,-15435534.0,9536958.0,-19294.417969,11921.197266,-15433667.0,...,-19278.064453,11947.682617,-15420581.0,9561172.0,-19275.726562,11951.464844,-15418711.0,9564198.0,-19273.388672,11955.247070
9.0,197217.0,2689.0,0.002918,0.004731,-19294.417969,11921.197266,-15433667.0,9539985.0,-19292.083984,11924.981445,-15431798.0,...,-19275.726562,11951.464844,-15418711.0,9564198.0,-19273.388672,11955.247070,-15416841.0,9567223.0,-19271.050781,11959.029297
9.0,197218.0,2689.0,0.002918,0.004730,-19292.083984,11924.981445,-15431798.0,9543012.0,-19289.748047,11928.765625,-15429930.0,...,-19273.388672,11955.247070,-15416841.0,9567223.0,-19271.050781,11959.029297,-15414969.0,9570248.0,-19268.710938,11962.810547
9.0,197219.0,2689.0,0.002919,0.004730,-19289.748047,11928.765625,-15429930.0,9546040.0,-19287.412109,11932.549805,-15428061.0,...,-19271.050781,11959.029297,-15414969.0,9570248.0,-19268.710938,11962.810547,-15413097.0,9573273.0,-19266.371094,11966.591797


# Convert Pandas Dataframe to LSTM Numpy Arrays

We will neeed to split the data into groups of time_steps.  The time_steps value is the number of time steps in the sequence we are feeding the LSTM network. 

Training input will be the sequence group of time_steps of the mass, acc_x, acc_y, vel_x, vel_y columns.  This will result in an input numpy array with a shape of (time_steps, 5).  This numpy array is then added to a 3D input array of shape (num_samples, time_steps, 5). 

Target values are the dis_x, dis_y, vel_x, vel_y for an arbitrary number of time steps in the future.  For the initial example, we are looking to predict 10 time steps into the future for every inference.  Since we have 4 target attributes, that will give us a target numpy array of (10, 4) or (num_pred_ts, 4).  We can then take each sample and construct a 3D numpy array of (num_samples, nump_pred_ts, 4).

Overall approach will be to slice out a body, then grab the data from the time step data for that body.

## Grab the Input Vector Data

In [3]:
# Get some values from the dataframe to help calculate shapes of numpy arrays.
# Need to be careful of modified dataframe index from previous processing.
# https://stackoverflow.com/questions/28772494/how-do-you-update-the-levels-of-a-pandas-multiindex-after-slicing-its-dataframe
min_body_index = int(min(raw_df.index.get_level_values(0)))
max_body_index = int(max(raw_df.index.get_level_values(0)))
min_ts_index = int(min(raw_df.index.get_level_values(1)))
max_ts_index = int(max(raw_df.index.get_level_values(1)))
num_bodies = len(raw_df.index.get_level_values(0).unique())
num_ts = len(raw_df.index.get_level_values(1).unique())
print(min_body_index)
print(max_body_index)
print(min_ts_index)
print(max_ts_index)
print(num_bodies)
print(num_ts)

0
9
1
197220
10
197220


For example, if I have 1000 time steps per body in the raw dataset, and I want a sequence of 4 input time steps for training the LSTM network, and I assume I'm only advancing 1 time step per sequence (overlapping samples), then that would mean I am creating samples until I reach an index that is 4 time steps short of the max time step.  In this case, I would be grabbing 996 samples from the original set of 100 time steps.

The total amout of samples from all bodies and time steps would then be num_bodies * 996.

In [4]:
# Set the number of time steps to include for the input sequence to the LSTM.
num_input_ts = 4
# Set the number of features in the input sequence.
num_input_features = 5

In [5]:
# Calculate the number of samples per body.
num_samples_per_body = int(max_ts_index - num_input_ts)
# Calculate the resulting number of samples from the entire dataset.
num_samples = int(num_samples_per_body * num_bodies)
# Create an empty numpy array of the sample dimensions to store all samples.
# Fill with nan then overwrite.  Preallocating memory then writing is muuuuccchhh faster.
input_samples_np = np.full(
    (num_samples, num_input_ts, num_input_features),
    np.nan,
    dtype=np.float32
)
input_samples_np.shape

(1972160, 4, 5)

In [6]:
# JUST TESTING CODE BLOCK
# Try slicing out the beginning 5 columns.
idx = pd.IndexSlice
test_df = raw_df.loc[idx[0, :], :].iloc[:, 0:5]
curr_ts = 5
test_df.loc[idx[0, curr_ts:curr_ts+num_input_ts-1], :]#.to_numpy().shape

Unnamed: 0_level_0,Unnamed: 1_level_0,mass,acc_x,acc_y,vel_x,vel_y
body,time_step,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,5.0,1390.0,-0.204014,0.233022,-9768.024414,17732.621094
0.0,6.0,1390.0,-0.144319,0.146106,-9883.479492,17849.505859
0.0,7.0,1390.0,-0.10616,0.099909,-9968.407227,17929.433594
0.0,8.0,1390.0,-0.080736,0.072773,-10032.996094,17987.652344


In [7]:
# Pandas index slicer to help with slicing.
idx = pd.IndexSlice
# Keep track of the sample index
curr_sample_index = 0
# Loop over each body and extract the data for that body.
for body_index in range(min_body_index, max_body_index+1):
    # Extract the current body's data.  Assume columns to left are input data.
    # Use iloc to only extract the first 5 columns of data.
    current_body_np = raw_df.loc[idx[body_index, :], :].iloc[:, 0:num_input_features].to_numpy()
    # Loop over time steps for the num_samples_per_body to extract input sequences.
    # Store the extracted sequence as a sample in the input_samples_np numpy array.
    for curr_ts in tqdm(range(min_ts_index, max_ts_index - num_input_ts + 1)):
        # Be careful accessing time step indexes in numpy version of time series.
        # In Pandas dataframe, indexes start at 1 while indexes start at 0 in numpy.
        input_samples_np[curr_sample_index, :, :] = \
            current_body_np[(curr_ts-1):(curr_ts+num_input_ts-1), :]
        # Advance the counter that keeps track of the sample we are on.
        curr_sample_index += 1    
# Swapped to using numpy arrays on the inner loop since Pandas dataframes take
# so long to index and slice.

100%|██████████| 197216/197216 [00:00<00:00, 634142.60it/s]
100%|██████████| 197216/197216 [00:00<00:00, 653033.01it/s]
100%|██████████| 197216/197216 [00:00<00:00, 597650.72it/s]
100%|██████████| 197216/197216 [00:00<00:00, 632102.49it/s]
100%|██████████| 197216/197216 [00:00<00:00, 657385.20it/s]
100%|██████████| 197216/197216 [00:00<00:00, 644496.74it/s]
100%|██████████| 197216/197216 [00:00<00:00, 666184.41it/s]
100%|██████████| 197216/197216 [00:00<00:00, 670801.29it/s]
100%|██████████| 197216/197216 [00:00<00:00, 659586.28it/s]
100%|██████████| 197216/197216 [00:00<00:00, 666271.87it/s]


In [8]:
input_samples_np.shape

(1972160, 4, 5)

In [9]:
input_samples_np[1972159]

array([[ 2.6890000e+03,  2.9173568e-03,  4.7311750e-03, -1.9296752e+04,
         1.1917413e+04],
       [ 2.6890000e+03,  2.9179130e-03,  4.7308332e-03, -1.9294418e+04,
         1.1921197e+04],
       [ 2.6890000e+03,  2.9184693e-03,  4.7304914e-03, -1.9292084e+04,
         1.1924981e+04],
       [ 2.6890000e+03,  2.9190257e-03,  4.7301496e-03, -1.9289748e+04,
         1.1928766e+04]], dtype=float32)

## Grab the Target Data

The LSTM network is trying to shotgun predict a certain number of time steps in the future given a sequence of input of length num_input_ts.

Depending on the number of features in that target, we can calculate the number of time steps being shotgun predicted.  For example, if there are 4 features in the output (dis_x, dis_y, vel_x, vel_y) and 40 target columns in the Pandas dataframe, then we know we are trying to predict 40/4 = 10 time steps in the future at each inference.

In [10]:
# Grab the target data from the raw dataframe.
target_df = raw_df.iloc[:, num_input_features:]
target_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,dis_x_1,dis_y_1,vel_x_1,vel_y_1,dis_x_2,dis_y_2,vel_x_2,vel_y_2,dis_x_3,dis_y_3,...,vel_x_8,vel_y_8,dis_x_9,dis_y_9,vel_x_9,vel_y_9,dis_x_10,dis_y_10,vel_x_10,vel_y_10
body,time_step,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0.0,1.0,-7231071.0,13187125.0,-9038.838867,16483.90625,-7492852.5,13767511.0,-9366.06543,17209.388672,-7683851.0,14036962.0,...,-10083.470703,18032.125,-8099020.5,14453898.0,-10123.775391,18067.373047,-8125220.5,14476922.0,-10156.525391,18096.152344
0.0,2.0,-7492852.5,13767511.0,-9366.06543,17209.388672,-7683851.0,14036962.0,-9604.813477,17546.203125,-7814419.5,14186097.0,...,-10123.775391,18067.373047,-8125220.5,14476922.0,-10156.525391,18096.152344,-8146808.5,14496177.0,-10183.510742,18120.220703
0.0,3.0,-7683851.0,14036962.0,-9604.813477,17546.203125,-7814419.5,14186097.0,-9768.024414,17732.621094,-7906783.5,14279605.0,...,-10156.525391,18096.152344,-8146808.5,14496177.0,-10183.510742,18120.220703,-8164801.5,14512605.0,-10206.001953,18140.755859
0.0,4.0,-7814419.5,14186097.0,-9768.024414,17732.621094,-7906783.5,14279605.0,-9883.479492,17849.505859,-7974726.0,14343547.0,...,-10183.510742,18120.220703,-8164801.5,14512605.0,-10206.001953,18140.755859,-8179942.0,14526864.0,-10224.927734,18158.580078
0.0,5.0,-7906783.5,14279605.0,-9883.479492,17849.505859,-7974726.0,14343547.0,-9968.407227,17929.433594,-8026397.0,14390122.0,...,-10206.001953,18140.755859,-8179942.0,14526864.0,-10224.927734,18158.580078,-8192775.0,14539422.0,-10240.96875,18174.277344
0.0,6.0,-7974726.0,14343547.0,-9968.407227,17929.433594,-8026397.0,14390122.0,-10032.996094,17987.652344,-8066776.5,14425700.0,...,-10224.927734,18158.580078,-8192775.0,14539422.0,-10240.96875,18174.277344,-8203719.5,14550620.0,-10254.649414,18188.275391
0.0,7.0,-8026397.0,14390122.0,-10032.996094,17987.652344,-8066776.5,14425700.0,-10083.470703,18032.125,-8099020.5,14453898.0,...,-10240.96875,18174.277344,-8203719.5,14550620.0,-10254.649414,18188.275391,-8213097.0,14560716.0,-10266.371094,18200.894531
0.0,8.0,-8066776.5,14425700.0,-10083.470703,18032.125,-8099020.5,14453898.0,-10123.775391,18067.373047,-8125220.5,14476922.0,...,-10254.649414,18188.275391,-8213097.0,14560716.0,-10266.371094,18200.894531,-8221159.5,14569903.0,-10276.449219,18212.378906
0.0,9.0,-8099020.5,14453898.0,-10123.775391,18067.373047,-8125220.5,14476922.0,-10156.525391,18096.152344,-8146808.5,14496177.0,...,-10266.371094,18200.894531,-8221159.5,14569903.0,-10276.449219,18212.378906,-8228110.0,14578338.0,-10285.137695,18222.921875
0.0,10.0,-8125220.5,14476922.0,-10156.525391,18096.152344,-8146808.5,14496177.0,-10183.510742,18120.220703,-8164801.5,14512605.0,...,-10276.449219,18212.378906,-8228110.0,14578338.0,-10285.137695,18222.921875,-8234110.0,14586138.0,-10292.637695,18232.671875


In [11]:
# Set the number of features we are trying to predict in the target
num_pred_feat = 4
# Get number of predicted time steps in data.
num_pred_ts = int(target_df.shape[1] / num_pred_feat)
num_pred_ts

10

In [12]:
# Create an empty numpy array of the sample dimensions to store all samples.
# Fill with nan then overwrite.  Preallocating memory then writing is muuuuccchhh faster.
target_samples_np = np.full(
    (num_samples, num_pred_ts, num_pred_feat),
    np.nan,
    dtype=np.float32
)
target_samples_np.shape

(1972160, 10, 4)

In [17]:
# Pandas index slicer to help with slicing.
idx = pd.IndexSlice
# Keep track of the sample index
curr_sample_index = 0
# Loop over each body and extract the data for that body.
for body_index in range(min_body_index, max_body_index+1):
    # Extract the current body's data.
    current_body_np = target_df.loc[body_index].to_numpy()
    # Loop over time steps for the num_samples_per_body to extract input sequences.
    # Store the extracted sequence as a sample in the input_samples_np numpy array.
    for curr_ts in tqdm(range(min_ts_index, max_ts_index - num_input_ts + 1)):
        # Be careful accessing time step indexes in numpy version of time series.
        # In Pandas dataframe, indexes start at 1 while indexes start at 0 in numpy.
        # Extract whole row of data then reshape knowing the size of the features.
        temp_row = current_body_np[curr_ts+num_input_ts-2].reshape(num_pred_ts, num_pred_feat)
        # Save to 
        target_samples_np[curr_sample_index, :, :] = temp_row
        # Advance the counter that keeps track of the sample we are on.
        curr_sample_index += 1

100%|██████████| 197216/197216 [00:00<00:00, 524513.07it/s]
100%|██████████| 197216/197216 [00:00<00:00, 530149.89it/s]
100%|██████████| 197216/197216 [00:00<00:00, 513588.67it/s]
100%|██████████| 197216/197216 [00:00<00:00, 527317.76it/s]
100%|██████████| 197216/197216 [00:00<00:00, 467336.83it/s]
100%|██████████| 197216/197216 [00:00<00:00, 531582.52it/s]
100%|██████████| 197216/197216 [00:00<00:00, 535915.78it/s]
100%|██████████| 197216/197216 [00:00<00:00, 533017.41it/s]
100%|██████████| 197216/197216 [00:00<00:00, 541804.91it/s]
100%|██████████| 197216/197216 [00:00<00:00, 500553.61it/s]


In [18]:
target_samples_np.shape

(1972160, 10, 4)

In [23]:
target_samples_np[0]

array([[-7.8144195e+06,  1.4186097e+07, -9.7680244e+03,  1.7732621e+04],
       [-7.9067835e+06,  1.4279605e+07, -9.8834795e+03,  1.7849506e+04],
       [-7.9747260e+06,  1.4343547e+07, -9.9684072e+03,  1.7929434e+04],
       [-8.0263970e+06,  1.4390122e+07, -1.0032996e+04,  1.7987652e+04],
       [-8.0667765e+06,  1.4425700e+07, -1.0083471e+04,  1.8032125e+04],
       [-8.0990205e+06,  1.4453898e+07, -1.0123775e+04,  1.8067373e+04],
       [-8.1252205e+06,  1.4476922e+07, -1.0156525e+04,  1.8096152e+04],
       [-8.1468085e+06,  1.4496177e+07, -1.0183511e+04,  1.8120221e+04],
       [-8.1648015e+06,  1.4512605e+07, -1.0206002e+04,  1.8140756e+04],
       [-8.1799420e+06,  1.4526864e+07, -1.0224928e+04,  1.8158580e+04]],
      dtype=float32)