# InputData tutorial

The input data object is the default format in which the infrastructure sends the data to the model. You can think about it as a dictionary where the key is the name of the sensor and the type ("ID:TYPE") and the value is a Pandas DataFrame. It also has some internal validation which ensures that the data used will always have the same format.

A simple way to initialize it is by using the classmethod `InputData.from_long_df`. This method will take a pandas DataFrame in long format with columns `["ID", "TYPE", "TIME", "VALUE"]` and return an InputData object:

In [1]:
import pandas as pd
from twinn_ml_interface.input_data import InputData

df = pd.read_parquet("tutorial_data/mock_data.parquet")
df.head()

Unnamed: 0,TIME,ID,TYPE,VALUE
0,2024-01-01 00:00:00,SENSOR1,TYPE1,32
1,2024-01-01 01:00:00,SENSOR1,TYPE1,30
2,2024-01-01 02:00:00,SENSOR1,TYPE1,43
3,2024-01-01 03:00:00,SENSOR1,TYPE1,63
4,2024-01-01 04:00:00,SENSOR1,TYPE1,63


In [2]:
input_data = InputData.from_long_df(df)

Using the InputData format instead of the long format means that we can work with the different types of data independently, and there is no need to constantly filter the different types of signal from a long dataframe. For example, if we want to operate with `"SENSOR2:TYPE2"` to make it constant, we can do:

In [3]:
my_signal = "SENSOR2:TYPE2"
input_data[my_signal].loc[:, my_signal] = 0
input_data[my_signal].head()

Unnamed: 0_level_0,SENSOR2:TYPE2
TIME,Unnamed: 1_level_1
2024-01-01 00:00:00,0
2024-01-01 01:00:00,0
2024-01-01 02:00:00,0
2024-01-01 03:00:00,0
2024-01-01 04:00:00,0


As working with long format dataframes can also be handy to get some aggregated information, we have added some information as properties of the InputData:

In [4]:
print("max time:", input_data.max_datetime)
print("min time:", input_data.min_datetime)
print("unit codes:", input_data.unit_codes)
print("unit tags:", input_data.unit_tags)

max time: 2024-01-02 01:00:00
min time: 2024-01-01 00:00:00
unit codes: {'SENSOR3', 'SENSOR1', 'SENSOR4', 'SENSOR2'}
unit tags: {'SENSOR4:TYPE4', 'SENSOR2:TYPE2', 'SENSOR3:TYPE3', 'SENSOR1:TYPE1'}


The InputData object guarantees that the data is sorted for any new dataframe that is added:

In [5]:
from datetime import timedelta

unit_tag = 'SENSOR1:TYPE1'
original_df = input_data[unit_tag].copy()
shuffled_data = input_data[unit_tag].sample(frac=1)

assert not shuffled_data.equals(original_df)

input_data[unit_tag] = shuffled_data

assert input_data[unit_tag].equals(original_df)