# Deep Learning Preprocessing Pipeline

In this tutorial, we are going to build a data preprocessing pipeline which is destined to be fed into a LSTM neural network, which is often used when dealing with time series. Learn more about LSTM neural networks here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

As a data source for our pipeline, we are going to use the `existenz_api_fetcher` package.

## Requirements
To run this notebook you will need the following python packages:
```
numpy
pandas
sklearn
```

Let's assume that we want our neural network to figure out the relationship between the river flow and the river height, using the data from the Aare river in Bern, Switzerland.
Let's first import the necessary modules from the `existenz_api_fetcher` package.

In [1]:
from existenz_api_fetcher import locations, hydro, pipelines

# Maps to find FOEN station code for the Aare river in Bern
locations.show()

We now merge the dataframes for the river flow and height.

In [2]:
df = pipelines.merge(hydro.flow('2159'), hydro.height('2159'))
df.columns = ['flow', 'height']
df

Unnamed: 0_level_0,flow,height
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-06-21,3.632812,519.239766
2021-06-22,8.549306,519.345347
2021-06-23,11.272028,519.494895
2021-06-24,8.320000,519.423600
2021-06-25,18.329000,519.648000
...,...,...
2023-06-19,1.734713,519.149524
2023-06-20,1.855140,519.157049
2023-06-20,1.855140,519.229205
2023-06-20,3.127682,519.157049


A data pipeline allows for the automated ingestion, batch processing and storage of data which is going to be used for data science projects and research. In this tutorial, we are focusing on the preprocessing part.
In our case, three processing operations are needed before being able to train a LSTM neural network:
1) Splitting the raw data into training, validation and testing data
2) Standardizing the data by making it more robust to outliers (flood events for example)
3) Creating sequences to feed into the neural network

We can create a class with a method for each one of these processing operations.

In [3]:
from sklearn.preprocessing import RobustScaler
import numpy as np


# Create pipeline class
class LSTMPipeline(RobustScaler):
    
    
    # Split DataFrame into train, validation and test data
    def split(self, df):
        
        train_size = int(len(df) * 0.7)
        validation_size = int(len(df) * 0.2)
        test_size = len(df) - train_size - validation_size
        
        train, validation, test = df.iloc[0:train_size], df.iloc[0:validation_size], df.iloc[train_size+validation_size:len(df)]
        print(f"Shape of training data: {train.shape}\n"
              f"Shape of validation data: {validation.shape}\n"
              f"Shape of test data: {test.shape}\n")
        
        return train, validation, test
    
    
    # Preprocess data by scaling
    def scale(self, train_df, test_df):
        
        w_columns = ['flow']
        w_transformer = RobustScaler()
        
        w_transformer = w_transformer.fit(train_df[w_columns].to_numpy())
        train_df.loc[:, w_columns] = w_transformer.transform(
          train_df[w_columns].to_numpy()
        )
        
        test_df.loc[:, w_columns] = w_transformer.transform(
          test_df[w_columns].to_numpy()
        )

        flow_transformer = RobustScaler()
        flow_transformer = flow_transformer.fit(train_df[['height']])
        train_df['height'] = flow_transformer.transform(train_df[['height']])
        test_df['height'] = flow_transformer.transform(test_df[['height']])
    
    
    # Function to prepare sequences
    def create_dataset(self, X, y, time_steps):
        
        Xs, ys = [], []
        for i in range(len(X) - time_steps):
            v = X.iloc[i:(i + time_steps)].values
            Xs.append(v)
            ys.append(y.iloc[i + time_steps])
            
        return np.array(Xs), np.array(ys)

We can now create an instance of our pipeline class and use its functions to process our Aare river data.

In [4]:
import pandas as pd
pd.options.mode.chained_assignment = None  # Removes false positive SettingWithCopyWarning

# Instantiate pipeline class
pipeline = LSTMPipeline()

train, validation, test = pipeline.split(df)
pipeline.scale(train, test)
X_train, y_train = pipeline.create_dataset(train, train.flow, time_steps=10)
print(f"Shape of flow training data: {X_train.shape}\n"
      f"Shape of height training data: {y_train.shape} \n")

Shape of training data: (513, 2)
Shape of validation data: (146, 2)
Shape of test data: (74, 2)

Shape of flow training data: (503, 10, 2)
Shape of height training data: (503,) 



All is now ready to feed our data to a LSTM neural network.