# DeepTime - Deep Learning Studio  for Time Series



This codeless studio  focus 

1. mapping from columns in the dataframe of Pandas to  values by 
  (1) One-hot coding; 
  (2) Categorical indicator
  (3) normalizing functions.

2. Windowing historic data automatically, so that RNN model can be fitted.

3. Training with built-in LSTM and Seq2Seq models

I am working on Deep Time (https://github.com/MRYingLEE/DeepTime-Deep-Learning-Framework-for-Time-Series-Forecasting). This tools is part of my research work.

Tensorflow 2.x is used.


![alt text](https://www.tensorflow.org/tutorials/structured_data/images/time_series.png)

# Import TensorFlow and other libraries

Maybe later sklearn Preprocessing function (https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) will be supported.

So far, only train_test_split of sklearn is used.

In [0]:
!pip install sklearn

In [0]:
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.feature_column import *
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

from io import StringIO

# ipywidgets （https://github.com/jupyter-widgets/ipywidgets） makes the Jupyter Notebook interactive.
from ipywidgets import *

import sys
import os

# Use seaborn for pairplot
!pip install -q seaborn
import seaborn as sns
from os import path

# Useful helper functions for Map transformation

The reason I create some helper function is that I want to make the generated code short and easy to read.

In [0]:
def categorical_strings(column,vocabulary_list):
  """A function to generate a one-hot column by the vocabulary list."""
  count_v=len(vocabulary_list)
  table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
    vocabulary_list, range(count_v), key_dtype=tf.string, value_dtype=tf.int64, name=column
    ),
    1)
    
  def one_hot_column(row):
    out = table.lookup(row)
    return tf.one_hot(out,count_v+1)

  return one_hot_column

def categorical_identitys(column,vocabulary_list):
  """A function to generate a one-hot column by the vocabulary list for an integer column."""
  count_v=len(vocabulary_list)
  table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
    vocabulary_list, range(count_v), key_dtype=tf.int64, value_dtype=tf.int64, name=column
    ),
    1)
  def one_hot_column(row):
    out = table.lookup(row)
    return tf.one_hot(out,count_v+1)

  return one_hot_column

# Class of an Estimator

I like the idea of estimator to make machine learning more easily, but this is NOT an child of tf.estimator.Estimator class(https://www.tensorflow.org/guide/estimator).

tf.estimator.Estimator class depends on feature_column (https://www.tensorflow.org/api_docs/python/tf/feature_column) heavily. I like the idea of feature_column also, but feature_column doesn't work well with time series and feature_column doeen't support functional API of Keras well. (If you find a way to solve my headache, please let me know.)


## The class is to define the grid column layout

In [0]:
class I_Column:
  p_Column=0
  p_dtype=1
  p_Input=2
  p_Label=3
  p_future=4
  p_FeatureKind=5
  p_Numeric_Normalizer=8

## The TsEstimator - The core of this studio

In [0]:
class TsEstimator:
  """
   An estimator for time series, but this is NOT an child of tf.estimator.Estimator class (https://www.tensorflow.org/guide/estimator).
  """
  # The AVAILABLE feature kind for dtype of Pandas
  dtype_mapping_cross = StringIO("""Kind,object,int64,float64,bool,datetime64,timedelta[ns],category,cat_int64,cat_string
    categorical_identitys,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE
    categorical_strings,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE
        """)
  df_dtype_mapping_cross = pd.read_csv(dtype_mapping_cross, sep=",")

  def __init__(self, fn_get_configs):
    """
    The initializer with assigned function
    """
    df_all, df_train, df_val, df_test,default_inputs,default_labels,default_future, past_history, future_target,categories_limit, batch_size,single_step=fn_get_configs()

    self._df_all=df_all

    assert(df_all is not None)

    # if (df_train is None):
    #   self._df_train=df_all
    # else:
    #   self._df_train=df_train
    self._df_train=df_train
    self._df_val=df_val  # Not used so far 
    self._df_test=df_test # Not used so far 

    self.input_lambdas=[]  # In the form of (column_name, lambda)
    self.future_lambdas=[]  # In the form of (column_name, lambda)
    self.label_lambdas=[]  # In the form of (column_name, lambda)

    self.global_normalizers={} # Not used so far 
    self.categorical_columns=[]
    self.categories_limit=categories_limit
    self.grid_features=None
    self.grid_periods=None
    self.category_lists= self.__df_desc() # If a column has less than this number (20 as default) of unique value, I will treate it as a category column.
    self.code=""

    self.default_inputs=default_inputs
    self.default_labels=default_labels
    self.default_future=default_future

    self.past_history = past_history
    self.future_target = future_target
    self.batch_size = batch_size
    self.single_step = single_step

    self.v_labels=0
    self.v_future=0

  @classmethod
  def get_available_mapping(cls,col_dtype):
    """
    to get available mapping for the assigned dtype
    """
    return set(cls.df_dtype_mapping_cross[["Kind",col_dtype]][cls.df_dtype_mapping_cross[col_dtype]]["Kind"].unique())

  # ## To generate normalizer lambda and denormalizer one
  # So far, only 2 kinds of normalizer and denormalizer are supported:

  @staticmethod
  def min_max_normalizer(min_v,max_v, v_str="by_train",is_int64=False):
    """
      To generate min-max normalizer and denomalizer lambda statements
      min-max  : (value-min)/(max-min)      
    """
    if is_int64:
      ext_v_str="tf.cast("+v_str+",tf.float32)"
    else:
      ext_v_str=v_str
    
    return "lambda "+v_str+": ("+ext_v_str+ " -"+str(min_v)+")/("+str(max_v)+"-"+str(min_v)+")","lambda "+v_str+": "+ext_v_str+ " *("+str(max_v)+"-"+str(min_v)+")+"+str(min_v)

  @staticmethod
  def std_normalizer(v_mean,v_std, v_str="by_train",is_int64=False):
    """
      To generate mean-std normalizer and denomalizer lambda statements
      mean-std  : (value-mean)/std      
    """
    if is_int64:
      ext_v_str="tf.cast("+v_str+",tf.float32)"
    else:
      ext_v_str=v_str

    return "lambda "+v_str+": ("+ext_v_str+ " -"+str(v_mean)+")/"+str(v_std),"lambda "+v_str+": "+ext_v_str+ " *"+str(v_std)+"+"+str(v_mean)

  @staticmethod
  def create_local_normalizers(col_name,df_statistics, v_str="by_train",is_int64=False):
    """
      To generate min-max/mean-std normalizer and denomalizer lambda statements given an statistics data
    """
    v_min=df_statistics.loc[col_name]["min"]
    v_max=df_statistics.loc[col_name]["max"]
    v_mean=df_statistics.loc[col_name]["mean"]
    v_std=df_statistics.loc[col_name]["std"]

    n1,d1=TsEstimator.min_max_normalizer(v_min,v_max,v_str,is_int64=is_int64)
    n2,d2=TsEstimator.std_normalizer(v_mean,v_std,v_str,is_int64=is_int64)

    locals={n1:d1,n2:d2}
    return locals

  @staticmethod
  def int_list_as_string(a):
    """To generated a suitable string for an integer list"""
    s = [str(i) for i in a]
    return  "["+",".join(s)+"]"

  @staticmethod
  def string_list_as_string(s):
    """To generated a suitable string for a string list"""
    return  "['"+"','".join(s)+"']"

  def __df_desc(self):
    """
      To generate available feature kinds and suitable normalizer lambda statements for every column.
      Please note the whole dataframe and the train part are both required.
      The whole dataframe is used to decide the vocalbulary list for each column.
      Both the whole dataframe and the train part are used to generate lambda statements for NUMERIC columns. 
      So normalizing can be based on the whole data or only the train part. It's up to the data scientist.
    """
    df_all=self._df_all
    df_train=self._df_train

    if (df_train is None):
      df_statistics_train=None
    else:
      df_statistics_train=df_train.describe().T # to use train part to normalize!
    df_statistics_all=df_all.describe().T # to use train part to normalize!
    
    category_lists={}
    
    for c in df_train.columns:
      dtype_name=df_train[c].dtype.name

      availables=self.get_available_mapping(dtype_name)

      if availables is None:
        availables={}

      feature="numeric_column('"+c+"')"

      local_normalizers={}

      if ((dtype_name=="int64") or (dtype_name=="object")):
        is_int64=(dtype_name=="int64")

        values_unique=df_all[c].unique()
        f=len(values_unique)   # I use all rows to decide the cetegory list   
        if f<self.categories_limit: #Category
          if is_int64:
            feature=categorical_identitys.__name__+"('"+c+"',"+self.int_list_as_string(values_unique)+")"
          else:
            feature=categorical_strings.__name__+"('"+c+"',"+self.string_list_as_string(values_unique)+")"
          self.categorical_columns.append(c)
        else:
          if is_int64:
            feature="numeric_column('"+c+"')"
            # local_normalizers=({"lambda x:x": "lambda x:x" })
            if (df_statistics_train is not None):
              local_normalizers0=self.create_local_normalizers(c,df_statistics_train,v_str="by_train", is_int64=True)
              local_normalizers.update(local_normalizers0)
              self.global_normalizers.update(local_normalizers0)
            local_normalizers1=self.create_local_normalizers(c,df_statistics_all,v_str="by_all", is_int64=True)
            self.global_normalizers.update(local_normalizers1)
            local_normalizers.update(local_normalizers1)
            local_normalizers.update({"lambda x:x": "lambda x:x" })
          else:
            feature="embedding_column('"+"('"+c+"')"
      else:
        if (dtype_name=="float64"):
            feature="numeric_column('"+c+"')"
            # local_normalizers=({"lambda x:x": "lambda x:x" })
            if (df_statistics_train is not None):
              local_normalizers0=self.create_local_normalizers(c,df_statistics_train,v_str="by_train", is_int64=False)
              local_normalizers.update(local_normalizers0)
              self.global_normalizers.update(local_normalizers0)
            local_normalizers1=self.create_local_normalizers(c,df_statistics_all,v_str="by_all", is_int64=False)
            self.global_normalizers.update(local_normalizers1)
            local_normalizers.update(local_normalizers1)
            local_normalizers.update({"lambda x:x": "lambda x:x" })
        elif  (dtype_name=="bool"):
            feature="numeric_column('"+c+"')"
        elif (dtype_name=="category"):
          feature="categorical_column_with_vocabulary_list('"+"('"+c+"')"
          self.categorical_columns.append(c)
        else:
          feature=dtype_defaults[dtype_name] 
      
      availables.add(feature)

      availables={s.replace("(...)","('"+c+"')") for s in availables}
      category_lists[c]={"default":feature,"available":availables,"normalizers": local_normalizers}

    return category_lists

  def get_feature_grid(self):
    """
    To get or create a ipywidget grid for datasource settings
    """
    if self.grid_features is not None:
      return self.grid_features

    default_inputs=self.default_inputs
    default_labels=self.default_labels
    default_future=self.default_future

    df_all=self._df_all
    df_train=self._df_train

    cols=len(df_train.columns)
    grid = GridspecLayout(cols+1, 12)
    # To add a header at row 0
    grid[0,I_Column.p_Column]= widgets.Label(value="Column")
    grid[0,I_Column.p_dtype]= widgets.Label(value="dtype")
    grid[0,I_Column.p_Input]= widgets.Label(value="Input?")
    grid[0,I_Column.p_Label]= widgets.Label(value="Label?")
    grid[0,I_Column.p_future]= widgets.Label(value="Future?")  
    grid[0,I_Column.p_FeatureKind:(I_Column.p_FeatureKind+3)]= widgets.Label(value="Feature Kind")
    grid[0,I_Column.p_Numeric_Normalizer:]= widgets.Label(value="Numeric Normalizer")

    for i in range(cols):
      feature_option=self.category_lists[df_train.columns[i]]
      grid[i+1,int(I_Column.p_Column)]= widgets.Label(value=df_train.columns[i])
      grid[i+1,I_Column.p_dtype]= widgets.Label(value=df_train.dtypes[i].name)
      grid[i+1,I_Column.p_Input]=widgets.Checkbox(value=(df_train.columns[i] in default_inputs),description='',indent=False,layout=Layout(height='auto', width='auto'))
      grid[i+1,I_Column.p_Label]=widgets.Checkbox(value=(df_train.columns[i] in default_labels),indent=False,description='',layout=Layout(height='auto', width='auto'))
      grid[i+1,I_Column.p_future]=widgets.Checkbox(value=(df_train.columns[i] in default_future),indent=False,description='',layout=Layout(height='auto', width='auto'))

      grid[i+1,I_Column.p_FeatureKind:(I_Column.p_FeatureKind+3)]= widgets.Dropdown(
        options=list(feature_option['available']),
        value=feature_option['default'],
        description="",
        layout=Layout(height='auto', width='auto')
        )
      
      if len(feature_option['normalizers'])>0:
        grid[i+1,I_Column.p_Numeric_Normalizer:]=widgets.Dropdown(
          options=list(feature_option['normalizers'].keys()),
          value=list(feature_option['normalizers'].keys())[0],
          layout=Layout(height='auto', width='auto'),
          description=""
          )
    
    self.grid_features=grid

    return grid

  @property
  def seq2seq_mode(self):
    """To decide whether the model should be seq2seq instead of a vanilla LSTM"""
    return len(self.future_lambdas)>0

  def update_by_grid_features(self):
    """ To update datasource settings according to the ipywidget grid"""
    self.input_lambdas.clear()
    self.label_lambdas.clear()
    self.future_lambdas.clear()

    grid=self.grid_features
    for i in range(1,grid.n_rows):
      f_col=grid[i,int(I_Column.p_Column)].value
      if (grid[i,I_Column.p_FeatureKind].value.startswith("numeric_column(")):
        if (grid[i,I_Column.p_dtype].value =="bool"):
          lambda_f=lambda x: x
        else:
          n_str=grid[i,I_Column.p_Numeric_Normalizer].value
          lambda_f= eval(n_str)
      else:
        n_str=grid[i,I_Column.p_FeatureKind].value     
        lambda_f= eval(n_str)

      if (grid[i,I_Column.p_Input].value==True):
        # code_generator.append("input_lambdas.append(('"+f_col+"',"+lambda_f+"))")
        self.input_lambdas.append((f_col,lambda_f))
      if (grid[i,I_Column.p_Label].value==True):
        # code_generator.append("label_lambdas.append(('"+f_col+"',"+lambda_f+"))")
        self.label_lambdas.append((f_col,lambda_f))
      if (grid[i,I_Column.p_future].value==True):
        # code_generator.append("future_lambdas.append(('"+f_col+"',"+lambda_f+"))")
        self.future_lambdas.append((f_col,lambda_f))

  def input_future_label_1d(self):
    """To create a mapping function according to the input, label, future lambdas"""
    input_lambdas=self.input_lambdas
    label_lambdas=self.label_lambdas
    future_lambdas=self.future_lambdas

    def transform(row):
      i_result1=[tf.reshape(tf.cast(y(row[x]),tf.float64),[-1]) for (x, y) in input_lambdas ]
      i_result2=[tf.reshape(tf.cast(y(row[x]),tf.float64),[-1]) for (x, y) in label_lambdas ]
      i_result3=[tf.reshape(tf.cast(y(row[x]),tf.float64),[-1]) for (x, y) in future_lambdas ]

      return tf.concat(i_result1+i_result2+i_result3,0)
    return transform

  def label_1d(self):
    """To create a mapping function according to the input lambdas."""
    label_lambdas=self.label_lambdas
    def transform(row):
      i_result2=[tf.reshape(tf.cast(y(row[x]),tf.float64),[-1]) for (x, y) in label_lambdas ]

      return tf.concat(i_result2,0)
    return transform

  def future_1d(self):
    """To create a mapping function according to the future lambdas"""
    future_lambdas=self.future_lambdas

    def transform(row):
      i_result2=[tf.reshape(tf.cast(y(row[x]),tf.float64),[-1]) for (x, y) in future_lambdas ]

      return tf.concat(i_result2,0)
    return transform

  def post_windowing_split2_single(self):
    """
      To split the data into 
      (input, label), when there is no future columns
      ((input, future), label) ，when there are future columns
    """
    future_target=self.future_target
    v_labels=self.v_labels
    single_step=self.single_step

    def input_label_single(row):
      i_input=row[:-future_target,:]
      i_label=row[-1,-(v_labels)]
      return i_input, i_label

    return input_label_single


  def post_windowing_split2(self):
    """
      To split the data into 
      (input, label), when there is no future columns
      ((input, future), label) ，when there are future columns
    """
    future_target=self.future_target
    v_labels=self.v_labels
    single_step=self.single_step
    def input_label(row):
      i_input=row[:-future_target,:]
      i_label=tf.reshape(row[-future_target:,-v_labels:],[-1]) # To reshape to a 1-d tensor
      return i_input, i_label

    return input_label

  def post_windowing_split3_single(self):
    """
      To split the data into 
      (input, label), when there is no future columns
      ((input, future), label) ，when there are future columns
    """
    future_target=self.future_target
    v_labels=self.v_labels
    v_future= self.v_future
    single_step=self.single_step

    def input_future_label_single(row):
      i_input=row[:-future_target,:]
      i_label=row[-1,-(v_labels+v_future):-v_future]
      i_future=row[-1,-v_future:]
      return (i_input,i_future), i_label

    return input_future_label_single

      
  def post_windowing_split3(self):
    """
      To split the data into 
      (input, label), when there is no future columns
      ((input, future), label) ，when there are future columns
    """
    future_target=self.future_target
    v_labels=self.v_labels
    v_future= self.v_future
    single_step=self.single_step

    def input_future_label(row):
      i_input=row[:-future_target,:]
      i_label=row[-future_target:,-(v_labels+v_future):-v_future] 
      # if (v_labels==1):
      #   i_label=tf.reshape(i_label,[-1]) # To reshape to a 1-d tensor?
      i_future=row[-future_target:,-v_future:]
      return (i_input,i_future), i_label

    return input_future_label

  def df_to_windoweddataset(self, dataframe, shuffle=False):
    """
    A utility method to create a windowed tf.data dataset from a Pandas Dataframe
    """
    ds = tf.data.Dataset.from_tensor_slices(dict(dataframe))
    if self.seq2seq_mode:
      self.v_future=ds.map(self.future_1d()).element_spec.shape[-1]
    else:
      self.v_future=0

    self.v_labels=ds.map(self.label_1d()).element_spec.shape[-1]
    ds_map=ds.map(self.input_future_label_1d())
    
    # Feel free to play with shuffle buffer size
    shuffle_buffer_size = len(dataframe)
    # Total size of window is given by the number of steps to be considered
    # before prediction time + steps that we want to forecast

    total_size = self.past_history + self.future_target

    # Selecting windows
    data = ds_map.window(total_size, shift=1, drop_remainder=True)
    data = data.flat_map(lambda k: k.batch(total_size))
    if shuffle:
      data = data.shuffle(shuffle_buffer_size, seed=42)

    if self.seq2seq_mode:
      # Extracting past features  + labels + future
      if self.single_step:
        data = data.map(self.post_windowing_split3_single())
      else:
        data = data.map(self.post_windowing_split3())
    else:
      if self.single_step:
        data = data.map(self.post_windowing_split2_single())
      else:
        data = data.map(self.post_windowing_split2())

    ds_4_train= data.batch(self.batch_size).prefetch(tf.data.experimental.AUTOTUNE)

    return ds_4_train

  def update_by_grid_periods(self):
    self.past_history = self.grid_periods[0,1].value
    self.future_target = self.grid_periods[1,1].value
    self.batch_size = self.grid_periods[2,1].value # A small batch sized is used for demonstration purposes
    self.single_step = self.grid_periods[3,1].value

  def get_periods_grid(self):
    if self.grid_periods is not None:
      return self.grid_periods

    a_past_history=widgets.Label("Past History Periods:")
    v_past_history = widgets.IntText(value=self.past_history)

    a_future_target=widgets.Label("Future_target Periods:")
    v_future_target = widgets.IntText(value=self.future_target)

    a_batch_size=widgets.Label("Batch Size:")
    v_batch_size = widgets.IntText(value=self.batch_size)

    a_single_step=widgets.Label("Single Step?")
    v_single_step= widgets.Checkbox(value=self.single_step)

    grid_periods = widgets.GridspecLayout(4, 5)

    grid_periods[0,0]=a_past_history; grid_periods[0,1:]=v_past_history
    grid_periods[1,0]=a_future_target; grid_periods[1,1:]=v_future_target
    grid_periods[2,0]=a_batch_size; grid_periods[2,1:]=v_batch_size
    grid_periods[3,0]=a_single_step; grid_periods[3,1:]=v_single_step

    self.grid_periods=grid_periods

    return grid_periods

  @staticmethod
  def create_vanilla_LSTM_model(input_shape, label_shape):
    """Slightly modified from Multi-Step model for a multivariate time series in https://www.tensorflow.org/tutorials/structured_data/time_series"""
    multi_step_model = tf.keras.models.Sequential()
    multi_step_model.add(tf.keras.layers.LSTM(32,
                                              return_sequences=True,
                                              input_shape=input_shape))
    multi_step_model.add(tf.keras.layers.LSTM(16, activation='relu'))
    multi_step_model.add(tf.keras.layers.Dense(label_shape))

    multi_step_model.compile(optimizer=tf.keras.optimizers.RMSprop(clipvalue=1.0), loss='mae')

    return multi_step_model

  @staticmethod 
  def create_seq2seq_model(input_shape,label_shape, future_shape):
    """Slightly modified from Encoder/Decoder model of https://www.angioi.com/time-series-encoder-decoder-tensorflow/"""
    latent_dim = 16

    # First branch of the net is an lstm which finds an embedding for the past
    past_inputs = tf.keras.Input(shape=input_shape, name='past_inputs')
    # Encoding the past
    encoder = tf.keras.layers.LSTM(latent_dim, return_state=True , name="lstm_encoder")
    encoder_outputs, state_h, state_c = encoder(past_inputs)

    future_inputs = tf.keras.Input(shape=future_shape, name='future_inputs')
    # Combining future inputs with recurrent branch output
    decoder_lstm = tf.keras.layers.LSTM(latent_dim, return_sequences=True, name="lstm_decoder")
    x = decoder_lstm(future_inputs, 
                                  initial_state=[state_h, state_c])

    x = tf.keras.layers.Dense(16, activation='relu')(x)
    x = tf.keras.layers.Dense(16, activation='relu')(x)
    output = tf.keras.layers.Dense(label_shape[-1], activation='relu')(x)

    model = tf.keras.models.Model(inputs=[past_inputs,future_inputs], outputs=output)

    optimizer = tf.keras.optimizers.Adam()
    loss = tf.keras.losses.Huber()
    model.compile(loss=loss, optimizer=optimizer, metrics=["mae"])

    return model

  def train_by_model(self,model_fn, epochs=1):
    """
    You may train by your provided model.
    Depending whether the seq2seq model is needed,
    you may provide model like vanilla LSTM model or Seq2Seq model
    """
    train_ds = self.df_to_windoweddataset(self._df_train, shuffle=False)

    if (dataframe_val is not None):
      val_ds=None
    else:
      val_ds = self.df_to_windoweddataset(dataframe_val, shuffle=False)
    # test_ds = self.df_to_windoweddataset(dataframe_test, shuffle=False)

    if not self.seq2seq_mode:
      input_shape=tuple(next(iter(train_ds))[0].shape[-2:].as_list())
      label_shape=next(iter(train_ds))[1].shape[-1]
      model = model_fn(input_shape, label_shape)
    else:
      input_shape=tuple(next(iter(train_ds))[0][0].shape[1:].as_list())
      future_shape=tuple(next(iter(train_ds))[0][1].shape[1:].as_list())
      label_shape=tuple(next(iter(train_ds))[1].shape[1:].as_list())
      model = model_fn(input_shape, label_shape, future_shape)
    
    history = model.fit(train_ds, epochs=epochs, validation_data=val_ds)

    return history

  def train_by_builtin_model(self,epochs=1):
    """
    You may train by built-in models, vanilla LSTM model or Seq2Seq model
    """
    if not self.seq2seq_mode:
      model_fn=TsEstimator.create_vanilla_LSTM_model
    else:
      model_fn=TsEstimator.create_seq2seq_model
    
    return self.train_by_model(model_fn, epochs)

  def pairplot(self):
    """
    To chart the combination of categorical columns.
    In order to find the balance of the data, you may need to inspect in detail.
    """
    if len(self.categorical_columns)>0:
      sns.pairplot(self._df_all[self.categorical_columns], diag_kind="kde")
    else:
      print("There are no categorical columns.")

The basic preprocessing:

1. Pre-windowing Transform : 1-> 1d vector (one-hot coding), embedding features (static, dynamic) (Label cannot be embedding)

2. Windowing (to dictionary?) (for different lookback and forecast periods)

3. Post-windowing Transform: ? Embedding Here?



# A Seq2Seq model demo

Let’s start with a practical example of a time series and look at [the UCI Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset); there we can find for each hour the amount of bikes rented by customers of a bike sharing service in Washington DC, together with other features such as whether a certain day was a national holiday, and which day of the week was it. 

## Your configues here

In [0]:
def get_configs():
  if not path.exists('bike_data/hour.csv'):
    ! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip
    ! unzip Bike-Sharing-Dataset.zip -d bike_data

  df = pd.read_csv('bike_data/hour.csv', index_col='instant')
  cols_to_keep = [       
    'cnt',
    'temp',
    'hum',
    'windspeed',
    'yr',
    'mnth', 
    'hr', 
    'holiday', 
    'weekday', 
    'workingday'
  ]
  dataframe = df[cols_to_keep]
  default_inputs=['cnt','temp','hum','windspeed']  # The default features list for inputs
  default_labels=['cnt']  # The default features list for labels
  default_future=list(set(cols_to_keep)-set(default_inputs))
    
  dataframe.head()

  dataframe_train, dataframe_test = train_test_split(dataframe, test_size=0.2)
  dataframe_train, dataframe_val = train_test_split(dataframe_train, test_size=0.2)
  print(len(dataframe_train), 'train examples')
  print(len(dataframe_val), 'validation examples')

  past_history = 24 * 7 * 3 
  future_target = 24 * 5
  categories_limit=20
  batch_size = 32
  single_step=False

  return dataframe, dataframe_train, dataframe_val, dataframe_test,default_inputs,default_labels,default_future, past_history, future_target,categories_limit, batch_size,single_step
  # The order of the variables is very important. 


## Create an estimator

In [0]:
estimator=TsEstimator(get_configs)

## Inspect the data by categorical columns

In [0]:
estimator.pairplot()

## To create an interactive features grid

You may try the builder INTERACTIVELY.

In [0]:
grid=estimator.get_feature_grid()
grid

<font color=red>**RERUN** the following cells once you change the above settings.</font>

In [0]:
estimator.update_by_grid_features()

assert(len(estimator.input_lambdas)>0)
assert(len(estimator.label_lambdas)>0)
# code_generator
estimator.input_lambdas,estimator.label_lambdas, estimator.future_lambdas

## To create an interactive periods grid

In [0]:
grid_periods=estimator.get_periods_grid()
grid_periods

In [0]:
estimator.update_by_grid_periods()

## Train the model by the built-in models



In [0]:
estimator.train_by_builtin_model(1)

# A LSTM model demo

Let’s start with a practical example of a time series and look at [the UCI Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset); there we can find for each hour the amount of bikes rented by customers of a bike sharing service in Washington DC, together with other features such as whether a certain day was a national holiday, and which day of the week was it. 

## Your configues here

**The weather dataset**
This tutorial uses a <a href="https://www.bgc-jena.mpg.de/wetter/" class="external">[weather time series dataset</a> recorded by the <a href="https://www.bgc-jena.mpg.de" class="external">Max Planck Institute for Biogeochemistry</a>.

This dataset contains 14 different features such as air temperature, atmospheric pressure, and humidity. These were collected every 10 minutes, beginning in 2003. For efficiency, you will use only the data collected between 2009 and 2016. This section of the dataset was prepared by François Chollet for his book [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python).

In [0]:
def get_configs_lstm():
  if not path.exists('/root/.keras/datasets/jena_climate_2009_2016.csv'):
    zip_path = tf.keras.utils.get_file(
      origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
      fname='jena_climate_2009_2016.csv.zip',
      extract=True)
    csv_path, _ = os.path.splitext(zip_path)
    print(csv_path)

  dataframe = pd.read_csv('/root/.keras/datasets/jena_climate_2009_2016.csv', index_col='Date Time')
  default_inputs=['p (mbar)', 'T (degC)', 'rho (g/m**3)']  # The default features list for inputs
  default_labels=['T (degC)'] # The default features list for labels
  default_future=[]
    
  dataframe.head()

  dataframe_train, dataframe_test = train_test_split(dataframe, test_size=0.2)
  dataframe_train, dataframe_val = train_test_split(dataframe_train, test_size=0.2)
  print(len(dataframe_train), 'train examples')
  print(len(dataframe_val), 'validation examples')

  past_history = 720 
  future_target = 72
  categories_limit=20
  batch_size = 32
  single_step=False

  return dataframe, dataframe_train, dataframe_val, dataframe_test,default_inputs,default_labels,default_future, past_history, future_target,categories_limit, batch_size,single_step
  # The order of the variables is very important. 


## Create an estimator

In [0]:
estimator_lstm=TsEstimator(get_configs_lstm)

## Inspect the data by categorical columns

In [0]:
estimator_lstm.pairplot()

## To create an interactive features grid

You may try the builder INTERACTIVELY.

In [0]:
grid_lstm=estimator_lstm.get_feature_grid()
grid_lstm

<font color=red>**RERUN** the following cells once you change the above settings.</font>

In [0]:
estimator_lstm.update_by_grid_features()

assert(len(estimator_lstm.input_lambdas)>0)
assert(len(estimator_lstm.label_lambdas)>0)
# code_generator
estimator_lstm.input_lambdas,estimator_lstm.label_lambdas, estimator_lstm.future_lambdas

## To create an interactive periods grid

In [0]:
grid_periods_lstm=estimator_lstm.get_periods_grid()
grid_periods_lstm

In [0]:
estimator_lstm.update_by_grid_periods()

## Train the model by the built-in models



In [0]:
estimator_lstm.seq2seq_mode

In [0]:
estimator_lstm.train_by_builtin_model(1)

