### Embeddings
- For each categorical variable we want one embedding layer
- https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/
- For each non-categorical variable we want to have an dense layer of same dimension

- We will work with data of the form shape `(batch_size, time_steps, features)`
- For *each feature* we want an embedding of the form `(batch_size, time_steps, embedding_dim)`
- Finally all embedded features are concatenated (along the feature axis)

### Learning
- **Categorical variables**
- Use `TimeDistributed`  layer on top of `Dense(embedding_dim)` layer in order to have the same embedding for each time step
- We need to keep the dimensions upons slicing for features, `input = Tensor(..., feature_idx, None)`
- Striclty speaking time distributed is not necessary for dense layers (from construction of the latter), however it seems to be more clean


- **Categorical variabels** 
- First call sklearns `OrdinalEncoder` 
- Construct embedding layer, `Embedding(input_dim=input_dim, output_dim=embedding_dim, input_length=time_steps)`
- Just slice date (no keep dims is required!) along feature axis `input = Tensor(..., feature_idx)`


- **Making it static**
- Assume we have data where a feature is static, that is time independent. Hence, for a given feature and sample (batch index) the tensor entries are constant. 

- In order to get a static entity, one could slice the time dependent embeddings (take an arbritary time index of the embedding)
- Or, learn the embedding on a time slice (any time index will do  as the data are static).
- These two approaches are equivalent.


In [2]:
import tensorflow as tf
from sklearn.preprocessing import OrdinalEncoder
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Tensorflow keeping dims upon slicing single index
- either slice an interval
- or slice one index and include axis

In [3]:
import numpy as np
t = tf.convert_to_tensor(np.array([[[1,2,3],[4,5,6], [7,8,9]],
                              [[10,20,30],[40,50,60], [70,80,90]]]))
t

<tf.Tensor: shape=(2, 3, 3), dtype=int64, numpy=
array([[[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9]],

       [[10, 20, 30],
        [40, 50, 60],
        [70, 80, 90]]])>

In [4]:
t[...,None, 1] ==  t[...,1, None]

<tf.Tensor: shape=(2, 3, 1), dtype=bool, numpy=
array([[[ True],
        [ True],
        [ True]],

       [[ True],
        [ True],
        [ True]]])>

In [5]:
t[...,None, 1] == t[..., 1:2]

<tf.Tensor: shape=(2, 3, 1), dtype=bool, numpy=
array([[[ True],
        [ True],
        [ True]],

       [[ True],
        [ True],
        [ True]]])>

In [6]:
t[1]

<tf.Tensor: shape=(3, 3), dtype=int64, numpy=
array([[10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])>

In [7]:
batch_size = 32
time_steps = 100
n_features = 20
hidden_dim = 60
feature_index=1
#xs = tf.random.normal(shape=(batch_size, time_steps, n_features))
xs = tf.convert_to_tensor(np.random.randint(100, size=(batch_size, time_steps, n_features)))
print(f"input data shape, {xs.shape}")


# time dependent contineous variables
input = xs[..., feature_index, None]
dense_and_td = tf.keras.layers.Dense(hidden_dim, name="dense_and_td")
contineous_time_dependent = tf.keras.layers.TimeDistributed(dense_and_td)(input)
print(f"Contineous data with time distribution expected(batch_size, time, hidden_dim)) {contineous_time_dependent.shape}")
print(f"    Dimensions of underlying dense layer {[info.shape for info in dense_and_td.trainable_weights]}")



#Experiment: time dependent contineous variables without time distributed layer
input = xs[..., feature_index, None]
dense = tf.keras.layers.Dense(hidden_dim, name="dense_no_td")
print(f"Contineous data with time distribution expected(batch_size, time, hidden_dim)) {dense(input).shape}")
print(f"    Dimensions of underlying dense layer {[info.shape for info in dense.trainable_weights]}")



# embedding (input_dim = 10, output_dim=2, input_length=time_steps)
input=xs[:, :, 2]
unique, _ = tf.unique(tf.reshape(input, [-1]))
categorical_embedding = tf.keras.layers.Embedding(input_dim=len(unique), output_dim=2, input_length=time_steps)(xs[:, :, 2])
print(f"Categorical embedding shape: expected (batch, time, out_dim) {categorical_embedding.shape}") #(batch, time, out_dim)

# make static variable (take t=0) from categroical embedding
print("Static slice of categorical embedding: expected (batch,  out_dim)", categorical_embedding[:,0,:].shape)


# Learning:
# keep dims for dense layer embedding for cont variables
# do not keep dims for categorical embeddings

input data shape, (32, 100, 20)
Contineous data with time distribution expected(batch_size, time, hidden_dim)) (32, 100, 60)
    Dimensions of underlying dense layer [TensorShape([1, 60]), TensorShape([60])]
Contineous data with time distribution expected(batch_size, time, hidden_dim)) (32, 100, 60)
    Dimensions of underlying dense layer [TensorShape([1, 60]), TensorShape([60])]
Categorical embedding shape: expected (batch, time, out_dim) (32, 100, 2)
Static slice of categorical embedding: expected (batch,  out_dim) (32, 2)


### Exkursion:  Time distributed dense layers for contineous embeddings
- let's look at a static dataset, where each time entry is equal
- is there a difference between a time distributed dense layer and a plain dense layer? 
- Are the results equal for each time?

### Conclusions:
- For dense layers time distribution does not make a difference
    - The numbers of parameters is the same in the layer
    - The result is the same for each time step
    - The overall result ist the same
- This seems to be side effect from the dense layer implementation
- It seems to be more appropriate to use time distributed layers at is matches the problem requrement

In [8]:
# set up toy data
x = tf.convert_to_tensor(np.random.randint(100, size=(batch_size, 1, n_features)))
xs = tf.concat([x]*time_steps, axis=1)

In [9]:
input = xs[..., feature_index, None]
kernel_initializer = tf.keras.initializers.Constant(2.)
bias_initializer = tf.keras.initializers.Constant(3.)

# time distributed dense
td_dense = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(hidden_dim, name="time_distributed_dense", 
                                                                 kernel_initializer=kernel_initializer, 
                                                                 bias_initializer=bias_initializer))
# simple dense layer
dense = tf.keras.layers.Dense(hidden_dim, name="plain_dense", kernel_initializer=kernel_initializer, 
                                                                 bias_initializer=bias_initializer)

td_dense_result = td_dense(input)
dense_result = dense(input)

In [10]:
print(f"simple dense {[info.shape for info in dense.trainable_weights]}")
print(f"time distributed dense {[info.shape for info in td_dense.trainable_weights]}")

simple dense [TensorShape([1, 60]), TensorShape([60])]
time distributed dense [TensorShape([1, 60]), TensorShape([60])]


In [11]:
agrees = []
for t in range(time_steps):
    agree = tf.reduce_all((td_dense_result[:, t, :]) == (dense_result[:, t, :])).numpy()
    agrees.append(agree)
print(f"both approaches  agree on all timestep: {all(agrees)}")

both approaches  agree on all timestep: True


In [12]:
agrees = []
for t in range(time_steps):
    for tau in range(time_steps):
        agree = tf.reduce_all((dense_result[:, t, :]) == (dense_result[:, tau, :])).numpy()
        agrees.append(agree)
print(f"Dense gives same result for each timestep: {all(agrees)}")

Dense gives same result for each timestep: True


In [13]:
agrees = []
for t in range(time_steps):
    for tau in range(time_steps):
        agree = tf.reduce_all((td_dense_result[:, t, :]) == (td_dense_result[:, tau, :])).numpy()
        agrees.append(agree)
print(f"Time distributed dense gives same result for each timestep: {all(agrees)}")

Time distributed dense gives same result for each timestep: True


### Exkursion: check time dependent, categorical variables 
- we need to ordinal encode categorical variables!
- Integers are not enough as we additionally require that there are no "gaps" between the possible values

In [14]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(xs[..., 1])
encoded = tf.convert_to_tensor(oe.transform(xs[..., 1]))

In [15]:
uniques, _ = tf.unique(tf.reshape(encoded, [-1]))
input_dim = len(uniques)
output_dim = 5
input_length = time_steps
print(f"Input Dimension {input_dim}")

Input Dimension 29


In [16]:
embeddings_initializer = tf.keras.initializers.Constant(2.)
embedding_layer = tf.keras.layers.Embedding(input_dim=input_dim, 
                                     output_dim=output_dim, 
                                     input_length=input_length, 
                                     embeddings_initializer=embeddings_initializer)
embedded = embedding_layer(encoded)

In [17]:
agrees = []
for t in range(time_steps):
    for tau in range(time_steps):
        agree = tf.reduce_all(embedded[:, t, :] == embedded[:, tau, :]).numpy()
        agrees.append(agree)
print(f"Embeddings over all time steps agree {all(agrees)}")

Embeddings over all time steps agree True


In [18]:
static_embedding_layer = tf.keras.layers.Embedding(input_dim=input_dim, 
                                                   output_dim=output_dim, 
                                                   embeddings_initializer=embeddings_initializer)
static_embedded = static_embedding_layer(encoded[:, 1])

In [19]:
agrees = tf.reduce_all(static_embedded == embedded[:, 0, :]).numpy()
print(f"Static embedding without input_dim agrees with time dependent embedding layer? {agrees}")

Static embedding without input_dim agrees with time dependent embedding layer? True
