---
# [Tabular Playground Series - Apr 2022][1]

- This challenge is a time series classification problem.

- The goal of this competition is to predict the state of sequence from the sensor data.

---

#### **The aim of this notebook is to**
- **1. Conduct Exploratory Data Analysis with TensorFlow Data Validation (TFDV)**
- **2. Compare the customized TabTransformer model and RNN(LSTM) model.**

#### **Conclusions**
- **RNN(LSTM) model seems to be more suitable for this competition's task.**


---
**References:** Thanks to previous great codes and notebooks.
- [Get started with Tensorflow Data Validation][2]
- [Migrating feature_columns to TF2's Keras Preprocessing Layers][3]
- [Classify structured data using Keras preprocessing layers][4]
- [Sachin's Blog Tensorflow Learning Rate Finder][5]
- [🔥🔥[TensorFlow]TabTransformer🔥🔥][6]
- [Top 1% | TPS APR 22 EDA | LSTM][7]
- [TPS Apr22 - EDA / FE + LSTM Tutorial][8]

---

#### **If you find this notebook useful, please do give me an upvote. It helps to keep up my motivation.**

---

[1]: https://www.kaggle.com/competitions/tabular-playground-series-apr-2022/overview
[2]: https://www.tensorflow.org/tfx/data_validation/get_started
[3]: https://www.tensorflow.org/guide/migrate/migrating_feature_columns
[4]: https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers
[5]: https://sachinruk.github.io/blog/tensorflow/learning%20rate/2021/02/15/Tensorflow-Learning-Rate-Finder.html
[6]: https://www.kaggle.com/code/usharengaraju/tensorflow-tabtransformer
[7]: https://www.kaggle.com/code/kartushovdanil/top-1-tps-apr-22-eda-lstm
[8]: https://www.kaggle.com/code/javigallego/tps-apr22-eda-fe-lstm-tutorial

# 0. Settings

In [None]:
# Import dependencies 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

import seaborn as sns

import os
import pathlib
import gc
import sys
import re
import math 
import random
import time 
import datetime as dt
from tqdm import tqdm 
from pprint import pprint

import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow_addons as tfa

print('import done!')

In [None]:
# For reproducible results    
def seed_all(s):
    random.seed(s)
    np.random.seed(s)
    tf.random.set_seed(s)
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
    os.environ['PYTHONHASHSEED'] = str(s) 
    print('Seeds setted!')
    
global_seed = 42
seed_all(global_seed)

# 1. Data Loading & Preprocessing

## 1.1 Data Loading

---
### [Files Descriptions](https://www.kaggle.com/competitions/tabular-playground-series-apr-2022/data)

- **train.csv** - the training set, comprising ~26,000 60-second recordings of thirteen biological sensors for almost one thousand experimental participants

- **train_labels.csv** - the class label for each sequence.

- **test.csv** - the test set. For each of the ~12,000 sequences, you should predict a value for that sequence's state.

- **sample_submission.csv** - a sample submission file in the correct format.

---

In [None]:
data_config = {'train_csv_path': '../input/tabular-playground-series-apr-2022/train.csv',
               'train_labels_path': '../input/tabular-playground-series-apr-2022/train_labels.csv',
               'test_csv_path': '../input/tabular-playground-series-apr-2022/test.csv',
               'sample_submission_path': '../input/tabular-playground-series-apr-2022/sample_submission.csv',
              }

train_df = pd.read_csv(data_config['train_csv_path'])
train_labels_df = pd.read_csv(data_config['train_labels_path'])
test_df = pd.read_csv(data_config['test_csv_path'])
submission_df = pd.read_csv(data_config['sample_submission_path'])

print(f'train_length: {len(train_df)}')
print(f'train_labels_length: {len(train_labels_df)}')
print(f'test_length: {len(test_df)}')
print(f'submission_length: {len(submission_df)}')

## 1.2 Data Check

---
### [Field Descriptions](https://www.kaggle.com/competitions/tabular-playground-series-apr-2022/data)

- **train.csv**
 - `sequence` - a unique id for each sequence
 - `subject` - a unique id for the subject in the experiment
 - `step` - time step of the recording, in one second intervals
 - `sensor_00` - `sensor_12` - the value for each of the thirteen sensors at that time step

- **train_labels.csv** - the class label for each sequence.
 - `sequence` - the unique id for each sequence.
 - `state` - the state associated to each sequence. This is the target which you are trying to predict.
 
---

In [None]:
# Null Value Check
print('train_df.info()'); print(train_df.info(), '\n')
print('train_labels_df.info()'); print(train_labels_df.info(), '\n')
print('test_df.info()'); print(test_df.info(), '\n')
print('submission_df.info()');  print(submission_df.info(), '\n')

In [None]:
test_only_subject = [s for s in test_df['subject'].unique() if s not in train_df['subject'].unique()]
#print(len(test_only_subject))

if len(test_only_subject) == len(test_df['subject'].unique()):
    print('There is no overlap in "subject" between train and test data.')
else:
    print('There are some overlaps in "subject" between train and test data.')

In [None]:
# train_df Check
train_df.head()

In [None]:
print(train_df.duplicated().value_counts())
train_df = train_df.drop_duplicates()

In [None]:
def print_unique_category(df, column):
    print(f'feature_name: {column}')
    print(f'unique_category_number: {df[column].nunique()}')
    print(f'categories: {df[column].unique()}\n')

# Categories in train_df
print_unique_category(train_df, 'sequence')
print_unique_category(train_df, 'subject')
print_unique_category(train_df, 'step')

In [None]:
# train_labels_df Check
train_labels_df.head()

In [None]:
# Categories in test_df 
print_unique_category(test_df, 'sequence')
print_unique_category(test_df, 'subject')
print_unique_category(test_df, 'step')

# test_df Check
test_df.head()

In [None]:
# Submission_df check
submission_df.head()

## 1.3 Data Scaling

### 1.3.1 Train Data

In [None]:
sensors = np.array(['sensor_00', 'sensor_01', 'sensor_02', 'sensor_03', 'sensor_04', 'sensor_05', 'sensor_06',
           'sensor_07', 'sensor_08', 'sensor_09', 'sensor_10', 'sensor_11', 'sensor_12',])

train_mean = train_df.mean()
train_var = train_df.var()
train_std = train_df.std()
train_min = train_df.min()
train_max = train_df.max()
train_minus_flg = pd.DataFrame(np.where(train_df.values < 0, -1, 1), columns=train_df.columns)

train_sensors_mean = train_mean.drop(['sequence', 'subject', 'step'])
train_sensors_std = train_std.drop(['sequence', 'subject', 'step'])
train_sensors_var = train_var.drop(['sequence', 'subject', 'step'])
train_sensors_min = train_min.drop(['sequence', 'subject', 'step'])
train_sensors_max = train_max.drop(['sequence', 'subject', 'step'])
train_sensors_minus_flg = train_minus_flg.drop(['sequence', 'subject', 'step'], axis=1)

train_df.describe() # Before Cleaning

In [None]:
train_quantiles = train_df.quantile([0.05, 0.95])
train_quantiles

In [None]:
def outlier_check(dataframe, means, stds, factor):
    outlier_counter = dataframe.copy()
    outlier_counter['count'] = 1
    for sensor in sensors:
        sensor_values = outlier_counter[sensor].values
        
        mean = means[sensor]
        std = stds[sensor]
        threshold_1 = mean - std * factor
        threshold_2 = mean + std * factor
        
        sensor_values = np.where((sensor_values < threshold_1) | (sensor_values > threshold_2), 1, 0)
        outlier_counter[sensor] = sensor_values
    
    outlier_counter = outlier_counter.drop(['sequence', 'subject', 'step'], axis=1)
    print(outlier_counter.sum(axis=0))
    
outlier_check(train_df, train_sensors_mean, train_sensors_std, 3)

In [None]:
train_thresholds = []

for sensor in sensors:
    sensor_values = train_df[sensor].values
    
    # Clipping on (mean ± std * 3)
    #mean = train_sensors_mean[sensor]
    #std = train_sensors_std[sensor]
    #threshold_1 = mean - std * 3
    #threshold_2 = mean + std * 3
    #train_thresholds.append((threshold_1, threshold_2))
    #sensor_values = np.where(sensor_values < threshold_1, threshold_1, sensor_values)
    #sensor_values = np.where(sensor_values > threshold_2, threshold_2, sensor_values)
    
    # Clipping on Quantile.
    #threshold_1 = train_quantiles[sensor].values[0]
    #threshold_2 = train_quantiles[sensor].values[1]
    #sensor_values = np.where(sensor_values < threshold_1, threshold_1, sensor_values)
    #sensor_values = np.where(sensor_values > threshold_2, threshold_2, sensor_values)
    
    # Min-Max Scaling
    #sensor_min = train_sensors_min[sensor]
    #sensor_max = train_sensors_max[sensor]
    #sensor_values = (sensor_values - sensor_min) / (sensor_max - sensor_min)
    
    # Logarithmic transformation_1
    #sensor_values = sensor_values - train_sensors_min[sensor]
    #sensor_values = np.log(1 + sensor_values)
    
    # Logarithmic transformation_2   
    sensor_values = np.log(1 + np.abs(sensor_values))
    sensor_values *= train_sensors_minus_flg[sensor].values
    train_df[sensor] = sensor_values
    
train_df.describe() # After Cleaning

### 1.3.2 Test Data

In [None]:
test_df.describe() # Before Cleaning

In [None]:
for sensor in sensors:
    sensor_values = test_df[sensor].values
    
    # Clipping on (mean ± std * 3)
    #mean = train_sensors_mean[sensor]
    #std = train_sensors_std[sensor]
    #threshold_1 = mean - std * 3
    #threshold_2 = mean + std * 3
    #sensor_values = np.where(sensor_values < threshold_1, threshold_1, sensor_values)
    #sensor_values = np.where(sensor_values > threshold_2, threshold_2, sensor_values)
    
    # Clipping on Quantile.
    #threshold_1 = train_quantiles[sensor].values[0]
    #threshold_2 = train_quantiles[sensor].values[1]
    #sensor_values = np.where(sensor_values < threshold_1, threshold_1, sensor_values)
    #sensor_values = np.where(sensor_values > threshold_2, threshold_2, sensor_values)
    
    # Min-Max Scaling
    #sensor_min = train_sensors_min[sensor]
    #sensor_max = train_sensors_max[sensor]
    #sensor_values = (sensor_values - sensor_min) / (sensor_max - sensor_min)
    
    # Logarithmic transformation_1
    #sensor_values = sensor_values - train_sensors_min[sensor]
    #sensor_values = np.log(1 + sensor_values)
    
    # Logarithmic transformation_2   
    test_sensors_minus_flg = pd.DataFrame(np.where(test_df.values < 0, -1, 1), columns=test_df.columns)
    test_sensors_minus_flg = test_sensors_minus_flg.drop(['sequence', 'subject', 'step'], axis=1)
    sensor_values = np.log(1 + np.abs(sensor_values)) 
    sensor_values *= test_sensors_minus_flg[sensor]
    
    test_df[sensor] = sensor_values
    
test_df.describe() # After Cleaning

### 1.3.3 Statistics Update

In [None]:
# Statistics Update
train_mean = train_df.mean()
train_var = train_df.var()
train_std = train_df.std()
train_min = train_df.min()
train_max = train_df.max()

train_sensors_mean = train_mean.drop(['sequence', 'subject', 'step'])
train_sensors_std = train_std.drop(['sequence', 'subject', 'step'])
train_sensors_var = train_var.drop(['sequence', 'subject', 'step'])
train_sensors_min = train_min.drop(['sequence', 'subject', 'step'])
train_sensors_max = train_max.drop(['sequence', 'subject', 'step'])

## 1.4 Feature Engineering

In [None]:
!pip install -q --user tsfresh

In [None]:
from tsfresh import extract_features
#extracted_features = extract_features(train_tmp, column_id="sequence", column_sort="step") # This creates too much features!

features_df = train_df.drop(['subject'], axis=1)
fc_parameters = {
    'abs_energy': None,
    'count_above_mean': None,
    'count_below_mean': None,
    'mean_abs_change': None,
    'mean_change': None,
}
extracted_features = extract_features(features_df, column_id="sequence", column_sort="step", default_fc_parameters=fc_parameters)
print(extracted_features.shape)
print(extracted_features.columns)

In [None]:
test_features_df = test_df.drop(['subject'], axis=1)
test_extracted_features = extract_features(test_features_df,
                                           column_id="sequence",
                                           column_sort="step",
                                           default_fc_parameters=fc_parameters)
print(test_extracted_features.shape)
print(test_extracted_features.columns)

## 1.5 EDA

In [None]:
sequences = [0, 1]
sensors = np.array(['sensor_00', 'sensor_01', 'sensor_02', 'sensor_03', 'sensor_04', 'sensor_05', 'sensor_06',
           'sensor_07', 'sensor_08', 'sensor_09', 'sensor_10', 'sensor_11', 'sensor_12',])
colors = ['#7A5197', '#BB5098']

figure, axes = plt.subplots(13, 2, sharex=True, figsize=(20, 16))
for i, sequence in enumerate(sequences):
    for j, sensor in enumerate(sensors):
        ax = plt.subplot(13, len(sequences), j * len(sequences) + (i + 1))
        plt.plot(range(60), train_df[train_df.sequence == sequence][sensor],
                color=colors[i])
        if j == 0: 
            if sequence==0:
                plt.title("Sequence 0: state=0");
            else:
                plt.title("Sequence 1: state=1")
        if sequence == sequences[0]: plt.ylabel(sensor)
figure.tight_layout(w_pad=0.1)
plt.suptitle('Selected Time Series', y=1.02, fontweight='bold')
plt.show()

In [None]:
aggs = {}
aggs['sequence'] = ['nunique', 'size']
gdf = train_df.groupby('subject').agg(aggs)
#print(gdf)

fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(1, 1, 1)
plt.hist(gdf['sequence']['nunique'], bins=50)
ax.set_xlabel('number of associated sequences')
#ax.set_ylabel('number of subjects')
plt.suptitle('Distribution of subjects associated with multiple sequences', y=1.02, fontweight='bold')

plt.show()

In [None]:
figure = plt.figure(figsize=(16, 8))

for sensor in range(13):
    sensor_name = f"sensor_{sensor:02d}"
    ax = figure.add_subplot(4, 4, sensor+1)
    ax.hist(train_df[f"{sensor_name}"], bins=100)
    ax.axes.yaxis.set_visible(False)
    ax.set_title(f"{sensor_name}")
plt.suptitle('Distribution of Sensor Signals', y=1.02, fontweight='bold')
figure.tight_layout()
plt.show()

### 1.5.1 EDA with [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv)

"TensorFlow Data Validation (TFDV) is a library for analyzing and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TFX. TFDV includes:

- Scalable calculation of summary statistics of training and test data.
- Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of datasets (Facets).
- Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies.
- A schema viewer to help you inspect the schema.
- Anomaly detection to identify anomalies, such as missing features, out-of- range values, or wrong feature types, to name a few.
- An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them. "

(from "[The TFX User Guide](https://www.tensorflow.org/tfx/guide)")

In [None]:
!pip install -q --user tensorflow_data_validation[visualization]

In [None]:
import tensorflow_data_validation as tfdv 
print(f'TFDV version: {tfdv.version.__version__}')

### 1.5.2 Statistics of Training data

In [None]:
train_stats = tfdv.generate_statistics_from_dataframe(train_df)
tfdv.visualize_statistics(train_stats)

### 1.5.3 Comparing Test Data with Training Data

In [None]:
test_stats = tfdv.generate_statistics_from_dataframe(test_df)

tfdv.visualize_statistics(
    lhs_statistics=test_stats, 
    rhs_statistics=train_stats, 
    lhs_name='TEST_DATASET', 
    rhs_name='TRAIN_DATASET')

### 1.5.4 Anomaly Detection

In [None]:
schema = tfdv.infer_schema(train_stats)
tfdv.display_schema(schema)

In [None]:
anomalies = tfdv.validate_statistics(statistics=test_stats, schema=schema)
tfdv.display_anomalies(anomalies)

## 1.6 Train Validation Split

In [None]:
unique_subjects = train_df['subject'].unique()
train_subjects = unique_subjects[:600]
valid_subjects = unique_subjects[600:]
print(len(train_subjects), len(valid_subjects))

train = train_df[train_df['subject'].isin(train_subjects)].reset_index(drop=True)
valid = train_df[train_df['subject'].isin(valid_subjects)].reset_index(drop=True)
print(len(train), len(valid))

In [None]:
train_mean = train.mean()
train_var = train.var()
train_std = train.std()
#print(train_mean, '\n', train_var, '\n', train_std)

train_sensors_mean = train_mean.drop(['sequence', 'subject', 'step'])
train_sensors_std = train_std.drop(['sequence', 'subject', 'step'])
train_sensors_var = train_var.drop(['sequence', 'subject', 'step'])
print(train_sensors_mean.shape)

train_labels = train_labels_df[train_labels_df['sequence'].isin(train['sequence'])].reset_index(drop=True)
valid_labels = train_labels_df[train_labels_df['sequence'].isin(valid['sequence'])].reset_index(drop=True)
print('train_labels: \n', train_labels['state'].value_counts(), '\n')
print('valid_labels: \n', valid_labels['state'].value_counts())

### 1.6.1 Extracted Features by tsfresh

In [None]:
tmp_df = train_df.query('step==0')
train_tmp_df = tmp_df[tmp_df['subject'].isin(train_subjects)]
train_seq_index = train_tmp_df['sequence'].values

valid_tmp_df = tmp_df[tmp_df['subject'].isin(valid_subjects)]
valid_seq_index = valid_tmp_df['sequence'].values

print(train_seq_index.shape, valid_seq_index.shape)

In [None]:
train_extracted_features = extracted_features.iloc[train_seq_index]
valid_extracted_features = extracted_features.iloc[valid_seq_index]
print(train_extracted_features.shape, valid_extracted_features.shape)

extracted_features_mean = train_extracted_features.mean()
extracted_features_var = train_extracted_features.var()
print(extracted_features_mean.shape, extracted_features_var.shape)

# 2. Model Training, Prediction and Submission

## 2.1 TabTransformer Model [TensorFlow]

In [None]:
# Limit GPU Memory in TensorFlow
physical_devices = tf.config.list_physical_devices('GPU')
if len(physical_devices) > 0:
    for device in physical_devices:
        tf.config.experimental.set_memory_growth(device, True)
        print('{} memory growth: {}'.format(device, tf.config.experimental.get_memory_growth(device)))
else:
    print("Not enough GPU hardware devices available")

### 2.1.1 Dataset for TabTransformer

In [None]:
sensors = np.array(['sensor_00', 'sensor_01', 'sensor_02', 'sensor_03', 'sensor_04', 'sensor_05', 'sensor_06',
           'sensor_07', 'sensor_08', 'sensor_09', 'sensor_10', 'sensor_11', 'sensor_12',])

def dataframe_to_dataset(dataframe, extracted_features_df=None, sensors=sensors):
    seq_data = {}
    
    for i in range(dataframe['step'].nunique()):
        tmp_df = dataframe.query(f'step=={i}')
        seq_data[f'x_{i}'] = tmp_df[sensors].values
    
    seq_data.update(**{column: np.expand_dims(extracted_features_df[column].values, -1) for column in extracted_features_df.columns})
        
    # Like the [CLS] token in BERT, I prepare an embedding for classification.
    seq_data['cls'] = np.ones((len(tmp_df), 1))
    seq_data['cls'] = seq_data['cls'].astype('int64')
        
    ds = tf.data.Dataset.from_tensor_slices(seq_data)     
    return ds


train_data_ds = dataframe_to_dataset(train, train_extracted_features)
valid_data_ds = dataframe_to_dataset(valid, valid_extracted_features)

train_labels_ds = tf.data.Dataset.from_tensor_slices(train_labels['state'])
valid_labels_ds = tf.data.Dataset.from_tensor_slices(valid_labels['state'])

train_ds = tf.data.Dataset.zip((train_data_ds, train_labels_ds))
valid_ds = tf.data.Dataset.zip((valid_data_ds, valid_labels_ds))

# Display a sample in train_ds.
print(f'length: {len(train_ds)}')
for example in train_ds.take(1):
    input_keys_list = list(example[0].keys())
    print(input_keys_list)
    print(example[1])

In [None]:
batch_size= 512

train_ds = train_ds.shuffle(buffer_size=(len(train_ds)))
train_ds = train_ds.batch(batch_size)
train_ds = train_ds.prefetch(batch_size)

valid_ds = valid_ds.batch(batch_size)
valid_ds = valid_ds.prefetch(batch_size)

# Display a sample in batched train_ds.
example = next(iter(train_ds))[0]
for key in example:
    print(f'{key}, shape:{example[key].shape}, {example[key].dtype}')

### 2.1.2 Preprocessing Model

In [None]:
preprocess_inputs = {}

preprocess_inputs['cls'] = tf.keras.Input(shape=(1), dtype='int64')

sensor_names = [f'x_{i}' for i in range(60)]
for key in input_keys_list:
    if key != 'cls':
        if key in sensor_names:
            preprocess_inputs[key] = tf.keras.Input(shape=(13), dtype='float64')
        else:
            preprocess_inputs[key] = tf.keras.Input(shape=(1), dtype='float64')

preprocess_outputs = {}
cls_output = tf.keras.layers.IntegerLookup(
    vocabulary=tf.constant([1]), output_mode='int')(preprocess_inputs['cls'])
preprocess_outputs['cls'] = cls_output

for key in next(iter(train_ds))[0]:
    if key != 'cls':
        if key in sensor_names:
            preprocess_outputs[key] = tf.keras.layers.Normalization(
                axis=1, mean=train_sensors_mean.values, variance=train_sensors_var.values,
            )(preprocess_inputs[key]) 
        else:
            preprocess_outputs[key] = tf.keras.layers.Normalization(
                mean=extracted_features_mean[key],
                variance=extracted_features_var[key],
            )(preprocess_inputs[key]) 
    
preprocessing_model = tf.keras.Model(preprocess_inputs, 
                                     preprocess_outputs)

In [None]:
# Apply the preprocessing in tf.data.Dataset.map
train_ds = train_ds.map(lambda x, y: (preprocessing_model(x), y), 
                        num_parallel_calls=tf.data.AUTOTUNE)

valid_ds = valid_ds.map(lambda x, y: (preprocessing_model(x), y), 
                        num_parallel_calls=tf.data.AUTOTUNE)

# Display a preprocessed input sample
example = next(train_ds.take(1).as_numpy_iterator())
for key in example[0]:
    print(f'{key}, shape:{example[0][key].shape}, {example[0][key].dtype}')

### 2.1.3 [Tab Transformer Model](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/structured_data/ipynb/tabtransformer.ipynb) (Customized)

The TabTransformer architecture works as follows:

- All the categorical features are encoded as embeddings, using the same embedding_dims. This means that each value in each categorical feature will have its own embedding vector.

- A column embedding, one embedding vector for each categorical feature, is added (point-wise) to the categorical feature embedding.

- The embedded categorical features are fed into a stack of Transformer blocks. Each Transformer block consists of a multi-head self-attention layer followed by a feed-forward layer.

- The outputs of the final Transformer layer, which are the contextual embeddings of the categorical features, are concatenated with the input numerical features, and fed into a final MLP block.

<img src="https://raw.githubusercontent.com/keras-team/keras-io/master/examples/structured_data/img/tabtransformer/tabtransformer.png" width="500"/>

---

#### **In this notebook, I customized TabTransformer on following points.**

- I considered the values of thirteen sensors at one time step as an embedding feature.
- Thus, a series of sensor value embeddings at 60 time steps are comprise categorical features of a sequence.
- I added a 'cls' embedding to the categorical features for the classification task after the transformer blocks (learned from [CLS] token in BERT).
- I considered the features extracted by tsfresh from data in chronological order as the numerical (continuous) features.
- I concatenated 'cls' embedding, the other 60 contextual features and normalized numerical features, and fed into a final MLP block.

---

In [None]:
embedding_dim = 64

model_inputs = {}

model_inputs['cls'] = tf.keras.Input(shape=(), dtype='int64')  

sensor_names = [f'x_{i}' for i in range(60)]
for key in input_keys_list:
    if key != 'cls':
        if key in sensor_names:
            model_inputs[key] = tf.keras.Input(shape=(13), dtype='float64')
        else:
            model_inputs[key] = tf.keras.Input(shape=(1), dtype='float64')
            

cls_embedding = tf.keras.layers.Embedding(2, embedding_dim)
cls_features = cls_embedding(model_inputs['cls'])
cls_features = tf.expand_dims(cls_features, axis=1)

first_dense = tf.keras.Sequential([
    tf.keras.layers.Dense(embedding_dim, activation='relu'),
    tf.keras.layers.BatchNormalization()])

sensor_signals = []
for key in model_inputs:
    if key in sensor_names:
        sensor_signals.append(model_inputs[key])
sensor_signals = tf.stack(sensor_signals, axis=1)
sensor_signals = first_dense(sensor_signals)

input_features = tf.concat([cls_features, sensor_signals], axis=1)

In [None]:
# Add column embedding to categorical feature embeddings.
num_columns = input_features.shape[1]
column_embedding = tf.keras.layers.Embedding(
    input_dim=num_columns, output_dim=embedding_dim)
column_indices = tf.range(start=0, limit=num_columns, delta=1)
encoded_features = input_features + column_embedding(column_indices)

In [None]:
# Create TabTransformer Model.
num_transformer_blocks = 8
num_heads = 4
dropout_rate = 0.2
mlp_hidden_units_factors= [2, 1] 

def create_mlp(hidden_units, dropout_rate, activation, normalization_layer, name=None):
    mlp_layers = []
    for units in hidden_units:
        mlp_layers.append(normalization_layer)
        mlp_layers.append(tf.keras.layers.Dense(
            units, activation=activation))
        mlp_layers.append(tf.keras.layers.Dropout(dropout_rate))
    return tf.keras.Sequential(mlp_layers, name=name)

# Create multiple layers of the Transformer block.
for block_idx in range(num_transformer_blocks):
    # Create a multi-head attention layer.
    attention_output = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads,
        key_dim=embedding_dim, 
        dropout=dropout_rate, 
        name=f'multi-head_attention_{block_idx}'
    )(encoded_features, encoded_features)
    # Skip connection 1.
    x = tf.keras.layers.Add(
        name=f'skip_connection1_{block_idx}'
    )([attention_output, encoded_features])
    # Layer normalization 1.
    x = tf.keras.layers.LayerNormalization(
        name=f'layer_norm1_{block_idx}', epsilon=1e-6
    )(x)
    # Feedforward.
    feedforward_output =  tf.keras.Sequential([
                        tf.keras.layers.Dense(embedding_dim, activation=keras.activations.gelu),
                        tf.keras.layers.Dropout(dropout_rate)
                        ], name=f'feedforward_{block_idx}'
    )(x)
    # Skip connection 2.
    x = tf.keras.layers.Add(
        name=f'skip_connection2_{block_idx}'
    )([feedforward_output, x])
    # Layer normalization 2.
    encoded_features = tf.keras.layers.LayerNormalization(
        name=f'layer_norm2_{block_idx}', epsilon=1e-6
    )(x)

# Numerical features
numerical_feature_list = []
for numerical_feature_name in train_extracted_features.columns:
    numerical_feature_list.append(model_inputs[numerical_feature_name])
numerical_features = tf.keras.layers.concatenate(numerical_feature_list)
# Apply layer normalization to the numerical features.
numerical_features = tf.keras.layers.LayerNormalization(epsilon=1e-6)(numerical_features)


features_1 = encoded_features[:, 0, :]

features_2 = encoded_features[:, 1:, :]
features_2 = tf.keras.layers.Dense(8, activation='relu')(features_2)
features_2 = tf.keras.layers.BatchNormalization()(features_2) 
features_2 = tf.keras.layers.Flatten()(features_2)
features_2 = tf.keras.layers.Dense(64, activation='relu')(features_2)
features_2 = tf.keras.layers.BatchNormalization()(features_2) 

# Prepare the input for the final MLP block.
features = tf.keras.layers.concatenate([features_1,
                                        features_2,
                                        numerical_features], axis=-1)

# Compute MLP hidden_units.
mlp_hidden_units = [
    factor * features.shape[-1] for factor in mlp_hidden_units_factors
]
# Create final MLP.
features = create_mlp(
    hidden_units=mlp_hidden_units, 
    dropout_rate=dropout_rate,
    activation=tf.keras.activations.selu,
    normalization_layer=tf.keras.layers.BatchNormalization(),
    name='MLP',
)(features)

model_outputs = tf.keras.layers.Dense(
    units=1, activation='sigmoid', name='sigmoid'
)(features)

training_tt_model = keras.Model(inputs=model_inputs, outputs=model_outputs)

In [None]:
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 0.0001

optimizer = tfa.optimizers.AdamW(
        learning_rate=LEARNING_RATE, weight_decay=WEIGHT_DECAY)

training_tt_model.compile(optimizer=optimizer,
                       loss=tf.keras.losses.BinaryCrossentropy(),
                       metrics=["accuracy"])

training_tt_model.summary()

In [None]:
#tf.keras.utils.plot_model(training_model, show_shapes=True, rankdir="LR")
#tf.keras.utils.plot_model(training_model, show_shapes=True)

### 2.1.4 [Learning Rate Finder](https://sachinruk.github.io/blog/tensorflow/learning%20rate/2021/02/15/Tensorflow-Learning-Rate-Finder.html)

In [None]:
class LRFind(tf.keras.callbacks.Callback):
    def __init__(self, min_lr, max_lr, n_rounds):
        self.min_lr = min_lr 
        self.max_lr = max_lr 
        self.step_up = (max_lr / min_lr) ** (1 / n_rounds)
        self.lrs = []
        self.losses = []

    def on_train_begin(self, logs=None):
        self.weights= self.model.get_weights()
        self.model.optimizer.lr = self.min_lr 

    def on_train_batch_end(self, batch, logs=None):
        self.lrs.append(self.model.optimizer.lr.numpy())
        self.losses.append(logs['loss'])
        self.model.optimizer.lr = self.model.optimizer.lr * self.step_up 
        if self.model.optimizer.lr > self.max_lr:
            self.model.stop_training = True 

    def on_train_end(self, logs=None):
        self.model.set_weights(self.weights)

lr_find_epochs = 1
lr_finder_steps = 100
lr_find = LRFind(1e-7, 5e-2, lr_finder_steps)

In [None]:
lr_find_batch_size = 256
lr_find_sequence_n = lr_find_batch_size * lr_finder_steps
lr_find_sample_n = 60 * lr_find_sequence_n

lr_find_data_ds = dataframe_to_dataset(train[:lr_find_sample_n], train_extracted_features[:lr_find_sequence_n])
lr_find_labels_ds = tf.data.Dataset.from_tensor_slices(train_labels['state'][:lr_find_sample_n])
lr_find_ds = tf.data.Dataset.zip((lr_find_data_ds, lr_find_labels_ds))
lr_find_ds = lr_find_ds.batch(batch_size=lr_find_batch_size)
lr_find_ds = lr_find_ds.prefetch(lr_find_batch_size)

lr_find_ds = lr_find_ds.map(lambda x, y: (preprocessing_model(x), y), 
                            num_parallel_calls=tf.data.AUTOTUNE)

training_tt_model.fit(lr_find_ds,
                   steps_per_epoch=lr_finder_steps,
                   epochs=lr_find_epochs,
                   callbacks=[lr_find])

plt.plot(lr_find.lrs, lr_find.losses)
plt.xscale('log')
plt.show()

### 2.1.5 Model Training

In [None]:
# Re-construct the model
model_config = training_tt_model.get_config()
training_tt_model = tf.keras.Model.from_config(model_config)

epochs = 15
steps_per_epoch = len(train)//batch_size

learning_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=5e-4,
    decay_steps=epochs*steps_per_epoch,
    alpha=0.0)
weight_decay = 0.0001

optimizer = tfa.optimizers.AdamW(
        learning_rate=learning_schedule,
        weight_decay=weight_decay)

training_tt_model.compile(optimizer=optimizer,
                       loss=tf.keras.losses.BinaryCrossentropy(),
                       metrics=["accuracy"])

checkpoint_filepath = '/tmp/tt/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath, 
    save_weights_only=True,
    monitor='val_accuracy', 
    mode='max', 
    save_best_only=True)

In [None]:
training_tt_model.fit(train_ds, epochs=epochs, shuffle=True,
                   validation_data=valid_ds, 
                   callbacks=[model_checkpoint_callback])

training_tt_model.load_weights(checkpoint_filepath)

### 2.1.6 Prediction and Submission

In [None]:
# At inference time, it can be useful to combine these separate stages into a single model that handles raw feature inputs.
tt_inputs = preprocessing_model.input 
tt_outputs = training_tt_model(preprocessing_model(tt_inputs))
inference_tt_model = tf.keras.Model(tt_inputs, tt_outputs)

In [None]:
test_tt_ds = dataframe_to_dataset(test_df, test_extracted_features)
test_tt_ds = test_tt_ds.batch(batch_size=batch_size,
                        drop_remainder=False)
test_tt_ds = test_tt_ds.prefetch(buffer_size=batch_size)

tt_pred = inference_tt_model.predict(test_tt_ds)
tt_pred = np.squeeze(tt_pred)
#tt_pred = np.where(tt_pred <  0.5, 0, 1)
submission_df['state'] = tt_pred
submission_df.to_csv('tt_submission.csv', index=False)
submission_df.head()

In [None]:
# Freeing up the Memory
tf.keras.backend.clear_session()
gc.collect()

from numba import cuda
cuda.select_device(0)
cuda.close()

## 2.2 RNN Model [Pytorch Lightning]

In [None]:
import torch
import pytorch_lightning as pl
import torchmetrics

!pip install torchinfo -q --user
from torchinfo import summary

print(pl.__version__)

### 2.2.1 Dataset for RNN

In [None]:
class TPSApr22RNNDataset(torch.utils.data.Dataset):
    
    def __init__(self, dataframe, labels=None, 
                 mean_df=train_sensors_mean, std_df=train_sensors_std, 
                 eps=1e-7):
        self.df = dataframe.drop(['sequence', 'subject', 'step'], axis=1)
        self.data = self.df.values
        
        if labels is not None:
            self.labels = labels['state'].values
        else:
            self.labels = None 
            
        self.means = mean_df.values
        self.stds = std_df.values
        self.eps = np.array(eps)
        
    def __len__(self):
        return len(self.data) // 60
    
    def __getitem__(self, idx):
        tmp_data = self.data[idx*60: (idx+1)*60]
        tmp_data = (tmp_data - self.means) / (self.stds + self.eps)
        seq_data = {'x': torch.tensor(tmp_data, dtype=torch.float32)}
        
        if self.labels is not None:
            label = self.labels[idx]
            label = torch.tensor(label, dtype=torch.float32)
            seq_data['labels'] = torch.unsqueeze(label, -1)
            
        return seq_data

In [None]:
batch_size = 512

train_rnn_ds = TPSApr22RNNDataset(train, train_labels)
valid_rnn_ds = TPSApr22RNNDataset(valid, valid_labels)
test_rnn_ds = TPSApr22RNNDataset(test_df)

train_rnn_dl = torch.utils.data.DataLoader(train_rnn_ds, 
                                           batch_size=batch_size, 
                                           shuffle=True)
valid_rnn_dl = torch.utils.data.DataLoader(valid_rnn_ds, 
                                           batch_size=batch_size, 
                                           shuffle=False)
test_rnn_dl = torch.utils.data.DataLoader(test_rnn_ds, 
                                          batch_size=batch_size, 
                                          shuffle=False, 
                                          drop_last=False)

print('------ train_rnn_dl ------')
tmp = train_rnn_dl.__iter__()
batch_sample = tmp.next()
print(f"x : {batch_sample['x'].shape}")
print(f"labels: {batch_sample['labels'].shape}")
print(f"n_samples: {len(train_rnn_ds)}")
print(f"n_batches: {len(tmp)}")
print()

print('------ test_rnn_dl ------')
tmp = test_rnn_dl.__iter__()
batch_sample = tmp.next()
print(f"x : {batch_sample['x'].shape}")
print(f"n_samples: {len(test_rnn_ds)}")
print(f"n_batches: {len(tmp)}")

In [None]:
class TPSApr22RNNDataModule(pl.LightningDataModule):
    
    def __init__(self, train, train_labels, 
                 valid, valid_labels, test, 
                 batch_size=512):
        super().__init__()
        self.train = train
        self.train_labels = train_labels
        self.valid = valid
        self.valid_labels = valid_labels
        self.test = test
        self.batch_size = batch_size
        
    def prepare_data(self):
        pass
    
    def setup(self, stage=None):
        # It receives stage arguments from Trainer
        if stage == 'fit' or stage is None:
            self.train_set = TPSApr22RNNDataset(self.train, 
                                                self.train_labels)
            self.valid_set = TPSApr22RNNDataset(self.valid, 
                                                self.valid_labels)
            self.n_train_samples = len(self.train_set)
            self.steps_per_epoch = self.n_train_samples // self.batch_size
            
        if stage == 'predict' or stage is None:
            self.test_set = TPSApr22RNNDataset(self.test)
            
    def train_dataloader(self):
        return torch.utils.data.DataLoader(self.train_set, 
                                           batch_size=self.batch_size, 
                                           shuffle=True)
    
    def val_dataloader(self):
        return torch.utils.data.DataLoader(self.valid_set, 
                                           batch_size=self.batch_size, 
                                           shuffle=False)
    
    def predict_dataloader(self):
        return torch.utils.data.DataLoader(self.test_set, 
                                           batch_size=self.batch_size, 
                                           shuffle=False, 
                                           drop_last=False)

### 2.2.2 [RNN Model (LSTM)](https://www.kaggle.com/code/kartushovdanil/top-1-tps-apr-22-eda-lstm)

In [None]:
class TPSApr22RNN_pl(pl.LightningModule):
    
    def __init__(self, lr, steps_per_epoch):
        super().__init__()
        self.save_hyperparameters() # I need this.
        self.lr = lr
        self.steps_per_epoch = steps_per_epoch
        
        # Layers for the model
        self.lstm1 = torch.nn.LSTM(13, 512, bidirectional=True, 
                                   batch_first=True)
        self.lstm2 = torch.nn.LSTM(1024, 256, bidirectional=True, 
                                   batch_first=True)
        self.gru = torch.nn.GRU(1024, 256, bidirectional=True, 
                                 batch_first=True)
        self.lstm3 = torch.nn.LSTM(1024, 128, bidirectional=True, 
                                   batch_first=True)
        self.dense1 = torch.nn.Linear(256, 128)
        self.dense2 = torch.nn.Linear(128, 1)
        
        # Metrics
        self.train_acc = torchmetrics.Accuracy()
        self.valid_acc = torchmetrics.Accuracy()
        
    def forward(self, x):
        x, _ = self.lstm1(x)
        y, _ = self.lstm2(x)
        z, _ = self.gru(x)
        c = torch.cat((y, z), axis=2)
        x1, _ = self.lstm3(c)
        x2, _ = torch.max(x1, dim=1)
        x3 = self.dense1(x2)
        x4 = torch.nn.SELU()(x3)
        output = self.dense2(x4)
        return output
    
    def training_step(self, batch, batch_idx):
        x = batch['x']
        labels = batch['labels']
        
        logits = self(x)
        loss = torch.nn.BCEWithLogitsLoss()(logits, labels)
        preds = (logits > 0).int()
        
        #self.log('train_loss', loss)
        results = {'loss': loss, 'logits': logits, 
                   'preds': preds, 'labels': labels}
        return results
    
    def training_epoch_end(self, train_step_outputs):
        # This function is called after 'validation_epoch_end'
        # It receives the list of returns of all training_steps.
        preds = torch.cat([items['preds'] for items in train_step_outputs], dim=0)
        labels = torch.cat([items['labels'] for items in train_step_outputs], dim=0)
        losses = torch.tensor([items['loss'] for items in train_step_outputs])
        
        #num_correct = (preds == labels).sum()
        #acc = (num_correct / preds.size(0)).item()
        acc = self.train_acc(preds, labels.int())
        
        epoch_loss = losses.sum() / len(train_step_outputs)
        
        self.log('train_loss', epoch_loss)
        self.log('train_acc', acc)
        
        print(f"train loss: {epoch_loss:.4f}, train acc: {acc:.4f}")
        
    def validation_step(self, batch, batch_idx):
        x = batch['x']
        labels = batch['labels']
        
        logits = self(x)
        loss = torch.nn.BCEWithLogitsLoss()(logits, labels)
        preds = (logits > 0).int()
        
        #self.log('val_loss', loss)
        results = {'loss': loss, 'logits': logits, 
                   'preds': preds, 'labels': labels}
        return results
    
    def validation_epoch_end(self, val_step_outputs):
        preds = torch.cat([items['preds'] for items in val_step_outputs], dim=0)
        labels = torch.cat([items['labels'] for items in val_step_outputs], dim=0)
        losses = torch.tensor([items['loss'] for items in val_step_outputs])
        
        #num_correct = (preds == labels).sum()
        #acc = (num_correct / preds.size(0)).item()
        acc = self.valid_acc(preds, labels.int())
        
        epoch_loss = losses.sum() / len(val_step_outputs)
        
        self.log('val_loss', epoch_loss)
        self.log('val_acc', acc)
        
        print(f"------ Epoch{self.current_epoch+1} ------")
        print(f"valid loss: {epoch_loss:.4f}, valid acc: {acc:.4f}")
        
    def predict_step(self, batch, batch_idx):
        x = batch['x']
        
        logits = self(x)
        probs = torch.nn.functional.sigmoid(logits)
        preds = (logits > 0).int()
        
        results = {'logits': logits, 'probs': probs, 'preds': preds}
        return results
    
    def configure_optimizers(self):
        #return torch.optim.Adam(self.parameters(), lr=self.lr)
        
        optimizer = torch.optim.AdamW(params=self.parameters(), 
                                      lr=self.lr)
        scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, 
                                                        max_lr=self.lr, 
                                                        pct_start=0.1, 
                                                        steps_per_epoch=self.steps_per_epoch, 
                                                        epochs=self.trainer.max_epochs)
        return [optimizer, ], [scheduler, ]
        

In [None]:
batch_size = 512
dm = TPSApr22RNNDataModule(train, train_labels, valid, valid_labels, 
                           test_df, batch_size=batch_size)
dm.prepare_data()
dm.setup('fit')

model = TPSApr22RNN_pl(lr=1e-3, steps_per_epoch=dm.steps_per_epoch)

checkpoint = pl.callbacks.ModelCheckpoint(
    monitor='val_loss', 
    mode='min', 
    save_top_k=1, 
    save_weights_only=True, 
    dirpath='/tmp/rnn/checkpoint',)

trainer = pl.Trainer(
    gpus=1, 
    max_epochs=15,  
    callbacks=[checkpoint])

summary(
    model, 
    input_size=(batch_size, 60, 13), 
    col_names=['output_size', 'num_params'],)

### 2.2.3 Model Training

In [None]:
trainer.fit(model, dm)

### 2.2.4 Prediction and Submission

In [None]:
model = TPSApr22RNN_pl.load_from_checkpoint(checkpoint.best_model_path)

rnn_preds = trainer.predict(model, dm)
print(len(rnn_preds))

rnn_probs = torch.cat([items['probs'] for items in rnn_preds], dim=0).numpy()
rnn_preds = torch.cat([items['preds'] for items in rnn_preds], dim=0).numpy()

rnn_probs = np.squeeze(rnn_probs)
submission_df['state'] = rnn_probs
submission_df.to_csv('rnn_submission_1.csv', index=False)
submission_df.head()