# <span style="color:#7A5197;">***TPS APR 22***<span>


### <span style="color:#7A5197;">*Table of content*<span>
<a id="table-of-contents"></a>
- [1. Introduction](#1)
    - [1.1 Storytelling](#1.1)
    - [1.2 Evaluation](#1.2)
- [2. Preparations](#2)
- [3. Dataset Overview](#3)
    - [3.1 Train Dataset](#3.1)
        - [3.1.1 Quick view](#3.1.1)
        - [3.1.2 Data types](#3.1.2)
        - [3.1.3 Basic Statistics](#3.1.3)
        - [3.1.4 Target Column](#3.1.4)
    - [3.2 Test Dataset](#3.2)
        - [3.2.1 Quick view](#3.2.1)
        - [3.2.2 Data types](#3.2.2)
        - [3.2.3 Basic Statistics](#3.2.3)
    - [3.3 Submission](#3.3)
- [4. Explore Data Analisys](#4)
    - [4.1 Target](#4.1)
    - [4.2 Sequence distribution by subject](#4.2)
    - [4.3 Features distribution](#4.3)
    - [4.4 Selected Time Series](#4.4)
    - [4.5 Profile per step with Quantiles](#4.5)
    - [4.6 Profile per Subject with Quantiles](#4.6)
    - [4.7 Scatter plot features](#4.7)
    - [4.8 Correlations](#4.8)
- [5. LSTM](#5)
- [6. Reference](#6)


[back to top](#table-of-contents)
<a id="1"></a>
# **<span style="color:#7A5197;">1. Introduction</span>**
<a id="1.1"></a>
## **<span style="color:#7A5197;">1.1 Storytelling</span>**

Welcome to the April edition of the 2022 Tabular Playground Series! This month's challenge is a time series classification problem.

You've been provided with thousands of sixty-second sequences of biological sensor data recorded from several hundred participants who could have been in either of two possible activity states. Can you determine what state a participant was in from the sensor data?

<a id="1.2"></a>
## **<span style="color:#7A5197;">1.2 Evaluation</span>**

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

[back to top](#table-of-contents)
<a id="2"></a>
# **<span style="color:#7A5197;">2. Preparations</span>**

Preparing packages and data that will be used in the analysis process. Packages that will be loaded are mainly for data manipulation, data visualization and modeling. There are 2 datasets that are used in the analysis, they are train and test dataset. The main use of train dataset is to train models and use it to predict test dataset. While sample submission file is used to informed participants on the expected submission for the competition. (to see the details, please expand)


In [None]:
%reset -sf

# import packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

# setting up options
import warnings
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
from cycler import cycler
from IPython.core.display import HTML


!pip install bloxs
from bloxs import B

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'



# read datasets
train_df = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
train_labels = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')
test_df = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
ssub = pd.read_csv('../input/tabular-playground-series-apr-2022/sample_submission.csv')

features = [col for col in test_df if col not in ['PassengerId']]

def multi_table(table_list):
    return HTML(
        f"<table><tr> {''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list])} </tr></table>")

def formatter(v):
    if type(v) is str:
        return v
    if pd.isna(v) or v <= 0:
        return ''
    if v == int(v):
        return f'{v:.0f}'
    return f'{v:.1f}'

from numpy import float64, float32, int64, int32, dtype

f'From {train_df.memory_usage().sum() / 1000000:,.2f}Mbs...'

def reduce_mem(df):
    df = df.copy()
    
    map_dtypes = {'int': dtype(int64), 'float': dtype(float32)}
    
    for col in df:
        if df[col].dtype == dtype(int64):
            df[col] = df[col].astype(int32)
        if df[col].dtype == dtype(float64):
            df[col] = df[col].astype(float32)
    return df

train_df = reduce_mem(train_df)
test_df = reduce_mem(test_df)
        
f'...to just {train_df.memory_usage().sum() / 1000000:,.2f}Mbs. Nice!'

[back to top](#table-of-contents)
<a id="3"></a>
# **<span style="color:#7A5197;">3. Dataset Overview</span>**

The intend of the overview is to get a feel of the data and its structure in train, test and submission file. An overview on train and test datasets will include a quick analysis on missing values and basic statistics, while sample submission will be loaded to see the expected submission.

|Variable|Definition|
|------|---|
| sequence | a unique id for each sequence |
| subject |  a unique id for the subject in the experiment |
| step | time step of the recording, in one second intervals |
| sensor_00 - sensor_12 | the value for each of the thirteen sensors at that time step |

[back to top](#table-of-contents)
<a id="3"></a>
### **<span style="color:#7A5197;">3.1 Train Dataset</span>**

As stated before, train dataset is mainly used to train predictive model as there is an available target variable in this set. This dataset is also used to explore more on the data itself including find a relation between each predictors and the target variable.


[back to top](#table-of-contents)
<a id="3.1.1"></a>
#### **<span style="color:#7A5197;">3.1.1 Quick view</span>**


In [None]:

B(train_df['subject'].nunique(), 'Subjects')
B(train_df['sequence'].nunique(), 'Sequences')
B(train_df['step'].nunique(), 'Steps')

train_df.head()

In [None]:
print(f'Number of rows: {train_df.shape[0]};  Number of columns: {train_df.shape[1]}; No of missing values: {sum(train_df.isna().sum())}')

train_df.isna().sum()

[back to top](#table-of-contents)
<a id="3.1.2"></a>
#### **<span style="color:#7A5197;">3.1.2 Data types</span>**


In [None]:
train_df.dtypes

[back to top](#table-of-contents)
<a id="3.1.3"></a>
#### **<span style="color:#7A5197;">3.1.3 Basic Statistics</span>**
Below is the basic statistics for each variables which contain information on count, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile and maximum.

In [None]:
train_df.describe()

[back to top](#table-of-contents)
<a id="3.1.3"></a>
#### **<span style="color:#7A5197;">3.1.4 Target Column</span>**


In [None]:
print('Target column basic statistics:')
train_labels['state'].describe()

In [None]:
print('Frequency of each target classes:')
train_labels['state'].value_counts()

[back to top](#table-of-contents)
<a id="3.2"></a>
### **<span style="color:#7A5197;">3.2 Test Dataset</span>**
Test dataset is used to make a prediction based on the model that has previously trained. Exploration in this dataset is also needed to see how the data is structured and especially on it’s similiarity with the train dataset.



[back to top](#table-of-contents)
<a id="3.2.1"></a>
#### **<span style="color:#7A5197;">3.2.1 Quick view</span>**


In [None]:

B(test_df['subject'].nunique(), '(Subjects)')
B(test_df['sequence'].nunique(), '(Sequences)')
B(test_df['step'].nunique(), 'Steps')

#test_df.head()

[back to top](#table-of-contents)
<a id="3.2.2"></a>
#### **<span style="color:#7A5197;">3.2.2 Data types</span>**

In [None]:
test_df.dtypes

[back to top](#table-of-contents)
<a id="3.2.3"></a>
#### **<span style="color:#7A5197;">3.2.3 Basic Statistics</span>**
Below is the basic statistics for each variables which contain information on count, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile and maximum.

In [None]:
test_df.describe()

[back to top](#table-of-contents)
<a id="3"></a>
### **<span style="color:#7A5197;">3.3 Submission</span>**
Below is the first 5 rows of submission file:

In [None]:
ssub.head()

[back to top](#table-of-contents)
<a id="4"></a>
# **<span style="color:#7A5197;">4. Explore Data Analisys</span>**

<a id="4.1"></a>
### **<span style="color:#7A5197;">4.1 Target</span>**


In [None]:
colors = ['#7A5197', '#BB5098', '#5344A9', '#F5C63C', '#F47F6B']

plt.subplots(figsize=(25, 10), facecolor='#f6f5f5')
plt.pie(train_labels['state'].value_counts(), startangle=90, wedgeprops={'width':0.3}, colors=['#F5C63C', '#7A5197'] )
plt.title('Target Balance Pie Chart', loc='center', fontsize=24, color='#7A5197', fontweight='bold');
plt.text(0, 0, f"{train_labels['state'].value_counts()[0] / train_labels['state'].count() * 100:.2f}%", ha='center', va='center', fontweight='bold', fontsize=42, color='#7A5197');
plt.legend(train_labels['state'].value_counts().index, ncol=2, facecolor='#f6f5f5', edgecolor='#f6f5f5', loc='lower center', fontsize=16);
plt.show();

[back to top](#table-of-contents)

<a id="4.2"></a>
### **<span style="color:#7A5197;">4.2 Sequence distribution by subject</span>**


In [None]:
colors = ['#7A5197', '#BB5098', '#5344A9', '#F5C63C', '#F47F6B']

plt.figure(figsize=(13, 4), facecolor='#f6f5f5')
ax = plt.subplot(1,1,1)
temp = train_df.subject.value_counts().sort_values() // 60
ax.bar(range(len(temp)), temp, width=1, color=colors[0], zorder=2);
for s in ["top","right"]:
    ax.spines[s].set_visible(False)
ax.set_facecolor('#f6f5f5') 

ax.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)

plt.xlabel('subject');
plt.ylabel('sequence count');
plt.suptitle('Sequence distribution by subject', y=1.02, fontweight='bold');
plt.show();

[back to top](#table-of-contents)

<a id="4.3"></a>
### **<span style="color:#7A5197;">4.3 Features distribution</span>**


In [None]:
FEATURES = [f'sensor_0{i+1}' for i in range(9)] + [f'sensor_{i+1}' for i in range(9, 12)]
colors = ['#7A5197', '#BB5098', '#5344A9', '#F5C63C', '#F47F6B']
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(15, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(4, 3)
gs.update(wspace=0.3, hspace=0.2)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 4):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].set_yticklabels([])
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1
        
        

run_no = 0
for col in FEATURES:
    sns.kdeplot(train_df[col], ax=locals()["ax"+str(run_no)], shade=True, color=colors[0], 
                edgecolor='black', linewidth=0, alpha=1, zorder=3);
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=10, fontweight='bold').set_rotation(0);
    locals()["ax"+str(run_no)].yaxis.set_label_coords(1.1, 0);
    locals()["ax"+str(run_no)].set_xlabel('');
    run_no += 1
    
plt.suptitle('Features distribution', fontweight='bold');


[back to top](#table-of-contents)

<a id="4.4"></a>
### **<span style="color:#7A5197;">4.4 Selected Time Series</span>**


In [None]:
sequences = [0, 1, 2, 8364, 15404]
figure, axes = plt.subplots(13, len(sequences), sharex=True, figsize=(16, 16), facecolor='#f6f5f5')
colors = ['#7A5197', '#BB5098', '#5344A9', '#F5C63C', '#F47F6B']

for i, sequence in enumerate(sequences):
    for sensor in range(13):
        sensor_name = f"sensor_{sensor:02d}"
        ax = plt.subplot(13, len(sequences), sensor * len(sequences) + i + 1)
        for s in ["top","right"]:
                ax.spines[s].set_visible(False)
        ax.set_facecolor('#f6f5f5')
        plt.plot(range(60), train_df[train_df.sequence == sequence][sensor_name],
                color=colors[i]);        #plt.rcParams['axes.prop_cycle'].by_key()['color'][i % 10])
        if sensor == 0: plt.title(f"Sequence {sequence}");
        if sequence == sequences[0]: plt.ylabel(sensor_name);
figure.tight_layout(w_pad=0.1)
plt.suptitle('Selected Time Series', y=1.02, fontweight='bold');
plt.show();

[back to top](#table-of-contents)

<a id="4.5"></a>
### **<span style="color:#7A5197;">4.5 Profile per step with Quantiles</span>**


In [None]:
train_df = pd.merge(train_df, train_labels, left_on='sequence', right_on='sequence')



sensors = train_df.columns[train_df.columns.str.contains('sensor')]


state_0 = f"{int(round((train_df['state']==0).sum() / len(train_df)*100))}"
#B('~'+state_0+'%', 'State 0', progress=state_0)
state_1 = f"{int(round((train_df['state']==1).sum() / len(train_df)*100))}"
#B('~'+state_1+'%', 'State 1', progress=state_1)

steps_0 = train_df[train_df['state']==0].pivot_table(
    index=['step'],
    values=sensors,
    aggfunc='mean'
)

steps_1 = train_df[train_df['state']==1].pivot_table(
    index=['step'],
    values=sensors,
    aggfunc='mean'
)

fig, ax = plt.subplots(1, 2, figsize=(15,8), sharey=True, constrained_layout=True, facecolor='#f6f5f5')
_ = ax[0].plot(steps_0.median(axis=1), label='PCT 50%', c=colors[0])
_ = ax[0].plot(steps_0.quantile(0.9, axis=1), label='PCT 90%', c=colors[0], alpha=0.25)
_ = ax[0].plot(steps_0.quantile(0.8, axis=1), label='PCT 80%', c=colors[0], alpha=0.5)
_ = ax[0].plot(steps_0.quantile(0.2, axis=1), label='PCT 20%', c=colors[0], alpha=0.5)
_ = ax[0].plot(steps_0.quantile(0.1, axis=1), label='PCT 10%', c=colors[0], alpha=0.25)
_ = ax[0].plot([0]*len(steps_0), c='black')
_ = ax[0].legend(ncol=5, facecolor='#f6f5f5', edgecolor='#f6f5f5', loc='lower center')
_ = ax[0].set_xlabel('Step')
_ = ax[0].set_ylabel('Value')
_ = ax[0].set_title('State 0')
for s in ["top","right"]:
    ax[0].spines[s].set_visible(False)
_ = ax[0].set_facecolor('#f6f5f5')


_ = ax[1].plot(steps_1.median(axis=1), label='PCT 50%', c=colors[1])
_ = ax[1].plot(steps_1.quantile(0.9, axis=1), label='PCT 90%', c=colors[1], alpha=0.25)
_ = ax[1].plot(steps_1.quantile(0.8, axis=1), label='PCT 80%', c=colors[1], alpha=0.5)
_ = ax[1].plot(steps_1.quantile(0.2, axis=1), label='PCT 20%', c=colors[1], alpha=0.5)
_ = ax[1].plot(steps_1.quantile(0.1, axis=1), label='PCT 10%', c=colors[1], alpha=0.25)
_ = ax[1].plot([0]*len(steps_0), c='black')
_ = ax[1].set_xlabel('Step')
_ = ax[1].legend(ncol=5, facecolor='#f6f5f5', edgecolor='#f6f5f5', loc='lower center')
#plt.legend(train_labels['state'].value_counts().index, ncol=2, facecolor='#f6f5f5', edgecolor='#f6f5f5', loc='lower center', fontsize=16);

_ = ax[1].set_title('State 1')
for s in ["top","right"]:
    ax[1].spines[s].set_visible(False)
_ = ax[1].set_facecolor('#f6f5f5')
        

_ = fig.suptitle('States 0|1 (mean) Profile per step with Quantiles', fontweight='bold')

[back to top](#table-of-contents)

<a id="4.6"></a>
### **<span style="color:#7A5197;">4.6 Profile per Subject with Quantiles</span>**


In [None]:
# Can we see that but through Subjects instead of Steps?

steps_0 = train_df[train_df['state']==0].pivot_table(
    index=['subject'],
    values=sensors,
    aggfunc='mean'
)

steps_1 = train_df[train_df['state']==1].pivot_table(
    index=['subject'],
    values=sensors,
    aggfunc='mean'
)

fig, ax = plt.subplots(2, 1, figsize=(15,10), sharey=True, constrained_layout=True, facecolor='#f6f5f5')
_ = ax[0].plot(steps_0.median(axis=1), label='PCT 50%', c=colors[0])
_ = ax[0].plot(steps_0.quantile(0.9, axis=1), label='PCT 90%', c=colors[0], alpha=0.25)
_ = ax[0].plot(steps_0.quantile(0.8, axis=1), label='PCT 80%', c=colors[0], alpha=0.5)
_ = ax[0].plot(steps_0.quantile(0.2, axis=1), label='PCT 20%', c=colors[0], alpha=0.5)
_ = ax[0].plot(steps_0.quantile(0.1, axis=1), label='PCT 10%', c=colors[0], alpha=0.25)
_ = ax[0].plot([0]*len(steps_0), c='black')
_ = ax[0].legend(ncol=5, facecolor='#f6f5f5', edgecolor='#f6f5f5', loc='lower center')
_ = ax[0].set_ylabel('Value')
_ = ax[0].set_title('State 0')
for s in ["top","right"]:
    ax[0].spines[s].set_visible(False)
_ = ax[0].set_facecolor('#f6f5f5')


_ = ax[1].plot(steps_1.median(axis=1), label='PCT 50%', c=colors[1])
_ = ax[1].plot(steps_1.quantile(0.9, axis=1), label='PCT 90%', c=colors[1], alpha=0.25)
_ = ax[1].plot(steps_1.quantile(0.8, axis=1), label='PCT 80%', c=colors[1], alpha=0.5)
_ = ax[1].plot(steps_1.quantile(0.2, axis=1), label='PCT 20%', c=colors[1], alpha=0.5)
_ = ax[1].plot(steps_1.quantile(0.1, axis=1), label='PCT 10%', c=colors[1], alpha=0.25)
_ = ax[1].plot([0]*len(steps_0), c='black')
_ = ax[1].set_ylabel('Value')
_ = ax[1].set_xlabel('Subject')
_ = ax[1].legend(ncol=5, facecolor='#f6f5f5', edgecolor='#f6f5f5', loc='lower center')
_ = ax[1].set_title('State 1')
for s in ["top","right"]:
    ax[1].spines[s].set_visible(False)
_ = ax[1].set_facecolor('#f6f5f5')

_ = fig.suptitle('States 0|1 (mean) Profile per Subject with Quantiles', fontweight='bold')


[back to top](#table-of-contents)

<a id="4.7"></a>
### **<span style="color:#7A5197;">4.7 Scatter plot features</span>**


In [None]:
FEATURES = [f'sensor_0{i+1}' for i in range(9)] + [f'sensor_{i+1}' for i in range(9, 12)]
colors = ['#7A5197', '#BB5098', '#5344A9', '#F5C63C', '#F47F6B']
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(15, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(4, 3)
gs.update(wspace=0.3, hspace=0.2)

background_color = "#f6f5f5"

run_no = 0
for row in range(0, 4):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].set_yticklabels([])
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1
        
        

run_no = 0
for col in FEATURES:
    sns.scatterplot(train_df[col], train_df.index, ax=locals()["ax"+str(run_no)], color=colors[0], 
                edgecolor=None, linewidth=0, alpha=1, zorder=3);
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=10, fontweight='bold').set_rotation(0);
    locals()["ax"+str(run_no)].yaxis.set_label_coords(1.1, 0);
    locals()["ax"+str(run_no)].set_xlabel('');
    run_no += 1
    
plt.suptitle('Features distribution', fontweight='bold');


[back to top](#table-of-contents)

<a id="4.8"></a>
### **<span style="color:#7A5197;">4.8 Correlation</span>**


In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
sns.heatmap(train_df.corr(), cmap='BuPu')


[back to top](#table-of-contents)

<a id="5"></a>
# **<span style="color:#7A5197;">LSTM</span>**


In [None]:
import tensorflow as tf
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.layers import GlobalMaxPooling1D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.layers import Concatenate, LSTM, GRU
from tensorflow.keras.layers import Bidirectional, Multiply

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import KFold, GroupKFold

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
train_labels = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')
test_df = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
ssub = pd.read_csv('../input/tabular-playground-series-apr-2022/sample_submission.csv')


In [None]:
features = train_df.columns.tolist()[3:]
def prep(df):
    for feature in features:
        df[feature + '_lag1'] = df.groupby('sequence')[feature].shift(1)
        df.fillna(0, inplace=True)
        df[feature + '_diff1'] = df[feature] - df[feature + '_lag1']    

prep(train_df)
prep(test_df)

features = train_df.columns.tolist()[3:]
sc = StandardScaler()
train_df[features] = sc.fit_transform(train_df[features])
test_df[features] = sc.transform(test_df[features])

groups = train_df["sequence"]
labels = train_labels["state"]

train_df = train_df.drop(["sequence", "subject", "step"], axis=1).values
train_df = train_df.reshape(-1, 60, train_df.shape[-1])

test_df = test_df.drop(["sequence", "subject", "step"], axis=1).values
test_df = test_df.reshape(-1, 60, test_df.shape[-1])

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
    BATCH_SIZE = tpu_strategy.num_replicas_in_sync * 64
    print("Running on TPU:", tpu.master())
    print(f"Batch Size: {BATCH_SIZE}")
    
except ValueError:
    strategy = tf.distribute.get_strategy()
    BATCH_SIZE = 256
    print(f"Running on {strategy.num_replicas_in_sync} replicas")
    print(f"Batch Size: {BATCH_SIZE}")
    
    
def dnn_model():
    
    x_input = Input(shape=(train_df.shape[-2:]))
    
    x1 = Bidirectional(LSTM(units=512, return_sequences=True))(x_input)
    x2 = Bidirectional(LSTM(units=256, return_sequences=True))(x1)
    z1 = Bidirectional(GRU(units=256, return_sequences=True))(x1)
    
    c = Concatenate(axis=2)([x2, z1])
    
    x3 = Bidirectional(LSTM(units=128, return_sequences=True))(c)
    
    x4 = GlobalMaxPooling1D()(x3)
    x5 = Dense(units=128, activation='selu')(x4)
    x_output = Dense(1, activation='sigmoid')(x5)

    model = Model(inputs=x_input, outputs=x_output, name='lstm_model')
    
    return model

model = dnn_model()

In [None]:
with tpu_strategy.scope():
    VERBOSE = True
    predictions, scores = [], []
    k = GroupKFold(n_splits = 10)

    for fold, (train_idx, val_idx) in enumerate(k.split(train_df, labels, groups.unique())):
        print('-'*15, '>', f'Fold {fold+1}', '<', '-'*15)
    
        X_train, X_val = train_df[train_idx], train_df[val_idx]
        y_train, y_val = labels.iloc[train_idx].values, labels.iloc[val_idx].values
        
        model = dnn_model()
        model.compile(optimizer="adam", loss="binary_crossentropy", metrics='AUC')

        lr = ReduceLROnPlateau(monitor="val_auc", factor=0.6, 
                               patience=4, verbose=VERBOSE)

        es = EarlyStopping(monitor="val_auc", patience=7, 
                           verbose=VERBOSE, mode="max", 
                           restore_best_weights=True)
        
        save_locally = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
        chk_point = ModelCheckpoint(f'./TPS_model_2022_{fold+1}C.h5', options=save_locally, 
                                    monitor='val_auc', verbose=VERBOSE, 
                                    save_best_only=True, mode='max')
        
        model.fit(X_train, y_train, 
                  validation_data=(X_val, y_val), 
                  epochs=15,
                  verbose=VERBOSE,
                  batch_size=BATCH_SIZE, 
                  callbacks=[lr, chk_point, es])
        
        load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
        model = load_model(f'./TPS_model_2022_{fold+1}C.h5', options=load_locally)
        
        y_pred = model.predict(X_val, batch_size=BATCH_SIZE).squeeze()
        score = roc_auc_score(y_val, y_pred)
        scores.append(score)
        predictions.append(model.predict(test_df, batch_size=BATCH_SIZE).squeeze())
        print(f"Fold-{fold+1} | OOF Score: {score}")
    
    print(f'Mean accuracy on {k.n_splits} folds - {np.mean(scores)}')


In [None]:
ssub["state"] = sum(predictions)/k.n_splits 
ssub.to_csv('submition.csv', index=False)
ssub.head(3)

[back to top](#table-of-contents)

<a id="6"></a>
# **<span style="color:#7A5197;">References</span>**

[@ambrosm](https://www.kaggle.com/code/ambrosm/tpsapr22-eda-which-makes-sense)

[@torchme](https://www.kaggle.com/code/kartushovdanil/bird-eda-dark-charts)

[@maulberto3](https://www.kaggle.com/code/maulberto3/tps-april-2022-easy-eda)

[@javigallego](https://www.kaggle.com/code/javigallego/tps-apr22-eda-skewness-outliers-and-more)

[@dmitryuarov](https://www.kaggle.com/code/dmitryuarov/tps-sensors-auc-0-964)