## 4.Modeling

This stage we apply the model on the dataset prepared on the last stage. First, we will make a simple baseline. The test data set doesn't have a target to check, we will utilize cross-validation on the training set for evaluation.

A)Baseline

For the baseline we will use the logistic regression and the cross-validation on the training set. At the end we also will submit the results on the kaggle with the test set. Let's start loading the data.

In [1]:
from dask.distributed import Client
import dask.dataframe as dd
import logging
import numpy as np
import pandas as pd
import tensorflow as tf

2023-08-24 04:37:48.482314: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
data_path = "../data"
#Loading the data
y_train = pd.read_csv(f"{data_path}/originalSet/train_labels.csv")
x_train = pd.read_parquet(f"{data_path}/V1Set/train/")
#x_test = dd.read_parquet(f"{data_path}/V1Set/test/")

In [48]:
x_train = x_train.merge(y_train, right_on='customer_ID', left_on='customer_ID')

In [25]:
x_train.to_parquet('./merged_train/merged_train.parquet')

In [26]:
drop_columns = x_train.columns[:2]

With the data loaded we need to build a class to feed the model with the samples on a time-series way. Which means, we need to feed the model with windows of time that will shift feeding the model. The model will make a set of predictions based on a windows of consecutive samples from the data. The template of the modeling and dataset window class is extracted from the tensorflow website and can be accesed by:<link>https://www.tensorflow.org/tutorials/structured_data/time_series#split_the_data</link>

The main features of the input windows are:

- The width (number of time steps) of the inpyt and label windows.
- The time offset between them.
- Which features are used as inputs, labels, or both.

After prepare the dataset, we gonna use deep learning model to solve the problem. The problem we are trying to solve, receive the months and has to predict the default or not of the customer. Then we have to choose the correct architecture to solve this kind of problem. The model receives n entries and predict one unique label, this is called multi-input single output. The right deep learning architecture to solve this is a LSTM n-by-one. 

Now let's customize the template from the tensorflows website to fullfill ours necessity.

First we will split the train data in train, test and validation. 

In [27]:
#split x_train in 3 new divisions, train, test, validation.
def get_dataset_partitions_pd(df, y_train, train_split=0.7, val_split=0.2, test_split=0.1):
    assert (train_split + test_split + val_split) == 1

    #specify seed to always have the same split distribution between runs
    customer_ids = y_train.sample(frac=1, random_state=7)['customer_ID'].values
    #splitting
    #train
    train_ds = df[df['customer_ID'].isin(customer_ids[:int(train_split * len(customer_ids))])]   
    
    #val
    val_ds = df[df['customer_ID'].isin(customer_ids[int(train_split * len(customer_ids)):
                                       int(train_split * len(customer_ids))+int(val_split*len(customer_ids))])]   
    #test
    test_ds = df[df['customer_ID'].isin(customer_ids[int(train_split * len(customer_ids))+int(val_split*len(customer_ids)):])]
    
    return train_ds, val_ds, test_ds

In [28]:
train_ds, val_ds, test_ds = get_dataset_partitions_pd(x_train, y_train)

In [29]:
train_ds = train_ds.drop(drop_columns ,axis=1)
val_ds = val_ds.drop(drop_columns ,axis=1)
test_ds = test_ds.drop(drop_columns ,axis=1)

In [30]:
len(x_train.columns)

163

In [31]:
x_train.columns

Index(['customer_ID', 'S_2', 'P_2', 'D_39', 'B_1', 'B_2', 'R_1', 'D_41', 'B_3',
       'D_44',
       ...
       'D_64_O', 'D_64_R', 'D_64_U', 'D_114_0.0', 'D_114_1.0', 'D_116_0.0',
       'D_116_1.0', 'D_120_0.0', 'D_120_1.0', 'target'],
      dtype='object', length=163)

In [32]:
del y_train

In [33]:
print(f'Number of customers per set: \n train:{len(train_ds)/13} \n validation:{len(val_ds)} \n test:{len(test_ds)}')

Number of customers per set: 
 train:321239.0 
 validation:1193166 
 test:596596


In [34]:
class WindowGenerator():
    def __init__(self, input_width, label_width, shift,
                   train_df=train_ds, val_df=val_ds, test_df=test_ds,
                   label_columns=['target']):
        # Store the raw data.
        self.train_df = train_df
        self.val_df = val_df
        self.test_df = test_df
        
        # Work out the label column indices.
        self.label_columns = label_columns
        if label_columns is not None:
          self.label_columns_indices = {name: i for i, name in
                                        enumerate(label_columns)}
        self.column_indices = {name: i for i, name in
                               enumerate(train_df.columns)}
        
        # Work out the window parameters.
        self.input_width = input_width
        self.label_width = label_width
        self.shift = shift
        
        self.total_window_size = input_width + shift
        
        self.input_slice = slice(0, input_width)
        self.input_indices = np.arange(self.total_window_size)[self.input_slice]
        
        self.label_start = self.total_window_size - self.label_width
        self.labels_slice = slice(self.label_start, None)
        self.label_indices = 13

    def __repr__(self):
        return '\n'.join([
            f'Total window size: {self.total_window_size}',
            f'Input indices: {self.input_indices}',
            f'Label indices: {self.label_indices}',
            f'Label column name(s): {self.label_columns}'])
    def split_window(self, features):
        inputs = features[:, self.input_slice, :-1]
        labels = features[:, self.labels_slice, :]
        if self.label_columns is not None:
            labels = tf.stack(
            [labels[:, :, self.column_indices[name]] for name in self.label_columns],
            axis=-1)
        
        # Slicing doesn't preserve static shape information, so set the shapes
        # manually. This way the `tf.data.Datasets` are easier to inspect.
        inputs.set_shape([None, self.input_width, None])
        labels.set_shape([None, self.label_width, None])
        
        return inputs, labels
    def make_dataset(self, data):
        
        data = np.array(data, dtype=np.float32)
        ds = tf.keras.utils.timeseries_dataset_from_array(data=data,
                                                          targets=None,
                                                          sequence_length=self.total_window_size,
                                                          sequence_stride=1,
                                                          shuffle=False,
                                                          batch_size=32,)
        ds = ds.map(self.split_window)
            
        return ds
    
    @property
    def train(self):
        return self.make_dataset(self.train_df)
            
    @property
    def val(self):
        return self.make_dataset(self.val_df)
    
    @property
    def test(self):
        return self.make_dataset(self.test_df)
    

In [35]:
w1 = WindowGenerator(input_width=13, label_width=1, shift=0)
w1

Total window size: 13
Input indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12]
Label indices: 13
Label column name(s): ['target']

In [39]:
# Stack three slices, the length of the total window.
example_window = tf.stack([np.array(train_ds[:w1.total_window_size].astype('float32')),
                           np.array(train_ds[130:130+w1.total_window_size].astype('float32')),
                           np.array(train_ds[260:260+w1.total_window_size].astype('float32'))])

example_inputs, example_labels = w1.split_window(example_window)

print('All shapes are: (batch, time, features)')
print(f'Window shape: {example_window.shape}')
print(f'Inputs shape: {example_inputs.shape}')
print(f'Labels shape: {example_labels.shape}')

All shapes are: (batch, time, features)
Window shape: (3, 13, 161)
Inputs shape: (3, 13, 160)
Labels shape: (3, 1, 1)


2023-08-23 09:42:29.862084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3103 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 980, pci bus id: 0000:65:00.0, compute capability: 5.2


In [49]:
MAX_EPOCHS = 2
def compile_and_fit(model, train, val, patience=2):
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor=tf.keras.metrics.Recall(),
                                                      patience=patience,
                                                      mode='min')
    
    model.compile(loss=tf.keras.losses.BinaryCrossentropy(),   optimizer=tf.keras.optimizers.AdamW(use_ema=True),
                  metrics=['acc',tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])
    
    history = model.fit(window.train, epochs=MAX_EPOCHS,
                        validation_data=window.val,verbose=1,
                        callbacks=[early_stopping])
    return history

In [2]:
import pyarrow.parquet as pq
MAX_EPOCHS=4
patience=2
lstm_model = tf.keras.models.Sequential([
    # Shape [batch, time, features] => [batch, time, lstm_units]
    tf.keras.layers.LSTM(32, return_sequences=True, dropout=0.8),
    tf.keras.layers.LSTM(32, return_sequences=True, dropout=0.7),
    tf.keras.layers.LSTM(32, return_sequences=True, dropout=0.6),
    tf.keras.layers.LSTM(32, return_sequences=False, dropout=0.5),
    #tf.keras.layers.Flatten(),
    # Shape => [batch, time, features]
    tf.keras.layers.Dense(units=128, activation= tf.keras.activations.relu),
    tf.keras.layers.Dense(units=64, activation= tf.keras.activations.relu),
    tf.keras.layers.Dense(units=1, activation= tf.keras.activations.sigmoid)
])
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='recall', patience=patience, mode='min')
lstm_model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.AdamW(use_ema=True),
              metrics=['acc',tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])
parquet_file = pq.ParquetFile('./merged_train/merged_train.parquet')
history = []
for me in range(MAX_EPOCHS):
    for batch in parquet_file.iter_batches(batch_size=650000):
        df = batch.to_pandas('./merged_train/merged_train.parquet')
        #data = np.array(data, dtype=np.float32)
        ds_train = tf.keras.utils.timeseries_dataset_from_array(data=df.iloc[:, 2:-1].to_numpy(),
                                                              targets=df.iloc[:, -1].to_numpy(),
                                                              sequence_length=13,
                                                              sequence_stride=13,
                                                              shuffle=False,
                                                              batch_size=32,)
        history.append( lstm_model.fit(ds_train, epochs=1,shuffle=False,verbose=1,
                            callbacks=[early_stopping]))

2023-08-24 04:37:55.561380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3125 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 980, pci bus id: 0000:65:00.0, compute capability: 5.2


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4


2023-08-24 04:38:03.386922: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8904
2023-08-24 04:38:03.451563: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fef1dedaa00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-08-24 04:38:03.451590: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 980, Compute Capability 5.2
2023-08-24 04:38:03.455928: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-08-24 04:38:03.558890: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


TypeError: Cannot convert str to pyarrow.lib.MemoryPool

Exception ignored in: 'pyarrow.lib._convert_pandas_options'
Traceback (most recent call last):
  File "/home/codemaster/anaconda3/envs/america-exp/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot convert str to pyarrow.lib.MemoryPool


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [3]:
lstm_model.save_weights('lstm_model')

In [5]:
parquet_file = pq.ParquetFile('./unified_test/test.parquet')
count = 0
for batch in parquet_file.iter_batches(batch_size=130000):
    df = batch.to_pandas()
    customers = df['customer_ID'].copy()
    test = pd.DataFrame(customers.unique().to_numpy().reshape((-1,1)),columns=['customer_ID'])
    
    #data = np.array(data, dtype=np.float32)
    ds_test = tf.keras.utils.timeseries_dataset_from_array(data=df.iloc[:, 2:].to_numpy(),
                                                          targets=None,
                                                          sequence_length=13,
                                                          sequence_stride=13,
                                                          shuffle=False,
                                                          batch_size=1)
    predictions = lstm_model.predict(ds_test)
    test['predictions'] = np.array(predictions).reshape((-1,1))
    test['predictions'] = test['predictions'].apply(lambda x: 1 if x >=0.5 else 0)
    test.to_parquet(f'../data/submit_data/answer{count}.parquet')
    count += 1



In [7]:
sub = pd.read_parquet("../data/submit_data/")

In [11]:
sub = sub.rename({'predictions': 'prediction'}, axis=1)

In [12]:
sub.to_csv("sub_final.csv", index=False)

In [15]:
!kaggle competitions submit -c amex-default-prediction -f 'sub_final.csv' -m "LSTM+DENSE"

/bin/bash: line 1: kaggle: command not found


#### Final score:
#### score:0.44
