### This notebook has the following Notebooks as reference on detailed analysis and reasonable way of handling the missing values, and feature generation and selection. Apart from that there are few other good notebooks from where this notebook got some value addition pieces of ideas! I wish to thank my fellow kagglers who compel me to learn and grow!

### [Kaggle Notebook] [Jane TF Keras LSTM](https://www.kaggle.com/rajkumarl/jane-tf-keras-lstm) (to fill missing values)
### [Kaggle Notebook] [Jane Day 242 Feature Generation and Selection](https://www.kaggle.com/rajkumarl/jane-day-242-feature-generation-and-selection) (to generate and select features)


### Training part of this notebook is available in the following notebook. Three models were trained with three-folds of data. Weights of those models are saved in h5 format to be used in this notebook for inference.

### [Kaggle Notebook] [TF Residual Network on Select Features](https://www.kaggle.com/rajkumarl/tf-residual-network-on-select-features-training) (for model training)

# 1. IMPORT LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import models, layers, regularizers
import datatable
import warnings
# ignore warnings during notebook running
warnings.filterwarnings('ignore')
SEED = 2222
# set seed
tf.random.set_seed(SEED)
np.random.seed(SEED)

# 2. LOAD DATA AND OPTIMIZE MEMORY

In [2]:
# path of train data file
train_path = '../input/jane-street-market-prediction/train.csv'

# use datatable to load big data file
train_file = datatable.fread(train_path).to_pandas()
train_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float64(135), int32(3)
memory usage: 2.4 GB


In [3]:
# It is found from info() that there are only two datatypes - float64 and int32
# try to convert to low-memory-space data types by comparing max and min values of data
# with the preset max and min values of low-memory-space data types
for c in train_file.columns:
    min_val, max_val = train_file[c].min(), train_file[c].max()
    if train_file[c].dtype == 'float64':
        if min_val>np.finfo(np.float16).min and max_val<np.finfo(np.float16).max:
            train_file[c] = train_file[c].astype(np.float16)
        elif min_val>np.finfo(np.float32).min and max_val<np.finfo(np.float32).max:
            train_file[c] = train_file[c].astype(np.float32)
    elif train_file[c].dtype == 'int32':
        if min_val>np.iinfo(np.int8).min and max_val<np.iinfo(np.int8).max:
            train_file[c] = train_file[c].astype(np.int8)
        elif min_val>np.iinfo(np.int16).min and max_val<np.iinfo(np.int16).max:
            train_file[c] = train_file[c].astype(np.int16)
train_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float16(135), int16(1), int32(1), int8(1)
memory usage: 631.5 MB


### That's a great reduction in memory usage (around 74% reduction)! It will help us go further efficiently!

# 3. HANDLING MISSING VALUES

In [4]:
# take useful features only...
features = train_file.columns[train_file.columns.str.contains('feature')]
# find range of values
val_range = train_file[features].max()-train_file[features].min()
# filler value if lesser by minimum value by 1% of range
filler = pd.Series(train_file[features].min()-0.01*val_range, index=features)
# This filler value will be used as a constant replacement of missing values 


"""
A function to fill all missing values with negative outliers as discussed in the referred notebook
https://www.kaggle.com/rajkumarl/jane-tf-keras-lstm
"""
def fill_missing(df):
    df[features] = df[features].fillna(filler)
    return df  

train = fill_missing(train_file)
train = train.loc[train.weight > 0]
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1981287 entries, 1 to 2390489
Columns: 138 entries, date to ts_id
dtypes: float16(135), int16(1), int32(1), int8(1)
memory usage: 538.5 MB


In [5]:
print("Now we have %d missing values in our data" %train.isnull().sum().sum())

Now we have 0 missing values in our data


# 4. FEATURE GENERATION AND SELECTION

In [56]:
"""
from notebook
https://www.kaggle.com/rajkumarl/jane-day-242-feature-generation-and-selection/comments
"""
def feature_transforms(df):
    # Generate Features using Linear shifting, Natural Logarithm and Square Root
    for f in [f'feature_{i}' for i in range(1,130)]: 
        # linear shifting to value above 1.0
        df['pos_'+str(f)] = (df[f]+abs(train[f].min())+1).astype(np.float16)
    for f in [f'feature_{i}' for i in range(1,130)]: 
        # Natural log of all the values
        df['log_'+str(f)] = np.log(df['pos_'+str(f)]).astype(np.float16)
    for f in [f'feature_{i}' for i in range(1,130)]: 
        # Square root of all the values
        df['sqrt_'+str(f)] = np.sqrt(df['pos_'+str(f)]).astype(np.float16)
    
    # Linearly shifted values are used for log and sqrt transformations
    # However they are useless since we have our original values which are 100% correlated
    # Let's drop them from our data
    df.drop([f'pos_feature_{i}' for i in range(1,130)], inplace=True, axis=1)
    
    # From the Shap Dependence plots, the following features seem to have cubic relationship with target
    cubic = [37, 39, 67, 68, 89, 98, 99, 118, 119, 121, 124, 125, 127]
    for i in cubic:
        f = f'feature_{i}'
        threes = np.array([3])
        df['cub_'+f] =np.power(df[f], threes) 
        
    # From the Shap Dependence plots, the following features seem to have quadratic relationship with target
    quad = [6, 37, 39, 40, 53, 60, 61, 62, 63, 64, 67, 68, 89, 98, 99, 101, 113, 116, 118, 119, 121, 123, 124, 125, 127]
    for i in quad:
        f = f'feature_{i}'
        df['quad_'+f] =np.square(df[f]) 
    
    return df

In [7]:
"""
from notebook
https://www.kaggle.com/rajkumarl/jane-day-242-feature-generation-and-selection/comments
"""
def manipulate_pairs(df):
    # features that can be added together or subtracted
    add_pairs = [(3,6), (15,26), (19,26), (30,37), (34,33), (35,39),(94,65), (101,4)]
    for i,j in add_pairs:
        df[f'add_{i}_{j}'] = df[f'feature_{i}']+df[f'feature_{j}']
        df[f'sub_{i}_{j}'] = df[f'feature_{i}']-df[f'feature_{j}']

    add_log_pairs = [(9,20), (22,37), (28,39), (29,25), (65,91), (74,103),(99,126), (109,7), (111,87), (112,97), (118,112)]
    for i,j in add_log_pairs:
        df[f'add_{i}_log{j}'] = df[f'feature_{i}']+df[f'log_feature_{j}']
        df[f'sub_{i}_log{j}'] = df[f'feature_{i}']-df[f'log_feature_{j}']
    # features that can be multiplied together
    mul_pairs = [(5,42), (12,66), (37,45), (39,95), (122,35)]
    for i,j in mul_pairs:
        df[f'mul_{i}_{j}'] = df[f'feature_{i}']*df[f'feature_{j}']

    mul_log_pairs = [(5,42), (6,42), (11,99), (21,42), (81,66), (98,20), (122,35)]
    for i,j in mul_log_pairs:
        df[f'mul_{i}_log{j}'] = df[f'feature_{i}']*df[f'log_feature_{j}']
    return df

In [8]:
"""
from notebook
https://www.kaggle.com/rajkumarl/jane-day-242-feature-generation-and-selection/comments
"""
selected_features = ['weight', 'feature_1', 'feature_2', 'feature_6', 'feature_9',
       'feature_10', 'feature_16', 'feature_20', 'feature_29', 'feature_37',
       'feature_38', 'feature_39', 'feature_40', 'feature_51', 'feature_52',
       'feature_53', 'feature_54', 'feature_69', 'feature_70', 'feature_71',
       'feature_83', 'feature_100', 'feature_109', 'feature_112',
       'feature_122', 'feature_123', 'feature_124', 'feature_126',
       'feature_128', 'feature_129', 'log_feature_1', 'log_feature_2',
       'log_feature_6', 'log_feature_37', 'log_feature_38', 'log_feature_39',
       'log_feature_40', 'log_feature_50', 'log_feature_51', 'log_feature_52',
       'log_feature_53', 'log_feature_54', 'log_feature_69', 'log_feature_70',
       'log_feature_71', 'log_feature_109', 'log_feature_112',
       'log_feature_122', 'log_feature_123', 'log_feature_126',
       'log_feature_128', 'log_feature_129', 'sqrt_feature_1',
       'sqrt_feature_2', 'sqrt_feature_6', 'sqrt_feature_9', 
       'sqrt_feature_10', 'sqrt_feature_37', 'sqrt_feature_38', 'sqrt_feature_39',
       'sqrt_feature_40', 'sqrt_feature_50', 'sqrt_feature_51',
       'sqrt_feature_52', 'sqrt_feature_53', 'sqrt_feature_54',
       'sqrt_feature_56', 'sqrt_feature_69', 'sqrt_feature_70',
       'sqrt_feature_71', 'sqrt_feature_83', 'sqrt_feature_109',
       'sqrt_feature_112', 'sqrt_feature_122', 'sqrt_feature_123',
       'sqrt_feature_124', 'sqrt_feature_126', 'sqrt_feature_128',
       'sqrt_feature_129', 'cub_feature_37', 'cub_feature_39',
       'quad_feature_53', 'quad_feature_64', 'quad_feature_67',
       'quad_feature_68', 'sub_3_6', 'sub_30_37', 'add_35_39', 'add_9_log20',
       'sub_9_log20', 'add_29_log25', 'sub_29_log25', 'add_109_log7',
       'sub_109_log7', 'add_112_log97', 'sub_112_log97', 'mul_39_95',
       'mul_122_35', 'mul_6_log42', 'mul_122_log35']

# 5. RESIDUAL NETWORK MODELING

In [10]:
class Residual(tf.keras.Model):  
    """The Residual layer of ResNet"""
    def __init__(self, units):
        super().__init__()
        # initialize necessary dense and batch norm layers
        self.d1 = layers.Dense(units, kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4))
        self.d2 = layers.Dense(units, kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4))
        self.d3 = layers.Dense(units, kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4))
        self.bn1 = layers.BatchNormalization()
        self.bn2 = layers.BatchNormalization()

    def call(self, X):
        # stack two dense layers in series...
        Y = tf.keras.activations.relu(self.bn1(self.d1(X)))
        Y = layers.Dropout(0.3)(self.bn2(self.d2(Y)))
        # ... and concatenate them with a third dense layer 
        X = self.d3(X)
        Y += X
        # apply dropout to avoid overfitting
        return layers.Dropout(0.3)(tf.keras.activations.relu(Y))

In [11]:
class ResnetBlock(layers.Layer):
    def __init__(self, num_units, num_residuals, **kwargs):
        super(ResnetBlock, self).__init__(**kwargs)
        # initialize a list of layers
        self.residual_layers = []
        for i in range(num_residuals):
            # append list with residual layers
            self.residual_layers.append(Residual(num_units))

    def call(self, X):
        for layer in self.residual_layers.layers:
            # stack residual layers in series
            X = layer(X)
        return X

In [12]:
def create_model():
    # a keras Sequential model
    model= tf.keras.Sequential([
        # model receives data with 100 features
        layers.Input(shape=(100,)),
        # incorporate noise to avoid overfitting
        layers.GaussianNoise(0.2),
        # introduce first layer before ResNet blocks with regularizers and relu activation
        layers.Dense(64, kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4)),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        # a dropout layer to avoid overfitting
        layers.Dropout(0.5),
        # four subsequent ResNet blocks
        ResnetBlock(64, 2),
        ResnetBlock(128, 2),
        ResnetBlock(256, 2),
        ResnetBlock(512, 2),
        # two layers after ResNet blocks
        layers.Dense(64, activation='relu'),
        # output layer - binary classification - sigmoid activation
        layers.Dense(1, activation='sigmoid')])
    
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), 
                  loss=tf.keras.losses.BinaryCrossentropy(), 
                  metrics=['accuracy'])
    return model


In [13]:
# find out the files in the training data path
!ls '../input/tf-residual-network-on-select-features-training'

__notebook__.ipynb  __results___files		resnet_select_feature_2.h5
__output__.json     custom.css			resnet_select_feature_3.h5
__results__.html    resnet_select_feature_1.h5


In [16]:
# path of our saved model weihts
PATH = '../input/tf-residual-network-on-select-features-training/'
models = []
folds = 3
for i in range(folds):
    model = create_model()
    model.load_weights(PATH+f'resnet_select_feature_{i+1}.h5')
    models.append(model)
print('Modeling phase completed')

Modeling phase completed


# 6. INFERENCE

In [51]:
# test your code with available test file
t_file = pd.read_csv('../input/jane-street-market-prediction/example_test.csv')
# Sample one row
test = t_file.loc[1]
# fill missing values, if any
if test[features].isna().any().sum():
    test[features] = fill_missing(test[features])
# convert dimensions of data as model expects
# (100,) to (1,100)
test = np.expand_dims(test,-2)
# feature generation
test = feature_transforms(test)
test = manipulate_pairs(test)
# feature selection
test = test[selected_features]
# convert to model supported dtype
test = np.array(test, dtype=np.float)
# predict target
action = np.mean([model(test).numpy() for model in models])
# binary classification
# set threshold as 0.5
action = 1 if action>0.5 else 0
# output prediction
action

In [None]:
from tqdm.auto import tqdm
import janestreet
janestreet.make_env.__called__ = False
env = janestreet.make_env()
for test,pred in tqdm(env.iter_test()):
    if test.weight.item()==0:
        pred.action = 0
    else:
        if test[features].isna().any().sum():
            test[features] = fill_missing(test[features])
        test = feature_transforms(test)
        test = manipulate_pairs(test)
        test = np.array(test[selected_features], dtype=np.float)
        action = np.mean([model(test).numpy() for model in models])
        pred.action = 1 if action>0.5 else 0
    env.predict(pred)

|          | 0/? [00:00<?, ?it/s]

### Thank you for your time!