# Mechanisms of Action (MoA) Prediction - Final Classifier
## Test

### Introduction
In this notebook, we will create a test data prediction pipeline for our final multi-label classifier in order to produce a submission file. In the prediction pipeline (depending on how effective it is) we will add the models created for our 0-label classifiers.

Because our strategy was to conduct nested cross-validation with bayesian hyperparameter search in each fold, each fold may have produced a model using different parameters, so a dataframe will be imported with these parameters (written during the training process). With this, we can ensemble all the models created during nested cross-validation (some may be the same, some may be different!) as well as the models created in using different random states.

Due to the computationally intensive nature of nested cross-validation strategy, we don't end up with many models to ensemble. However, the expectation is that the fewer models we have perform better, so there will be no need for more models to ensemble. Provided this is true, the test prediction pipeline is consequently much faster (as each test record doesn't need to be passed through a ridiculous number of models before a prediction is created).

This entire process has been novel to me. I have realised the **need** for a robust and effective cross-validation strategy - this can make or break a data science experiment. So with that in mind, I have not been making any submissions to the Kaggle public test set (that is used for the public leaderboard). My hope is that by having such a robust cross-validation strategy, I don't *have to* constantly evaluate the model performance based on the leaderboard, which is essentially overfitting to the leaderboard test data. 

One aspect that can be improved with the training pipeline is the usage of a sub-validation set through Keras's fit method, rather than a sub-validation set of my own creation. Scaling, feature selection etc are all done without using the OOF validation set (hence nested cross-validation), but we can further improve this by not using sub-validation data too! Now we're getting serious about over-fitting...

Also, there is certainly room for more research into machine learning based feature selection techniques for multi-label problems. Currenty I employ a SelectFromModel technique *per label* (in a similar vein to one-vs-all classification). Storing the features selected for each label, by the end I rank the features in terms of how many times they were selected. The features to use in the final model are then defined by a parameter of num_features (that is included in our Bayesian hyperparameter search). This is rudimentary and could *definitely* be improved!

## 1.00 Import Packages

In [1]:
# General packages
import pandas as pd
import numpy as np
import os
import gc
import random
from tqdm import tqdm, tqdm_notebook
import json # For reading in csv with string list representation values

import time
import warnings
warnings.filterwarnings('ignore')

# Data vis packages
import matplotlib.pyplot as plt
%matplotlib inline

# Data prep
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA

# Modelling packages
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import backend as k
# Key layers
from tensorflow.keras.models import load_model
# Cross validation
from sklearn.model_selection import KFold
from sklearn import metrics

In [2]:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

strategy = tf.distribute.get_strategy()
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

# Data access
gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)

Num GPUs Available:  1
REPLICAS: 1


## 2.00 Read in Data

In [3]:
# Directory and file paths
input_dir                 = '../input/lish-moa/'
train_features_path       = os.path.join(input_dir, 'train_features.csv')
test_features_path        = os.path.join(input_dir, 'test_features.csv')
train_targets_scored_path = os.path.join(input_dir, 'train_targets_scored.csv')
sample_submission_path    = os.path.join(input_dir, 'sample_submission.csv')

# Read in data
train_features       = pd.read_csv(train_features_path)
test_features        = pd.read_csv(test_features_path)
train_targets_scored = pd.read_csv(train_targets_scored_path)
sample_submission    = pd.read_csv(sample_submission_path)

del train_features_path, test_features_path, train_targets_scored_path, sample_submission_path

print(f'train_features shape: \t\t{train_features.shape}')
print(f'test_features shape: \t\t{test_features.shape}')
print(f'train_targets_scored shape: \t{train_targets_scored.shape}')
print(f'sample_submission shape: \t{sample_submission.shape}')

train_features shape: 		(23814, 876)
test_features shape: 		(3982, 876)
train_targets_scored shape: 	(23814, 207)
sample_submission shape: 	(3982, 207)


In [4]:
# Define key parameters
SCALER_METHOD = RobustScaler()

MODEL_TO_USE = 'nn'
MODEL_NAME = MODEL_TO_USE + '_final_classifier'

print(f'Model name: {MODEL_NAME}')

Model name: nn_final_classifier


## 3.00 Data Preparation

In [5]:
def get_transformed_row_features(df):
    """
    Input data and returns transformed features using row level statistics.
    """
    
    def get_row_stat(df, stat, feat_type):
        """
        Input data and returns row level statistics.
        stat: str ['sum','mean','med','std','min','max']
        feat_type: str [None,'g','c']
        """

        # Separate features into numerical and categorical (and by feature type if specified)
        if feat_type == None:
            df_numerical = df.select_dtypes('number').drop('cp_time', axis=1)
            df_categorical = df.select_dtypes('object')
        elif feat_type == 'g':
            df_numerical = df.select_dtypes('number').drop('cp_time', axis=1)
            df_categorical = df.select_dtypes('object')
            # Subset to g features
            df_numerical = df_numerical[df_numerical.columns[df_numerical.columns.str.startswith('g-')]]
            df_categorical = df_categorical[df_categorical.columns[df_categorical.columns.str.startswith('g-')]]
        elif feat_type == 'c':
            df_numerical = df.select_dtypes('number').drop('cp_time', axis=1)
            df_categorical = df.select_dtypes('object')
            # Subset to g features
            df_numerical = df_numerical[df_numerical.columns[df_numerical.columns.str.startswith('c-')]]
            df_categorical = df_categorical[df_categorical.columns[df_categorical.columns.str.startswith('c-')]]

        # Add statistic feature
        if stat == 'sum':
            stat_feat = df_numerical.sum(axis=1)
        elif stat == 'mean':
            stat_feat = df_numerical.mean(axis=1)
        elif stat == 'med':
            stat_feat = df_numerical.median(axis=1)
        elif stat == 'std':
            stat_feat = df_numerical.std(axis=1)
        elif stat == 'min':
            stat_feat = df_numerical.min(axis=1)
        elif stat == 'max':
            stat_feat = df_numerical.max(axis=1)

        return(stat_feat)
    
    
    # Get list of original column names (so we don't make transformations using new features)
    df_cols = df.columns
    
    # Total row stats
    df['row_sum']  = get_row_stat(df=df[df_cols], stat='sum' , feat_type=None)
    df['row_mean'] = get_row_stat(df=df[df_cols], stat='mean', feat_type=None)
    df['row_med']  = get_row_stat(df=df[df_cols], stat='med' , feat_type=None)
    df['row_std']  = get_row_stat(df=df[df_cols], stat='std' , feat_type=None)
    df['row_min']  = get_row_stat(df=df[df_cols], stat='min' , feat_type=None)
    df['row_max']  = get_row_stat(df=df[df_cols], stat='max' , feat_type=None)
    # G feature row stats
    df['row_sum_g']  = get_row_stat(df=df[df_cols], stat='sum' , feat_type='g')
    df['row_mean_g'] = get_row_stat(df=df[df_cols], stat='mean', feat_type='g')
    df['row_med_g']  = get_row_stat(df=df[df_cols], stat='med' , feat_type='g')
    df['row_std_g']  = get_row_stat(df=df[df_cols], stat='std' , feat_type='g')
    df['row_min_g']  = get_row_stat(df=df[df_cols], stat='min' , feat_type='g')
    df['row_max_g']  = get_row_stat(df=df[df_cols], stat='max' , feat_type='g')
    # C feature row stats
    df['row_sum_c']  = get_row_stat(df=df[df_cols], stat='sum' , feat_type='c')
    df['row_mean_c'] = get_row_stat(df=df[df_cols], stat='mean', feat_type='c')
    df['row_med_c']  = get_row_stat(df=df[df_cols], stat='med' , feat_type='c')
    df['row_std_c']  = get_row_stat(df=df[df_cols], stat='std' , feat_type='c')
    df['row_min_c']  = get_row_stat(df=df[df_cols], stat='min' , feat_type='c')
    df['row_max_c']  = get_row_stat(df=df[df_cols], stat='max' , feat_type='c')
    
    # G features row stats / row sum
    df['row_sum_g_by_row_sum']  = df['row_sum_g']  / df['row_sum']
    df['row_mean_g_by_row_sum'] = df['row_mean_g'] / df['row_sum']
    df['row_med_g_by_row_sum']  = df['row_med_g']  / df['row_sum']
    df['row_std_g_by_row_sum']  = df['row_std_g']  / df['row_sum']
    df['row_min_g_by_row_sum']  = df['row_min_g']  / df['row_sum']
    df['row_max_g_by_row_sum']  = df['row_max_g']  / df['row_sum']
    # C features row stats / row sum
    df['row_sum_c_by_row_sum']  = df['row_sum_c']  / df['row_sum']
    df['row_mean_c_by_row_sum'] = df['row_mean_c'] / df['row_sum']
    df['row_med_c_by_row_sum']  = df['row_med_c']  / df['row_sum']
    df['row_std_c_by_row_sum']  = df['row_std_c']  / df['row_sum']
    df['row_min_c_by_row_sum']  = df['row_min_c']  / df['row_sum']
    df['row_max_c_by_row_sum']  = df['row_max_c']  / df['row_sum']
    
    # G features row stats / row mean
    df['row_sum_g_by_row_mean']  = df['row_sum_g']  / df['row_mean']
    df['row_mean_g_by_row_mean'] = df['row_mean_g'] / df['row_mean']
    df['row_med_g_by_row_mean']  = df['row_med_g']  / df['row_mean']
    df['row_std_g_by_row_mean']  = df['row_std_g']  / df['row_mean']
    df['row_min_g_by_row_mean']  = df['row_min_g']  / df['row_mean']
    df['row_max_g_by_row_mean']  = df['row_max_g']  / df['row_mean']
    # C features row stats / row mean
    df['row_sum_c_by_row_mean']  = df['row_sum_c']  / df['row_mean']
    df['row_mean_c_by_row_mean'] = df['row_mean_c'] / df['row_mean']
    df['row_med_c_by_row_mean']  = df['row_med_c']  / df['row_mean']
    df['row_std_c_by_row_mean']  = df['row_std_c']  / df['row_mean']
    df['row_min_c_by_row_mean']  = df['row_min_c']  / df['row_mean']
    df['row_max_c_by_row_mean']  = df['row_max_c']  / df['row_mean']
    
    # G features row stats / row med
    df['row_sum_g_by_row_med']  = df['row_sum_g']  / df['row_med']
    df['row_mean_g_by_row_med'] = df['row_mean_g'] / df['row_med']
    df['row_med_g_by_row_med']  = df['row_med_g']  / df['row_med']
    df['row_std_g_by_row_med']  = df['row_std_g']  / df['row_med']
    df['row_min_g_by_row_med']  = df['row_min_g']  / df['row_med']
    df['row_max_g_by_row_med']  = df['row_max_g']  / df['row_med']
    # C features row stats / row med
    df['row_sum_c_by_row_med']  = df['row_sum_c']  / df['row_med']
    df['row_mean_c_by_row_med'] = df['row_mean_c'] / df['row_med']
    df['row_med_c_by_row_med']  = df['row_med_c']  / df['row_med']
    df['row_std_c_by_row_med']  = df['row_std_c']  / df['row_med']
    df['row_min_c_by_row_med']  = df['row_min_c']  / df['row_med']
    df['row_max_c_by_row_med']  = df['row_max_c']  / df['row_med']
    
    # G features row stats / row std
    df['row_sum_g_by_row_std']  = df['row_sum_g']  / df['row_std']
    df['row_mean_g_by_row_std'] = df['row_mean_g'] / df['row_std']
    df['row_med_g_by_row_std']  = df['row_med_g']  / df['row_std']
    df['row_std_g_by_row_std']  = df['row_std_g']  / df['row_std']
    df['row_min_g_by_row_std']  = df['row_min_g']  / df['row_std']
    df['row_max_g_by_row_std']  = df['row_max_g']  / df['row_std']
    # C features row stats / row std
    df['row_sum_c_by_row_std']  = df['row_sum_c']  / df['row_std']
    df['row_mean_c_by_row_std'] = df['row_mean_c'] / df['row_std']
    df['row_med_c_by_row_std']  = df['row_med_c']  / df['row_std']
    df['row_std_c_by_row_std']  = df['row_std_c']  / df['row_std']
    df['row_min_c_by_row_std']  = df['row_min_c']  / df['row_std']
    df['row_max_c_by_row_std']  = df['row_max_c']  / df['row_std']
    
    # G features row stats / row min
    df['row_sum_g_by_row_min']  = df['row_sum_g']  / df['row_min']
    df['row_mean_g_by_row_min'] = df['row_mean_g'] / df['row_min']
    df['row_med_g_by_row_min']  = df['row_med_g']  / df['row_min']
    df['row_std_g_by_row_min']  = df['row_std_g']  / df['row_min']
    df['row_min_g_by_row_min']  = df['row_min_g']  / df['row_min']
    df['row_max_g_by_row_min']  = df['row_max_g']  / df['row_min']
    # C features row stats / row min
    df['row_sum_c_by_row_min']  = df['row_sum_c']  / df['row_min']
    df['row_mean_c_by_row_min'] = df['row_mean_c'] / df['row_min']
    df['row_med_c_by_row_min']  = df['row_med_c']  / df['row_min']
    df['row_std_c_by_row_min']  = df['row_std_c']  / df['row_min']
    df['row_min_c_by_row_min']  = df['row_min_c']  / df['row_min']
    df['row_max_c_by_row_min']  = df['row_max_c']  / df['row_min']
    
    # G features row stats / row max
    df['row_sum_g_by_row_max']  = df['row_sum_g']  / df['row_max']
    df['row_mean_g_by_row_max'] = df['row_mean_g'] / df['row_max']
    df['row_med_g_by_row_max']  = df['row_med_g']  / df['row_max']
    df['row_std_g_by_row_max']  = df['row_std_g']  / df['row_max']
    df['row_min_g_by_row_max']  = df['row_min_g']  / df['row_max']
    df['row_max_g_by_row_max']  = df['row_max_g']  / df['row_max']
    # C features row stats / row max
    df['row_sum_c_by_row_max']  = df['row_sum_c']  / df['row_max']
    df['row_mean_c_by_row_max'] = df['row_mean_c'] / df['row_max']
    df['row_med_c_by_row_max']  = df['row_med_c']  / df['row_max']
    df['row_std_c_by_row_max']  = df['row_std_c']  / df['row_max']
    df['row_min_c_by_row_max']  = df['row_min_c']  / df['row_max']
    df['row_max_c_by_row_max']  = df['row_max_c']  / df['row_max']
    
    # G features row stats / C features row stats
    df['row_sum_g_by_row_sum_c']  = df['row_sum_g']  / df['row_sum_g']
    df['row_sum_g_by_row_mean_c'] = df['row_mean_g'] / df['row_mean_g']
    df['row_sum_g_by_row_med_c']  = df['row_med_g']  / df['row_med_g']
    df['row_sum_g_by_row_std_c']  = df['row_std_g']  / df['row_std_g']
    df['row_sum_g_by_row_min_c']  = df['row_min_g']  / df['row_min_g']
    df['row_sum_g_by_row_max_c']  = df['row_max_g']  / df['row_max_g']
    
    # Row stats / cp_time
    df['row_sum_by_cp_time']  = df['row_sum']  / df['cp_time']
    df['row_mean_by_cp_time'] = df['row_mean'] / df['cp_time']
    df['row_med_by_cp_time']  = df['row_med']  / df['cp_time']
    df['row_std_by_cp_time']  = df['row_std']  / df['cp_time']
    df['row_min_by_cp_time']  = df['row_min']  / df['cp_time']
    df['row_max_by_cp_time']  = df['row_max']  / df['cp_time']
    
    # G features row stats / cp_time
    df['row_sum_g_by_cp_time']  = df['row_sum_g']  / df['cp_time']
    df['row_mean_g_by_cp_time'] = df['row_mean_g'] / df['cp_time']
    df['row_med_g_by_cp_time']  = df['row_med_g']  / df['cp_time']
    df['row_std_g_by_cp_time']  = df['row_std_g']  / df['cp_time']
    df['row_min_g_by_cp_time']  = df['row_min_g']  / df['cp_time']
    df['row_max_g_by_cp_time']  = df['row_max_g']  / df['cp_time']
    
    # C features row stats / cp_time
    df['row_sum_c_by_cp_time']  = df['row_sum_c']  / df['cp_time']
    df['row_mean_c_by_cp_time'] = df['row_mean_c'] / df['cp_time']
    df['row_med_c_by_cp_time']  = df['row_med_c']  / df['cp_time']
    df['row_std_c_by_cp_time']  = df['row_std_c']  / df['cp_time']
    df['row_min_c_by_cp_time']  = df['row_min_c']  / df['cp_time']
    df['row_max_c_by_cp_time']  = df['row_max_c']  / df['cp_time']
    
    return(df, df_cols)

In [6]:
def get_transformed_col_features(df, df_cols, stat, row_feat_type, col_feat_type, feature_name):
    """
    Input data and returns transformed features using column level statistics.
    stat: str ['sum','mean','med','std','min','max']
    row_feat_type: str [None,'g','c']
    col_feat_type: str [None,'g','c']
    feature_name: str, name to call new outputted feature
    df_cols: list of column names from original dataset
    """
    
    def get_column_stat(df, stat, feat_type):
        """
        Input data and returns column level statistics.
        stat: str ['sum','mean','med','std','min','max']
        feat_type: str [None,'g','c']
        """

        # Separate features into numerical and categorical (and by feature type if specified)
        if feat_type == None:
            df_numerical = df.select_dtypes('number').drop('cp_time', axis=1)
            df_categorical = df.select_dtypes('object')
        elif feat_type == 'g':
            df_numerical = df.select_dtypes('number').drop('cp_time', axis=1)
            df_categorical = df.select_dtypes('object')
            # Subset to g features
            df_numerical = df_numerical[df_numerical.columns[df_numerical.columns.str.startswith('g-')]]
            df_categorical = df_categorical[
                df_categorical.columns[df_categorical.columns.astype(str).str.startswith('g-')]
            ]
        elif feat_type == 'c':
            df_numerical = df.select_dtypes('number').drop('cp_time', axis=1)
            df_categorical = df.select_dtypes('object')
            # Subset to g features
            df_numerical = df_numerical[df_numerical.columns[df_numerical.columns.str.startswith('c-')]]
            df_categorical = df_categorical[
                df_categorical.columns[df_categorical.columns.astype(str).str.startswith('c-')]
            ]

        # Add statistic feature
        if stat == 'sum':
            stat_feat = np.sum(df_numerical.values)
        elif stat == 'mean':
            stat_feat = np.mean(df_numerical.values)
        elif stat == 'med':
            stat_feat = np.median(df_numerical.values)
        elif stat == 'std':
            stat_feat = np.std(df_numerical.values)
        elif stat == 'min':
            stat_feat = np.min(df_numerical.values)
        elif stat == 'max':
            stat_feat = np.max(df_numerical.values)

        return(stat_feat)
    
    # Get column level statistic
    col_stat = get_column_stat(df=df[df_cols], stat=stat, feat_type=col_feat_type)
    
    # Redefine the feature suffix based on row_feat_type
    if row_feat_type == None:
        row_feat_type = ''
    elif row_feat_type == 'g':
        row_feat_type = '_g'
    elif row_feat_type == 'c':
        row_feat_type = '_c'
    
    # Get transformed feature
    if stat == 'sum':
        df[feature_name] = df['row_sum' + row_feat_type] / col_stat
    elif stat == 'mean':
        df[feature_name] = df['row_mean' + row_feat_type] / col_stat
    elif stat == 'med':
        df[feature_name] = df['row_med' + row_feat_type] / col_stat
    elif stat == 'std':
        df[feature_name] = df['row_std' + row_feat_type] / col_stat
    elif stat == 'min':
        df[feature_name] = df['row_min' + row_feat_type] / col_stat
    elif stat == 'max':
        df[feature_name] = df['row_max' + row_feat_type] / col_stat
    
    return(df)

In [7]:
def transform_feature_set(X_train, X_test, y_train, 
                          selected_features,
                          num_features,
                          pca, 
                          num_components,
                          seed,
                          scaler=SCALER_METHOD,
                          verbose=0):
    """
    Takes in X_train and X_test datasets, and applies feature transformations,
    feature selection, scaling and pca (dependent on arguments). 
    
    Returns transformed X_train and X_test data ready for training/prediction, and returns
    list of numerical cols and categorical cols, for the use of creating embeddings.
    """
    
    ## DATA PREPARATION ##
    
    # Drop unique ID feature
    X_train = X_train.drop('sig_id', axis=1)
    X_test  = X_test.drop('sig_id', axis=1)
    # Get indices for train and test dfs - we'll need these later
    train_idx = list(X_train.index)
    test_idx  = list(X_test.index)
    
    
    ## IN-FOLD FEATURE ENGINEERING ##

    if verbose == 1:
        print('ENGINGEERING FEATURES...')
        
    for X_dataset in [X_train, X_test]:
        # Row transformations
        df, df_cols = get_transformed_row_features(X_dataset)
    for X_dataset in [X_train, X_test]:
        # Total row stats / column stats
        get_transformed_col_features(X_dataset, df_cols, 'sum', None, None, 'row_sum_by_col_sum')
        get_transformed_col_features(X_dataset, df_cols, 'mean',None, None, 'row_mean_by_col_mean')
        get_transformed_col_features(X_dataset, df_cols, 'med', None, None, 'row_med_by_col_med')
        get_transformed_col_features(X_dataset, df_cols, 'std', None, None, 'row_std_by_col_std')
        get_transformed_col_features(X_dataset, df_cols, 'min', None, None, 'row_min_by_col_min')
        get_transformed_col_features(X_dataset, df_cols, 'max', None, None, 'row_max_by_col_max')
        # G features row stats / column stats
        get_transformed_col_features(X_dataset, df_cols, 'sum', 'g', None, 'row_sum_g_by_col_sum')
        get_transformed_col_features(X_dataset, df_cols, 'mean','g', None, 'row_mean_g_by_col_mean')
        get_transformed_col_features(X_dataset, df_cols, 'med', 'g', None, 'row_med_g_by_col_med')
        get_transformed_col_features(X_dataset, df_cols, 'std', 'g', None, 'row_std_g_by_col_std')
        get_transformed_col_features(X_dataset, df_cols, 'min', 'g', None, 'row_min_g_by_col_min')
        get_transformed_col_features(X_dataset, df_cols, 'max', 'g', None, 'row_max_g_by_col_max')    
        # C features row stats / column stats
        get_transformed_col_features(X_dataset, df_cols, 'sum', 'c', None, 'row_sum_c_by_col_sum')
        get_transformed_col_features(X_dataset, df_cols, 'mean','c', None, 'row_mean_c_by_col_mean')
        get_transformed_col_features(X_dataset, df_cols, 'med', 'c', None, 'row_med_c_by_col_med')
        get_transformed_col_features(X_dataset, df_cols, 'std', 'c', None, 'row_std_c_by_col_std')
        get_transformed_col_features(X_dataset, df_cols, 'min', 'c', None, 'row_min_c_by_col_min')
        get_transformed_col_features(X_dataset, df_cols, 'max', 'c', None, 'row_max_c_by_col_max')
        # G features row stats / C features column stats
        get_transformed_col_features(X_dataset, df_cols, 'sum', 'g', 'c', 'row_sum_g_by_col_sum_c')
        get_transformed_col_features(X_dataset, df_cols, 'mean','g', 'c', 'row_mean_g_by_col_mean_c')
        get_transformed_col_features(X_dataset, df_cols, 'med', 'g', 'c', 'row_med_g_by_col_med_c')
        get_transformed_col_features(X_dataset, df_cols, 'std', 'g', 'c', 'row_std_g_by_col_std_c')
        get_transformed_col_features(X_dataset, df_cols, 'min', 'g', 'c', 'row_min_g_by_col_min_c')
        get_transformed_col_features(X_dataset, df_cols, 'max', 'g', 'c', 'row_max_g_by_col_max_c')
        # C features row stats / G features column stats
        get_transformed_col_features(X_dataset, df_cols, 'sum', 'c', 'g', 'row_sum_c_by_col_sum_g')
        get_transformed_col_features(X_dataset, df_cols, 'mean','c', 'g', 'row_mean_c_by_col_mean_g')
        get_transformed_col_features(X_dataset, df_cols, 'med', 'c', 'g', 'row_med_c_by_col_med_g')
        get_transformed_col_features(X_dataset, df_cols, 'std', 'c', 'g', 'row_std_c_by_col_std_g')
        get_transformed_col_features(X_dataset, df_cols, 'min', 'c', 'g', 'row_min_c_by_col_min_g')
        get_transformed_col_features(X_dataset, df_cols, 'max', 'c', 'g', 'row_max_c_by_col_max_g')    

    # Replace any infinite values generated with 0
    X_train.replace(to_replace=[np.inf, -np.inf, np.nan], value=0, inplace=True)
    X_test.replace(to_replace=[np.inf, -np.inf, np.nan], value=0, inplace=True)
    
    # Separate train data types
    X_train_numerical   = X_train.select_dtypes('number')
    X_train_categorical = X_train.select_dtypes('object')
    X_train_categorical = X_train_categorical.astype('category')
    # Separate val data types
    X_test_numerical   = X_test.select_dtypes('number')
    X_test_categorical = X_test.select_dtypes('object')
    X_test_categorical = X_test_categorical.astype('category')
    
    # Get colnames before scaling and feature selection
    num_cols = X_train_numerical.columns
    cat_cols = X_train_categorical.columns
    
    ## SCALING ##
    
    if scaler != None:
        if verbose == 1:
            print('APPLYING SCALER...')
            
        # Fit and transform scaler to train and val
        scaler.fit(X_train_numerical)
        X_train_numerical = scaler.transform(X_train_numerical)
        X_test_numerical  = scaler.transform(X_test_numerical)
        # Convert to back dataframe
        X_train_numerical = pd.DataFrame(X_train_numerical, index=train_idx, columns=num_cols)
        X_test_numerical  = pd.DataFrame(X_test_numerical, index=test_idx, columns=num_cols)
    
    
    ## FEATURE SELECTION ##
    
    # Subset to features selected during train process (and stored in corresponding parameters file)
    if verbose == 1:
        print('APPLYING FEATURE SELECTOR...')
        num_cols = X_train_numerical.shape[1]
                
    # Subset datasets to selected features only
    X_train_numerical = X_train_numerical[selected_features]
    X_test_numerical  = X_test_numerical[selected_features]
    if verbose == 1: 
        print(f'{num_cols - X_train_numerical.shape[1]} features removed in feature selection.')
        del num_cols

            
    ## PCA ##
    
    if pca != None:
        if verbose == 1:
            print('APPLYING PCA...')
            
        # Fit and transform pca to train and val
        pca.fit(X_train_numerical)
        X_train_numerical = pca.transform(X_train_numerical)
        X_test_numerical  = pca.transform(X_test_numerical)
        if verbose == 1:
            print(f'NUMBER OF PRINCIPAL COMPONENTS: {pca.n_components_}')
        # Convert numerical features into pandas dataframe and clean colnames
        X_train_numerical = pd.DataFrame(X_train_numerical, index=train_idx).add_prefix('pca_')
        X_test_numerical  = pd.DataFrame(X_test_numerical, index=test_idx).add_prefix('pca_')
    
    
    ## CATEGORICAL FEATURES ##
    
    # Get categorical and numerical column names
    num_cols = X_train_numerical.columns
    cat_cols = X_train_categorical.columns

    # Encode categorical features
    X_train_categorical = X_train_categorical.apply(lambda x: x.cat.codes)
    X_test_categorical  = X_test_categorical.apply(lambda x: x.cat.codes)

    
    # Concatenate transformed categorical features with transformed numerical features  
    X_train = pd.concat([X_train_categorical, X_train_numerical], axis=1)
    X_test  = pd.concat([X_test_categorical, X_test_numerical], axis=1)
    
    if verbose == 1:
        print(f'TRAIN SHAPE: \t\t{X_train.shape}')
        print(f'VALIDATION SHAPE: \t{X_test.shape}')
    
    return X_train, X_test, num_cols, cat_cols

In [8]:
X_train = train_features
y_train = train_targets_scored.drop('sig_id', axis=1)

## 4.00 Test Predictions

Because in the model train pipeline, we performed in-fold Bayesian hyperparameter searches for each model, it is expected that the model architecture will be slighlty different for each of the 10 folds. Consequently, we'll need to read in the csv of parameters to prepare the test prediction pipeline before we start to make the predictions (as we won't be able to feed in the same dataset into each model - differing transformations will be required per model).

In future, I'd like to automate this step. In order to do this, more work will need to be carried out on the train notebook, but due to time constraints and resource limits, we will have to move on for now without making those amendments. 

### 4.01 Prepare Prediction Pipeline

In [9]:
def make_test_predictions(X_test,
                          selected_features,
                          num_features,
                          num_components, 
                          use_embedding, 
                          seed, 
                          kfold,
                          num_folds,
                          X_train=X_train, 
                          y_train=y_train, 
                          model_name=MODEL_NAME,
                          submission=sample_submission):
    """
    Reads in X_test feature set, loads the model specified by model_path, and 
    applies transformations as per num_components and use_embedding
    
    Returns dataframe with sig_id and a binary column indicating 
    """
    
    # Retrieve the dataframe ids that were used in kfold during cross validation (using specified seed)
    kf = KFold(n_splits=num_folds, random_state=seed)
    for fold, (tdx, vdx) in enumerate(kf.split(X_train, y_train)):
        if fold == kfold:
            # End the loop when it gets to kfold so we can retain tdx for kfold
            break
            
    # Instantiate PCA method
    pca = PCA(n_components=num_components, random_state=seed)
        
    # Subset X_train and y_train as per what occurred during cross validation for kfold and seed
    X_train, y_train = X_train.iloc[tdx, :], y_train.iloc[tdx, :]
    
    # Transform data - again to replicate what occurred with at kfold and seed
    X_train, X_test, num_cols, cat_cols = transform_feature_set(X_train           = X_train, 
                                                                X_test            = X_test, 
                                                                y_train           = y_train, 
                                                                selected_features = selected_features,
                                                                num_features      = num_features,
                                                                pca               = pca,
                                                                num_components    = num_components,
                                                                seed              = seed)
        
    # Further transformations if an embedding was used at kfold and seed
    if use_embedding == True:
        # Separate data to fit into embedding and numerical input layers
        X_train = [np.absolute(X_train[i]) for i in cat_cols] + [X_train[num_cols]]
        X_test = [np.absolute(X_test[i]) for i in cat_cols] + [X_test[num_cols]]
            
    # Get the model name and file path for kfold and seed, then load that model
    model_name = model_name + '_seed' + str(seed)
    model_path = 'models/' + model_name + '/' + model_name + '_' + str(kfold) + '.h5'
    model = load_model(model_path)
    
    # Make test predictions using the model created at kfold and seed
    preds = model.predict(X_test)
            
    return(preds)

In [10]:
# Compile model parameters for all models produced
parameter_files = os.listdir('final_classifier_parameters')

model_parameters = pd.DataFrame()
# Remove any files that aren't a parameter csv
for idx, file in enumerate(parameter_files):
    try:
        model_parameters = pd.read_csv(f'final_classifier_parameters/{file}')
    except ValueError:
        print(f'Passing file {file}')
        pass

# Print model parameters
model_parameters

Passing file .ipynb_checkpoints


Unnamed: 0,kfold,selected_features,num_features,num_components,use_embedding,seed
0,0,"['row_std_by_col_std', 'row_max_g_by_row_min',...",810,233,1,14
1,1,"['row_std_by_col_std', 'row_max_g_by_row_min',...",757,194,1,14
2,2,"['row_max_g_by_row_min', 'row_std_by_col_std',...",810,233,1,14
3,3,"['row_std_by_col_std', 'row_std', 'row_max_g_b...",810,233,1,14
4,4,"['row_max_g_by_row_min', 'row_std', 'row_std_b...",500,200,1,14
5,5,"['row_max_g_by_row_min', 'row_std_g', 'row_std...",500,200,1,14
6,6,"['row_max_g_by_row_min', 'row_std_by_cp_time',...",810,233,1,14
7,7,"['row_max_g_by_row_min', 'row_std', 'row_std_b...",810,233,1,14
8,8,"['row_std', 'row_std_by_cp_time', 'row_min_by_...",717,274,1,14
9,9,"['row_std_by_col_std', 'row_max_g_by_row_min',...",810,233,1,14


### 4.02 Make Test Predictions

In [11]:
# Make 0_label test predictions for all models created during CV for all seeds
preds = []
for idx in tqdm(model_parameters.index):
    
    # Get number of folds for each seed - add 1 because of zero indexing
    seed      = model_parameters.iloc[idx]['seed']
    num_folds = max(model_parameters.loc[model_parameters.seed == seed, 'kfold']) + 1
    
    # Convert string list representation to list of strings for selected_features
    selected_features = model_parameters.loc[idx, 'selected_features'].replace("'", '"')
    selected_features = json.loads(selected_features)
    # Remove non-numerical features from selected_features list
    if 'cp_type' in selected_features:
        selected_features.remove('cp_type')
    if 'cp_dose' in selected_features:
        selected_features.remove('cp_dose')
    
    # Make test predictions
    fold_preds = make_test_predictions(
        X_test            = test_features, 
        selected_features = selected_features,
        num_features      = model_parameters.iloc[idx]['num_features'],
        num_components    = model_parameters.iloc[idx]['num_components'], 
        use_embedding     = model_parameters.iloc[idx]['use_embedding'], 
        seed              = seed, 
        kfold             = model_parameters.iloc[idx]['kfold'],
        num_folds         = num_folds
    )
    preds.append(fold_preds)

  0%|          | 0/10 [00:00<?, ?it/s]

Making predictions...


 10%|█         | 1/10 [00:55<08:22, 55.84s/it]

Making predictions...


 20%|██        | 2/10 [01:46<07:14, 54.34s/it]

Making predictions...


 30%|███       | 3/10 [02:38<06:14, 53.46s/it]

Making predictions...


 40%|████      | 4/10 [03:28<05:16, 52.69s/it]

Making predictions...


 50%|█████     | 5/10 [04:19<04:20, 52.15s/it]

Making predictions...


 60%|██████    | 6/10 [05:11<03:27, 51.89s/it]

Making predictions...


 70%|███████   | 7/10 [06:02<02:35, 51.85s/it]

Making predictions...


 80%|████████  | 8/10 [06:53<01:43, 51.62s/it]

Making predictions...


 90%|█████████ | 9/10 [07:45<00:51, 51.68s/it]

Making predictions...


100%|██████████| 10/10 [08:36<00:00, 51.52s/it]


In [13]:
# Ensemble predictions and generate submission
for idx, fold_preds in enumerate(preds):
    # Convert fold_preds to dataframe
    fold_preds = pd.DataFrame(fold_preds, columns=sample_submission.columns[1:])

    # Update the submission for the first round of preds
    if idx == 0:
        # Add sig_id feature to fold_preds
        fold_preds['sig_id'] = sample_submission['sig_id']
        sample_submission.update(fold_preds)
    # Add to the preds following the first round of preds
    else:
        sample_submission.iloc[:, 1:] = sample_submission.iloc[:, 1:] + fold_preds

# Divide summed preds by number of total folds  
sample_submission.iloc[:, 1:] = sample_submission.iloc[:, 1:] / len(preds)

sample_submission.head()

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
0,id_0004d9e33,0.001691,0.001426,0.002286,0.017347,0.028212,0.007495,0.002672,0.004651,0.001185,...,0.001537,0.002621,0.0031,0.003465,0.002033,0.001523,0.004193,0.002999,0.00371,0.002713
1,id_001897cda,0.000917,0.002057,0.002178,0.004902,0.004709,0.004658,0.004267,0.006555,0.004035,...,0.002343,0.001881,0.003351,0.002179,0.017245,0.002102,0.032521,0.001765,0.003567,0.002584
2,id_002429b5b,0.001654,0.001826,0.001684,0.007391,0.008268,0.00149,0.003062,0.004547,0.00179,...,0.002616,0.001503,0.003741,0.007312,0.005048,0.001836,0.021108,0.002582,0.004846,0.002329
3,id_00276f245,0.001403,0.001257,0.001412,0.013701,0.006676,0.006088,0.00256,0.005352,0.001283,...,0.001351,0.001475,0.004,0.04955,0.006395,0.00109,0.006839,0.002154,0.003174,0.003203
4,id_0027f1083,0.002768,0.002004,0.002412,0.018689,0.024226,0.003102,0.003743,0.004358,0.001314,...,0.001216,0.001106,0.004042,0.001914,0.001901,0.00159,0.004227,0.002139,0.00195,0.002504


In [None]:
sample_submission.to_csv('submissions/submission.csv', index=False)