# XGBoost 🚀 with feature engineering and feature selection.

My approach to this dataset was:

- Forget about the time-series structure (i.e. the `step` variable) and let the algorithm hopefully find some structure in the data
- Use the subject as a group for cross-validation model selection (hyperparameter tuning)
- Use XGBoost (I originally considered a neural network but XGBoost came out better)
- Pivot the `step` data into a column variable as we are interested in prediction the `state` for each `sequence`, rather than `sequence`-`step` combinations
- Aggregation of sensor data across 

I take advantage of the GPU accelerator resources available to us using the option `tree_method='gpu_hist'` in the XGBoost constructor.

Useful notebooks:

https://www.kaggle.com/code/cv13j0/tps-apr-2022-xgboost-model

https://www.kaggle.com/code/hasanbasriakcay/tpsapr22-fe-pseudo-labels-bi-lstm

https://www.kaggle.com/code/tyrionlannisterlzy/xgboost-dnn-ensemble-lb-0-980

https://www.kaggle.com/code/ambrosm/tpsapr22-eda-which-makes-sense

https://www.kaggle.com/competitions/tabular-playground-series-apr-2022/discussion/318527

Thanks everyone who entered this competition and shared notebooks and ideas - I continue to learn so much here at kaggle.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# PARAMETERS


N_FEATURES = 200
N_ESTIMATORS = 500

        

# Data wrangling

## Load data and pivot (put step variable as a column combined with the sensor)

In [None]:
%%time 


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from scipy.stats import kurtosis as kurt

import matplotlib.pyplot as plt
import seaborn as sns

if 'train' not in locals():
    print('loading data', end='...')
    train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2022/train.csv')
    test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2022/test.csv')
print('')

ntrain = train.shape[0]
train_sequences = train['sequence']
test_sequences = test['sequence']
both = pd.concat([train, test])

print('pivoting data', end='...')
both_long = both.melt(id_vars = ['sequence','subject','step'])
both_long['step_sensor'] = both_long['step'].map(lambda x: 'step_%02d' % x) + '_' + both_long['variable']
both_wide = both_long.pivot(index   = ['sequence','subject'], 
                            columns = 'step_sensor',
                            values  = 'value')
both_wide = both_wide.reset_index()
print('')

metrix = ['mean','max','min','var','median','skew',kurt]



## Feature engineering: 

Aggregate by sequence/subject and subject, also introduce [subject count variable](https://www.kaggle.com/code/ambrosm/tpsapr22-eda-which-makes-sense).

In [None]:

print('Aggregating sensor by subject and sequence', end='...')
t1 = both.filter(regex='sensor_|subject|sequence', axis=1).\
    groupby(['sequence','subject']).\
    aggregate(metrix)
# Flatten multiindex column names
t1.columns = ["subject_"+"_".join(x) for x in t1.columns]
print('')

print('Aggregating sensor by subject only', end='...')
t2 = both.filter(regex='sensor_|subject', axis=1).\
    groupby(['subject']).\
    aggregate(metrix)
t2.columns = ["_".join(x) for x in t2.columns]
print('')

print('merging', end='...')
#Merge

both_all = both_wide.merge(t1, right_index=True, left_on = ['sequence', 'subject'])
both_all = both_all.merge(t2, right_index = True, left_on = 'subject')

# Now add subject count variable (# times subject appears in data)
count = both_all['subject'].value_counts().to_frame()
count = count.rename(columns={"subject": "subject_count"})

both_all = both_all.merge(count, left_on='subject', right_index=True)

print('')



## Separate the data making it stays in the correct order

I'm pretty sure pivoting the data reorders the subjects and sequences so I merge the data back with the retained sequences from test and training data. This ensures that I properly separate the sequences from the training and test datasets.

In [None]:

newtrain = pd.DataFrame(np.unique(train_sequences)).merge(both_all, left_on=0, right_on='sequence')
newtrain = newtrain.drop(0, axis=1)

newtest = pd.DataFrame(np.unique(test_sequences)).merge(both_all, left_on=0, right_on='sequence')
newtest = newtest.drop(0, axis=1)

labels = pd.read_csv("/kaggle/input/tabular-playground-series-apr-2022/train_labels.csv")

newtrain_with_labels = newtrain.merge(labels, how = 'left', on = 'sequence')
train_sub = newtrain_with_labels['subject']
train_seq = newtrain_with_labels['sequence']
ytrain = newtrain_with_labels['state']
Xtrain = newtrain_with_labels.drop(['subject','sequence','state'], axis=1)

test_seq = newtest['sequence']
Xtest = newtest.drop(['subject', 'sequence'], axis=1)

Xtrain.head()
Xtrain.shape
Xtest.head()
Xtest.shape

# Feature selection

Reducing the number of variables (columns) means the fitting and hyperparameter tuning is quicker. Also, [AmbrosM](https://www.kaggle.com/competitions/tabular-playground-series-apr-2022/discussion/318527) claims too many variables passed into the model reduces model performance. I use `SelectKBest` with `f_classif` which just chooses variables using an ANOVA test (i.e. the largest between-state variable differences). There's probably (definitely) more sophisticated ways to select variables but I ran out of time to explore the differences between methods.

In [None]:
%%time
print(f'Reduce {Xtrain.shape[1]:d} columns down to {N_FEATURES:d}')
# Select (filter) important columns

from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.feature_selection import SequentialFeatureSelector, SelectKBest,f_classif
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GroupKFold

filter_columns = True
if filter_columns:
    import time
    tt= time.time()
    skb = SelectKBest(score_func = f_classif,
                      k = N_FEATURES)
    skb.fit(Xtrain, ytrain)
    sum(skb.get_support())
    #print(num_cols[sfs.support_])
    print(time.time()-tt)
    print(skb)
    dir(skb)
    print(len(skb.get_support()))
    print(Xtrain.shape)
    support = skb.get_support()
else:
    support = [True for _ in range(Xtrain.shape[1])]


print('Selected variables:')
print(Xtrain.columns[support])
    




# Fit an XGB model

I use GridSearchCV to tune a few hyperparameters of the XGBoost model. Cross-validation proceeds with [GroupKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html) using `subject` as the group. This makes sense as sensor and state information should be similar for the same subject so we want [subjects split evenly across cross-validation folds](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#visualize-cross-validation-indices-for-many-cv-objects).

In [None]:
%%time 
# Fit an XGB model

from sklearn.model_selection import GridSearchCV, GroupKFold


cv = GroupKFold(n_splits = 3) # Subject is used as group, passed through the fit method below...

from xgboost import XGBClassifier

xgbc = XGBClassifier(n_estimators = N_ESTIMATORS, eval_metric='rmse', tree_method='gpu_hist',
                    use_label_encoder=False)
gscv = GridSearchCV(estimator = xgbc,
                    param_grid = {'eta': [0,0.1,0.2,.3,.5],
                                 'max_depth': [3,6,9],
                                 'gamma': [0,.5,1,1.5,2]},
                    scoring = make_scorer(roc_auc_score),
                    cv = cv,
                    n_jobs = -1, verbose = 1, refit=True)


gscv.fit(Xtrain.loc[:,support], ytrain,
         groups=train_sub)
print(gscv.best_estimator_)
print(gscv.best_score_)




# Make predictions

In [None]:

ypred = gscv.best_estimator_.predict(Xtest.loc[:,support])

# Submission file:

In [None]:
submission = pd.DataFrame({'sequence':newtest['sequence'], 'state': [int(x) for x in ypred]})

submission
submission.to_csv('submission.csv', index=False)

# Check feature importance

Fitted XGBoost models provide the `feature_importances` attribute. [Tyrion Lannister-lzy](https://www.kaggle.com/code/tyrionlannisterlzy/xgboost-dnn-ensemble-lb-0-980) provided a nice function that plots these easily so I borrowed/stole this (thanks Tyrion!).

In [None]:
# Function courtesy of Tyrion Lannister-lzy:
# https://www.kaggle.com/code/tyrionlannisterlzy/xgboost-dnn-ensemble-lb-0-980

def plot_feature_importance(importance, names, model_type, max_features = 10):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_df = fi_df.head(max_features)

    #Define size of bar plot
    plt.figure(figsize=(8,6))

    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

import seaborn as sns
import matplotlib.pyplot as plt
plot_feature_importance(gscv.best_estimator_.feature_importances_,Xtrain.loc[:,support].columns,
                        'XG BOOST ', max_features = 25)


