## Introduction
This notebook is created for my experiment.  
I want to find out how effective the lag and lead features are in predicting.  
I would be happy if this could be useful for someone's modeling.  
  
This is the my first notebook by python.  
After joining kaggle, I can use python.   
Thanks to kagglers!

### Update
Add null importance. In order to save the calculation time, only fold1 is calculate.  
Correction of typographical errors and expressions.

In [None]:
import os
import gc
import numpy as np
import pandas as pd
import time

import seaborn as sns
import matplotlib.pyplot as plt

import lightgbm as lgb
from sklearn.model_selection import GroupKFold, KFold, StratifiedKFold
from sklearn.metrics import cohen_kappa_score, accuracy_score, mean_squared_error, f1_score

import warnings
warnings.filterwarnings('ignore')

import eli5
from eli5.sklearn import PermutationImportance

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
train = pd.read_feather("../input/ion-switching-expriment-lag-and-lead-features/train_lag30.feather")
test = pd.read_feather("../input/ion-switching-expriment-lag-and-lead-features/test_lag30.feather")
sample_submission = pd.read_csv("../input/liverpool-ion-switching/sample_submission.csv")

In [None]:
print(train.shape)
print(test.shape)

## Features
Lag, lead, and minimal features have been created.  
* Batch: 50s, 10s, 5s, 0.5s, 0.05s(does not use in model)
* mean, sd: Applies to 5s, 0.5s, 0.05s batch.
* signal lag: 30 features. NA fills using by first data of 10s batch.
* signal lead: 30 features. NA fills using by last data of 10s batch.
* rolling mean: Created at the center of the lead and lag range(61points ma). NA fills using by mean of signal. 

[I have created features by R.](https://www.kaggle.com/kei96kag/ion-switching-expriment-1-lag-and-lead-features)

In [None]:
train.columns

## Validation

In [None]:
## Preparation

drop_cols = [
    'time','open_channels'
    ]

train['open_channels'] = train['open_channels'].astype(int)

X = train.drop(drop_cols, axis = 1)
X_test = test.drop(drop_cols, axis = 1)

y = train['open_channels']

print(f'Number of features = {len(X.columns)}')

del train
del test
gc.collect()


In [None]:
#--------------------------------------------------------
# Parameter Setting
#--------------------------------------------------------

# I also use 'tracking'thanks Rob!!
#
TOTAL_FOLDS = 5
MODEL_TYPE = 'LGBM'
SHUFFLE = True
NUM_BOOST_ROUND = 2_500
EARLY_STOPPING_ROUNDS = 50
VERBOSE_EVAL = 500
RANDOM_SEED = 31452

##
calc_feature_imp = True
calc_perm_imp = True  ##!!　Calculation takes time.

# prams for validation
params = {
    'learning_rate': 0.03, 
    'max_depth': -1,
    'num_leaves': 2**8+1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.82,
    'bagging_freq': 0,
    'n_jobs': 8,
    'random_state': 31452,
    'metric': 'rmse',
    'objective' : 'regression'
    }

params2 = {
    #'eval_metric': 'rmse', #This did not work.4/26
    'n_estimators': 50000,
    'early_stopping_rounds': 50
    }

params.update(params2)


Permutation importance takes so long to calculate, so I calculate once(Fold-1) in this notebook.

In [None]:
#--------------------------------------------------------
# Validation
#--------------------------------------------------------

fold = StratifiedKFold(n_splits=TOTAL_FOLDS, shuffle=SHUFFLE, random_state=RANDOM_SEED)

df_feature_importance = pd.DataFrame()
oof_pred = np.ones(len(X))*-1
models = []


for i, (train_index, valid_index) in enumerate(fold.split(X, y)):
    print(f'Fold-{i+1} started at {time.ctime()}')
    X_train, X_valid = X.iloc[train_index, :], X.iloc[valid_index, :]
    y_train, y_valid = y[train_index], y[valid_index]
    
    model = lgb.LGBMRegressor()
    model.set_params(**params)
    model.fit(X_train, y_train,
              eval_set = [(X_train, y_train),(X_valid, y_valid)],
              verbose = 500
              )
    
    y_pred_valid = model.predict(X_valid, num_iteration = model.best_iteration_)
    y_pred_valid = np.round(np.clip(y_pred_valid, 0, 10)).astype(int)
    f1_valid = f1_score(y_valid, y_pred_valid, average = 'macro')
    
    oof_pred[valid_index] = y_pred_valid
    #len(oof_pred[oof_pred >= 0])
    print(f'Fold-{i+1} f1 score = {f1_valid:0.5f}')
    
    models.append(model)
    
    #del X_train, y_train
    gc.collect()
    
    ## Feateure Importance
    
    if i == 0: # for time reducing
        if calc_feature_imp:
            #model.feature_importances_
            fold_importance = pd.DataFrame()
            fold_importance['feature'] = X.columns.tolist()
            fold_importance['fold'] = i + 1
            fold_importance['split_importance'] = model.booster_.feature_importance(importance_type='split')
            fold_importance['gain_importance'] = model.booster_.feature_importance(importance_type='gain')


        if calc_feature_imp * calc_perm_imp:
            # from corochann's notebooks
            print("Calculating permutation importance... ")
            perm = PermutationImportance(model, random_state = 1, cv = 'prefit')
            perm.fit(X_valid, y_valid)
            fold_importance['permutation_importance'] = perm.feature_importances_

        if calc_feature_imp:
            df_feature_importance =\
                pd.concat([df_feature_importance, fold_importance], axis= 0)

    del X_valid, y_valid
    gc.collect()
    
    break # <- !!! fold1 only, add 5/2

In [None]:
# Only fold1 calculation, so commented out here.

#oof_f1 = f1_score(y, oof_pred, average = "macro")
#print(f'f1_score_oof = {oof_f1:0.5f}')

There was almost no difference in the score between StratifiedKFold and KFold.  
I would also like to consider GroupKFold.

Let's display the calculated feature importances side by side.  
I created the code to arrange graphs by referring to Andrew's notebook.

In [None]:
def plot_feature_importance_by3(df_feature_importance, null_imp = False):
    
    fig, ax = plt.subplots(1, 3, figsize = (18, 14))
    plt.rcParams["font.size"] = 15

    if not null_imp:
        col_name = ["split_importance", "gain_importance", "permutation_importance"]
    else:
        col_name = ["p_by_split_distribution", "p_by_gain_distribution", "dummy"]


    for i, importance_name in enumerate(col_name):
        try:
            cols = (df_feature_importance[["feature", importance_name]]\
                    .groupby("feature").mean()\
                    .sort_values(by = importance_name, ascending=False)[:100].index)

            best_features = df_feature_importance.loc[df_feature_importance.feature.isin(cols), ['feature', importance_name]]
            sns.barplot(x = importance_name, y ="feature", ax=ax[i], 
                        data = best_features.sort_values(by = importance_name, ascending=False))
            plt.tight_layout()
            ax[i].set_title(f'{importance_name} (averaged over folds)')
        except:
            pass
    plt.tight_layout()

### Feature importance: All features

In [None]:
#Name shortening
st1 = df_feature_importance[df_feature_importance['feature'].str.contains("_10s")]['feature'].str.replace("_10s", "")
df_feature_importance.loc[df_feature_importance['feature'].str.contains("_10s"), 'feature'] = st1

plot_feature_importance_by3(df_feature_importance)

### Feature importance: Lag features

In [None]:
df_lags = df_feature_importance[df_feature_importance['feature']\
                      .str.startswith("lag")]
plot_feature_importance_by3(df_lags)

### Feature importance: Lead features

In [None]:
df_leads = df_feature_importance[df_feature_importance['feature']\
                      .str.startswith("lead")]
plot_feature_importance_by3(df_leads)

It seems that all the lag and lead features are used in terms of split importance, but it seems that only lag1 and lead1 contributes to accuracy in gain importance.  
Looking at permutation importance, it seems that there is a contribution up to about lag6 and lead6.

## Prediction
Not done here.

In [None]:
#--------------------------------------------------------
# Test Predoction and Submission
#--------------------------------------------------------  

flg_submit = False

if flg_submit:
    
    test_preds = pd.DataFrame()
    for i, model in enumerate(models):
        print(f'Predictinig {i+1}th model...')
        test_pred = model.predict(X_test, num_iteration = model.best_iteration_)
        test_pred = np.round(np.clip(test_pred, 0, 10)).astype(int)
        test_preds[f'Fold{i+1}'] = test_pred

    sample_submission['open_channels'] = test_preds.median(axis=1).astype(int)
    #sample_submission.open_channels.value_counts()

    save_sub_name = 'submission.csv'

    sample_submission.to_csv(save_sub_name,
            index=False,
            float_format='%0.4f')

## CV and LB score

I tried additional calculations and got the following results.  
I did the calculation with and without lag1-6 and lead1-6 , from the result of Permutation importance.
  
### * 30 lag + 30 lead features + minimal features
**CV score: 0.93589 ---> LB score: 0.937** 
### * 6 lag + 6 lead features + minimal features
**CV score: 0.93584 ---> LB score: 0.937**
### * 24 lag(with out lag1-6) + 24 lead(with out lead1-6) + minimal features
**CV score: 0.93084 ---> LB score: 0.933**

For lag7 and lead7 and above, it doesn't seem to affect the accuracy with or without.  
However, lag1-6 and lead1-6 seems to affect prediction accuracy.  
It is considered that the smaller the lag and lead value, the more it contributes to accuracy.  
I think I will make a model by incorporating lag1-6(or less) and lead1-6(or less).

In [None]:
#dir()
del df_lags, df_leads, fold_importance, models, sample_submission, oof_pred
gc.collect()

## Additional study: Null importance
I'm going to conider[ null inmportance that olivier shared](https://www.kaggle.com/ogrellier/feature-selection-with-null-importances).  
I use the only first fold data used above to reduce code creation time. 
And sorry for the unsophisticated code. 
  
Target permutation is done, and calculation is performed with lgb. The best_iteration value of the original calculation, was used as the value of n_estimator. Please refer the code.  

This notebook uses 30 rounds to save time.

In [None]:
#--------------------------------------
# Calculation null importnce
#--------------------------------------

## re-define params

n_est = model.booster_.best_iteration

# prams for validation
params_null = {
    'learning_rate': 0.03, 
    'max_depth': -1,
    'num_leaves': 2**8+1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.82,
    'bagging_freq': 0,
    'n_jobs': 8,
    'random_state': 31452,
    'metric': 'rmse',
    'objective' : 'regression'
    }

params_null2 = {
    #'eval_metric': 'rmse', #This did not work.4/26
    'n_estimators': n_est #,
    #'early_stopping_rounds': 50
    }

params_null.update(params_null2)


n_round = 30 #!!

df_null_importance = pd.DataFrame()

for i in range(n_round):
    if i % 10 == 0:
        print(f'Calculating null importance round {i}')
    y_null = y_train.copy().sample(frac = 1.0)
    
    model_null = lgb.LGBMRegressor()
    model_null.set_params(**params_null)
    
    model_null.fit(X_train, y_null, 
                   eval_set = [(X_train, y_null)], #,(X_valid, y_valid)],
                   verbose = 300)
    
    tmp_importance = pd.DataFrame()
    tmp_importance['feature'] = X.columns.tolist()
    tmp_importance['round'] = i + 1 
    tmp_importance['split_importance'] =\
        model_null.booster_.feature_importance(importance_type = 'split')
    tmp_importance['gain_importance'] =\
        model_null.booster_.feature_importance(importance_type = 'gain')
    
    df_null_importance =\
        pd.concat([df_null_importance, tmp_importance], axis = 0)

#Name shotening
st1 = df_null_importance[df_null_importance['feature'].str.contains("_10s")]['feature'].str.replace("_10s", "")
df_null_importance.loc[df_null_importance['feature'].str.contains("_10s"), 'feature'] = st1

### Distribution plot: Split Importance  

The red actual importance is greater than the blue null importance histogram value, and the farther away it is, the higher the feature importance is.

In [None]:
def plot_null_dist(df_feature_importance, df_null_importance, X_cols, imp):

    fig, ax = plt.subplots(3, 4, figsize = (15, 10))

    k = 0
    for i in range(3):
        for j in range(4):
            try:
                disp_col = X_cols[k]

                act_imp =\
                    df_feature_importance[df_feature_importance['feature'] == disp_col][imp].values
                null_imp =\
                    df_null_importance[df_null_importance['feature'] == disp_col][imp]

                f = ax[i, j].hist(null_imp, alpha = 0.8, color = "dodgerblue", 
                             label = "Null importance")
                y_max = np.max(f[0])

                ax[i, j].plot([act_imp, act_imp], [0.0, y_max], linewidth = 7, color ="magenta", label = "Real target")
                ax[i, j].set_title(disp_col.replace("_10s", ""),  fontsize = 16)
                k += 1
            except:
                pass
    plt.tight_layout()
    fig.suptitle("Distribution of " + imp + ": actual(magenta), target_permutation(blue) ", fontsize=20)
    plt.subplots_adjust(top=0.9)
    plt.show()

**Without lag and lead features**

In [None]:
X_cols = df_feature_importance[df_feature_importance['feature'].str.contains("mean|sd")]['feature'].unique()
imp = "split_importance"
plot_null_dist(df_feature_importance, df_null_importance, X_cols, imp)

**Lag features**

In [None]:
X_cols = df_feature_importance[df_feature_importance['feature'].str.contains("lag")]['feature'].unique()
imp = "split_importance"
plot_null_dist(df_feature_importance, df_null_importance, X_cols, imp)

**Lead features**

In [None]:
X_cols = df_feature_importance[df_feature_importance['feature'].str.contains("lead")]['feature'].unique()
imp = "split_importance"
plot_null_dist(df_feature_importance, df_null_importance, X_cols, imp)

### Distribution plot: Gain Importance

**Without lag and lead features**

In [None]:
X_cols = df_feature_importance[df_feature_importance['feature'].str.contains("mean|sd")]['feature'].unique()
imp = "gain_importance"
plot_null_dist(df_feature_importance, df_null_importance, X_cols, imp)

**Lag features**

In [None]:
X_cols = df_feature_importance[df_feature_importance['feature'].str.contains("lag")]['feature'].unique()
imp = "gain_importance"
plot_null_dist(df_feature_importance, df_null_importance, X_cols, imp)

**Lead fetures**

In [None]:
X_cols = df_feature_importance[df_feature_importance['feature'].str.contains("lead")]['feature'].unique()
imp = "gain_importance"
plot_null_dist(df_feature_importance, df_null_importance, X_cols, imp)

### Quantification of null importance
I examined how to quantify the above distribution graph and use it as feature importance.  
Here is the method I tried.  
* The target permutated distribution was assumed to be gaussian.  
* The probability value was calculated by normalized actual value.  
* p = norm.cdf(x=normalized_actual_importance_vlaue, loc=0, scale=1)

In [None]:
## evaluation of null_importance

#df_null_p = df_null_importance.groupby('feature').agg(["mean", "std"], sort = False).reset_index()
df_null_mean = df_null_importance.groupby('feature', sort = False).mean().reset_index()
df_null_mean.drop('round', axis = 1, inplace = True)
df_null_mean.rename(columns = {'split_importance':'split_importance_mean', 
                               'gain_importance':'gain_importance_mean'}, inplace = True)

df_null_sd = df_null_importance.groupby('feature', sort = False).std().reset_index()
df_null_sd.drop('round', axis = 1, inplace = True)
df_null_sd.rename(columns = {'split_importance':'split_importance_sd', 
                               'gain_importance':'gain_importance_sd'}, inplace = True)

df_act = df_feature_importance.groupby('feature', sort =False).mean().reset_index()
df_act.drop('fold', axis = 1, inplace = True)

df_null_summary = df_act.merge(df_null_mean, how = "left", on = 'feature')
df_null_summary = df_null_summary.merge(df_null_sd, how = "left", on = 'feature')
df_null_summary['z_split'] = (df_null_summary['split_importance']-df_null_summary['split_importance_mean'])/df_null_summary['split_importance_sd'] 
df_null_summary['z_gain'] = (df_null_summary['gain_importance']-df_null_summary['gain_importance_mean'])/df_null_summary['gain_importance_sd'] 

from scipy.stats import norm
df_null_summary['p_by_split_distribution'] = norm.cdf(x = df_null_summary['z_split'], loc = 0, scale = 1)
df_null_summary['p_by_gain_distribution'] = norm.cdf(x = df_null_summary['z_gain'], loc = 0, scale = 1)

#df_null_summary.columns

### Feature importance: All features

In [None]:
df_null_summary_tmp =\
df_null_summary[~df_null_summary['feature'].str.contains("lag2[0-9]|lead2[0-9]|lag3[0-9]|lead3[0-9]")]
plot_feature_importance_by3(df_null_summary_tmp, null_imp = True)

### Feature importance: Lag features

In [None]:
df_null_summary_tmp = df_null_summary[df_null_summary['feature'].str.contains("lag")]
plot_feature_importance_by3(df_null_summary_tmp, null_imp = True)

### Feature importance: Lead features

In [None]:
df_null_summary_tmp = df_null_summary[df_null_summary['feature'].str.contains("lead")]
plot_feature_importance_by3(df_null_summary_tmp, null_imp = True)

* Important variables in "split" are the almost same as permutation importance.  
* In "gain", only Lag-1 / Lead-1,2 became important.

I think the accuracy increases as the number of iterations increases. It takes more time to calculate, but I think it takes less time than permutation importance.

## Acknowledgment
Thanks to Chris to share clean data: [Data Without Drift](https://www.kaggle.com/cdeotte/data-without-drift), and thanks to always amaizing works.  
Thanks to corochann for many suggestions: [Permutation importance for feature selection part1](https://www.kaggle.com/corochann/permutation-importance-for-feature-selection-part1)  
Thanks to Gabriel to share the study materials for modeling: [Ion Switching Advanced EDA and Prediction](https://www.kaggle.com/gpreda/ion-switching-advanced-eda-and-prediction)  
Thank to Rob to share really helpfull notobook: [Ion Switching - 5kfold LGBM & Tracking](https://www.kaggle.com/robikscube/ion-switching-5kfold-lgbm-tracking)  
Thanks to Andrew for your notobooks(I have learned a lot of things): [EDA and models](https://www.kaggle.com/artgor/eda-and-models) etc.  
Thanks to olivier to share notebook about null importrance:[ Feature Selection with Null Importances](https://www.kaggle.com/ogrellier/feature-selection-with-null-importances)