# Permutation Feature Importance Practical

In this practical we look at Permutation Feature Importance for a dataset that contains both numerical and categorical variables. We investigate what happens when we add random features to the data and (bonus part) look at how adding a correlated feature impacts the permutation feature importance scores.

First we'll need a few imports:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.metrics import balanced_accuracy_score
from lightgbm.sklearn import LGBMClassifier
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Introduction to Practical

### The Dataset

The data (sourced [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing)) is from marketing campaigns of a Portuguese bank. The features in the data are:

| Feature     | Description                                                                                                                                                              | Data type   |
|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| Age         | Age                                                                                                                                                                      | Numeric     |
| Job         | Type Of Job Out Of   ('Admin.','Blue-Collar','Entrepreneur','Housemaid','Management','Retired','Self-Employed','Services','Student','Technician','Unemployed','Unknown') | Categorical |
| Marital     | Marital Status ('Divorced','Married','Single','Unknown'; Note: 'Divorced'   Means Divorced Or Widowed)                                                                   | Categorical |
| Education   | Education Type   ('Basic.4Y','Basic.6Y','Basic.9Y','High.School','Illiterate','Professional.Course','University.Degree','Unknown')                                       | Categorical |
| Default     | Has Credit In Default? ('No','Yes','Unknown')                                                                                                                            | Categorical |
| Housing     | Has Housing Loan? ('No','Yes','Unknown')                                                                                                                                 | Categorical |
| Loan        | Has Personal Loan? ('No','Yes','Unknown')                                                                                                                                | Categorical |
| Contact     | Contact Communication Type ('Cellular','Telephone')                                                                                                                      | Categorical |
| Month       | Last Contact Month Of Year ('Jan', 'Feb', 'Mar', ..., 'Nov', 'Dec')                                                                                                      | Categorical |
| Day of Week | Last Contact Day Of The Week ( 'Mon','Tue','Wed','Thu','Fri')                                                                                                            | Categorical |
| Campaign    | Number Of Contacts Performed During This Campaign And For This Client                                                                                                    | Numeric     |
| pdays       | Number Of Days That Passed By After The Client Was Last Contacted From A   Previous Campaign (999 Means Client Was Not Previously Contacted)                             | Numeric     |
| Previous    | Number Of Contacts Performed Before This Campaign And For This Client                                                                                                    | Numeric     |
| poutcome    | Outcome Of The Previous Marketing Campaign   ('Failure','Nonexistent','Success')                                                                                         | Numeric     |
| y           | Has The Client Subscribed A Term Deposit? ('Yes','No')                                                                                                                   | Binary      |

### Functions

A function has been pre-written to train a LightGBM model for data with a mixture of numeric and categorical features. It takes as inputs: `X_train`, `X_test`, `y_train`, `y_test`, a list of the names of the numerical features `numerical_features` and a list of the names of the categorical features `categorical_features`. It outputs a trained LightGBM classifier `lgbm`.

In [None]:
def train_lgbm(X_train, X_test, y_train, y_test, numerical_features, categorical_features):
    
    categorical_encoder = OneHotEncoder(handle_unknown='ignore')
    
    numerical_pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='mean'))
    ])

    preprocessing = ColumnTransformer(
        [('cat', categorical_encoder, categorical_features),
         ('num', numerical_pipe, numerical_features)])

    lgbm = Pipeline([
        ('preprocess', preprocessing),
        ('classifier', LGBMClassifier(class_weight="balanced", n_jobs=-1))
    ])

    lgbm.fit(X_train, y_train)
    
    print("LightGBM train accuracy: %0.3f" % lgbm.score(X_train, y_train))
    print("LightGBM test accuracy: %0.3f" % lgbm.score(X_test, y_test))
    print("LightGBM balanced test accuracy: %0.3f" % balanced_accuracy_score(y_test, lgbm.predict(X_test)))
    
    return lgbm

Another function has been pre-written to generate the permutation feature importance scores and plot. The function takes a trained model `model`, train or test set features `X` and corresponding labels `y`. It produces a visualisation of the box-plots of feature importance scores for that model and data. Feel free to play around with the parameters `n_repeats` and `random_state`.

In [None]:
def plot_pfi(model, X, y):

    result = permutation_importance(model, X, y, n_repeats=10,
                                    random_state=42, n_jobs=2, scoring="balanced_accuracy")
    sorted_idx = result.importances_mean.argsort()

    fig, ax = plt.subplots()
    ax.boxplot(result.importances[sorted_idx].T,
               vert=False, labels=X.columns[sorted_idx])
    ax.set_title("Permutation Feature Importances")
    fig.tight_layout()
    plt.show()

# Exercises

### Part 1: Load Data, Train LightGBM Model and Plot Permutation Feature Importances

**Exercise 1:** use the `pd.read_csv()` function to load in the file `data/bank.csv`. Call the dataframe `df`.

In [None]:
df = pd.read_csv("data/bank.csv")


**Exercise 2:** use the `head()` function to view the first few samples in the dataset.

In [None]:
df.head()


**Exercise 3:** split the data into labels `y` and features `X`. Hint: use the `.drop("y", axis=1)` function to extract a dataframe without the labels.

In [None]:
y = df["y"]
X = df.drop("y", axis=1)


**Exercise 4:** run the command `y.map({"no": 0, "yes": 1})` to map the labels from "no" to 0 and "yes" to 1. This makes it easier to train our model.

In [None]:
y = y.map({"no": 0, "yes": 1})


**Exercise 5:** define two lists `numerical_features` and `categorical_features` which contain the names of the numerical and categorical features respectively. Hint: use `X.dtypes` to look at the datatypes (object is a categorical variable). We have to treat these types of variables separately when training the model.

In [None]:
X.dtypes
numerical_features = ["age", "campaign", "pdays", "previous"]
categorical_features = ["job", "marital", "education","default", "housing", "loan", 
                        "contact", "month", "day_of_week", "poutcome"]


**Exercise 6:** use the `train_test_split()` function to split the data into `X_train`, `X_test`, `y_train`, `y_test` to prepare for training the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)


**Exercise 7:** use the `train_lgbm()` function defined above to train and observe the performance of a LGBM classifier on our data. Remember to pass in the correct arguments (see the function in the Introduction to Practical section above). Because there is a class imbalance in our data, the function also prints the balanced test accuracy.

In [None]:
lgbm = train_lgbm(X_train, X_test, y_train, y_test, numerical_features, categorical_features)


**Exercise 8:** use the `plot_pfi()` function defined above to plot the permutation feature importance scores on the *training* set. Hint: look above to see which arguments you need to pass in.

In [None]:
plot_pfi(lgbm, X_train, y_train)


**Exercise 9:** repeat exercise 8 on the *test* set and observe differences between the two plots.

In [None]:
plot_pfi(lgbm, X_test, y_test)


Note that the permutation feature importance can be negative for features on the test set, but not the training set. Why is this the case?

### Part 2: Add Random Features

The following piece of code loads in the data (as you did in Exercise 1, 3 and 4). It then defines two new features `random_cat` and `random_num` which are completely random categorical and numerical variables respectively. This part of the practical looks at how adding random features (which should not be informative to the model) can affect permutation feature importance.

Run the following cell to define a new dataset with the `random_cat` and `random_num` features added, and train a new model `lgbm_rand` on this data.

In [None]:
df_rand = pd.read_csv("data/bank.csv")
y_rand = df_rand["y"]
X_rand = df_rand.drop("y", axis=1)
y_rand = y_rand.map({"no": 0, "yes": 1})

rng = np.random.RandomState(seed=42)
X_rand["random_cat"] = rng.randint(3, size=X_rand.shape[0])
X_rand["random_num"] = rng.randn(X_rand.shape[0])

numerical_features_rand = ["age", "campaign", "pdays", "previous", "random_num"]
categorical_features_rand = ["job", "marital", "education","default", "housing", "loan",
            "contact", "month", "day_of_week", "poutcome", "random_cat"]
    
X_train_rand, X_test_rand, y_train_rand, y_test_rand = train_test_split(X_rand, y_rand)

lgbm_rand = train_lgbm(X_train_rand, X_test_rand, y_train_rand, y_test_rand, numerical_features_rand, categorical_features_rand)

**Exercise 10:** use the `plot_pfi()` function to plot the permutation feature importance of the `lgbm_rand` model on the *training* data `X_train_rand` and `y_train_rand`.

In [None]:
plot_pfi(lgbm_rand, X_train_rand, y_train_rand)


**Exercise 11:** use the `plot_pfi()` function to plot the permutation feature importance of the `lgbm_rand` model on the *test* data `X_test_rand` and `y_test_rand`.

In [None]:
plot_pfi(lgbm_rand, X_test_rand, y_test_rand)


**Interpretation of results:** 

Note that both random features have very low importances (close to 0) as expected on the test set in Exercise 11 - this is good news!

However, looking at the training set plot in Exercise 10 reveals that `random_num` gets a significantly higher importance ranking than when computed on the test set. The difference between those two plots is a confirmation that the model has enough capacity to use that random numerical feature to overfit: it has found information in the random numbers that we know it shouldn't have.

### Bonus Part 3: Add a Correlated Feature (Season)

The following piece of code loads in the data (as you did in Exercise 1, 3 and 4). It then defines a new feature `season` which is defined using the `month` feature. Clearly `season` and `month` are highly correlated, so this part of the practical looks at how adding correlated features can affect permutation feature importance.

Run the following cell to define a new dataset with the `season` feature, and train a new model `lgbm_corr` on this data.

In [None]:
df_corr = pd.read_csv("data/bank.csv")
y_corr = df_corr["y"]
X_corr = df_corr.drop("y", axis=1)
y_corr = y_corr.map({"no": 0, "yes": 1})

for month in ["dec", "jan", "feb"]:
    X_corr.loc[X_corr["month"] == month, "season"] = "winter"
for month in ["mar", "apr", "may"]:
    X_corr.loc[X_corr["month"] == month, "season"] = "spring"
for month in ["jun", "jul", "aug"]:
    X_corr.loc[X_corr["month"] == month, "season"] = "summer"
for month in ["sep", "oct", "nov"]:
    X_corr.loc[X_corr["month"] == month, "season"]= "autumn"

numerical_features_corr = ["age", "campaign", "pdays", "previous"]
categorical_features_corr = ["job", "marital", "education","default", "housing", "loan",
            "contact", "month", "day_of_week", "poutcome", "season"]
    
X_train_corr, X_test_corr, y_train_corr, y_test_corr = train_test_split(X_corr, y_corr)

lgbm_corr = train_lgbm(X_train_corr, X_test_corr, y_train_corr, y_test_corr, numerical_features_corr, categorical_features_corr)

**Exercise 12:** use the `plot_pfi()` function to plot the permutation feature importance of the `lgbm_corr` model on the *training* data with the `season` variable.

In [None]:
plot_pfi(lgbm_corr, X_train_corr, y_train_corr)


**Exercise 13:** use the `plot_pfi()` function to plot the permutation feature importance of the `lgbm_corr` model on the *test* data with the `season` variable.

In [None]:
plot_pfi(lgbm_corr, X_test_corr, y_test_corr)


We have to be slightly careful directly comparing these results to the original model in Part 1, as they have of course used different models altogether! But comparing the training plot to the training plot produced in Part 1, we see the importance score for `month` decreases slightly when `season` is added in. Comparing the training and test plots in this part, we see that `season` has a larger impact in the results on the test set than it did for training. To see more about Permutation Feature Importance with correlated variables, see [this](https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-multicollinear-py) tutorial.