# Tutorial 3: Feature Engineering and Selection

---

## Introduction

Welcome! This tutorial will show you how to pre-process features to use for classification. This is an essential pre-processing step, because the infrared spectrum data are large and highly correlated. The pre-processed features will be used to train the machine learning methods later in the upcoming tutorials.

There are two general approaches to feature preprocessing. _Feature engineering_ creates new features that distill the important information needed for classification.  _Feature selection_ selects the best existing features in order to reduce random information that can confuse the classifier. Both methods serve the purpose of cutting down on redundant information as well as eliminating irrelevant information. The two methods are complementary, and often both methods are used.  In this notebook we will apply both feature engineering and feature selection to our spectral data.

As usual, we begin with the necessary imports.

In [None]:
# ___Cell no. 1___

import pandas as pd
import numpy as np

Next, recall the data from the previous notebook, and display the data

In [None]:
# ___Cell no. 2___

%store -r X
%store -r Y
%store -r df

In [None]:
# ___Cell no. 3___

X.head(3)

---

### Feature engineering 

We notice that consecutive columns differ very slightly, so they contain redundant information. One way to reduce the amount of data is by calculating the average of groups of 𝑛 consecutive columns in a df and creating a smaller df. To illustrate this, we will demonstrate on a smaller dataframe.

##### **Example of consecutive column averaging for small dataframe**

First let us create a dummy dataframe

In [None]:
# ___Cell no. 4___

df_dummy = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [0, 10, 20, 30, 40], 'C': [0, 100, 200, 300, 400], 'D': [0, 1, 2, 3, 4],  'E': [0, 1, 2, 3, 4],  'F': [0, 1, 2, 3, 4]})
df_dummy


Now let us take the sum of groups of 3 consecutive columns

In [None]:
# ___Cell no. 5___

df_dummyRoll = df_dummy.rolling(3, axis = 1).sum()  # if axis = 1, then the rolling window will go over the columns
print(df_dummyRoll)


The column sums overlap, but we are only interested in non-overlapping sums. So we select every third sum. 

In [None]:
# ___Cell no. 6___

df_dummyRoll = df_dummyRoll.iloc[:, 2::3]
df_dummyRoll

Hopefully by now you got the idea. So let's do the same on our main dataset

##### **Now applying column averaging on GS data**

We have created a Python function in `source.utils` called `creat_rollingData`. We will be using this function to create aggregated rolling windows for a given data frame; the function has the following arguments:
 - df: the data frame
 - window_arr: an array of window sizes; we will be using this to create several aggregated datasets
 - axis: specifies the axis of aggregation (0,1 = rows, columns: the default is 1)
 - method: Name of the method used to aggregate the data. These methods have also been programmed as functions (e.g. `mean_df, skew_df, kurt_df` calculate the mean, skew, and kurtosis, respectively, of each group of columns

In [None]:
# ___Cell no. 7___

import sys
sys.path.append("..")
from source.utils import creat_rollingData, skew_df, mean_df, kurt_df

first let us run the `creat_rollingData` function on the dummy data we called above. We use two different window sizes (2 and 3), which produce two different feature sets. Either feature set can be used as an input.

In [None]:
# ___Cell no. 8___

df_arrayRol_dummy  = creat_rollingData (df = df_dummy, window_arr = [2,3], method =  mean_df)
df_arrayRol_dummy

We have checked that the method is working given the above results. For the following, we create four different sets of engineered features using four different windows. Later we will compare the performance obtained using these four different sets.

In [None]:
# ___Cell no. 9___

X_arrayRol  = creat_rollingData (df = X, window_arr = [10, 30, 50, 100], method =  mean_df )

# let us see the shape of the created rolled dataframes
for x in X_arrayRol:
    print(x.shape)

Let us now visualise the data, we will make use of a customized function called `graph_df` in `source/graphs` directory

In [None]:
# ___Cell no. 10___

from source.graphs import graph_df

In [None]:
# ___Cell no. 11___

graph_df (X_arrayRol, Y, n = 50) # where n is the number of randomly selected samples


Notice, that the resolution is decreasing since we are losing features.

**Exercise 1:** Apply column averaging on the other 2 datasets
<br>

In [None]:
#  ___ code here ____


**Exercise 2:** Using the function 'mean_df' as an example, write another function to compute the variance of the column groups.  Call your function `var_df`.

<br>

In [None]:
#  ___ code here ____


---

###  Feature Selection 

The idea is that we want to select the best N features from each data set we created above. I will set N = 10 (you are encouraged to change N)

For feature selection, we will make use of `Sequential feature selection` (SFS) from `sklearn`. For more information about the method, visit the following [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)

We need to consider the following when using SFS:
   -  SFS is an optimization tool; hence we will need to split the data into training and testing and perform SFS/feature selection on the training set
   - SFS removes features based on a _cross-validation score_. To obtain this score, we will need to specify a ML model to that is run on different feature sets.  For more information about cross validation, please visit the following [link](https://scikit-learn.org/stable/modules/cross_validation.html). 
   
   In this tutorial, we will be using logistic regression. As mentioned in a previous tutorial, we are interested in the precision of the classifier, so the cross-validation score will be based on the precision of the classification.

`Notice`: It is *essential* to perform optimizations on the training set instead of the testing to avoid information leakage. Otherwise, model performances will be overestimated. 

We begin with the necessary imports.

In [None]:
# ___Cell no. 12___

from source.utils import split #  a pre-defined function to split the data into training and testing
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression

Before We perfom feature selection, let's convert the `Y` values from `S` and `B` to `1` and `0` respectively.

In [None]:
# ___Cell no. 13___

Y = Y.map({'S': 1, 'B': 0})

Now let us perfom the feature selection, using the methods from `sklearn`.

In [None]:
# ___Cell no. 14___

selected_indexes = []

for x_roll in X_arrayRol:
    Xtrain, Xtest, Ytrain, Ytest  = split( x_roll, Y) # splitting the data
    print("(Number of samples, number of features) = ", Xtrain.shape)
    sfs = SequentialFeatureSelector(estimator=LogisticRegression(solver = 'newton-cg'), n_features_to_select=10, direction = 'forward', scoring = 'precision' ) # def SFS
    sfs.fit(Xtrain.values, Ytrain)
    selected_indexes.append(sfs.support_) # storing the indices of the best n_features_to_select columns


If you want to know how to choose the best N features automatically based on their score, you can visit this [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html), or this [link](https://machinelearningmastery.com/rfe-feature-selection-in-python/)

**Exercise 3:** Perform feature selection on the other 2 data sets


**Exercise 4:** Visualise the feature importance using a scatter plot on the 3 datasets, as we would like to know where are the best 10 features on the spectrum.
<br>

**DONE**

---

<b><i> Saving data for later use </i></b>

We can save the data so that we can call it up again in the next notebooks

In [None]:
# ___Cell no. 15___

%store  X_arrayRol
%store  selected_indexes