# Feature Selection With Pipelines

This notebook is a template for common feature selection activities. There are a large number of [feature selection techniques](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection). This template uses three techniques. Although you may want to use other techniques, it's beneficial to use a feature selection pipeline to select a subset of the features. This is because it's likely that you'll go through the process of build models, getting feedback and making adjustments to your data several times.  Keeping track of what transformation you made in previous iterations can help you better refine your model.  

## Setup

This section defines variables used in the following sections to construct and run the pipeline. You can change the values of the variables to meet your needs. 

In [None]:
data_file = './data/feature.csv'
name_space = 'example'
dataset_name = 'sample02'
full_ds_name = '/'.join([name_space, dataset_name])
ds_pred_trgt_name = 'CATEGORY' 
rfe_feat_ct = 4 
vtr_threshold = 0.16 # = 0.8*(1-0.8)

## Pipeline

Import the pandas' or other needed libraries, and then create a dataframe:

In [None]:
from cortex import Cortex
import pandas as pd
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.feature_selection import VarianceThreshold

data_frame = pd.read_csv(data_file)
data_frame

In the next line a local Cortex client is created. You create a server-side client by replacing `Cortex.local()` with `Cortex.client()`. Doing that will cause the data set to be persisted in Cortex.

In [None]:
cortex = Cortex.local()

Now make a dataset, using the cortex builder. The dataset is used to create the pipeline. 

In [None]:
builder = cortex.builder()
data_set_builder = builder.dataset(full_ds_name)
feat_sel_dataset = data_set_builder.from_csv(data_file).build()

A namespace is an organizational element for data saved in the Cortex infrastructure. Use it to keep various artifacts related to a particular project together. This template combines the namespace and the dataset name to create a qualified dataset name. 

In [None]:
pipeline_name = 'feat_sel'
full_pipeline_name = '/'.join([name_space, pipeline_name])

feat_sel_dataset = cortex.dataset(full_ds_name) 
pipeline = feat_sel_dataset.pipeline(full_pipeline_name)
pipeline.reset()

You can select from the following techniques to create a qualified data set. Adding the feature engineering techniques to the pipeline in the order you want them run provides a well defined, reproducible method of curating the data for improving the predictability of the model. 

## Feature Construction

Feature construction can be used to add a new feature to that data that is some function of existing features. For example:

In [None]:
def calc_theta(pipeline, df):
    df['THETA'] = df['ETA'] * df['ZETA']
    return df
    
pipeline.add_step(calc_theta)


## Recursive Feature Elimination

Recursive feature elimination (RFE) uses a model (in this case a [Support Vector Regression](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) model) to repeatedly fit the data to the target, using different combinations of features. Features that contribute the least to the prediction are eliminated. 

In [None]:
def calc_rfe(pipeline, df):
    cols = list(df.columns)
    y = df[ds_pred_trgt_name].values 
    estmtr =  SVR(kernel='linear')
    rfe = RFE(estmtr, rfe_feat_ct)
    rfe = rfe.fit(df, y)
    temp = pd.Series(rfe.support_, index = cols)
    selected_features_rfe = temp[temp==True].index
    return df[selected_features_rfe]
    
pipeline.add_step(calc_rfe)

## Variance Threshold Reduction

Features that have a low variance (in other words, that tend to have very similar values) across records may also not contribute much to a model's predictive ability. Eliminating such features decreases model training time.  

In [None]:
def calc_vtr(pipeline, df):
    cols = list(df.columns)
    sel = VarianceThreshold(threshold=vtr_threshold)
    sel.fit_transform(df)
    temp = pd.Series(sel.get_support(), index = cols)
    selected_features_vtr = temp[temp==True].index
    return df[selected_features_vtr]
    
pipeline.add_step(calc_vtr)

## Run the Pipeline

In [None]:
feat_sel_ds = pipeline.run(data_frame)

## Display the Results

In [None]:
feat_sel_ds