# SelectByFIRST

We now demonstrate how to use FIRST for factor selection within the `sklearn.pipeline.Pipeline` via `SelectByFIRST` class. If you have not installed `pyfirst`, please uncomment and run `%pip install pyfirst` below before proceeding. 

In [1]:
# %pip install pyfirst

## Imports

In [2]:
import numpy as np
from pyfirst import SelectByFIRST
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing, load_breast_cancer

## Regression

### Fetch Data

In [3]:
housing = fetch_california_housing()
X = housing.data
y = np.log(housing.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

### Build Pipeline and Train

In [4]:
pipe = Pipeline([
    ('selector', SelectByFIRST(regression=True,approx_knn=True,random_state=43)),
    ('estimator', RandomForestRegressor(random_state=43))
]).fit(X_train, y_train)

### Test $R^2$ vs Full Model

In [5]:
pipe.score(X_test, y_test)

0.8536755908380326

In [6]:
full = RandomForestRegressor(random_state=43).fit(X_train, y_train)
full.score(X_test, y_test)

0.845913188558511

In [7]:
pipe['selector'].get_feature_importance()

array([0.00734335, 0.        , 0.        , 0.        , 0.        ,
       0.01010999, 0.13941309, 0.15473792])

Comparable test $R^2$ is observed from the random forest model fitted on the 4 factors identified by FIRST versus the random forest model fitted on the entire set of factors. 

## Binary Classification

### Fetch Data

In [8]:
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

### Build Pipeline and Train

In [9]:
pipe = Pipeline([
    ('selector', SelectByFIRST(regression=False,random_state=43)),
    ('estimator', RandomForestClassifier(random_state=43))
]).fit(X_train, y_train)

### Test Accuracy vs Full Model

In [10]:
pipe.score(X_test, y_test)

0.9736842105263158

In [11]:
full = RandomForestClassifier(random_state=43).fit(X_train, y_train)
full.score(X_test, y_test)

0.9912280701754386

In [12]:
pipe['selector'].get_feature_importance()

array([0.        , 0.04416144, 0.01358814, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00339703, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00339703, 0.0101911 , 0.14267543,
       0.        , 0.        , 0.        , 0.        , 0.        ])

Comparable test accuracy is observed from the random forest model fitted on the 6 factors identified by FIRST versus the random forest model fitted on the entire set of factors. 