# Feature Engineering

"Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data."
Jason Brownlee, [Machine Learning Mastery](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)

**Construct new features** using algorithms or domain knowledge

**Select features** 

- **Use statistical tests** to determine each feature's usefulness in predicting the target variable. Rank the features and then select the K best features (Select K Best). (Filter Methods)

- **Use the ML models** to determine each feature's usefulness in predicting the target variable. (Wrapper Methods)


## Lesson Goals 

- We will use SelectKBest to select the top 2 features based on how correlated each feature is with the target variable. 
- We will use Recursive Feature Elimination and a linear regression algorithm to keep the top 2 features based on which features lead to the best performing linear regression model. 


In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings("ignore")

import wrangle_grades
import split_scale

# acquire data and remove null values 
df = wrangle_grades.wrangle_grades()

In [2]:
df = df.drop(columns=['student_id'])

In [3]:
# split into train and test
train, test = split_scale.split_my_data(df)

In [4]:
train.head()

Unnamed: 0,exam1,exam2,exam3,final_grade
93,85,83,87,87
35,62,70,79,70
72,73,70,75,76
87,62,70,79,70
81,83,80,86,85


For the feature engineering methods, we want to use the scaled data:

In [5]:
# scale data using standard scaler
scaler, train, test = split_scale.standard_scaler(train, test)

# to return to original values
# scaler, train, test = scaling.my_inv_transform(scaler, train, test)

X_train = train.drop(columns='final_grade')
y_train = train[['final_grade']]
X_test = test.drop(columns='final_grade')
y_test = test[['final_grade']]

____________________________

## Select K Best 

- Filter method
- Provide the algorithm with the test statistic to use to rank the variables. 
- Provide the desired number of features to end up with (K)

### Steps to implement

This is done in 2 steps behind the scenes when using the `SelectKBest` method with `f_regression`. 

1. The correlation between each regressor and the target is computed.    
2. It is converted to an F score then to a p-value.    

____________________________

1. Initialize the f_selector object, which defines the test for scoring the features and the number of features we want to keep, i.e. `k`)

In [6]:
f_selector = SelectKBest(f_regression, k = 2)

2. Fit the object to our data. In doing this, our selector is scoring, ranking, and identifying the top `k` features. 

In [7]:
# running correlation test between each x and y and returning the score, f-statistic

f_selector.fit(X_train, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x104635840>)

3. Transform our dataset to reduce to the `k` best features. 

In [9]:
# select the k best features
X2 = f_selector.transform(X_train)

print(X2.shape)
print(X_train.shape)

(71, 2)
(71, 3)


### SelectKBest in one line

We can simplify steps 1-3 in the following way: 

In [13]:
X2 = SelectKBest(f_regression, k = 2).fit_transform(X_train, y_train)
print(X2.shape)

X2[0:2]

(71, 2)


array([[ 0.53924057,  0.39096601],
       [-1.16072052, -0.56211852]])

We can use the `inverse_transform` function to return to the original variables. 

Get a list of the features selected using `get_support`.

### Get list of features selected

0. Use `.get_support` to get a list of booleans, a mask for the feature names or columns in X_train. 
1. Use `.loc` with our mask to subset to the features selected
2. Use `.columns` to get the column names
3. Convert the values to a list using `.tolist()`. 

In [11]:
f_support = f_selector.get_support()
f_support

array([ True, False,  True])

In [17]:
f_feature = X_train.loc[:,f_support].columns.tolist()
f_feature

['exam1', 'exam3']

### Summary of SelectKBest

We used the `SelectKBest` method to select the top `k` features, and these features are scored and ranked using a statistical test, which we used the f-regression test in this case. 


## Recursive Feature Elimination

- wrapper method
- iterative & computationally expensive
- more accurate and flexible than SelectKBest
- ranks the variables (1 being most important, along with its support, True being relevant feature and False being irrelevant feature.)


These are the steps we will take to implement RFE:  

1. Initialize the linear regression object. `sklearn.linear_model.LinearRegression`  
2. Initialize the RFE object. `sklearn.feature_selection.RFE`  
3. Fit the RFE object to our data. `rfe.fit()`  
4. Transform our X dataframe to include only those 2 features. `rfe.transform()`  

Optional: Get list of features selected and/or ranking of all variables  

1. Initialize the linear regression object

In [18]:
lm = LinearRegression()

2. Initialize the RFE object, setting the hyperparameters to be our linear regression object created above (as the algorithm to test the features on) and the number of features to return to be 2.   

In [19]:
rfe = RFE(lm, 2)

3. Fit the RFE object to our data. This means create multiple linear regression models, find the one that performs best, and identify the features that are used in that model. Those are the features we want.   

In [20]:
rfe.fit(X_train, y_train)

RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False),
  n_features_to_select=2, step=1, verbose=0)

4. Transform our X dataframe to include only those 2 features. `.transform()` *or do both of those steps together with `.fit_transform()`*

In [22]:
X_rfe = rfe.transform(X_train)
X_rfe[0:2]

array([[ 0.53924057,  0.39096601],
       [-1.16072052, -0.56211852]])

When we move on to modeling, we would then use our new X dataframe as the one to move forward for actual modeling. As a sneak peak...

In [23]:
lm.fit(X_rfe, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

### RFE all together

In [25]:
lm = LinearRegression()
rfe = RFE(lm, 2)
X_rfe = rfe.fit_transform(X_train, y_train)

For extra fun and sneak peak, let's just build the model with those features and make predictions...

In [26]:
y_train['predicted_final_grade'] = lm.fit(X_rfe, y_train).predict(X_rfe)
y_train

Unnamed: 0,final_grade,predicted_final_grade
93,0.559281,0.505315
35,-1.059549,-1.026019
72,-0.488197,-0.499429
87,-1.059549,-1.026019
81,0.368830,0.364153
...,...,...
85,-0.488197,-0.499429
19,1.130632,1.091416
100,-1.059549,-1.026019
94,0.368830,0.364153


### Get list of features selected and ranking of features

For list of the features that remain, use `.support_` (like `.get_support()` with `SelectKBest`).

In [29]:
mask = rfe.support_
rfe_features = X_train.loc[:,mask].columns.tolist()
rfe_features

['exam1', 'exam3']

Get ranking of the features using `rfe.ranking_`. 

- Will return a 1 for the features that were selected. 
- The features that were eliminated will be ranked accordingly. 

In [32]:
var_ranks = rfe.ranking_
var_names = X_train.columns.tolist()

pd.DataFrame({'Feature': var_names, 'Rank': var_ranks})

Unnamed: 0,Feature,Rank
0,exam1,1
1,exam2,2
2,exam3,1


### Summary of RFE

Here we took LinearRegression model with 2 features and RFE gave feature ranking as above, but the selection of number ‘2’ was random. If you would like to learn how to find the optimum number of features, for which the accuracy is the highest, see the extended lesson in the appendix of ds.codeup.com.  

## Summary

- We used SelectKBest to select the top 2 features based on how correlated each feature is with the target variable. We ended up with exam1 and exam3.    
- We use RFE and a linear regression algorithm to keep the top 2 features based on which features lead to the best performing linear regression model. This eliminated exam2 and also left us with exam1 and exam3, like SelectKBest. 