# Feature Engineering

"Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data."
Jason Brownlee, [Machine Learning Mastery](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)

You can construct new features out of existing features, select the best features, remove the worst features, penalize features by giving them no weight in the model, and transform features, as examples. 

Some feature engineering methods include:  

- Construct new features: Use domain knowledge, create products of features, etc. 

- Use statistical tests to determine each feature's usefulness in predicting the target variable. Rank the features and then select the `K` best features (Select K Best). 

- Recursively remove attributes to meet the number of required features and then build a model on those attributes that remain, to see if you can match or improve performance with a smaller subset (Recursive Feature Elimination).   
- Recursively remove the worst-performing features one by one till the overall performance of the model comes in an acceptable range (Backward Elimination).  

- Incorporate features one by one, starting from the predictor that exhibits the highest correlation with the dependent variable. Variables of greater theoretical importance are entered first. Once in the equation, the variable remains there (Forward Selection). 

- many, many more...

**In general, we want to use scaled data for the methods discussed in this lesson.**

In [1]:
import pandas as pd
import numpy as np
import wrangle
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Here's the source for the dataset and data dictionary https://archive.ics.uci.edu/ml/datasets/student+performance
path = "https://gist.githubusercontent.com/ryanorsinger/55ccfd2f7820af169baea5aad3a9c60d/raw/da6c5a33307ed7ee207bd119d3361062a1d1c07e/student-mat.csv"

df, X_train_explore, \
    X_train_scaled, y_train, \
    X_validate_scaled, y_validate, \
    X_test_scaled, y_test = wrangle.wrangle_student_math(path)

## Select K Best 

The goal of filter methods, such as SelectKBest, is to keep the attributes with the highest correlation to the target variable and of those features, if two are highly correlated with each other, remove one of them. With filter methods, the model is built after selecting the features. These methods identify the relevant features and subset the data with only those features. 

Select K Best is a filter method, meaning the goal is to find and keep the attributes with the highest correlation to the target variable, and of those features, if two are highly correlated with each other, remove one of them. 

`SelectKBest` will identify the `K` most relevant features and subset the data with only those features. Relevancy is determined by the test statistic for the chosen function or test (Chi-squared, F-regression, etc.). For regression, we will use the f-regression test to score the individual effect of each of the features (aka regressors). 

In [3]:
from sklearn.feature_selection import SelectKBest, f_regression

1. Initialize the f_selector object, setting the parameters, or instructions for the method to follow: "use the f_regression test for scoring the features, and return to me the top 10 features", for example.

In [4]:
f_selector = SelectKBest(f_regression, k=2)

2. Fit the object to our data. In doing this, our selector scores, ranks, and identifies the top `k` features.

In [5]:
f_selector.fit(X_train_scaled, y_train)

3. Transform our dataset to reduce to the `k` best features. 

In [6]:
X_reduced = f_selector.transform(X_train_scaled)

print(X_train_scaled.shape)
print(X_reduced.shape)

(221, 41)
(221, 2)


We can simplify Steps 1-3 in the following way: 

In [7]:
X_reduced2 = SelectKBest(f_regression, k=2).fit_transform(X_train_scaled, y_train)
print(X_reduced2.shape)

(221, 2)


We can use the `inverse_transform` function to return to the original variables. 

Let's say we want a list of the features we have selected. Why? Maybe we want to run various feature selection methods and want to keep track of how many times each feature was selected. We could simply grab the column names from our new dataframe above that contains only the best 2 features. However, maybe we don't need the new dataframe yet, as we aren't quite sure which features we will finally decide to keep. In that case, we could use `get_feature_names_out()` after our object is fit and transformed. 

In [8]:
f_selector.get_feature_names_out()

array(['G1', 'G2'], dtype=object)

To summarize, we used the `SelectKBest` method to select the top `k` features which were scored and ranked using the f-regression statistical test.

### Recursive Feature Elimination

Recursive Feature Elimination is a *wrapper* method for feature selection. This means that it works by using the output of a machine learning algorithm as the evaluation criteria for eliminating features; in the case of linear regression, it uses the resulting coefficients.

You feed all the features to the selected Machine Learning algorithm, and, based on the hyperparameters you have set, features are removed. **One word of caution...this is an iterative and computationally expensive process!** The pro is that it is *more accurate than `SelectKBest`*. 

RFE recursively removes attributes and then builds a model on those attributes that remain. The RFE method takes the machine learning algorithm to be used and the number of required features as input. It returns the ranking of all the variables, `1` being the most important, along with its support: a list of boolean values, `True` indicating relevant features and `False` indicating irrelevant features.

These are the steps we will take to implement RFE:  

1. Initialize the linear regression object. `sklearn.linear_model.LinearRegression`  
2. Initialize the RFE object. `sklearn.feature_selection.RFE`  
3. Fit the RFE object to our data. `rfe.fit()`  
4. Transform our X dataframe to include only `n` number of features. `rfe.transform()`  
5. Optional: Get a list of features selected.
6. Optional: Get the ranking of all variables.

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

1. Initialize the linear regression object

In [10]:
lm = LinearRegression()

2. Initialize the RFE object, setting the hyperparameters to be our linear regression object created above (as the algorithm to test the features on) and the number of features to return to be 2.   

In [11]:
rfe = RFE(lm, n_features_to_select=2)

3. Fit the RFE object to our data. This means you'll create multiple linear regression models, find the one that performs best, and identify the features that are used in that model. Those will be the features we want.   

4. Transform our X dataframe to include only those 2 `n` features. `.transform()` *or do both of those steps together with `.fit_transform()`*

In [12]:
# Transforming data using RFE
X_rfe = rfe.fit_transform(X_train_scaled,y_train)  

We would then use our new X dataframe as the one to move forward with for actual modeling. As a sneak peak...

In [13]:
#Fitting the data to model
lm.fit(X_rfe,y_train)

5. If we want a list of the features that remain, we can use `get_feature_names_out()`, just like `SelectKBest`. 

In [14]:
rfe_features = rfe.get_feature_names_out()

In [15]:
print(str(len(rfe_features)), 'selected features')
print(rfe_features)

2 selected features
['G1' 'G2']


6. We can also get a ranking of the features using `rfe.ranking_`. This will return a `1` for the features that were selected. So, since we said we wanted 2 features to remain, the top two features will have a rank of `1`. The features that were eliminated will be ranked accordingly. In this case, the third feature will have a rank of `2`. However, if we had more than 1 feature that was eliminated, they would all have different ranks. 

In [16]:
var_ranks = rfe.ranking_
var_names = X_train_scaled.columns.tolist()

pd.DataFrame({'Var': var_names, 'Rank': var_ranks})

Unnamed: 0,Var,Rank
0,age,3
1,Medu,13
2,Fedu,15
3,traveltime,5
4,studytime,34
5,failures,9
6,famrel,4
7,freetime,39
8,goout,18
9,Dalc,19


Here we took `LinearRegression` model with 2 features and RFE gave feature ranking as above, but the selection of number `2` was random. 

## Summary

### SelectKBest

Select the `K` best features using a statistical test to compare each X with y and find which X's have the strongest relationship with y. For regression, we will use the correlation test (`f-regression`) to score the relationships. 

1. **Initialize the f_selector object**, setting the parameters, or instructions for the method to follow: "use the *f_regression* test for scoring the features, and return to me the top *10* features", for example. 
2. **Fit the object to our data.** That is, run a correlation test for every X variable with our y variable, and then rank the X variables based on how correlated they are with the y/target variable. Then give me the top *10* features. 
3. **Use get_feature_names_out()** to get the list of features, and save them to a variable that you can use to filter your dataframe in modeling. 

In [17]:
from sklearn.feature_selection import SelectKBest, f_regression

# parameters: f_regression stats test, give me 10 features
f_selector = SelectKBest(f_regression, k=10)

# find the top 10 X's correlated with y
f_selector.fit(X_train_scaled, y_train)

# save top 10 features
f_features = f_selector.get_feature_names_out()
f_features

array(['age', 'Medu', 'Fedu', 'traveltime', 'failures', 'G1', 'G2',
       'sex_M', 'guardian_other', 'higher_yes'], dtype=object)

### Recursive Feature Elimination

Recursive Feature Elimination will create a model with all the features, evaluate the performance metrics, find the weakest feature, remove it, then create a new model with the remaining features, evaluate the performance metrics, find the weakest feature, remove it, and so on, until it gets down to the number of features you have indicated you want when creating the RFE object. You will also need to indicate which Machine Learning algorithm you want to use. 

1. **Initialize the machine learning algorithm**, in this case, LinearRegression
2. **Initialize the RFE object**, and provide the ML algorithm object from Step 1
3. **Fit the RFE object to our data.** Doing this will provide us with a list of features (the number we asked for) as well as a ranking of all the features. 
4. **Assign the list** of selected features to a variable. 
5. **Optional:** Get a ranking of all variables (1 being the most important)

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# initialize the ML algorithm
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, n_features_to_select=2)

# fit the data using RFE
rfe.fit(X_train_scaled,y_train)  

# get list of the column names. 
rfe_feature = rfe.get_feature_names_out()
rfe_feature

array(['G1', 'G2'], dtype=object)

In [19]:
# view list of columns and their ranking

# get the ranks
var_ranks = rfe.ranking_
# get the variable names
var_names = X_train_scaled.columns.tolist()
# combine ranks and names into a df for clean viewing
rfe_ranks_df = pd.DataFrame({'Var': var_names, 'Rank': var_ranks})
# sort the df by rank
rfe_ranks_df.sort_values('Rank').head(10)

Unnamed: 0,Var,Rank
14,G2,1
13,G1,1
12,absences,2
0,age,3
6,famrel,4
3,traveltime,5
20,Mjob_health,6
22,Mjob_services,7
21,Mjob_other,8
5,failures,9


## Exercises

Do your work for this exercise in a jupyter notebook named `feature_engineering` within the `regression-exercises` repo. Add, commit, and push your work.

1. Load the `tips` dataset.

    1. Create a column named `price_per_person`. This should be the total bill divided by the party size.
    1. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? 
    1. Use Select K Best to select the top 2 features for predicting tip amount. What are they?
    1. Use Recursive Feature Elimination to select the top 2 features for tip amount. What are they?
    1. Why do you think Select K Best and Recursive Feature Elimination might give different answers for the top features? Does this change as you change the number of features you are selecting?

1. Write a function named `select_kbest` that takes in the predictors (X), the target (y), and the number of features to select (`k`) and returns the names of the top `k` selected features based on the `SelectKBest` class. Test your function with the `tips` dataset. You should see the same results as when you did the process manually.

1. Write a function named `rfe` that takes in the predictors, the target, and the number of features to select. It should return the top `n` features based on the `RFE` class. Test your function with the `tips` dataset. You should see the same results as when you did the process manually.

1. Load the `swiss` dataset and use all the other features to predict Fertility. Find the top 3 features using both Select K Best and Recursive Feature Elimination (use the functions you just built to help you out).