# Feature Engineering

[Lesson link](https://ds.codeup.com/regression/feature-engineering/#selectkbest)

In [1]:
# feature eng imports
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE # Recursive Feature Elimination¶

# other imports
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import sys
warnings.filterwarnings('ignore')
import pydataset

# set a default them for all my visuals
sns.set_theme(style="whitegrid")

sys.path.append("./util_")
# Personal libraries
import prepare_


**Load the tips dataset.**

In [2]:
# import tips data set
tips = pydataset.data("tips")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4



**Create a column named price_per_person. This should be the total bill divided by the party size.**

In [3]:
tips["price_per_person"] = tips.total_bill / tips["size"]
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,3.446667


**Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?**
- Total_bill and size and price_per_person

**Split data**

In [4]:
train, validate, test = prepare_.split_data_(tips, random_state=59)
train.shape, validate.shape, test.shape

((146, 8), (49, 8), (49, 8))

### SelectKBest

**Use select k best to select the top 2 features for predicting tip amount. What are they?**

- looks at each feature in isolation against the target based on correlation
- fastest of all approaches covered in this lesson
- doesn't consider feature interactions
- After fitting: `.scores_`, `.pvalues_`, `.get_support()`, and `.transform`
- K: Number of top features to select.

In [5]:
# separate features from target
xtrain = train.select_dtypes("number").drop(columns="tip")
ytrain = train.tip

In [6]:
# parameters: f_regression stats test, give me 2 features
# f_regression: a regression stat test that tests if the feature is useful to predicting target
kbest = SelectKBest(f_regression, k=2)

# FIT the thing
kbest.fit(xtrain, ytrain)

In [7]:
# statistical f-value / feature's scores:
kbest.scores_

array([101.53446278,  55.49449347,  16.46791013])

In [8]:

# p value: 
kbest.pvalues_

array([2.10685424e-18, 7.93436360e-12, 8.08264694e-05])

In [9]:
kbest.get_support()

array([ True,  True, False])

In [10]:
# select the top 2 features for predicting tip amount
top = kbest.feature_names_in_
top[:2]

array(['total_bill', 'size'], dtype=object)

### Recursive Feature Elimination

**Use recursive feature elimination to select the top 2 features for tip amount. What are they?**

- Recursive Feature Elimination
- Progressively eliminate features based on importance to the model
- Requires a model with either a `.coef_ or .feature_importances_` property
- After fitting: `.ranking_, .get_support(), and .transform()`

In [11]:
# make a model object to use in RFE process.
# The model is here to give us metrics on feature importance and model score
# allowing us to recursively reduce the number of features to reach our desired space
linear_model = LinearRegression()

# MAKE the thing
rfe = RFE(linear_model, n_features_to_select=2)

# FIT the thing
rfe.fit(xtrain, ytrain)

In [12]:
# Get feature ranking
# Selected features are assigned a rank 1

rfe.ranking_

array([2, 1, 1])

In [13]:
# Dataframe of rankings
pd.DataFrame(
    {'rfe_ranking': rfe.ranking_},
    index = xtrain.columns
)

Unnamed: 0,rfe_ranking
total_bill,2
size,1
price_per_person,1


In [14]:
# get the two best predictors
rfe.get_support()

# or 

rfe.support_

array([False,  True,  True])

In [15]:
# view top feature names
xtrain.columns[rfe.support_]

Index(['size', 'price_per_person'], dtype='object')

In [16]:
# transform out selected features into a dataframe
X_train_RFEtransformed = pd.DataFrame(
    rfe.transform(xtrain),
    index=xtrain.index,
    columns = xtrain.columns[rfe.support_]
)
X_train_RFEtransformed.head()

Unnamed: 0,size,price_per_person
198,4.0,10.7775
178,2.0,7.24
146,2.0,4.175
183,3.0,15.116667
215,3.0,9.39


**Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features you are selecting?**

- SelectKBest uses univariate statistical tests to select features while RFE uses a model-based approach to select features.

**Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.**

In [17]:
def get_selectBest(xtrain, ytrain, k):
    # parameters: f_regression stats test, give me 2 features
    # f_regression: a regression stat test that tests if the feature is useful to predicting target
    kbest = SelectKBest(f_regression, k=k)

    # FIT the thing
    kbest.fit(xtrain, ytrain)
    
    # select the top 2 features for predicting tip amount
    top_k_features = kbest.feature_names_in_[:k]
    
    return top_k_features

In [18]:
get_selectBest(xtrain,ytrain,2)

array(['total_bill', 'size'], dtype=object)

**Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.**

In [19]:
def get_recursive_feature_elimination(xtrain, ytrain, n_features):
    # make a model object to use in RFE process.
    # The model is here to give us metrics on feature importance and model score
    # allowing us to recursively reduce the number of features to reach our desired space
    linear_model = LinearRegression()

    # MAKE the thing
    rfe = RFE(linear_model, n_features_to_select=n_features)

    # FIT the thing
    rfe.fit(xtrain, ytrain)
    
    # view top feature names
    top_model_features = xtrain.columns[rfe.support_]

    return top_model_features


In [20]:
get_recursive_feature_elimination(xtrain,ytrain, 2)

Index(['size', 'price_per_person'], dtype='object')

**Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).**

In [23]:
swiss = pydataset.data("swiss")
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


**split data**

In [25]:
train, validate, test = prepare_.split_data_(swiss, random_state=95)
train.shape, validate.shape, test.shape

((27, 6), (10, 6), (10, 6))

In [30]:
# separate features from target
xtrain = swiss.select_dtypes("number").drop(columns="Fertility")
ytrain = swiss.Fertility

**SelectBest**

In [31]:
get_selectBest(xtrain,ytrain,3)

array(['Agriculture', 'Examination', 'Education'], dtype=object)

**Recursive Feature Elimination**

In [32]:
get_recursive_feature_elimination(xtrain,ytrain, 3)

Index(['Examination', 'Education', 'Infant.Mortality'], dtype='object')