# Feature Engineering Exercises

In [1]:
import pandas as pd
import numpy as np
import wrangle
import warnings


warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression, RFE

1. Load the tips dataset.
- Create a column named `price_per_person`. This should be the total bill divided by the party size.
- Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?
- Use select k best to select the top 2 features for predicting tip amount. What are they?
- Use recursive feature elimination to select the top 2 features for tip amount. What are they?
- Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

In [2]:
from pydataset import data

tips_df = data('tips')
tips_df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3


In [3]:
tips_df['price_per_person'] = tips_df['total_bill'] / tips_df['size']
tips_df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,7.003333


In [4]:
pd.isna(tips_df).sum()

total_bill          0
tip                 0
sex                 0
smoker              0
day                 0
time                0
size                0
price_per_person    0
dtype: int64

I think the most important features for predicting the tip amount would be in this order, total_bill, day, time

In [5]:
X_tips = tips_df.drop(columns=['tip', 'sex', 'smoker', 'day', 'time'])
y_tips = tips_df['tip']

In [6]:
# parameters: f_regression stats test, give me 8 features
f_selector = SelectKBest(f_regression, k=2)

# find the top 8 X's correlated with y
f_selector.fit(X_tips, y_tips)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

# get list of top K features. 
f_feature = X_tips.iloc[:,feature_mask].columns.tolist()

In [7]:
f_feature

['total_bill', 'size']

The top two features for predicting tip amount were the Total Bill and party size

In [8]:
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
rfe = RFE(lm, n_features_to_select=2)

# fit the data using RFE
rfe.fit(X_tips, y_tips)  

# get the mask of the columns selected
feature_mask = rfe.support_

# get list of the column names. 
rfe_feature = X_tips.iloc[:,feature_mask].columns.tolist()

rfe_feature

['total_bill', 'price_per_person']

Using the recurseive feature elimination the top 2 features selected were for the Total Bill and Price per person.

- I believe that select k best and recursive feature elimination had different answers for top 2nd feature because the select K best is comparing every single variable against its corresponding y target and then selecting those which had the highest total numbers with the most correlation as opposed to the recursive feature elimination which I believe takes the lowest overall aveage correlation for each group and gets rid of them one by one.  So I think that select K chose 'size' because it had the second most common number of higher correlations when ranked agains all others even though it may have had many lower correlations that brought the overall mean down which caused it to be eliminated before price_per_person when using rfe.

- Yes, it did change when I changed the number of features for rfe to more than 2.

2. Write a function named `select_kbest` that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the `SelectKBest` class. Test your function with the `tips` dataset. You should see the same results as when you did the process manually.

In [11]:
def select_kbest(x, y, k):
    f_selector = SelectKBest(f_regression, k)
    
    f_selector.fit(X_tips, y_tips)
    
    feature_mask = f_selector.get_support()
    
    f_feature = X_tips.iloc[:,feature_mask].columns.tolist()
                             
    return f_feature



In [12]:
select_kbest(X_tips, y_tips, 2)

['total_bill', 'size']

- Yes, the select_kbest function had the same result as the manual method.

3. Write a function named `rfe` that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the `RFE` class. Test your function with the `tips` dataset. You should see the same results as when you did the process manually.

In [13]:
def rfe(x, y, k):
    lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
    rfe = RFE(lm, n_features_to_select=2)

# fit the data using RFE
    rfe.fit(X_tips, y_tips)  

# get the mask of the columns selected
    feature_mask = rfe.support_

# get list of the column names. 
    rfe_feature = X_tips.iloc[:,feature_mask].columns.tolist()

    return rfe_feature

In [14]:
rfe(X_tips, y_tips, 2)

['total_bill', 'price_per_person']

- Yes, I got the same results using the function of rfe as the manual method

4. Load the `swiss` dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [15]:
swiss_df = data('swiss')
swiss_df.head(3)

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2


In [17]:
X_tips = swiss_df.drop(columns=['Fertility'])
y_tips = swiss_df['Fertility']

In [20]:
select_kbest(X_tips, y_tips, 3)

['Examination', 'Education', 'Catholic']

In [24]:
def rfe(x, y, k):
    lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features I want to end up with. 
    rfe = RFE(lm, n_features_to_select=3)

# fit the data using RFE
    rfe.fit(X_tips, y_tips)  

# get the mask of the columns selected
    feature_mask = rfe.support_

# get list of the column names. 
    rfe_feature = X_tips.iloc[:,feature_mask].columns.tolist()

    return rfe_feature

In [25]:
rfe(X_tips, y_tips, 3)

['Examination', 'Education', 'Infant.Mortality']