# Exercises
Do your work for this exercise in a jupyter notebook named feature_engineering within the regression-exercises repo. Add, commit, and push your work.

1. Load the tips dataset.

    - Create a column named tip_percentage. This should be the tip amount divided by the total bill.
    - Create a column named price_per_person. This should be the total bill divided by the party size.
    - Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage? 
        - I think tip percentage might be a good predictor, maybe not in linear regression but when used with total bill it could be good
        - But tip percentage is derived from tip so I don't think that would be great
        - size of the party might also be a good predictor
    - Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?
        - 'total_bill', 'size'
        - 'total_bill', 'tip_percentage'
    - Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?
        - 'total_bill', 'tip'
        - 'tip', 'size'
    - Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?
        - becuase recursive feature elimination takes into account the how the variables work with eachother
    
2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).


In [74]:
import pandas as pd
import numpy as np
import math
#import matplotlib.pyplot as plt
#import seaborn as sns

from pydataset import data


from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import SelectKBest, f_regression, RFE


# set seaborn defaults
#sns.set_palette('plasma')

In [2]:
# load the tips dataset
df = data('tips')

In [23]:
# create new column with tip percentage
df['tip_percentage'] = round((df.tip/ df.total_bill * 100), 2)
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,5.94
2,10.34,1.66,Male,No,Sun,Dinner,3,16.05
3,21.01,3.5,Male,No,Sun,Dinner,3,16.66


In [24]:
# create new column price per person
df['price_per_person'] = round(df['total_bill'] / df['size'], 2)

In [25]:
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,5.94,8.49
2,10.34,1.66,Male,No,Sun,Dinner,3,16.05,3.45
3,21.01,3.5,Male,No,Sun,Dinner,3,16.66,7.0


In [54]:
# remember this. Very helpful for dividing up columns

df.drop(columns='tip').select_dtypes(include=['float64', 'int64']).head()


Unnamed: 0,total_bill,size,tip_percentage,price_per_person
1,16.99,2,5.94,8.49
2,10.34,3,16.05,3.45
3,21.01,3,16.66,7.0
4,23.68,2,13.98,11.84
5,24.59,4,14.68,6.15


In [79]:
# assign X and y
X_df = df[df.drop(columns='tip').select_dtypes(include=['float64', 'int64']).columns]
y_df = df['tip']

In [80]:
# here's where we do the Select K Best. use the score_func= f_regression
f_selector = SelectKBest(score_func=f_regression, k=3)
f_selector.fit(X_df, y_df)

SelectKBest(k=3, score_func=<function f_regression at 0x7fa5588f69d0>)

In [67]:
besties = f_selector.get_support()
besties

array([ True,  True, False,  True])

In [68]:
# these are the best features according to Select K Best
X_df.columns[besties]

Index(['total_bill', 'size', 'price_per_person'], dtype='object')

In [81]:
lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X_df, y_df)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [82]:
rfe.support_

array([ True, False,  True, False])

In [83]:
X_df.columns[rfe.support_]

Index(['total_bill', 'tip_percentage'], dtype='object')

<hr style="border-top: 10px groove darkmagenta; margin-top: 1px; margin-bottom: 1px"></hr>

In [84]:
X_df = df[df.drop(columns='tip_percentage').select_dtypes(include=['float64', 'int64']).columns]
y_df = df['tip_percentage']

In [71]:
f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X_df, y_df)

SelectKBest(k=2, score_func=<function f_regression at 0x7fa5588f69d0>)

In [72]:
besties = f_selector.get_support()
besties

array([ True,  True, False, False])

In [87]:
X_df.columns[besties]

Index(['total_bill', 'tip'], dtype='object')

In [90]:
list(X_df.columns[besties])

['total_bill', 'tip']

In [85]:
lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X_df, y_df)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [86]:
X_df.columns[rfe.support_]

Index(['tip', 'size'], dtype='object')

<hr style="border-top: 10px groove darkmagenta; margin-top: 1px; margin-bottom: 1px"></hr>
2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [91]:
def select_kbest(X, y, k, score_func=f_regression):
    '''
    takes in the predictors (X), the target (y), and the number of features to select (k) 
    and returns the names (in a list) of the top k selected features based on the SelectKBest class
    Optional arg: score_func. Default is f_regression. other options ex: f_classif 
    '''
    # create selector
    f_selector = SelectKBest(score_func=score_func, k=k)
    
    #fit to X and y
    f_selector.fit(X, y)
    
    # return the list of the column names that are the top k selected features
    return list(X.columns[f_selector.get_support()])

In [92]:
# little test using X and y from before
my_cols = select_kbest(X_df, y_df, 2)

In [93]:
my_cols

['total_bill', 'tip']

<hr style="border-top: 10px groove darkmagenta; margin-top: 1px; margin-bottom: 1px"></hr>
3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [126]:
def rfe(X, y, n, estimator=LinearRegression()):
    '''
    takes in the predictors (X), the target (y), and the number of features to select (n) 
    and returns the names (in a list) of the top k selected features based on the Recursive Feature Elimination class
    Optional arg: estimator. Default is LinearRegression()
    '''
    # use the estimator model to create estimator
    est = estimator
    
    # set up with estimator and n_features
    rfe = RFE(estimator=est, n_features_to_select=n)
    
    # fit to X and y
    rfe.fit(X, y)
    
    # return the list of the columns 
    
    return list(X.columns[rfe.support_])


In [127]:
# test the function
my_cols = rfe(X_df, y_df, 2)
my_cols

['tip', 'size']

<hr style="border-top: 10px groove darkmagenta; margin-top: 1px; margin-bottom: 1px"></hr>
4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [99]:
swiss_df = data('swiss')

In [100]:
swiss_df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [101]:
swiss_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 47 entries, Courtelary to Rive Gauche
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fertility         47 non-null     float64
 1   Agriculture       47 non-null     float64
 2   Examination       47 non-null     int64  
 3   Education         47 non-null     int64  
 4   Catholic          47 non-null     float64
 5   Infant.Mortality  47 non-null     float64
dtypes: float64(4), int64(2)
memory usage: 2.6+ KB


In [112]:
# set up X and y
X_swiss = swiss_df[swiss_df.drop(columns='Fertility').columns]
y_swiss = swiss_df.Fertility

In [109]:
# use function from above to select 3 best using Select K Best
swiss_k_best = select_kbest(X_swiss, y_swiss, 3)
swiss_k_best

['Examination', 'Education', 'Catholic']

In [128]:
# use function from above to select 3 best usng RFE
swiss_rfe_best = rfe(X_swiss, y_swiss, 3)
swiss_rfe_best

['Examination', 'Education', 'Infant.Mortality']