# Exercises

Do your work for this exercise in a jupyter notebook named feature_engineering within the regression-exercises repo. Add, commit, and push your work.

1. Load the tips dataset.  
    a. Create a column named tip_percentage. This should be the tip amount divided by the total bill.  
    b. Create a column named price_per_person. This should be the total bill divided by the party size.  
    c. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?  
    d. Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features.   
    What are they?  
    e. Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features.   
    What are they?  
    f. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?  
    
2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [2]:
# 1. load tips dataset from pydataset
df = data('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


In [4]:
# 1.a and 1.b
df['tip_percentage'] = df.tip/df.total_bill
df['price_per_person'] = df.total_bill/df['size']
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,6.1475


In [5]:
# 1.c 
# personally I tip based on percentage of total bill, so I think that feature will be most important

In [6]:
# it makes sense to convert object type columns to numeric values at this point
# create a mask to identify the object columns
mask = np.array(df.dtypes == 'object')
# create a df using the mask
objdf = df.iloc[:, mask]
# get dummies
dummy_df = pd.get_dummies(objdf, dummy_na=False, drop_first=True)
# put the dummies with the original
df = pd.concat([df, dummy_df], axis=1)
# drop the columns from the original we now have dummies for
df.drop(columns=objdf.columns, inplace=True)

In [7]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   size              244 non-null    int64  
 3   tip_percentage    244 non-null    float64
 4   price_per_person  244 non-null    float64
 5   sex_Male          244 non-null    uint8  
 6   smoker_Yes        244 non-null    uint8  
 7   day_Sat           244 non-null    uint8  
 8   day_Sun           244 non-null    uint8  
 9   day_Thur          244 non-null    uint8  
 10  time_Lunch        244 non-null    uint8  
dtypes: float64(4), int64(1), uint8(6)
memory usage: 12.9 KB


In [8]:
# now split the data into train, validate, test
train_validate, test = train_test_split(df, test_size=.2, random_state=123)
train, validate = train_test_split(train_validate, test_size=.3, random_state=123)
train.shape, validate.shape, test.shape

((136, 11), (59, 11), (49, 11))

In [9]:
# 1.d create X and y datasets, drop tip_percentage because that is a feature dirived from tip
X_train = train.drop(columns=['tip'])
X_validate = validate.drop(columns=['tip'])
X_test = test.drop(columns=['tip'])

y_train = train[['tip']]
y_validate = validate[['tip']]
y_test = test[['tip']]



In [10]:
# skipping explore stage

In [11]:
# scaling data, not sure MinMaxScaler is the best one to use here, but proceeding with this one to save time
scaler = MinMaxScaler(copy=True).fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)
# note this returns X_train_scaled as an array

In [12]:
# convert scaled array back to df
# convert array to dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns.values).set_index([X_train.index.values])
X_validate_scaled = pd.DataFrame(X_validate_scaled, columns=X_validate.columns.values).set_index([X_validate.index.values])
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns.values).set_index([X_test.index.values])



In [13]:
# 1.d
# initialize the f selector object, defines the scoring method
f_selector = SelectKBest(f_regression, k=2)
# f_regression is type of test to use, k is the top # of features allowed

In [14]:
# fit the object to X and y train_scaled
# this will score, rand, and ID the top k features
f_selector.fit(X_train_scaled, y_train.tip)
# the .G3 is to get rid of warning that y_train is dataframe, adding .G3 makes it a series

SelectKBest(k=2, score_func=<function f_regression at 0x7fc2a1956950>)

In [15]:
# Transform to reduce to the best k features
X_train_reduced = f_selector.transform(X_train_scaled)
print(X_train.shape)
print(X_train_reduced.shape)

(136, 10)
(136, 2)


In [16]:
# 1.d Kbest = 
f_support = f_selector.get_support()
# create df with just selected features
X_reduced_scaled = X_train_scaled.iloc[:,f_support]
# this is now ready for modeling
X_reduced_scaled.head()

Unnamed: 0,total_bill,size
19,0.307114,0.4
173,0.092355,0.2
119,0.206805,0.2
29,0.411622,0.2
238,0.657534,0.2


In [17]:
# 1.d start RFE
# initialize the linear regression object
lm = LinearRegression()
# initialize the RFE object
rfe = RFE(lm, 2)
# 2 is the number of features to return
X_rfe = rfe.fit_transform(X_train_scaled, y_train.tip)
# could add .G3 to y_train to get rid of pink warning
# save the X_rfe for later, to feed to a model

In [18]:
rfe_mask = rfe.support_
X_reduced_scaled_rfe = X_train_scaled.iloc[:,rfe_mask]
X_reduced_scaled_rfe

Unnamed: 0,total_bill,tip_percentage
19,0.307114,0.252863
173,0.092355,1.0
119,0.206805,0.161808
29,0.411622,0.240873
238,0.657534,0.0
208,0.787892,0.061984
184,0.444101,0.362968
61,0.380468,0.181661
42,0.317941,0.162793
161,0.407203,0.188456


In [19]:
# 1.e create X and y datasets, drop tip because that is a feature used to directly dirive tip_percentage
TPX_train = train.drop(columns=['tip_percentage'])
TPX_validate = validate.drop(columns=['tip_percentage'])
TPX_test = test.drop(columns=['tip_percentage'])

TPy_train = train[['tip_percentage']]
TPy_validate = validate[['tip_percentage']]
TPy_test = test[['tip_percentage']]



In [20]:
# scaling data, not sure MinMaxScaler is the best one to use here, but proceeding with this one to save time
scaler = MinMaxScaler(copy=True).fit(TPX_train)

TPX_train_scaled = scaler.transform(TPX_train)
TPX_validate_scaled = scaler.transform(TPX_validate)
TPX_test_scaled = scaler.transform(TPX_test)
# note this returns X_train_scaled as an array

In [21]:
# convert scaled array back to df
# convert array to dataframe
TPX_train_scaled = pd.DataFrame(TPX_train_scaled, columns=TPX_train.columns.values).set_index([TPX_train.index.values])
TPX_validate_scaled = pd.DataFrame(TPX_validate_scaled, columns=TPX_validate.columns.values).set_index([TPX_validate.index.values])
TPX_test_scaled = pd.DataFrame(TPX_test_scaled, columns=TPX_test.columns.values).set_index([TPX_test.index.values])




In [22]:
# 1.e
# initialize the f selector object, defines the scoring method
TPf_selector = SelectKBest(f_regression, k=2)
# f_regression is type of test to use, k is the top # of features allowed

In [23]:
# fit the object to X and y train_scaled
# this will score, rand, and ID the top k features
TPf_selector.fit(TPX_train_scaled, TPy_train.tip_percentage)
# the .G3 is to get rid of warning that y_train is dataframe, adding .G3 makes it a series

SelectKBest(k=2, score_func=<function f_regression at 0x7fc2a1956950>)

In [24]:
# Transform to reduce to the best k features
TPX_train_reduced = TPf_selector.transform(TPX_train_scaled)
print(TPX_train.shape)
print(TPX_train_reduced.shape)

(136, 10)
(136, 2)


In [25]:
# 1.e Kbest = 
TPf_support = TPf_selector.get_support()
# create df with just selected features
TPX_reduced_scaled = TPX_train_scaled.iloc[:,TPf_support]
# this is now ready for modeling
TPX_reduced_scaled.head()

Unnamed: 0,tip,price_per_person
19,0.3125,0.150344
173,0.51875,0.032258
119,0.1,0.182796
29,0.4125,0.452194
238,0.02125,0.775647


In [26]:
# 1.e start RFE
# initialize the linear regression object
TPlm = LinearRegression()
# initialize the RFE object
TPrfe = RFE(TPlm, 2)
# 2 is the number of features to return
TPX_rfe = rfe.fit_transform(TPX_train_scaled, TPy_train.tip_percentage)
# could add .G3 to y_train to get rid of pink warning
# save the X_rfe for later, to feed to a model

In [27]:
TPrfe_mask = rfe.support_
TPX_reduced_scaled_rfe = TPX_train_scaled.iloc[:,TPrfe_mask]
TPX_reduced_scaled_rfe

Unnamed: 0,total_bill,tip
19,0.307114,0.3125
173,0.092355,0.51875
119,0.206805,0.1
29,0.411622,0.4125
238,0.657534,0.02125
208,0.787892,0.25
184,0.444101,0.6875
61,0.380468,0.27625
42,0.317941,0.1925
161,0.407203,0.3125


1.f 
The algorithms work differently behind the scenes:

Select K Best is a filter method, meaning the goal is to find and keep the attributes with highest correlation to the target variable and of those features, if two are highly correlated with each other, remove one of them. SelectKBest will identify the K most relevant features and subset the data with only those features. Relevancy is determined by the the test statistic for the chosen function or test (Chi-squared, F-regression, etc.). For regression, we will use the f-regression test to score the individual effect of each of the features (aka regressors).   

RFE recursively removes attributes and then builds a model on those attributes that remain. The RFE method takes the machine learning algorithm to be used and the number of required features as input. It returns the ranking of all the variables, 1 being most important, along with its support: a list of boolean values, True indicating relevant features and False indicating irrelevant features.

In [31]:
# 2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number 
# of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. 
# Test your function with the tips dataset. You should see the same results as when you did the process manually.

def select_kbest(features, target, min_num_features):
    # initialize the f selector object, defines the scoring method
    f_selector = SelectKBest(f_regression, k=min_num_features)
    # fit the object to X and y train_scaled
    f_selector.fit(features, target)
    X_train_reduced = f_selector.transform(features)
    f_support = f_selector.get_support()
    # create df with just selected features
    X_reduced_scaled = X_train_scaled.iloc[:,f_support]
    return X_reduced_scaled.columns.tolist()

In [35]:
Kbest2 = select_kbest(X_train_scaled, y_train, 2)
Kbest2

  y = column_or_1d(y, warn=True)


['total_bill', 'size']

In [37]:
# 3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. 
# It should return the top k features based on the RFE class. 
# Test your function with the tips dataset. You should see the same results as when you did the process manually.

def select_rfe(features, target, min_num_features):
    # initialize the linear regression object
    lm = LinearRegression()
    # initialize the RFE object
    rfe = RFE(lm, min_num_features)
    X_rfe = rfe.fit_transform(features, target)
    rfe_mask = rfe.support_
    X_reduced_scaled_rfe = features.iloc[:,rfe_mask]
    return X_reduced_scaled_rfe.columns.tolist()

In [38]:
rfebest2 = select_rfe(X_train_scaled, y_train, 2)
rfebest2


  y = column_or_1d(y, warn=True)


['total_bill', 'tip_percentage']

In [40]:
# 4. Load the swiss dataset and use all the other features to predict Fertility. 
# Find the top 3 features using both select k best and recursive feature elimination
# (use the functions you just built to help you out).

swiss = data('swiss')
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [41]:
swiss.info()

<class 'pandas.core.frame.DataFrame'>
Index: 47 entries, Courtelary to Rive Gauche
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fertility         47 non-null     float64
 1   Agriculture       47 non-null     float64
 2   Examination       47 non-null     int64  
 3   Education         47 non-null     int64  
 4   Catholic          47 non-null     float64
 5   Infant.Mortality  47 non-null     float64
dtypes: float64(4), int64(2)
memory usage: 2.6+ KB


In [43]:
# all data is numeric, will apply MinMaxScaler to scale 
# split data before scaling
train_validate, test = train_test_split(swiss, test_size=.2, random_state=123)
train, validate = train_test_split(train_validate, test_size=.3, random_state=123)

In [44]:
X_train = train.drop(columns=['Fertility'])
X_validate = validate.drop(columns=['Fertility'])
X_test = test.drop(columns=['Fertility'])

y_train = train[['Fertility']]
y_validate = validate[['Fertility']]
y_test = test[['Fertility']]

In [45]:
scaler = MinMaxScaler(copy=True).fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

In [46]:
# convert array to dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns.values).set_index([X_train.index.values])
X_validate_scaled = pd.DataFrame(X_validate_scaled, columns=X_validate.columns.values).set_index([X_validate.index.values])
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns.values).set_index([X_test.index.values])

In [48]:
Kbestswiss = select_kbest(X_train_scaled, y_train, 3)
Kbestswiss

  y = column_or_1d(y, warn=True)


['Examination', 'Catholic', 'Infant.Mortality']

In [49]:
rfebestswiss = select_rfe(X_train_scaled, y_train, 3)
rfebestswiss

  y = column_or_1d(y, warn=True)


['Agriculture', 'Examination', 'Infant.Mortality']