# Exercises
Do your work for this exercise in a jupyter notebook named feature_engineering within the regression-exercises repo. Add, commit, and push your work.

1. Load the tips dataset.

    - Create a column named price_per_person. This should be the total bill divided by the party size.
    - Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?
    - Use select k best to select the top 2 features for predicting tip amount. What are they?
    - Use recursive feature elimination to select the top 2 features for tip amount. What are they?
    - Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [23]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import math

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from pydataset import data
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load the tips dataset
df = data('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


a. Create a column nacmes price_per_person. This should be the total bill divided by party size.

In [3]:
df['price_per_person'] = round((df.total_bill / df['size']), 2) 

In [4]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,8.49
2,10.34,1.66,Male,No,Sun,Dinner,3,3.45
3,21.01,3.5,Male,No,Sun,Dinner,3,7.0
4,23.68,3.31,Male,No,Sun,Dinner,2,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,6.15


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
dtypes: float64(3), int64(1), object(4)
memory usage: 17.2+ KB


In [6]:
df.day
.value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

In [7]:
# encode object features
df['sex_encoded'] = df.sex.map({'Female':1, 'Male':0})
df['smoker_encoded'] = df.smoker.map({'Yes':1, 'No':0})
df['day_encoded'] = df.day.map({'Thur':4, 'Fri':5, 'Sat':6, 'Sun':7})
df['time_encoded'] = df.time.map({'Lunch':1, 'Dinner':0})
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,sex_encoded,smoker_encoded,day_encoded,time_encoded
1,16.99,1.01,Female,No,Sun,Dinner,2,8.49,1,0,7,0
2,10.34,1.66,Male,No,Sun,Dinner,3,3.45,0,0,7,0
3,21.01,3.5,Male,No,Sun,Dinner,3,7.0,0,0,7,0
4,23.68,3.31,Male,No,Sun,Dinner,2,11.84,0,0,7,0
5,24.59,3.61,Female,No,Sun,Dinner,4,6.15,1,0,7,0


In [8]:
# drop encoded columns
df = df.drop(columns=['sex', 'smoker', 'day', 'time'])

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   size              244 non-null    int64  
 3   price_per_person  244 non-null    float64
 4   sex_encoded       244 non-null    int64  
 5   smoker_encoded    244 non-null    int64  
 6   day_encoded       244 non-null    int64  
 7   time_encoded      244 non-null    int64  
dtypes: float64(3), int64(5)
memory usage: 17.2 KB


In [16]:
# split tips dataframe
train_validate, test = train_test_split(df, test_size=.2, random_state=123)
train, validate = train_test_split(train_validate, test_size=.3, random_state=123)
train.shape, validate.shape, test.shape

((136, 8), (59, 8), (49, 8))

In [17]:
# split train into X, y 
X_train = train.drop(columns=['tip'])
y_train = train['tip']

X_validate = validate.drop(columns=['tip'])
y_validate = validate['tip']

X_test = test.drop(columns=['tip'])
y_test = test['tip']

X_train.head()

Unnamed: 0,total_bill,size,price_per_person,sex_encoded,smoker_encoded,day_encoded,time_encoded
19,16.97,3,5.66,1,0,7,0
173,7.25,2,3.62,0,1,7,0
119,12.43,2,6.22,1,0,4,1
29,21.7,2,10.85,0,0,6,0
238,32.83,2,16.42,0,1,6,0


b. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?
- I believe total bill will be the most important feature

c. Use select k best to select the top 2 features for predicting tip amount. What are they?

In [18]:
kbest = SelectKBest(f_regression, k=2)
kbest.fit(X_train, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7f89f9f898b0>)

In [19]:
kbest_results = pd.DataFrame(dict(p=kbest.pvalues_, f=kbest.scores_), index=X_train.columns)
kbest_results

Unnamed: 0,p,f
total_bill,7.18647e-20,115.984909
size,1.341642e-12,61.259089
price_per_person,0.001306594,10.783502
sex_encoded,0.2844794,1.154792
smoker_encoded,0.5579978,0.344909
day_encoded,0.1045855,2.670276
time_encoded,0.1821449,1.798647


In [20]:
X_train.columns[kbest.get_support()]

Index(['total_bill', 'size'], dtype='object')

In [21]:
X_train_transformed = pd.DataFrame(
    kbest.transform(X_train),
    index=X_train.index,
    columns=X_train.columns[kbest.get_support()]
)
X_train_transformed.head()

Unnamed: 0,total_bill,size
19,16.97,3.0
173,7.25,2.0
119,12.43,2.0
29,21.7,2.0
238,32.83,2.0


d. Use recursive feature elimination to select the top 2 features for tip amount. What are they?

In [24]:
model = LinearRegression()
rfe = RFE(model, n_features_to_select=2)
rfe.fit(X_train, y_train)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [25]:
pd.DataFrame({'rfe_ranking': rfe.ranking_}, index=X_train.columns)

Unnamed: 0,rfe_ranking
total_bill,1
size,3
price_per_person,2
sex_encoded,1
smoker_encoded,6
day_encoded,5
time_encoded,4


In [26]:
X_train.columns[rfe.get_support()]

Index(['total_bill', 'sex_encoded'], dtype='object')

In [27]:
X_train_transformed = pd.DataFrame(
    rfe.transform(X_train),
    index=X_train.index,
    columns=X_train.columns[rfe.support_]
)
X_train_transformed.head()

Unnamed: 0,total_bill,sex_encoded
19,16.97,1.0
173,7.25,0.0
119,12.43,1.0
29,21.7,0.0
238,32.83,0.0


e. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [29]:
def select_kbest(X, y, k):
    kbest = SelectKBest(f_regression, k=k)
    kbest.fit(X , y)
    return X.columns[kbest.get_support()]

In [30]:
# Test your function with the tips dataset. You should see the same results as when you did the process manually.
select_kbest(X_train, y_train, 2)

Index(['total_bill', 'size'], dtype='object')

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [31]:
def select_rfe(X, y, k):
    model = LinearRegression()
    rfe = RFE(model, n_features_to_select= k)
    rfe.fit(X, y)
    return X.columns[rfe.get_support()]

In [32]:
select_rfe(X_train, y_train, 2)

Index(['total_bill', 'sex_encoded'], dtype='object')

4. Load the swiss dataset and use all the other features to predict Fertility. 
Find the top 3 features using both select k best and recursive feature elimination 
(use the functions you just built to help you out).

In [33]:
df =  data('swiss')
df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 47 entries, Courtelary to Rive Gauche
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fertility         47 non-null     float64
 1   Agriculture       47 non-null     float64
 2   Examination       47 non-null     int64  
 3   Education         47 non-null     int64  
 4   Catholic          47 non-null     float64
 5   Infant.Mortality  47 non-null     float64
dtypes: float64(4), int64(2)
memory usage: 2.6+ KB


In [34]:
train_validate, test = train_test_split(df, test_size=.2, random_state=123)
train, validate = train_test_split(train_validate, test_size=.3, random_state=123)
train.shape, validate.shape, test.shape

((25, 6), (12, 6), (10, 6))

In [36]:
# split train into X, y 
X_train = train.drop(columns=['Fertility'])
y_train = train['Fertility']

X_validate = validate.drop(columns=['Fertility'])
y_validate = validate['Fertility']

X_test = test.drop(columns=['Fertility'])
y_test = test['Fertility']

X_train.head()

Unnamed: 0,Agriculture,Examination,Education,Catholic,Infant.Mortality
Rolle,60.8,16,10,7.72,16.3
Lavaux,73.0,19,9,2.84,20.0
Nyone,50.9,22,12,15.14,16.7
Conthey,85.9,3,2,99.71,15.1
Yverdon,49.5,15,8,6.1,22.5


Find the top 3 features using both select k best and recursive feature elimination 
(use the functions you just built to help you out).

In [40]:
select_kbest(X_train, y_train, 3)

Index(['Examination', 'Catholic', 'Infant.Mortality'], dtype='object')

In [39]:
select_rfe(X_train, y_train, 3)

Index(['Agriculture', 'Examination', 'Infant.Mortality'], dtype='object')