# Feature Engineering Exercises

Do your work for this exercise in a jupyter notebook named feature_engineering within the regression-exercises repo. Add, commit, and push your work.

1. Load the tips dataset.
    * a. Create a column named tip_percentage. This should be the tip amount divided by the total bill.
    * b. Create a column named price_per_person. This should be the total bill divided by the party size.
    * c. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?
    * d. Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?
    * e. Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?
    * f. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from pydataset import data
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('ignore')

## 1. Load the tips dataset.

In [2]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [4]:
df.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

### a. Create a column named tip_percentage. This should be the tip amount divided by the total bill.

In [5]:
df['tip_percentage'] = round(df.tip / df.total_bill, 3)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059
1,10.34,1.66,Male,No,Sun,Dinner,3,0.161
2,21.01,3.5,Male,No,Sun,Dinner,3,0.167
3,23.68,3.31,Male,No,Sun,Dinner,2,0.14
4,24.59,3.61,Female,No,Sun,Dinner,4,0.147


### b. Create a column named price_per_person. This should be the total bill divided by the party size.

In [6]:
df['price_per_person'] = round(df.total_bill / df['size'], 2)
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059,8.49
1,10.34,1.66,Male,No,Sun,Dinner,3,0.161,3.45
2,21.01,3.5,Male,No,Sun,Dinner,3,0.167,7.0


### c. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

> I think the the most useful features for predicting tip could be total_bill, day, and size.
> 
> tip and tip_percentage are dependent on each other, so we won't use tip_percentage.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   total_bill        244 non-null    float64 
 1   tip               244 non-null    float64 
 2   sex               244 non-null    category
 3   smoker            244 non-null    category
 4   day               244 non-null    category
 5   time              244 non-null    category
 6   size              244 non-null    int64   
 7   tip_percentage    244 non-null    float64 
 8   price_per_person  244 non-null    float64 
dtypes: category(4), float64(4), int64(1)
memory usage: 11.1 KB


> First I'll convert the categorical columns into numerical features.

In [8]:
for col in df[['sex', 'smoker', 'day', 'time']]:
    print(df[col].value_counts())
    print("\n")

Male      157
Female     87
Name: sex, dtype: int64


No     151
Yes     93
Name: smoker, dtype: int64


Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64


Dinner    176
Lunch      68
Name: time, dtype: int64




In [9]:
# encode variables
df['is_male'] = df.sex.map({'Male': 1, 'Female': 0})
df.smoker = df.smoker.map({'Yes': 1, 'No': 0})
# encode and scale days
#df.day = df.day.map({'Thur': -1, 'Fri': -0.333, 'Sat': .333, 'Sun': 1})
df.day = df.day.map({'Thur': 0, 'Fri': 1, 'Sat': 2, 'Sun': 3})
df['dinner'] = df.time.map({'Dinner': 1, 'Lunch': 0})

df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person,is_male,dinner
0,16.99,1.01,Female,0,3,Dinner,2,0.059,8.49,0,1
1,10.34,1.66,Male,0,3,Dinner,3,0.161,3.45,1,1
2,21.01,3.5,Male,0,3,Dinner,3,0.167,7.0,1,1


In [10]:
# drop unneeded features
df = df.drop(columns=['sex', 'tip_percentage', 'time'])
df.head()

Unnamed: 0,total_bill,tip,smoker,day,size,price_per_person,is_male,dinner
0,16.99,1.01,0,3,2,8.49,0,1
1,10.34,1.66,0,3,3,3.45,1,1
2,21.01,3.5,0,3,3,7.0,1,1
3,23.68,3.31,0,3,2,11.84,1,1
4,24.59,3.61,0,3,4,6.15,0,1


In [11]:
df.dtypes

total_bill           float64
tip                  float64
smoker              category
day                 category
size                   int64
price_per_person     float64
is_male             category
dinner              category
dtype: object

> Variables are ready to convert to ints and floats.

In [12]:
df.smoker = df.smoker.astype('int64')
df.day = df.day.astype('float64')
df.is_male = df.is_male.astype('int64')
df.dinner = df.dinner.astype('int64')

df.dtypes

total_bill          float64
tip                 float64
smoker                int64
day                 float64
size                  int64
price_per_person    float64
is_male               int64
dinner                int64
dtype: object

> Now we have a numeric dataframe to work with, but we need to split the data before continuing.

In [13]:
train_validate, test = train_test_split(df, test_size=.2, random_state=666)
train, validate = train_test_split(train_validate, test_size=.3, random_state=666)

print('train:', train.shape, '|', 'validate:', validate.shape, '|', 'test:', test.shape)

train: (136, 8) | validate: (59, 8) | test: (49, 8)


> Next, split into X & y dataframes.

In [14]:
# x df's are all cols except tip
X_train = train.drop(columns=['tip'])
X_validate = validate.drop(columns=['tip'])
X_test = test.drop(columns=['tip'])

# y df's are just tip
y_train = train[['tip']]
y_validate = validate[['tip']]
y_test = test[['tip']]

> Scale the data

In [15]:
scaler = MinMaxScaler(copy=True).fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

In [16]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns.values).set_index([X_train.index.values])

X_validate_scaled = pd.DataFrame(X_validate_scaled, columns=X_validate.columns.values).set_index([X_validate.index.values])

X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns.values).set_index([X_test.index.values])

> Data is ready for feature selections.

### d. Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?

> ### KBest feature selection:

<img align="left" width="100" height="100" src="algorithm_icon.png">

In [18]:
# initialize f_selector object
f_selector = SelectKBest(f_regression, k=2)
# fit
f_selector = f_selector.fit(X_train_scaled, y_train.tip)

In [19]:
# transform data to reduct to K best features
X_train_reduced = f_selector.transform(X_train_scaled)

print(X_train.shape)
print(X_train_reduced.shape)

(136, 7)
(136, 2)


In [20]:
# mask to get select features
f_support = f_selector.get_support()
print(f_support)

[ True False False  True False False False]


In [21]:
# create dataframe with just the selected features
X_reduced_scaled = X_train_scaled.iloc[:,f_support]
X_reduced_scaled.head(3)

Unnamed: 0,total_bill,size
187,0.54838,0.8
45,0.278296,0.2
217,0.129605,0.2


In [22]:
# features selected using kbest
X_reduced_scaled.columns.tolist()

['total_bill', 'size']

> ### RFE feature selection:

<img align="left" width="100" height="100" src="algorithm_icon.png">

In [23]:
# init lm object and set hyperparameters
lm = LinearRegression()
rfe = RFE(lm, 2)

In [24]:
# fit and transform
X_rfe = rfe.fit_transform(X_train_scaled, y_train.tip)

In [25]:
# save the mask for selected features
mask = rfe.support_

In [26]:
# apply mask
X_reduced_scaled_rfe = X_train_scaled.iloc[:,mask]
X_reduced_scaled.head(3)

In [27]:
# features selected using rfe
X_reduced_scaled_rfe.columns.tolist()

['total_bill', 'price_per_person']

### e. Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?

> Re-prepare and split the data

In [29]:
df = sns.load_dataset('tips')
df['tip_percentage'] = round(df.tip / df.total_bill, 3)
df['price_per_person'] = round(df.total_bill / df['size'], 2)
# encode variables
df['is_male'] = df.sex.map({'Male': 1, 'Female': 0})
df.smoker = df.smoker.map({'Yes': 1, 'No': 0})
# encode and scale days
#df.day = df.day.map({'Thur': -1, 'Fri': -0.333, 'Sat': .333, 'Sun': 1})
df.day = df.day.map({'Thur': 0, 'Fri': 1, 'Sat': 2, 'Sun': 3})
df['dinner'] = df.time.map({'Dinner': 1, 'Lunch': 0})
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person,is_male,dinner
0,16.99,1.01,Female,0,3,Dinner,2,0.059,8.49,0,1
1,10.34,1.66,Male,0,3,Dinner,3,0.161,3.45,1,1
2,21.01,3.5,Male,0,3,Dinner,3,0.167,7.0,1,1


In [30]:
df.smoker = df.smoker.astype('int64')
df.day = df.day.astype('float64')
df.is_male = df.is_male.astype('int64')
df.dinner = df.dinner.astype('int64')

df.dtypes

total_bill           float64
tip                  float64
sex                 category
smoker                 int64
day                  float64
time                category
size                   int64
tip_percentage       float64
price_per_person     float64
is_male                int64
dinner                 int64
dtype: object

In [31]:
# drop unneeded features
df = df.drop(columns=['sex', 'time', 'tip'])
df.head(3)

Unnamed: 0,total_bill,smoker,day,size,tip_percentage,price_per_person,is_male,dinner
0,16.99,0,3.0,2,0.059,8.49,0,1
1,10.34,0,3.0,3,0.161,3.45,1,1
2,21.01,0,3.0,3,0.167,7.0,1,1


> Split the data

In [32]:
train_validate, test = train_test_split(df, test_size=.2, random_state=666)
train, validate = train_test_split(train_validate, test_size=.3, random_state=666)

print('train:', train.shape, '|', 'validate:', validate.shape, '|', 'test:', test.shape)

train: (136, 8) | validate: (59, 8) | test: (49, 8)


In [33]:
# x df's are all cols except tip_percentage
X_train = train.drop(columns=['tip_percentage'])
X_validate = validate.drop(columns=['tip_percentage'])
X_test = test.drop(columns=['tip_percentage'])

# y df's are just tip
y_train = train[['tip_percentage']]
y_validate = validate[['tip_percentage']]
y_test = test[['tip_percentage']]

> Scale the data

In [34]:
scaler = MinMaxScaler(copy=True).fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

In [35]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns.values).set_index([X_train.index.values])

X_validate_scaled = pd.DataFrame(X_validate_scaled, columns=X_validate.columns.values).set_index([X_validate.index.values])

X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns.values).set_index([X_test.index.values])

In [36]:
X_train_scaled.head(3)

Unnamed: 0,total_bill,smoker,day,size,price_per_person,is_male,dinner
187,0.54838,1.0,1.0,0.8,0.222299,1.0,1.0
45,0.278296,0.0,1.0,0.2,0.433518,1.0,1.0
217,0.129605,1.0,0.666667,0.2,0.202216,1.0,1.0


> ### KBest feature selection:

<img align="left" width="100" height="100" src="algorithm_icon.png">

In [37]:
# initialize f_selector object
f_selector = SelectKBest(f_regression, k=2)
# fit
f_selector = f_selector.fit(X_train, y_train.tip_percentage)

In [38]:
# transform data to reduct to K best features
X_train_reduced = f_selector.transform(X_train)

print(X_train.shape)
print(X_train_reduced.shape)

(136, 7)
(136, 2)


In [39]:
# mask to get select features
f_support = f_selector.get_support()
print(f_support)

[ True False False False  True False False]


In [40]:
# create dataframe with just the selected features
X_reduced = X_train.iloc[:,f_support]
X_reduced.head(3)

Unnamed: 0,total_bill,price_per_person
187,30.46,6.09
45,18.29,9.14
217,11.59,5.8


In [41]:
# features selected using kbest
X_reduced.columns.tolist()

['total_bill', 'price_per_person']

> ### RFE feature selection:

<img align="left" width="100" height="100" src="algorithm_icon.png">

In [46]:
# init lm object and set hyperparameters
lm = LinearRegression()
rfe = RFE(lm, 2)

In [47]:
# fit and transform
X_rfe = rfe.fit_transform(X_train_scaled, y_train.tip_percentage)

In [48]:
# save the mask for selected features
mask = rfe.support_

In [49]:
# apply mask
X_reduced_scaled_rfe = X_train_scaled.iloc[:,mask]
X_reduced_scaled.head(3)

Unnamed: 0,total_bill,size
187,0.54838,0.8
45,0.278296,0.2
217,0.129605,0.2


In [50]:
# features selected using rfe
X_reduced_scaled_rfe.columns.tolist()

['size', 'price_per_person']

### f. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

> RFE is bound to pick more accurate features because it is creating many models behind the scenes and comparing them. I would probably use RFE to select features unless the numbers of features I am selecting is really high, because it would take a long time to run.

## 2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class.
### Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [53]:
def select_kbest(X, y, k):
    '''
    Takes in the predictors (X), the target (y), and the number of features to select (k)
    and returns the names of the top k selected features based on the SelectKBest class.
    '''

    # copy pasta:

    # note: this is designed to take in data that is already scaled

    # initialize f_selector object
    f_selector = SelectKBest(f_regression, k)
    # fit
    f_selector = f_selector.fit(X, y)
    # transform data to reduct to K best features
    X_train_reduced = f_selector.transform(X)
    # mask to get select features
    f_support = f_selector.get_support()
    # create dataframe with just the selected features
    X_reduced_scaled = X.iloc[:,f_support]
    # features selected using kbest
    return X_reduced_scaled.columns.tolist()

In [None]:
select_kbest()