## Feature Engineering
### Corey Solitaire
#### 8/01/2020

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import warnings
warnings.filterwarnings("ignore")

from evaluate import linear_model

# This is the code for the Linear Model
from statsmodels.formula.api import ols

In [2]:
df = sns.load_dataset("tips")
train_validate, test = train_test_split(df, test_size = .2, random_state = 123)
train, validate = train_test_split(train_validate, test_size = .3, random_state = 123)
train.shape, validate.shape, test.shape

((136, 7), (59, 7), (49, 7))

In [3]:
train.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
18,16.97,3.5,Female,No,Sun,Dinner,3
172,7.25,5.15,Male,Yes,Sun,Dinner,2
118,12.43,1.8,Female,No,Thur,Lunch,2
28,21.7,4.3,Male,No,Sat,Dinner,2
237,32.83,1.17,Male,Yes,Sat,Dinner,2


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136 entries, 18 to 166
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  136 non-null    float64 
 1   tip         136 non-null    float64 
 2   sex         136 non-null    category
 3   smoker      136 non-null    category
 4   day         136 non-null    category
 5   time        136 non-null    category
 6   size        136 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 5.2 KB


In [5]:
train.describe()

Unnamed: 0,total_bill,tip,size
count,136.0,136.0,136.0
mean,18.790515,2.946985,2.544118
std,8.779733,1.456611,0.987834
min,3.07,1.0,1.0
25%,12.645,2.0,2.0
50%,16.71,2.68,2.0
75%,22.7525,3.5,3.0
max,48.33,9.0,6.0


# Exercises

****

## 1. Load the tips dataset.

- Create a column named tip_percentage. This should be the tip amount divided by the total bill.

In [6]:
# .assign makes temp colums in data table
#train.assign(tip_percentage=train.tip / train.total_bill)

train['tip_percentage'] = round((train.tip / train.total_bill),2)
train

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
18,16.97,3.50,Female,No,Sun,Dinner,3,0.21
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.71
118,12.43,1.80,Female,No,Thur,Lunch,2,0.14
28,21.70,4.30,Male,No,Sat,Dinner,2,0.20
237,32.83,1.17,Male,Yes,Sat,Dinner,2,0.04
...,...,...,...,...,...,...,...,...
233,10.77,1.47,Male,No,Sat,Dinner,2,0.14
6,8.77,2.00,Male,No,Sun,Dinner,2,0.23
7,26.88,3.12,Male,No,Sun,Dinner,4,0.12
115,17.31,3.50,Female,No,Sun,Dinner,2,0.20


- Create a column named price_per_person. This should be the total bill divided by the party size.

In [7]:
train['size'] = train['size'].astype('float')
train.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
18,16.97,3.5,Female,No,Sun,Dinner,3.0,0.21
172,7.25,5.15,Male,Yes,Sun,Dinner,2.0,0.71
118,12.43,1.8,Female,No,Thur,Lunch,2.0,0.14
28,21.7,4.3,Male,No,Sat,Dinner,2.0,0.2
237,32.83,1.17,Male,Yes,Sat,Dinner,2.0,0.04


In [10]:
train['price_per_person'] = (train.total_bill / train.size)
train.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
18,16.97,3.5,Female,No,Sun,Dinner,3.0,0.21,0.013864
172,7.25,5.15,Male,Yes,Sun,Dinner,2.0,0.71,0.005923
118,12.43,1.8,Female,No,Thur,Lunch,2.0,0.14,0.010155
28,21.7,4.3,Male,No,Sat,Dinner,2.0,0.2,0.017729
237,32.83,1.17,Male,Yes,Sat,Dinner,2.0,0.04,0.026822


- Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

- Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?

- Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?

- Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

<div class="alert alert-block alert-success"></div>

****

## 2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

<div class="alert alert-block alert-success"></div>

****

## 3.  Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

<div class="alert alert-block alert-success"></div>

****

## 4.  Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

<div class="alert alert-block alert-success"></div>