# tips dataset
1. Load the tips dataset.
    * Create a column named tip_percentage. This should be the tip amount divided by the total bill.

    * Create a column named price_per_person. This should be the total bill divided by the party size.

    * Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

    * Use select k best and recursive feature elimination to select the top 2 features for predicting tip amount. What are they?

    * Use select k best and recursive feature elimination to select the top 2 features for predicting tip percentage. What are they?

    * Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

## #1

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from pydataset import data

In [2]:
df = data('tips')
df.head(3), len(df)

(   total_bill   tip     sex smoker  day    time  size
 1       16.99  1.01  Female     No  Sun  Dinner     2
 2       10.34  1.66    Male     No  Sun  Dinner     3
 3       21.01  3.50    Male     No  Sun  Dinner     3,
 244)

Create a column named tip_percentage. This should be the tip amount divided by the total bill.

Create a column named price_per_person. This should be the total bill divided by the party size.

In [3]:
### New Columns ###
df['tip_percentage'] = df['tip'] / df['total_bill']
df['price_per_person'] = df['total_bill'] / df['size']
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.003333


Before using any of the methods discussed in the lesson, which features do you think would be most important for:
- predicting the tip amount?
    * total_bill, size, time
- predicting the tip percentage?
    * total_bill, tip, time

Use select k best and recursive feature elimination to select the top 2 features for predicting tip amount. What are they?

In [4]:
df['smoker'].unique()

array(['No', 'Yes'], dtype=object)

In [5]:
df['time'].unique()

array(['Dinner', 'Lunch'], dtype=object)

In [6]:
df['day'].unique()

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

In [7]:
df['sex'] = df['sex'].map({'Male':0, 'Female':1})
df['smoker'] = df['smoker'].map({'No':0, 'Yes':1})
# df['day'] = df['day'].map({'Thur':0, 'Fri':1, 'Sat':2, 'Sun':3}) - One-hot encode these with more time
df['time'] = df['time'].map({'Lunch':0, 'Dinner':1})
df.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,1,0,Sun,1,2,0.059447,8.495
2,10.34,1.66,0,0,Sun,1,3,0.160542,3.446667
3,21.01,3.5,0,0,Sun,1,3,0.166587,7.003333


### tip target

In [8]:
train, test = train_test_split(df, test_size=0.2, random_state=123)
X_train, y_train = train.drop(columns=['tip', 'day']), train.tip
X_test, y_test = test.drop(columns=['tip', 'day']), test.tip

In [9]:
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

X_train_scaled.head(3)

Unnamed: 0,total_bill,sex,smoker,time,size,tip_percentage,price_per_person
0,2.227511,-0.748331,-0.799159,0.595119,1.512853,0.450044,0.677655
1,-0.440469,-0.748331,1.251315,0.595119,-0.57939,-1.033309,-0.005989
2,-0.769891,1.336306,1.251315,0.595119,-0.57939,0.181202,-0.504267


#### K-Best: target = 'tip'

In [10]:
kbest = SelectKBest(f_regression, k=2)
kbest.fit(X_train_scaled, y_train)
list(X_train.columns[kbest.get_support()])

['total_bill', 'size']

#### RFE: target = 'tip'

In [11]:
rfe = RFE(estimator=LinearRegression(), n_features_to_select=2)
rfe.fit(X_train_scaled, y_train)
list(X_train.columns[rfe.get_support()])

['total_bill', 'tip_percentage']

### tip_percentage target

In [12]:
X_train, y_train = train.drop(columns=['tip_percentage', 'day']), train.tip_percentage
X_test, y_test = test.drop(columns=['tip_percentage', 'day']), test.tip_percentage

scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_train.columns)

X_train_scaled.columns = X_train.columns
X_train_scaled.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,time,size,price_per_person
0,2.227511,3.101608,-0.748331,-0.799159,0.595119,1.512853,0.677655
1,-0.440469,-1.035358,-0.748331,1.251315,0.595119,-0.57939,-0.005989
2,-0.769891,-0.53865,1.336306,1.251315,0.595119,-0.57939,-0.504267


#### K-Best: target = 'tip_percentage'

In [13]:
kbest = SelectKBest(f_regression, k=2)
kbest.fit(X_train_scaled, y_train)
list(X_train.columns[kbest.get_support()])

['tip', 'price_per_person']

#### K-Best: target = 'tip_percentage'

In [14]:
rfe = RFE(estimator=LinearRegression(), n_features_to_select=2)
rfe.fit(X_train_scaled, y_train)
list(X_train.columns[rfe.get_support()])

['total_bill', 'tip']

Why do you think select k best and recursive feature elimination might give different answers for the top features? 
- K-Best looks at each feature individually, RFE looks at feature combinations

Does this change as you change the number of features your are selecting?
- If you had one feature, it would be similar... but there's no point to using one feature in this

## #2
Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [15]:
def select_kbest(X, y, k):
    """ Returns the names of top k-selected features using SelectKBest """
    # Build, fit kbest
    kbest = SelectKBest(f_regression, k=k)
    kbest.fit(X, y)
    # Put top k selected feature names into a list
    return_list = list(X.columns[kbest.get_support()])
    
    # Return the feature name list
    return return_list

In [16]:
select_kbest(X_train_scaled, y_train, k=2)

['tip', 'price_per_person']

## #3
Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [17]:
def rfe(X, y, k):
    """ Returns the names of top k-selected features using RFE """
    # Build, fit kbest
    rfe = RFE(estimator=LinearRegression(), n_features_to_select=k)
    rfe.fit(X, y)
    # Put top k selected feature names into a list
    return_list = list(X.columns[rfe.get_support()])
    
    # Return the feature name list
    return return_list

In [18]:
rfe(X_train_scaled, y_train, k=2)

['total_bill', 'tip']

## #4
Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [19]:
df = data('swiss')
df.head(3)

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2


In [20]:
select_kbest(df.drop(columns='Fertility'), df.Fertility, k=3)

['Examination', 'Education', 'Catholic']

In [21]:
rfe(df.drop(columns='Fertility'), df.Fertility, k=3)

['Examination', 'Education', 'Infant.Mortality']