**Load the tips dataset.**

**Create a column named tip_percentage. This should be the tip amount divided by the total bill.**

**Create a column named price_per_person. This should be the total bill divided by the party size.**

**Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?**

**Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?**

**Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?**

**Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.**

**Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.**

**Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).**

In [1]:
#Load the tips dataset.

import pydataset

df = pydataset.data('tips')

df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [3]:
df.dtypes

total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

In [4]:
df['size'] = df['size'].astype(float)

In [5]:
df.dtypes

total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size          float64
dtype: object

**Create a column named tip_percentage. This should be the tip amount divided by the total bill.**

In [6]:
df['tip_percentage'] = round(df.tip / df.total_bill,2)

**Create a column named price_per_person. This should be the total bill divided by the party size.**

In [7]:
df['price_per_person'] = round(df.total_bill / df['size'],2)

In [8]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2.0,0.06,8.49
2,10.34,1.66,Male,No,Sun,Dinner,3.0,0.16,3.45
3,21.01,3.5,Male,No,Sun,Dinner,3.0,0.17,7.0
4,23.68,3.31,Male,No,Sun,Dinner,2.0,0.14,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4.0,0.15,6.15


**Use select k best to select the top 2 features. What are they?**

In [9]:
df = df[['total_bill','tip','size','tip_percentage','price_per_person']]

In [10]:
import sklearn.preprocessing
import pandas as pd

scaler = sklearn.preprocessing.MinMaxScaler()
# Note that we only call .fit with the training data,
# but we use .transform to apply the scaling to all the data splits.
scaler.fit(df)

df = pd.DataFrame(scaler.transform(df), columns=df.columns.values).set_index([df.index.values])

In [11]:
df

Unnamed: 0,total_bill,tip,size,tip_percentage,price_per_person
1,0.291579,0.001111,0.2,0.029851,0.322599
2,0.152283,0.073333,0.4,0.179104,0.032777
3,0.375786,0.277778,0.4,0.194030,0.236918
4,0.431713,0.256667,0.2,0.149254,0.515239
5,0.450775,0.290000,0.6,0.164179,0.188039
...,...,...,...,...,...
240,0.543779,0.546667,0.4,0.238806,0.391029
241,0.505027,0.111111,0.2,0.044776,0.615871
242,0.410557,0.111111,0.2,0.074627,0.486486
243,0.308965,0.083333,0.2,0.089552,0.346751


In [12]:
X = df.drop(columns='tip')
y = df.tip

In [13]:
from sklearn.feature_selection import SelectKBest, f_regression

f_selector = SelectKBest(f_regression, k=2)

f_selector.fit(X, y)

X_reduced = f_selector.transform(X)

print(X.shape)
print(X_reduced.shape)

(244, 4)
(244, 2)


In [14]:
f_support = f_selector.get_support()
f_support

array([ True,  True, False, False])

In [15]:
f_feature = X.loc[:,f_support].columns.tolist()
f_feature

['total_bill', 'size']

**Use recursive feature elimination to predict top two features**

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

In [17]:
lm = LinearRegression()

rfe = RFE(lm, 2)

X_rfe = rfe.fit_transform(X,y)  

lm.fit(X_rfe,y)

mask = rfe.support_

rfe_features = X.loc[:,mask].columns.tolist()

print(rfe_features)

['total_bill', 'tip_percentage']


**Write a function named select_kbest that takes in the predictors (X), the target (y),**

**and the number of features to select (k) and returns the names of the top k selected features**

**based on the SelectKBest class. Test your function with the tips dataset.**

**You should see the same results as when you did the process manually.**

In [18]:
def select_k_best(X,y,k):
    
    f_selector = SelectKBest(f_regression, k=k)

    f_selector.fit(X, y)

    X_reduced = f_selector.transform(X)
    
    f_support = f_selector.get_support()
    
    f_feature = X.loc[:,f_support].columns.tolist()
    
    return f_feature

In [19]:
X = df.drop(columns='tip')
y = df.tip
k = 2

select_k_best(X,y,k)

['total_bill', 'size']

**Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.**

In [20]:
def rfe(X,y,k):
    
    lm = LinearRegression()

    rfe = RFE(lm, k)

    X_rfe = rfe.fit_transform(X,y)  

    lm.fit(X_rfe,y)

    mask = rfe.support_

    rfe_features = X.loc[:,mask].columns.tolist()
    
    return rfe_features

In [21]:
rfe(X,y,k)

['total_bill', 'tip_percentage']

**Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).**

In [22]:
df = pydataset.data('swiss')

In [23]:
df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [24]:
X = df.drop(columns='Fertility')
y = df.Fertility
k = 2

In [25]:
select_k_best(X,y,k)

['Examination', 'Education']

In [26]:
rfe(X,y,k)

['Education', 'Infant.Mortality']