Load the tips dataset.

Create a column named price_per_person. This should be the total bill divided by the party size.

Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

Use select k best to select the top 2 features for predicting tip amount. What are they?

Use recursive feature elimination to select the top 2 features for tip amount. What are they?

Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pydataset import data
import wrangle
import prepare
import math

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler

In [2]:
df = data ('tips')

In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
df['tip_percentage'] = round((df['tip'] / df['total_bill'])*100 , 2)

In [5]:
df ['price_per_person']=  df['total_bill'] / df['size']

In [6]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,5.94,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,16.05,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,16.66,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,13.98,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,14.68,6.1475


In [7]:
#total_bill and size would be most important features for predicting the tip amount
#split data in train, validate and split
train, validate, test = wrangle.split_data(df)

train -> (136, 9)
validate -> (59, 9)
test -> (49, 9)


In [8]:
#split the target and the features
X_train = train.drop(columns = ['tip'])
y_train = train['tip']

In [9]:
X_validate = validate.drop(columns = ['tip'])
X_test = test.drop(columns = ['tip'])

In [10]:
#get all numerics columns
cols = X_train.select_dtypes(exclude='object').columns.to_list()

In [11]:
#scaled the columns
X_train_scaled , X_validate_scaled , X_test_scaled = prepare.scaled_mimmax(cols, X_train , X_validate, X_test)

In [12]:
X_train_scaled.head()

Unnamed: 0,total_bill,size,tip_percentage,price_per_person
19,0.307114,0.4,0.252853,0.150344
173,0.092355,0.2,1.0,0.032258
119,0.206805,0.2,0.16185,0.182796
29,0.411622,0.2,0.240996,0.452194
238,0.657534,0.2,0.0,0.775647


In [13]:
f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X_train_scaled, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7f84430b88b0>)

In [14]:
#get the top 2 features
mask = f_selector.get_support()
X_train_scaled.columns[mask]

Index(['total_bill', 'size'], dtype='object')

In [15]:
lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X_train_scaled, y_train)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [16]:
rfe.support_

array([ True, False,  True, False])

In [17]:
#get the top 2 features

X_train_scaled.columns[rfe.support_]

Index(['total_bill', 'tip_percentage'], dtype='object')

In [18]:
pd.Series(dict(zip(X_train_scaled.columns, rfe.ranking_))).sort_values()

total_bill          1
tip_percentage      1
size                2
price_per_person    3
dtype: int64

In [19]:
#the top 2 features for SelectKBest are: total_bill', 'size'
#the top 2 features for Recursive Feature Elimination(RFE) are: 'total_bill', 'tip_percentage'

In [20]:
#split the target and the features
X_train = train.drop(columns = ['tip_percentage'])
y_train = train['tip_percentage']

In [21]:
X_validate = validate.drop(columns = ['tip_percentage'])
X_test = test.drop(columns = ['tip_percentage'])

In [22]:
cols = X_train.select_dtypes(exclude='object').columns.to_list()
cols

['total_bill', 'tip', 'size', 'price_per_person']

In [23]:
#scaled the columns
X_train_scaled , X_validate_scaled , X_test_scaled = prepare.scaled_mimmax(cols, X_train , X_validate, X_test)

In [24]:
X_train_scaled.head()

Unnamed: 0,total_bill,tip,size,price_per_person
19,0.307114,0.3125,0.4,0.150344
173,0.092355,0.51875,0.2,0.032258
119,0.206805,0.1,0.2,0.182796
29,0.411622,0.4125,0.2,0.452194
238,0.657534,0.02125,0.2,0.775647


In [25]:
#select kbest
f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X_train_scaled, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7f84430b88b0>)

In [26]:
mask = f_selector.get_support()
X_train_scaled.columns[mask]

Index(['tip', 'price_per_person'], dtype='object')

In [27]:
lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X_train_scaled, y_train)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [28]:
rfe.support_

array([ True,  True, False, False])

In [29]:
#let's see the ranks 
pd.Series(dict(zip(X_train_scaled.columns, rfe.ranking_))).sort_values()

total_bill          1
tip                 1
size                2
price_per_person    3
dtype: int64

the top 2 features for SelectKBest are: ''tip_minmax', 'price_per_person_minmax' the top 2 features for Recursive Feature Elimination (RFE) are: 'total_bill_minmax', 'tip_minmax'

In [30]:
f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X_train_scaled, y_train)
mask = f_selector.get_support()
X_train_scaled.columns[mask]

Index(['tip', 'price_per_person'], dtype='object')

In [31]:
lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select= 2)
rfe.fit(X_train_scaled, y_train)
rfe.support_
X_train_scaled.columns[rfe.support_]

Index(['total_bill', 'tip'], dtype='object')

In [32]:
def select_kbest  (X_df, y_df, n_features):
    '''
    Takes in the predictors, the target, and the number of features to select (k) ,
    and returns the names of the top k selected features based on the SelectKBest class
    
    X_df : the predictors
    y_df : the target
    n_features : he number of features to select (k)
    Example
    select_kbest(X_train_scaled, y_train, 2)
    '''
    
    f_selector = SelectKBest(score_func=f_regression, k= n_features)
    f_selector.fit(X_df, y_df)
    mask = f_selector.get_support()
    X_df.columns[mask]
    top = list(X_df.columns[mask])
    
    return print(f'The top {n_features} selected feautures based on the SelectKBest class are: {top}' )

In [33]:
select_kbest(X_train_scaled,y_train, 2)

The top 2 selected feautures based on the SelectKBest class are: ['tip', 'price_per_person']


In [34]:
def select_rfe (X_df, y_df, n_features):
    lm = LinearRegression()
    rfe = RFE(estimator=lm, n_features_to_select= n_features)
    rfe.fit(X_df, y_df)
    rfe.support_
    top = list(X_df.columns[rfe.support_])
    return print(f'The top {n_features} selected feautures based on the the RFE class class are: {top}' )

In [35]:
select_rfe(X_train_scaled,y_train,2)

The top 2 selected feautures based on the the RFE class class are: ['total_bill', 'tip']


In [36]:
swiss_df = data('swiss')

In [37]:
data('swiss', show_doc =True)

swiss

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Swiss Fertility and Socioeconomic Indicators (1888) Data

### Description

Standardized fertility measure and socio-economic indicators for each of 47
French-speaking provinces of Switzerland at about 1888.

### Usage

    data(swiss)

### Format

A data frame with 47 observations on 6 variables, each of which is in percent,
i.e., in [0,100].

[,1] Fertility Ig, "common standardized fertility measure" [,2] Agriculture
[,3] Examination nation [,4] Education [,5] Catholic [,6] Infant.Mortality
live births who live less than 1 year.

All variables but 'Fert' give proportions of the population.

### Source

Project "16P5", pages 549-551 in

Mosteller, F. and Tukey, J. W. (1977) “Data Analysis and Regression: A Second
Course in Statistics”. Addison-Wesley, Reading Mass.

indicating their source as "Data used by permission of Franice van de Walle.
Office of Population Research, Princeton Univer

In [38]:
swiss_df.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [39]:
#split data in train, validate and split
train, validate, test = wrangle.split_data(swiss_df)

train -> (25, 6)
validate -> (12, 6)
test -> (10, 6)


In [40]:
#split X, y
def split_Xy (train, validate, test, target):
    '''
    This function takes in three dataframe (train, validate, test) and a target  and splits each of the 3 samples
    into a dataframe with independent variables and a series with the dependent, or target variable.
    The function returns 3 dataframes and 3 series:
    X_train (df) & y_train (series), X_validate & y_validate, X_test & y_test.
    '''
    
    #split train
    X_train = train.drop(columns= [target])
    y_train= train[target]
    #split validate
    X_validate = validate.drop(columns= [target])
    y_validate= validate[target]
    #split validate
    X_test = test.drop(columns= [target])
    y_test= test[target]
    return  X_train, y_train, X_validate, y_validate, X_test, y_test
    

In [41]:
#split Xy using my function
X_train, y_train, X_validate, y_validate, X_test, y_test = wrangle.split_Xy (train, validate, test, 'Fertility' )

X_train -> (25, 5)               y_train->(25,)
X_validate -> (12, 5)         y_validate->(12,) 
X_test -> (10, 5)                  y_test>(10,)


In [42]:
columns = list(X_train.select_dtypes(exclude='object').columns)
columns

['Agriculture', 'Examination', 'Education', 'Catholic', 'Infant.Mortality']

In [43]:
#scaled
X_train_scaled_df, validate_scaled_df, test_scaled_df = prepare.scaled_mimmax(columns, X_train, X_validate, X_test)

In [49]:
#kbest
select_kbest(X_train_scaled_df, y_train, 4)

The top 4 selected feautures based on the SelectKBest class are: ['Examination', 'Education', 'Catholic', 'Infant.Mortality']


In [50]:
#rfe
select_rfe(X_train_scaled_df, y_train, 5)

The top 5 selected feautures based on the the RFE class class are: ['Agriculture', 'Examination', 'Education', 'Catholic', 'Infant.Mortality']
