## Feature Engineering Regression Exercises

In [1]:
# imports
import numpy as np
import pandas as pd
from pydataset import data
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE

1. Load the tips dataset.

In [2]:
# load in the data
tips = data('tips')

In [3]:
# look at the docs
data('tips', show_doc=True)

tips

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Tipping data

### Description

One waiter recorded information about each tip he received over a period of a
few months working in one restaurant. He collected several variables:

### Usage

    data(tips)

### Format

A data frame with 244 rows and 7 variables

### Details

  * tip in dollars, 

  * bill in dollars, 

  * sex of the bill payer, 

  * whether there were smokers in the party, 

  * day of the week, 

  * time of day, 

  * size of the party. 

In all he recorded 244 tips. The data was reported in a collection of case
studies for business statistics (Bryant & Smith 1995).

### References

Bryant, P. G. and Smith, M (1995) _Practical Data Analysis: Case Studies in
Business Statistics_. Homewood, IL: Richard D. Irwin Publishing:




In [4]:
# what does it look like?
tips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


1a. Create a column named tip_percentage. This should be the tip amount divided by the total bill.

In [5]:
tips['tip_percentage'] = (tips['tip'] / tips['total_bill']) * 100

1b. Create a column named price_per_person. This should be the total bill divided by the party size.

In [6]:
tips['price_per_person'] = tips['total_bill'] / tips['size']

In [7]:
# now what does it look lie?
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,5.944673,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3,16.054159,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3,16.658734,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2,13.978041,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,14.680765,6.1475


1c. Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

- total_bill
- tip_percentage
- price_per_person

In [8]:
# let's define our X and y

variables = ['total_bill', 'tip_percentage', 'price_per_person', 'size']
X = tips[variables]
y = tips['tip']

1d. Use all the other numeric features to predict tip amount. Use select k best and recursive feature elimination to select the top 2 features. What are they?

In [30]:
# setting up my select k best

f_selector = SelectKBest(score_func=f_regression, k=2)
f_selector.fit(X, y)

SelectKBest(k=2, score_func=<function f_regression at 0x7fac03230940>)

In [31]:
# so which columns win?
mask = f_selector.get_support()
X.columns[mask]

Index(['total_bill', 'size'], dtype='object')

In [11]:
# now fitting the model to top 2 features
X_kbest = f_selector.transform(X)

model = LinearRegression().fit(X_kbest, y)

1e. Use all the other numeric features to predict tip percentage. Use select k best and recursive feature elimination to select the top 2 features. What are they?

In [12]:
# build an LR model and see coefficient of each variable
model = LinearRegression().fit(X, y)
model.coef_

array([0.10759403, 0.14654319, 0.06453116, 0.27837548])

In [22]:
# use the RFE function to select 2 best features
lm = LinearRegression()
rfe = RFE(estimator=lm, n_features_to_select=2)
rfe.fit(X, y)

RFE(estimator=LinearRegression(), n_features_to_select=2)

In [23]:
# which 2 features?
rfe.support_

array([ True,  True, False, False])

In [24]:
# slightly cleaner look...
X.columns[rfe.support_]

Index(['total_bill', 'tip_percentage'], dtype='object')

In [25]:
# Hot Damn! This is the winner winner chicken dinner
pd.Series(dict(zip(X.columns, rfe.ranking_))).sort_values()

total_bill          1
tip_percentage      1
size                2
price_per_person    3
dtype: int64

1f. Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

2. Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [49]:
def select_kbest(X, y, k):
    '''
    This function takes in the predictors (X), the target (y), and the number 
    of features to select (k) and returns the names of the top k selected 
    features based on the SelectKBest class. It requires the X, y, and k be 
    predefined.
    '''
    # setting up my select k best
    f_selector = SelectKBest(score_func=f_regression, k=k)
    # fitting to X and y
    f_selector.fit(X, y)
    # putting a mask on to see which columns are selected 
    mask = f_selector.get_support()
    
    return X.columns[mask]

In [50]:
select_kbest(X, y, 2)

Index(['Examination', 'Education'], dtype='object')

3. Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [51]:
def rfe(X, y, k):
    '''
    This function takes in the predictors, the target, and the number 
    of features to select. It should return the top k features based 
    on the RFE class. It requires the X, y, and k be predefined.
    '''
    # build an LR model and fit to X and y
    model = LinearRegression().fit(X, y)
    # use the RFE function to select k best features
    lm = LinearRegression()
    rec = RFE(estimator=lm, n_features_to_select=k)
    # fit this bad boy to X and y 
    rec.fit(X, y)
    
    return X.columns[rec.support_]

In [52]:
rfe(X, y, 2)

Index(['Education', 'Infant.Mortality'], dtype='object')

4. Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [53]:
# load the data
swiss = data('swiss')

In [54]:
# let's see the docs
data('swiss', show_doc=True)

swiss

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Swiss Fertility and Socioeconomic Indicators (1888) Data

### Description

Standardized fertility measure and socio-economic indicators for each of 47
French-speaking provinces of Switzerland at about 1888.

### Usage

    data(swiss)

### Format

A data frame with 47 observations on 6 variables, each of which is in percent,
i.e., in [0,100].

[,1] Fertility Ig, "common standardized fertility measure" [,2] Agriculture
[,3] Examination nation [,4] Education [,5] Catholic [,6] Infant.Mortality
live births who live less than 1 year.

All variables but 'Fert' give proportions of the population.

### Source

Project "16P5", pages 549-551 in

Mosteller, F. and Tukey, J. W. (1977) “Data Analysis and Regression: A Second
Course in Statistics”. Addison-Wesley, Reading Mass.

indicating their source as "Data used by permission of Franice van de Walle.
Office of Population Research, Princeton Univer

In [55]:
# what's it look like?
swiss.info()

<class 'pandas.core.frame.DataFrame'>
Index: 47 entries, Courtelary to Rive Gauche
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Fertility         47 non-null     float64
 1   Agriculture       47 non-null     float64
 2   Examination       47 non-null     int64  
 3   Education         47 non-null     int64  
 4   Catholic          47 non-null     float64
 5   Infant.Mortality  47 non-null     float64
dtypes: float64(4), int64(2)
memory usage: 2.6+ KB


In [56]:
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [57]:
X = swiss.drop(columns='Fertility')
y = swiss['Fertility']

In [58]:
# using my selectkbest
select_kbest(X, y, 3)

Index(['Examination', 'Education', 'Catholic'], dtype='object')

In [59]:
# using my rfe
rfe(X, y, 3)

Index(['Examination', 'Education', 'Infant.Mortality'], dtype='object')