# Question 1:
Load the `tips` dataset
- Create a column named `price_per_person`. This should be the total bill divided by the party size.
- Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?
- Use Select K Best to select the top 2 features for predicting tip amount. What are they?
- Use Recursive Feature Elimination to select the top 2 features for tip amount. What are they?
- Why do you think Select K Best and Recursive Feature Elimination might give different answers for the top features? Does this change as you change the number of features you are selecting?

In [1]:
#imports
from pydataset import data
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
#aquire the data
df = data('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


#### Create a column named `price_per_person`. This should be the total bill divided by the party size.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


In [4]:
df['size'] = df['size'].astype(float)

In [5]:
# create the new column
df['price_per_person'] = (df['total_bill']/df['size'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2.0,8.495
2,10.34,1.66,Male,No,Sun,Dinner,3.0,3.446667
3,21.01,3.5,Male,No,Sun,Dinner,3.0,7.003333
4,23.68,3.31,Male,No,Sun,Dinner,2.0,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4.0,6.1475


In [7]:
# need to split my data: 
from sklearn.model_selection import train_test_split
def split_tips_data(df, stratify = None, seed = 1234):
    '''
    This funciton will split the tips data into train, validate and test,
    It will split the data into 20% and 80% into test and train
    It will split the data into 30% and 70% into validate and train
    
    '''
    
    train_validate, test = train_test_split(df, test_size=.2, random_state= seed)
    train, validate = train_test_split(train_validate, test_size=.3, random_state= seed)
    
    return train, validate, test

In [8]:
train, validate, test = split_tips_data(df, stratify = 'tip')
train.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
216,12.9,1.1,Female,Yes,Sat,Dinner,2.0,6.45
242,22.67,2.0,Male,Yes,Sat,Dinner,2.0,11.335
108,25.21,4.29,Male,Yes,Sat,Dinner,2.0,12.605
26,17.81,2.34,Male,No,Sat,Dinner,4.0,4.4525
50,18.04,3.0,Male,No,Sun,Dinner,2.0,9.02


In [9]:
# need to create dummies and remove columns
def get_dummies(train, validate, test):
    '''
    This will take in train, validate, and test and create dummy columns
    '''
    col_list = ['sex','smoker','day','time']
    
    # train data set
    dummy_train = pd.get_dummies(train[col_list], dummy_na = False)
    train = pd.concat([train, dummy_train], axis = 1)
    train = train.drop(columns = col_list)
    
    # validate data set
    dummy_validate = pd.get_dummies(validate[col_list], dummy_na = False)
    validate = pd.concat([validate, dummy_validate], axis = 1)
    validate = validate.drop(columns = col_list)
    
    # test data set
    dummy_test = pd.get_dummies(test[col_list], dummy_na = False)
    test = pd.concat([test, dummy_test], axis = 1)
    test = test.drop(columns = col_list)
    
    return train, validate, test

In [10]:
train, validate, test = get_dummies(train, validate, test)
train.head()

Unnamed: 0,total_bill,tip,size,price_per_person,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
216,12.9,1.1,2.0,6.45,1,0,0,1,0,1,0,0,1,0
242,22.67,2.0,2.0,11.335,0,1,0,1,0,1,0,0,1,0
108,25.21,4.29,2.0,12.605,0,1,0,1,0,1,0,0,1,0
26,17.81,2.34,4.0,4.4525,0,1,1,0,0,1,0,0,1,0
50,18.04,3.0,2.0,9.02,0,1,1,0,0,0,1,0,1,0


#### Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

In [6]:
# I think it will either be the size or the total_bill

#### Use Select K Best to select the top 2 features for predicting tip amount. What are they?

In [11]:
# split into X_train and y_train
X_train = train.drop(columns = 'tip')

y_train = train.tip

X_validate = validate.drop(columns = 'tip')
y_validate = validate.tip

X_test = test.drop(columns = 'tip')
y_test = test.tip

In [12]:
# import for k best
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector

In [13]:
# Make the thing: (k is how many features you want it to look at, the question asks for the top 2)
kbest = SelectKBest(f_regression, k = 2)

#fit the thing
kbest.fit(X_train, y_train)

In [14]:
# statiscal f-value / featurer's scores:
kbest.scores_

array([135.05091323,  51.59436615,  19.97233773,   1.65024984,
         1.65024984,   1.59545395,   1.59545395,   0.68231437,
         0.1362716 ,   2.54338441,   0.58983901,   0.90735416,
         0.90735416])

In [15]:
# p value: 
kbest.pvalues_

array([5.02625208e-22, 4.26386246e-11, 1.65634550e-05, 2.01141239e-01,
       2.01141239e-01, 2.08742036e-01, 2.08742036e-01, 4.10259201e-01,
       7.12598704e-01, 1.13113065e-01, 4.43832678e-01, 3.42532478e-01,
       3.42532478e-01])

In [16]:
# get the names of the features its looking at: 
kbest.feature_names_in_

array(['total_bill', 'size', 'price_per_person', 'sex_Female', 'sex_Male',
       'smoker_No', 'smoker_Yes', 'day_Fri', 'day_Sat', 'day_Sun',
       'day_Thur', 'time_Dinner', 'time_Lunch'], dtype=object)

In [17]:
kbest_results = pd.DataFrame(
                dict(p=kbest.pvalues_, f=kbest.scores_),
                                        index = X_train.columns)

In [18]:
kbest_results

Unnamed: 0,p,f
total_bill,5.026252e-22,135.050913
size,4.263862e-11,51.594366
price_per_person,1.656346e-05,19.972338
sex_Female,0.2011412,1.65025
sex_Male,0.2011412,1.65025
smoker_No,0.208742,1.595454
smoker_Yes,0.208742,1.595454
day_Fri,0.4102592,0.682314
day_Sat,0.7125987,0.136272
day_Sun,0.1131131,2.543384


<div class="alert alert-success" role="alert">
    Takeaways: <br>
        - The top two features to look at are total_bill and size

#### Use Recursive Feature Elimination to select the top 2 features for tip amount. What are they?

In [20]:
# imports: 
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [21]:
# make it: 
rfe = RFE(model, n_features_to_select= 2)

# Fit the thing:
rfe.fit(X_train, y_train)

In [23]:
# get the feature ranking:
rfe.ranking_

array([ 8,  3, 12, 11,  9, 10,  7,  5,  1,  4,  6,  2,  1])

In [24]:
# make a dataframe of the rankings for better understanding
pd.DataFrame(
{
    'rfe_ranking':rfe.ranking_
}, index = X_train.columns)

Unnamed: 0,rfe_ranking
total_bill,8
size,3
price_per_person,12
sex_Female,11
sex_Male,9
smoker_No,10
smoker_Yes,7
day_Fri,5
day_Sat,1
day_Sun,4


<div class="alert alert-success" role="alert">
    Takeaways: <br>
    - The best features to use are total_bill and size


#### Why do you think Select K Best and Recursive Feature Elimination might give different answers for the top features? Does this change as you change the number of features you are selecting?

- My tests ended up providing the same answer for the best features.
- SelectKBest: This technique selects the top K features based on a specific univariate statistical test, such as chi-squared, ANOVA F-test, or mutual information. It evaluates each feature independently and ranks them according to their individual scores. SelectKBest is model-agnostic and doesn't consider how the features interact with each other within a specific model. It evaluates features independently based on a statistical test.

- RFE: RFE, on the other hand, is a recursive technique that starts with all features and iteratively removes the least important ones based on a model's performance (e.g., using cross-validated accuracy or other performance metrics). It considers the interaction between features and their combined importance. RFE depends on a specific model. The choice of the model can influence which features are considered important. For example, different models may have different feature rankings, leading to varying results.

# Question 2: 
Write a function named `select_kbest` that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [28]:
# create the function: 
def select_kbest(x, y, k_value):
    '''
    This funciton will take in the predictors (x), the target(y), and the numebr of features to select and 
    returns the top k selectef features based on the SelectKbest class. 
    '''
    

    
    # make the thing: 
    kbest = SelectKBest(f_regression, k = k_value)
    
    #fit the thing
    kbest.fit(x, y)
    
    # return a dataframe with the names
    kbest_results = pd.DataFrame(
                dict(p=kbest.pvalues_, f=kbest.scores_),
                                        index = x.columns)
    
    # returns the dataframe with the top kbest features
    return kbest_results.head(k_value)

In [29]:
select_kbest(X_train, y_train, 2)

Unnamed: 0,p,f
total_bill,5.026252e-22,135.050913
size,4.263862e-11,51.594366


# Question 3: 
Write a function named `rfe` that takes in the predictors, the target, and the number of features to select. It should return the top n features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [34]:
# create a function: 

def rfe(x, y, k_value):
    '''
    This function will take in predicotrs(x), the target(y), and the number of features to select k_value
    It will return the top n feautres based on the RFE class.
    '''
    
    model = LinearRegression()
    
    rfe = RFE(model, n_features_to_select=k_value)
    
    rfe.fit(x, y)
    
    
    df = pd.DataFrame(
{
    'rfe_ranking':rfe.ranking_
}, index = X_train.columns)
    
    return df

In [35]:
rfe(X_train, y_train, 2)

Unnamed: 0,rfe_ranking
total_bill,8
size,3
price_per_person,12
sex_Female,11
sex_Male,9
smoker_No,10
smoker_Yes,7
day_Fri,5
day_Sat,1
day_Sun,4


In [36]:
def rfe_2(X, y, k=2):
    rfe = RFE(LinearRegression(), n_features_to_select = k)
    rfe.fit(X, y)
    feature_mask = rfe.support_
    
    return X.iloc[:,feature_mask].columns.tolist()

In [37]:
rfe_2(X_train, y_train)

['day_Sat', 'time_Lunch']

# Question 4:
Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both Select K Best and Recursive Feature Elimination (use the functions you just built to help you out).

In [38]:
swiss = data('swiss')

In [46]:
X_train = swiss.drop(columns = 'Fertility')
y_train = swiss.Fertility

In [47]:
select_kbest(X_train, y_train, 3)

Unnamed: 0,p,f
Agriculture,0.0149172,6.408884
Examination,9.450437e-07,32.208745
Education,3.658617e-07,35.445582


In [48]:
rfe_2 (X_train, y_train)

['Education', 'Infant.Mortality']

In [50]:
rfe(X_train, y_train, 3)

Unnamed: 0,rfe_ranking
Agriculture,2
Examination,1
Education,1
Catholic,3
Infant.Mortality,1
