# Feature Selection Techniques tutorial 

This is an introductory tutorial on some famous feature selection techniques. A large portion of the techniques are studied from
'Feature Engineering made easy' by Sinan Ozdemir and Divya Susarla. You can use this tutorial as a support while reading from the book and feel free to play with the code. I have used a famous dataset 'Titanic' since it is pretty easy to understand, and our main goal is to predict who will survive ('Survived' column).A brief description for each feature selection technique is provided, in addition to its pros and cons. 

# Table of contents:
0. [Data Preparation](#0)
1. [What is feature selection](#1)
2. [Measuring the effect of feature selection on machine learning performance](#2)
3. [Feature Selection Techniques](#3)
     - [Univariate Statistics](#4)
          * [Selecting features using Pearson's Correlation](#4.1)   
          * [Hypthesis Testing using selectKBest](#4.2) 
     - [Model-Based Feature Selection](#5)
          * [Decision Trees](#5.1)
          * [Logistic Regression](#5.2)
          * [LinearSVM](#5.3)
4. [References](#6)

In [1]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from copy import deepcopy
from sklearn.feature_selection import f_classif
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.svm import LinearSVC

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# Reading data from csv files:
train= pd.read_csv('../input/titanic/train.csv',nrows=100000)
test = pd.read_csv('../input/titanic/test.csv',nrows=100000)

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv


# Data Preparation <a id="0"></a>

Some columns will be removed in this dataset. The tutorial focuses on implementing feature selection techniques, so we can deal with the removed features ( 'Cabin' and 'Ticket' ) in another version for now. 

In [3]:
train= train.drop(['PassengerId','Name','Ticket','Cabin'],axis=1)
print(train.dtypes)
train

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S
887,1,1,female,19.0,0,0,30.0000,S
888,0,3,female,,1,2,23.4500,S
889,1,1,male,26.0,0,0,30.0000,C


For the machine learning models to understand the data, we need to change labels into numerical categories.

In [4]:
train["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [5]:
train["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [6]:
cleanup_nums = { "Embarked": {"S": 0, "C": 1, "Q": 2 },"Sex":     {"male": 0, "female": 1}}


In [7]:
train.replace(cleanup_nums, inplace=True)
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22.0,1,0,7.25,0.0
1,1,1,1,38.0,1,0,71.2833,1.0
2,1,3,1,26.0,0,0,7.925,0.0
3,1,1,1,35.0,1,0,53.1,0.0
4,0,3,0,35.0,0,0,8.05,0.0


Rows with null values will be removed to avoid errors while using the dataset in the machine learning models.

In [8]:
train=train.dropna()
train

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22.0,1,0,7.2500,0.0
1,1,1,1,38.0,1,0,71.2833,1.0
2,1,3,1,26.0,0,0,7.9250,0.0
3,1,1,1,35.0,1,0,53.1000,0.0
4,0,3,0,35.0,0,0,8.0500,0.0
...,...,...,...,...,...,...,...,...
885,0,3,1,39.0,0,5,29.1250,2.0
886,0,2,0,27.0,0,0,13.0000,0.0
887,1,1,1,19.0,0,0,30.0000,0.0
889,1,1,0,26.0,0,0,30.0000,1.0


# What is feature selection<a id="1"></a>

Feature selection is a subset of feature engineering and it aims at excluding features that are not important, leaving only the better features i.e the features that are the best when it comes to model prediction. 
***
# **Advantages of feature selection:**
1. results in a better performing model.
2. creates an easier to understand model. 
3. results in a model that runs faster. 
4. reduces the chance of overfitting. 

# Measuring the effect of feature selection on machine learning performance<a id="2"></a>
We need a function to evaluate the performance of each feature selection technique we use in this tutorial. To optimise the performance of our machine learning model, we will tune its hyperparameters using gridsearch. Check this https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html to read about gridSearchCV.


In [9]:

def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model,params,error_score=0., verbose=0, n_jobs=2)
    grid.fit(X, y) # fit the model and parameters
    s= "Best Accuracy: {}".format(grid.best_score_)+ '\n'+\
       "Best Parameters: {}".format(grid.best_params_) +'\n'+\
       "Average Time to Fit (s):{}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)) +'\n'+\
       "Average Time to Score (s):{}".format(round(grid.cv_results_['mean_score_time'].mean(), 3))
    return s


In [10]:
# Next we need to separate all columns from the target column to allow our models to make predictions. 
X = train.drop('Survived', axis=1)
# create our response variable
y = train['Survived']
# here we will write our machine learning paramters 
tree_params = {'max_depth':[None,1, 3, 5,7]}
# decision tree is the classifier thta we will use. 
d_tree = DecisionTreeClassifier()

print('Results for training set:','\n',get_best_model_and_accuracy(d_tree,tree_params,X, y))



Results for training set: 
 Best Accuracy: 0.7992416034669556
Best Parameters: {'max_depth': 5}
Average Time to Fit (s):0.005
Average Time to Score (s):0.003


# Feature Selection Techniques <a id="3"></a>
We will explore three main general techniques and illustrate each one by implementing code on the given dataset. 

# 1- Univariate Statistics <a id="4"></a>

Univariate statistics involves checking for statistically significant relationship between each feature and the target. We will focus on two techniques under this heading: Pearson's Correlation and Hypothesis Testing. 


# Selecting features using Pearson's Correlation <a id="4.1"></a>
Pearson correlation is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1.

In [11]:
train.corr()
# here is a table that shows the correlation between each feature with the rest of the features. 

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
Survived,1.0,-0.356462,0.536762,-0.082446,-0.015523,0.095265,0.2661,0.108517
Pclass,-0.356462,1.0,-0.150826,-0.365902,0.065187,0.023666,-0.552893,-0.108502
Sex,0.536762,-0.150826,1.0,-0.099037,0.106296,0.249543,0.182457,0.097129
Age,-0.082446,-0.365902,-0.099037,1.0,-0.307351,-0.187896,0.093143,0.012186
SibSp,-0.015523,0.065187,0.106296,-0.307351,1.0,0.383338,0.13986,0.004021
Parch,0.095265,0.023666,0.249543,-0.187896,0.383338,1.0,0.206624,-0.014082
Fare,0.2661,-0.552893,0.182457,0.093143,0.13986,0.206624,1.0,0.176859
Embarked,0.108517,-0.108502,0.097129,0.012186,0.004021,-0.014082,0.176859,1.0


In [12]:
train.corr()['Survived'] 
# we will hone on the correlation values between the target feature since that's what we are investigating. 

Survived    1.000000
Pclass     -0.356462
Sex         0.536762
Age        -0.082446
SibSp      -0.015523
Parch       0.095265
Fare        0.266100
Embarked    0.108517
Name: Survived, dtype: float64

In [13]:
all_corr=train.corr()['Survived'].abs() >= .2
highly_correlated =all_corr[all_corr==True].index
# Only the features that have a correlation value of 0.05 or above with the target will be selected. 
highly_correlated= highly_correlated.tolist() 
highly_correlated.remove('Survived')
print('The number of features removed out of', all_corr.size, 'is', all_corr.size- len(highly_correlated),', leaving',len(highly_correlated),'selected features.')
highly_correlated

The number of features removed out of 8 is 5 , leaving 3 selected features.


['Pclass', 'Sex', 'Fare']

Only the selected features in the list 'highly_correlated' will be used to make predictions for the 'Survived' column.

In [14]:
#Only colums with features that have a correlation value 0.2 or above as in the highly_correlated list 
#will be used to make predictions for the 'Survived' column.
X_subsetted = X[highly_correlated]
print(get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y))


Best Accuracy: 0.7991431104107161
Best Parameters: {'max_depth': 3}
Average Time to Fit (s):0.005
Average Time to Score (s):0.003


# Hypothesis testing using KBest <a id="4.2"></a>
This is a method that involves calculating p-value to determine whether a hypothesis can be rejected or not.<br/> P-value stands for 'probablity value', it indicates how likely it is that a result occured by chance alone. <br/>
SelectKBest scores the features against the target variable using a function (in this case f_classif but could be others) and then keeps the most significant features i.e the features that have the highest p-value. 

This models a statistical test known as ANOVA. <br/>
ANNOVA: Analysis of variance is a collection of statistical models and their associated estimation procedures used to analyze the differences among group means in a sample

In [15]:
# keep only the best 4 features according to p-values of ANOVA test
k_best = SelectKBest(f_classif, k=4)

In [16]:
# fit the data and then tranform it.
k_best.fit_transform(X, y)
k_best

SelectKBest(k=4)

You can read more about fit_transform here https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.

In [17]:
# get the p values of columns
k_best.pvalues_
# make a dataframe of features and p-values
# sort that dataframe by p-value
p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value')
# show the top 4 features
p_values.head(4)

Unnamed: 0,column,p_value
1,Sex,2.242852e-54
0,Pclass,9.30362e-23
5,Fare,5.256796e-13
6,Embarked,0.003742742


In [18]:
# features with a low p value
p_values[p_values['p_value'] < .01]

Unnamed: 0,column,p_value
1,Sex,2.242852e-54
0,Pclass,9.30362e-23
5,Fare,5.256796e-13
6,Embarked,0.003742742


In [19]:
k_best = SelectKBest(f_classif)
# setting the paramters
tree_pipe_params = {'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}
# The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
select_k_pipe = Pipeline([('k_best', k_best),('classifier', d_tree)])
select_k_best_pipe_params = deepcopy(tree_pipe_params)
select_k_best_pipe_params.update({'k_best__k':[1,2,3,4,5,6] + ['all']})
print(select_k_best_pipe_params,'\n')

print('Results for training set:','\n',get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y),'\n')

{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'k_best__k': [1, 2, 3, 4, 5, 6, 'all']} 

Results for training set: 
 Best Accuracy: 0.8034374076627598
Best Parameters: {'classifier__max_depth': 5, 'k_best__k': 6}
Average Time to Fit (s):0.008
Average Time to Score (s):0.003 



Check https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline to read about Pipeline function.

# Pros and Cons of Pearsn's and Hypothesis Testing: 
**Correlation coefficient like Pearson's: <br />**
**Pros:** <br/>
1- It gives an idea about how well the variables are related to each other.<br/>
**Cons:** <br/>
1- It assumes that there is always of linear relationship between the variables which might not be the case at all time.<br/>
2- It does not determine causation between any two variables.
<br />
<br />
**Hypthesis Testing:** <br />
**Pros:** <br/>
1- It aids in reaching a conclusion by examining a sample. <br />
**Cons:** <br/>
1- Results of significance tests are based on probabilities and as such cannot be expressed with full certainty. <br />
2- Statistical inferences based on the significance tests cannot be said to be entirely correct evidences concerning the truth of the hypothesis.
***
It is worthy to note the correlation and  p-value are the most commoon statistical tests for identifying relationship between variables. 

**Difference between correlation and P value:**<br />

Correlation is used to measure *how strong a relationship is between variables*. <br />
Famous types of correlations: <br />
1- Pearson <br />
2- Kendall <br />
3- Spearman <br />

On the other hand, p-value measures *how well your data rejects the null hypothesis*, which claims that the two compared have no relationship. 

Both correlation and p-value give a measure about the relationship but correlation does not imply causation while p-value provides some support for causation as is present in the dataset. You can read more about this here https://lumina.com/causal-hypothesis-testing/#:~:text=The%20framework%20defines%20a%20p,is%20used%20in%20classic%20statistics.
***
**Bottom line:** <br />
If you want to measure the statistical significance between two continuous variables, use correlation techniques like Pearson's. 

If you want to draw conclusions about the population using sample data, go for hypothesis testing. A hypothesis test basically helps us in making a decision about the population supported by sample data. 


# 2- Model-Based Selection <a id="5"></a>
Model selection is the process of choosing between different machine learning approaches or choosing between different hyperparameters or sets of features for the same machine learning approach. <br/>
The two main machine learning models that we will use in this section for the purposes of feature selection are tree-based models and linear models.They both have a notion of feature ranking that are useful when subsetting feature sets.<br/>  In this dataset, we will use decision tree classifier for tree-based model and logistic regression and SVM for linear models. 


In [20]:
# by default, 75% goes to the training set while 25% goes to the test set. 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print('train_test_ split is used to split the dataset into two pieces, so that the model can be trained and tested on different data. This is a better method for evaluating the model performance rather than testing it on the training data only. ')

train_test_ split is used to split the dataset into two pieces, so that the model can be trained and tested on different data. This is a better method for evaluating the model performance rather than testing it on the training data only. 


Check this to have a better understanding of train_test_split https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split


# i) Decision Trees <a id="5.1"></a>
A decision tree is a decision support tool that uses a tree-like model of decisions. Decision Trees provide an effective means for decision making because they consider all possible branches / scenarios as well as their outcome. 

In [21]:
tree = DecisionTreeClassifier()
tree.fit(X, y)
importances = pd.DataFrame({'importance': tree.feature_importances_,'feature':X.columns}).sort_values('importance', ascending=False)
importances.head()

Unnamed: 0,importance,feature
1,0.29736,Sex
2,0.267482,Age
5,0.188768,Fare
0,0.143748,Pclass
3,0.058308,SibSp


SelectfromModel is a skikit wrapper that captures the top k most importance features by considering a machine learning interal metric for feature importance. SelectFromModel is similar to SelectKBest as it picks the top k most important features. However, it measures the importance of a feature based on a model's internal metric for feature importance rather than the p-value. You can read more about SelectFromModel here https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html?highlight=selectfrom%20mode#sklearn.feature_selection.SelectFromModel

In [22]:

select_from_model = SelectFromModel(DecisionTreeClassifier(),threshold=.05)
selected_X = select_from_model.fit_transform(X, y)
# this shows the features that are selected by the model. 
selected_X.shape

(712, 5)

In [23]:
# create a SelectFromModel that is tuned by a DecisionTreeClassifier
select = SelectFromModel(DecisionTreeClassifier())
select_from_pipe = Pipeline([('select', select),('classifier', d_tree)])
select_from_pipe_params = deepcopy(tree_pipe_params)
select_from_pipe_params.update({'select__threshold': [.01, .05, .1, .2, .25, .3, .4, .5, .6, "mean","median", "2.*mean"],'select__estimator__max_depth': [None, 1, 3, 5, 7]})
print(select_from_pipe_params,'\n')
print('Results for training set:','\n', get_best_model_and_accuracy(select_from_pipe,select_from_pipe_params,X_train, y_train), '\n')
print('Results for test set:','\n',get_best_model_and_accuracy(select_from_pipe,select_from_pipe_params,X_test, y_test))

{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'select__threshold': [0.01, 0.05, 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 'mean', 'median', '2.*mean'], 'select__estimator__max_depth': [None, 1, 3, 5, 7]} 

Results for training set: 
 Best Accuracy: 0.8202080761770411
Best Parameters: {'classifier__max_depth': 3, 'select__estimator__max_depth': None, 'select__threshold': 0.01}
Average Time to Fit (s):0.009
Average Time to Score (s):0.003 

Results for test set: 
 Best Accuracy: 0.7696825396825396
Best Parameters: {'classifier__max_depth': None, 'select__estimator__max_depth': 1, 'select__threshold': 0.01}
Average Time to Fit (s):0.009
Average Time to Score (s):0.003


In [24]:
select_from_pipe.set_params(**{'select__threshold': 0.01,'select__estimator__max_depth': None,'classifier__max_depth': 3})
# fit our pipeline to our data
select_from_pipe.steps[0][1].fit(X, y)
# this presents the name of the featurs that are selected from the model.
X.columns[select_from_pipe.steps[0][1].get_support()]

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], dtype='object')

We could continue onward by trying several other tree-based models, such as RandomForest, ExtraTreesClassifier.

# Decision trees pros and cons: 

**Pros:** <br/>
1- Compared to other algorithms, decision trees require **less pre-processing**.Decision trees are definitely more robust to Outliers and missing values than regression techniques. <br/>
2- **No normalization** required. <br/>
3- **No scaling** required. <br/>
4- **Not necessary to deal with missing values**. Missing values data does not affect the process of building decision tree. This is because they work on segmentation of population and treat all missing values as a different class itself.<br/>
5- Desicion trees are **easy to understand**.Decision trees require no complex formulas. They consider all of the decision alternatives for quick comparisons in a format that is comprehensible.<br/>

**Cons:**<br/>
1- Mathematical calculation of decsion tree requires **more memory and time**.<br/>
2- A small change in the data can result in large change in the tree structure, thus this can have a **large effect on the tree model sensitivity**. 
<br/>
3-**Higher space and time complexity** for a decision tree. 
***
As a side note, a single decision tree is often a weak learner so random forect is usually used instead for better prediction. 

# ii) Logistic Regression <a id="5.2"></a>
A linear regression model predicts the target as a weighted sum of the feature inputs. 
Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.<br/>
Linear models work by placing coefficients next to features that tells how much it affects the response when the feature is changed. <br/>
You can read more about it's technicality here https://christophm.github.io/interpretable-ml-book/limo.html#limo.

# When do we use logistic Regression? 
**You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem)**

# When is using any type of regression not suitable?
The two most common kinds of issues are (1) when your data contain major violations of regression assumptions and (2) when you don't have enough data (or of the right kinds).

Core assumptions behind regression include

- That there is in fact a relationship between the outcome variable and the predictor variables.

- That observations are independent.

- That the residuals are **normally distributed** and **independent** of the values of variables in the model.

- That each predictor variable is not a linear combination of any others and is not extremely correlated with any others.

- Additional assumptions depend on the nature of your dependent variable; for example whether it is measured on a continuous scale or is categorical yes/no etc. The form of regression you use (linear, logistic, etc.) must match the type of data.

Not having enough data means having very few cases at all or having large amounts of missing values for the variables you want to analyze. If you don't have enough observations, your model either will not be able to run or else the estimates could be so imprecise (with large standard errors) that they aren't useful.
***
source: https://www.answers.com/Q/When_regression_is_not_applicable?#slide=1

In linear models, regularization is a method for imposing additional constraints to a
learning model, where the goal is to prevent overfitting and improve the generalization of
the data. l1 and l2 are regularization methods.

In [25]:
logistic_selector = SelectFromModel(LogisticRegression(max_iter = 1500))
regularization_pipe = Pipeline([('select', logistic_selector),('classifier', tree)])
regularization_pipe_params = deepcopy(tree_pipe_params)
regularization_pipe_params.update({'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],'select__estimator__penalty': ['l1', 'l2'],})
# l1 and l2 are regularization methods.

print(regularization_pipe_params) 
print('Results for training set:','\n',get_best_model_and_accuracy(regularization_pipe,regularization_pipe_params,X_train, y_train),'\n')
print('Results for test set:','\n',get_best_model_and_accuracy(regularization_pipe,regularization_pipe_params,X_test, y_test))

{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2']}
Results for training set: 
 Best Accuracy: 0.7996825956621407
Best Parameters: {'classifier__max_depth': 5, 'select__estimator__penalty': 'l2', 'select__threshold': 0.01}
Average Time to Fit (s):0.027
Average Time to Score (s):0.002 

Results for test set: 
 Best Accuracy: 0.7752380952380953
Best Parameters: {'classifier__max_depth': 3, 'select__estimator__penalty': 'l2', 'select__threshold': 0.05}
Average Time to Fit (s):0.02
Average Time to Score (s):0.001


In [26]:
regularization_pipe.set_params(**{'select__threshold': 0.01,'classifier__max_depth': 5,'select__estimator__penalty': 'l2'})
# fit our pipeline to our data
regularization_pipe.steps[0][1].fit(X, y)
# list the columns  selected by calling the get_support() method from SelectFromModel
X.columns[regularization_pipe.steps[0][1].get_support()]

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

# Pros and Cons of Logistic Regression:
**Pros:** <br/>
1- Logistic regression can be a good (and effective) choice, provided that your dataset is fit for it.[Click here for explanation](http://) <br/>
2 Logistic regression is less prone to over-fitting but it can overfit in high dimensional datasets. You should consider Regularization (L1 and L2) techniques to avoid over-fitting in these scenarios. Regularization is a method for imposing additional constraints to a learning model.<br/>
3- In addition to giving a measure of how relevant a coefficient size, logistic regression also provides information about the predictor's association ( positive or negative). 
<br/><br/>
**Cons:** <br/>
1- can only be used to predict discrete functions.<br/>
2- should not be used when the number of observations is less than the number of features. <br/>
3- assumes that there is linearity between dependent and independent variables which is rarely the case as data is usually a unorganized. 


# iii) LinearSVC <a id="5.3"></a>

SVC is a linear model that uses linear supports to seperate classes in euclidean space. This model can only work for binary classification tasks. 

In [27]:
svc_selector = SelectFromModel(LinearSVC(max_iter=100000,dual=False))
svc_pipe = Pipeline([('select', svc_selector),('classifier', tree)])
svc_pipe_params = deepcopy(tree_pipe_params)
svc_pipe_params.update({'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],'select__estimator__penalty': ['l1', 'l2'],'select__estimator__loss': ['squared_hinge', 'hinge'],'select__estimator__dual': [True, False]})
print(svc_pipe_params,'\n') # 'select__estimator__loss': ['squared_hinge','hinge'], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median','2.*mean'], 'select__estimator__penalty': ['l1', 'l2'],'classifier__max_depth': [1, 3, 5, 7], 'select__estimator__dual': [True,False]}
print('Results for training set:','\n',get_best_model_and_accuracy(svc_pipe,svc_pipe_params,X_train, y_train),'\n')
print('Results for test set:','\n',get_best_model_and_accuracy(svc_pipe,svc_pipe_params,X_test, y_test))

{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__penalty': ['l1', 'l2'], 'select__estimator__loss': ['squared_hinge', 'hinge'], 'select__estimator__dual': [True, False]} 

Results for training set: 
 Best Accuracy: 0.7996825956621407
Best Parameters: {'classifier__max_depth': 5, 'select__estimator__dual': False, 'select__estimator__loss': 'squared_hinge', 'select__estimator__penalty': 'l1', 'select__threshold': 0.01}
Average Time to Fit (s):0.328
Average Time to Score (s):0.001 

Results for test set: 
 Best Accuracy: 0.7696825396825396
Best Parameters: {'classifier__max_depth': None, 'select__estimator__dual': True, 'select__estimator__loss': 'hinge', 'select__estimator__penalty': 'l2', 'select__threshold': '2.*mean'}
Average Time to Fit (s):0.109
Average Time to Score (s):0.001




In [28]:
svc_pipe.set_params(**{'select__estimator__loss': 'squared_hinge','select__threshold': 0.01,'select__estimator__penalty': 'l1','classifier__max_depth': 5,'select__estimator__dual': False})
# fit our pipeline to our data
svc_pipe.steps[0][1].fit(X, y)
# list the columns that the SVC selected by calling the get_support() method from SelectFromModel
X.columns[svc_pipe.steps[0][1].get_support()]

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked'], dtype='object')

The SVM module (SVC, NuSVC, etc) is a **wrapper** around the libsvm library and supports different kernels while LinearSVC is based on liblinear and only supports a linear kernel. 

Note: <br/>
We have a come across the term 'wrapper' several times in this tutorial. There are three classes for feature selection: filter, wrapper, and embedded methods. Wrapper methods evaluate the importance of features based on the classifier performance. Filter methods measure evaluates the features via univariate statistics instead of cross-validation performance.Finally, embedded methods, are quite similar to wrapper methods, however, the difference is that an intrinsic model building metric is used during learning. Please refer back to this to learn more https://sebastianraschka.com/faq/docs/feature_sele_categories.html.

# Pros and cons of SVC: 
**Pros:** <br/>
1- Effective in high dimensional spaces. <br/>
2- Memory efficient because it uses a subset of training points in the decision vectore ( support vectors).<br/>

**Cons:**<br/>
1- Similar to linear regression, SVC should not be used when the number of observations is less than the number of features ( this is known as overfitting). <br/>
2- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.


#  Now, having gone through the three model-based selection techniques, which one do you think is the most efficient in this dataset? 

# Tips on when to use each technique: 
If your features are mostly categorical, you should start by trying to implement a SelectKBest with a Chi2 ranker or a tree-based model selector ( like decision tree). <br/>
If your features are largely quantitative, using linear models as model-based selectors and relying on correlations tends to yield greater results.<br/>
If you are solving a binary classification problem, using a Support Vector Classification model along with a SelectFromModel selector will probably fit nicely, as the SVC tries to find coefficients to optimize for binary classification tasks.

# References: <a id="6"></a>
* Introduction to machine learning by Andreas C. Müller and Sarah Guido
* Feature Engineering made easy by Sinan Ozdemir and Divya Susarla
* https://christophm.github.io/interpretable-ml-book/limo.html#limo
* https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
* https://dataschool.com/fundamentals-of-analysis/correlation-and-p-value/#:~:text=The%20two%20most%20commonly%20used,an%20experiment%20is%20statistically%20significant.
* https://lumina.com/causal-hypothesis-testing/#:~:text=The%20framework%20defines%20a%20p,is%20used%20in%20classic%20statistics.
* https://www.coursehero.com/file/p4jtjlo/The-disadvantages-of-the-Pearson-r-correlation-method-are-It-assumes-that-there/#:~:text=The%20disadvantages%20of%20the%20Pearson%20r%20correlation%20method%20are%3B%E2%9D%96,does%20not%20necessarily%20mean%20very
* https://www.wisdomjobs.com/e-university/research-methodology-tutorial-355/limitations-of-the-tests-of-hypotheses-11539.html
* And, ofcourse Qoura and stackOverflow. 