<center>Applied Machine Learning</center>

***

<center>Lecture 9</center>

***

<center>Feature Selection <br> + <br>Automate ML & Parameters Tuning</center>

***

<center>8 April 2021<center>
<center>Rahman Peimankar<center>

# Feature Selection For Machine Learning Problems

* The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.
* Irrelevant or partially relevant features can negatively impact model performance.
* You will discover automatic feature selection techniques that you can use to prepare your machine learning data in Python with scikit-learn.

# Feature Selection

* Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.
* Irrelevant features can decrease the accuracy of many models, especially linear algorithms such as linear and logistic regression.

# Benefits of Feature Selection

* **Reduces Overfitting**: Less redundant data means less opportunity to make decisions based on noise.

* **Improves Accuracy**: Less misleading data means modeling accuracy improves.

* **Reduces Training Time**: Less data means that algorithms train faster.

Learn more about feature selection with scikit-learn:<br>
https://scikit-learn.org/stable/modules/feature_selection.html

# Different Feature Selection Methods

1. Univariate Selection
2. Recursive Feature Elimination
3. Principle Component Analysis
4. Feature Importance

# 1. Univariate Selection

* Statistical tests can be used to select those features that have the strongest relationship with the output variable.
* The scikit-learn library provides the **SelectKBest** class that can be used with a suite of different statistical tests to select a specific number of features.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

# Using chi-squared ($chi^2$) Univariate Statistical Test for Feature Selection

In [1]:
from pandas import read_csv
from numpy import set_printoptions
set_printoptions(precision=3)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [2]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
y = array[:,8]

In [3]:
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, y)

In [4]:
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


# 2. Recursive Feature Elimination (RFE)

* The RFE recursively removes features and then builds a model on the remained features.
* RFE checks the model accuracy to see which features are the best to predict the target label.
* Learn more about RFE class in scikit-learn:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

# Using RFE with Logistic Regression

In [5]:
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [6]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
y = array[:,8]

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [9]:
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, y)
print("Num Features: {}".format(fit.n_features_))
print("Selected Features: {}".format(fit.support_))
print("Feature Ranking: {}".format(fit.ranking_))

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 5 6 1 1 3]


# 3. Principle Component Analysis