<center>Applied Machine Learning</center>

***

<center>Lecture 9</center>

***

<center>Feature Selection <br> + <br>Automate ML & Parameters Tuning</center>

***

<center>8 April 2021<center>
<center>Rahman Peimankar<center>

# Feature Selection For Machine Learning Problems

* The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.
* Irrelevant or partially relevant features can negatively impact model performance.
* You will discover automatic feature selection techniques that you can use to prepare your machine learning data in Python with scikit-learn.

# Feature Selection

* Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.
* Irrelevant features can decrease the accuracy of many models, especially linear algorithms such as linear and logistic regression.

# Benefits of Feature Selection

* **Reduces Overfitting**: Less redundant data means less opportunity to make decisions based on noise.

* **Improves Accuracy**: Less misleading data means modeling accuracy improves.

* **Reduces Training Time**: Less data means that algorithms train faster.

Learn more about feature selection with scikit-learn:<br>
https://scikit-learn.org/stable/modules/feature_selection.html

# Different Feature Selection Methods

1. Univariate Selection
2. Recursive Feature Elimination
3. Principle Component Analysis
4. Feature Importance

# 1. Univariate Selection

* Statistical tests can be used to select those features that have the strongest relationship with the output variable.
* The scikit-learn library provides the **SelectKBest** class that can be used with a suite of different statistical tests to select a specific number of features.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

# Using chi-squared ($chi^2$) Univariate Statistical Test for Feature Selection

In [1]:
from pandas import read_csv
from numpy import set_printoptions
set_printoptions(precision=3)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [2]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
y = array[:,8]

In [3]:
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, y)

In [4]:
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


# 2. Recursive Feature Elimination (RFE)

* The RFE recursively removes features and then builds a model on the remained features.
* RFE checks the model accuracy to see which features are the best to predict the target label.
* Learn more about RFE class in scikit-learn:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

### Using RFE with Logistic Regression

In [51]:
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [52]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
Y = array[:,8]

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [9]:
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: {}".format(fit.n_features_))
print("Selected Features: {}".format(fit.support_))
print("Feature Ranking: {}".format(fit.ranking_))

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 5 6 1 1 3]


# 3. Principle Component Analysis (PCA)

* PCA transforms the data into a compressed form. (also called data reduction technique)
* You can choose the number of principal components or the number of dimensions to be reduced.
* Learn more about PCA class in scikit-learn here:<br>
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

### Implementation of PCA  

In [17]:
from sklearn.datasets import load_diabetes
from sklearn.decomposition import PCA

data = load_diabetes()
X = data.data
Y = data.target

In [56]:
# feature extraction
pca = PCA(n_components=2)
fit = pca.fit(X)

In [57]:
# summarize components
print("Explained Variance: {}".format(fit.explained_variance_ratio_))
print(fit.components_)

Explained Variance: [0.889 0.062]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]]


# 4. Feature Importance

* Random Forest (bagging) methods can be used to estimate the importance of features.
* Learn more about **ExtraTreesClassifier** class in scikit-learn here:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html<br>

"*This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.*"


### Implementation of ExtraTreesClassifier (Feature Importance)  

In [47]:
from sklearn.datasets import load_diabetes
from sklearn.ensemble import ExtraTreesClassifier

data = load_diabetes()
X = data.data
Y = data.target

In [55]:
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)
print(df.columns)

[0.111 0.23  0.097 0.082 0.075 0.147 0.117 0.14 ]
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')


# Automate Machine Learning Workfows

### Think-Peers

* Can you list standard steps in a machine learning project?

(Breakout rooms!!)

1. Statndard steps in a machine learning project can be automated.
2. In scikit-learn *pipelines* can define and automate these steps.

In this section you will will learn:
* how to use *pipelines*.
* how to apply a data preparation and modeling pipeline.
* how to apply a feature extraction and modeling pipeline.


# 1. Automating Machine Learning Workfows

* The standard steps in applied machine learning help to overcome common problems.
* scikit-learn provides a Pipeline utility to help automate machine learning workflows.
* Pipelines work by allowing for a linear sequence of data transforms to be chained together.
* Learn more about **Pipelines** in scikit-learn here:<br>
http://scikit-learn.org/stable/modules/pipeline.html <br>

The goal is to ensure that all of the steps in the pipeline are constrained to the data available
for the evaluation.

# 2. Data Preparation and Modeling Pipeline

* A common mistake in applied machine learning is leaking data from your training dataset to your test dataset.
* To avoid this, you need a strong separation of training and test sets.
* Most of the data leakage happens in data preparation step. 
* For example, normalizing of standardizing your entire data would not be a correct preparaion (preprocessing) method. Because the training data would have been influenced by the data in the test set!

Pipelines help preventing data leakage in the evaluation process by ensuring that data preparation like standardization is constrained to each fold of your cross validation procedure.

# Implementation of Data Preparation Pipeline 

The below pipeline consists of two steps:
    1. Standardize the data.
    2. Learn a Logistic Regression model.

In [72]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [73]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
Y = array[:,8]

In [74]:
# create pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lr', LogisticRegression()))
model = Pipeline(estimators)

In [91]:
# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7773410799726589


# 3. Feature Extraction and Modeling Pipeline

* Feature extraction is another procedure that is susceptible to data leakage.
* Like data preparation, feature extraction procedures must be restricted to the data in your training dataset.
* The pipeline provides a handy tool called the **FeatureUnion** which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a model can be trained.
* Importantly, all the feature extraction and the feature union occurs within each fold of the cross validation procedure.


# Implementation of Feature Extraction Pipeline

In this example, we implement the pipeline with four steps:
* Feature Extraction with Principal Component Analysis (3 features).
* Feature Extraction with Statistical Selection (6 features).
* Feature Union.
* Learn a Logistic Regression Model.

In [82]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest

In [83]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
Y = array[:,8]

In [84]:
# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

In [85]:
# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression()))
model = Pipeline(estimators)

In [90]:
# evaluate pipeline
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7773410799726589


# Summary

We learned how to use Pipelines in two important use cases:
    1. Data preparation and modeling constrained to each fold of the cross validation procedure.
    2. Feature extraction and feature union constrained to each fold of the cross validation procedure.

# Improve Performance with Algorithm Tuning

* Machine learning models are parameterized so that their behavior can be tuned for a given problem.
* Models can have many parameters and finding the best combination of parameters can be treated as a search problem.

In this lesson, you will see:
    1. The importance of algorithm parameter tuning to improve algorithm performance.
    2. How to use a grid search algorithm tuning strategy.
    3. How to use a random search algorithm tuning strategy.

# Machine Learning Algorithm Parameters

* Algorithm tuning is a final step in the process of applied machine learning before finalizing your model.
* It is sometimes called hyperparameter optimization. 
* Since this is an optimization problem, we need to use search based methods/strategies to find robust and good (set of) parameters.

scikit-learn provides two simple methods for algorithm parameter tuning:
    1. Grid Search Parameter Tuning.
    2. Random Search Parameter Tuning.

# 1. Grid Search Parameter Tuning

* Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
* You can perform a grid search using the **GridSearchCV** class: <br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

# Implementation of GridSearchCV

The example below evaluates different alpha values for the Ridge Regression algorithm on the standard diabetes dataset.

In [99]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
import numpy as np

In [100]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
Y = array[:,8]

In [102]:
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid.fit(X, Y)
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.27610844129292433
1.0


# 2. Random Search Parameter Tuning

* Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations.
* A model is constructed and evaluated for each combination of parameters chosen.
* You can perform a random search for algorithm parameters using the **RandomizedSearchCV** class:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

# Implementation of RandomizedSearchCV

* The example below evaluates diffeerent random alpha values between 0 and 1 for the Ridge Regression algorithm on the standard diabetes dataset.
* A total of 100 iterations are performed with uniformly random alpha values selected in the range between 0 and 1 (the range that alpha values can take).

In [103]:
import numpy as np
from pandas import read_csv
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

In [104]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
Y = array[:,8]

In [105]:
param_grid = {'alpha': uniform()}
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100, random_state=7)
rsearch.fit(X, Y)
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

0.27610755734028547
0.9779895119966027


# Summary

* Algorithm parameter tuning is an important step for improving algorithm performance right before presenting results or preparing a system for production.
* In this chapter you discovered algorithm parameter tuning and two methods that you can use right now in Python and scikit-learn to improve your algorithm results:
    1. Grid Search Parameter Tuning
    2. Random Search Parameter Tuning

# Save and Load Machine Learning Models

Finding an accurate machine learning model is not the end of the project.

* Your machine learning model can be saved for future use. Afterwards, it can be loaded when needed.
* This allows you to save your model to file and load it later in order to make predictions.

After completing this lesson you will know:
   1. The importance of serializing models for reuse.
   2. How to use **pickle** to serialize and deserialize machine learning models.
   3. How to use **Joblib** to serialize and deserialize machine learning models. 

# 1. Finalize Your Model with pickle

* Pickle is the standard way of serializing objects in Python.
* You can use the pickle operation to serialize your machine learning algorithms and save the serialized format to a file.
https://docs.python.org/2/library/pickle.html

# Implementation of Train and Save a Model Using Pickle

In [107]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from pickle import dump
from pickle import load

In [108]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

In [109]:
# Fit the model on 33%
model = LogisticRegression()
model.fit(X_train, Y_train)

LogisticRegression()

In [110]:
# save the model to disk
filename = 'finalized_model.sav'
dump(model, open(filename, 'wb'))

# Some Time Later!

In [111]:
# load the model from disk
filename = 'finalized_model.sav'
loaded_model = load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)

0.7874015748031497


# 2. Finalize Your Model with Joblib

* The Joblib library is part of the SciPy ecosystem and provides utilities for pipelining Python jobs.
https://pypi.org/project/joblib/
* It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.
* This can be useful for some machine learning algorithms that require a lot of parameters.

# Implementation of Train and Save a Model Using Joblib

In [114]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from joblib import dump
from joblib import load

In [115]:
# load data
filename = 'diabetes.csv'
df = read_csv(filename)
array = df.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

In [116]:
# Fit the model on 33%
model = LogisticRegression()
model.fit(X_train, Y_train)

LogisticRegression()

In [117]:
# save the model to disk
filename = 'finalized_model.sav'
dump(model, open(filename, 'wb'))

# Some Time Later!

In [123]:
# load the model from disk
filename = 'finalized_model.sav'
loaded_model = load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)

0.7874015748031497


<font size="25"><center>Thank you!