## Feature Selection and Dimensionality Reduction (Without Categorical Data in Features)
### Dr. Robert G. de Luna, PECE

## Feature Selection


**Feature Selection** is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

### Three benefits of performing Feature Selection before modeling your data are:

**1. Reduces Overfitting:** Less redundant data means less opportunity to make decisions based on noise.

**2. Improves Accuracy:** Less misleading data means modeling accuracy improves.

**3. Reduces Training Time:** Less data means that algorithms train faster.

TO CHECK THE VERSION OF LIBRARIES

In [None]:
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

TO IMPORT LIBRARIES

In [None]:
import numpy as np
import matplotlib.pyplot as plot
import pandas as pd

# To allow plots to appear within the notebook
%matplotlib inline

TO LOAD THE DATASET

In [None]:
dataset = pandas.read_csv('diabetes.csv')
dataset

TO DETERMINE THE DIMENSIONS OF THE DATASET

In [None]:
print(dataset.shape)

TO PEEK AT THE DATA

In [None]:
print(dataset.head(20))

TO SEE THE STATISTICAL SUMMARY

In [None]:
print(dataset.describe())

TO SEE THE CLASS DISTRIBUTION

In [None]:
print(dataset.groupby('Outcome').size())

TO SHOW THE UNIVARIATE PLOT (BOX and WHISKER PLOTS)

In [None]:
dataset.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plot.show()

TO SHOW THE HISTOGRAM FOR THE DISTRIBUTION

In [None]:
dataset.hist()
plot.show()

FOR THE MULTIVARIATE PLOT

In [None]:
# For the Scatter Plot Matrix
from pandas.plotting import scatter_matrix
scatter_matrix(dataset)
plot.show()

TO CREATE THE MATRIX OF INDEPENDENT VARIABLE, X

In [None]:
X = dataset.iloc[:, 0:8].values
X

TO CREATE THE MATRIX OF DEPENDENT VARIABLE, Y

In [None]:
Y = dataset.iloc[:,8].values
Y

## To Create Machine Learning Models with K-Fold Cross Validation

#### 1. USING LOGISTIC REGRESSION

In [None]:
# To Import the Logistic Regression Model
from sklearn.linear_model import LogisticRegression

# To Instantiate the Model (Using the Default Parameters)
logistic_regression = LogisticRegression(max_iter=100000, random_state=0)

In [None]:
# To Apply K-fold Cross Validation for the Logistic Regression Model Performance
from sklearn.model_selection import StratifiedKFold
k_Fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=logistic_regression, X=X, y=Y, cv=k_Fold, scoring='accuracy')
accuracies_average = accuracies.mean()
accuracies_deviation = accuracies.std()
print("ACCURACIES IN K-FOLDS:")
print(accuracies)
print('')
print("AVERAGE ACCURACY OF K-FOLDS:")
print(accuracies_average)
print('')
print("ACCURACY DEVIATION OF K-FOLDS:")
print(accuracies_deviation)
print('')

#### 2. USING K NEAREST NEIGHBORS WITH K = 5

In [None]:
# To Import the K Nearest Neighbors Model
from sklearn.neighbors import KNeighborsClassifier

# To Instantiate the Model (Using the Default Parameters)
k_nearest_neighbors = KNeighborsClassifier(n_neighbors=5)

In [None]:
# To Apply K-fold Cross Validation for the K Nearest Neighbors Model Performance
from sklearn.model_selection import StratifiedKFold
k_Fold = StratifiedKFold (n_splits=10, shuffle=True, random_state=0)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=k_nearest_neighbors, X=X, y=Y, cv=k_Fold, scoring='accuracy')
accuracies_average = accuracies.mean()
accuracies_deviation = accuracies.std()
print("ACCURACIES IN K-FOLDS:")
print(accuracies)
print('')
print("AVERAGE ACCURACY OF K-FOLDS:")
print(accuracies_average)
print('')
print("ACCURACY DEVIATION OF K-FOLDS:")
print(accuracies_deviation)
print('')

#### 3. USING SUPPORT VECTOR MACHINE

In [None]:
# To Import the Support Vector Machine Model
from sklearn.svm import SVC

# To Instantiate the Model (Using Majority of Default Parameters)
support_vector_machine = SVC(kernel = 'rbf', random_state=0)

In [None]:
# To Apply K-fold Cross Validation for the Support Vector Machine Model Performance
from sklearn.model_selection import StratifiedKFold
k_Fold = StratifiedKFold (n_splits=10, shuffle=True, random_state=0)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=support_vector_machine, X=X, y=Y, cv=k_Fold, scoring='accuracy')
accuracies_average = accuracies.mean()
accuracies_deviation = accuracies.std()
print("ACCURACIES IN K-FOLDS:")
print(accuracies)
print('')
print("AVERAGE ACCURACY OF K-FOLDS:")
print(accuracies_average)
print('')
print("ACCURACY DEVIATION OF K-FOLDS:")
print(accuracies_deviation)
print('')

## To Perform Different Feature Selection Methods

### A. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the Chi-squared (chi^2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

In [None]:
# To Import the Class of SelectKBest and chi2
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# For the List of Features
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
print('List of Features:')
print(features)
print('')

# To Perform Feature Selection with SelectKBest
selection_method_skb = SelectKBest(score_func=chi2, k=4)
selection_fit_skb = selection_method_skb.fit(X, Y)

# To Show the Results of Feature Selection
selection_scores = selection_fit_skb.scores_
print('Selection Scores: %s' % selection_scores)
print('')

print('Features and the Selection Scores')
list(zip(features, selection_scores))

#Print('Summary of the Selected Features')
#selected_features_skb = selection_fit_skb.transform(X)
#print(selected_features_skb)
#print('')

feat_importances = pd.Series(selection_scores, index=features)
feat_importances = feat_importances.nlargest(8)
feat_importances.plot(kind='barh')

NOTE: You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores):
   
Insulin, Glucose, Age, and BMI

#### B. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

The example below uses RFE with the logistic regression algorithm to select the top 4
features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

In [None]:
# To Import the Class of RFE
from sklearn.feature_selection import RFE

# For the List of Features
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
print('List of Features:')
print(features)
print('')

# To Perform Feature Extraction with RFE using Logistic Regression as the Model
selection_method_rfe = RFE(estimator=logistic_regression, n_features_to_select=4) 
# Note: Logistic Regression has "coef_" or "feature_importances_" attributes (unlike KNN and SVM) that can be used by RFE 
selection_fit_rfe = selection_method_rfe.fit(X, Y)

# To Show the Results of Feature Selection
number_features = selection_fit_rfe.n_features_
selected_features = selection_fit_rfe.support_
features_ranking = selection_fit_rfe.ranking_

print("Number of Features: %s" % number_features)
print("Selected Features: %s" % selected_features)
print("Feature's Ranking: %s" % features_ranking)
print('')

print('Features, Selected Features, and the Ranking Score:')
list(zip(features, selected_features, features_ranking))

feat_importances = pd.Series(features_ranking, index=features)
feat_importances = feat_importances.nlargest(8)
feat_importances.plot(kind='barh')

NOTE: You can see that using the logistic regression model as estimator, the 4 attributes chosen are:

Pregnancies, Glucose, BMI, and DiabetesPedigreeFunction

#### C. Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 4 principal components.

In [None]:
# To Import the Class of PCA
from sklearn.decomposition import PCA

# For the List of Features
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
print('List of Features:')
print(features)
print('')

# To Perform Feature Selection with PCA
selection_method_pca = PCA(n_components=4)
selection_fit_pca = selection_method_pca.fit(X)

# To Summarize the Principal Components
explained_variance = selection_fit_pca.explained_variance_ratio_
print("Explained Variance: %s" % explained_variance)
print('')

print("For the Transformed Component:")
components = selection_fit_pca.components_
print(components)

You can see that the transformed dataset (4 principal components) bare little resemblance to the source data.

#### D. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the diabetes dataset.

In [None]:
# To Import Class ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesClassifier

# For the List of Features
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
print('List of Features:')
print(features)
print('')

# To Perform Feature Importance with Extra Trees Classifier
model = ExtraTreesClassifier(random_state=0)
model.fit(X, Y)

# To Show the Results of Feature Importance
importance = model.feature_importances_
print('Importance Score: %s' % importance)
print('')

print('Features and the Importance Score:')
list(zip(features, importance))

feat_importances = pd.Series(importance, index=features)
feat_importances = feat_importances.nlargest(20)
feat_importances.plot(kind='barh')

You can see that we are given an importance score for each attribute where the larger scores are the more important attributes. The scores suggest the importance of:

Glucose, Age, BMI, and DiabetesPedigreeFunction

## Correlation Matrix with Heatmap is Applicable Only to determine Corellation Between Features

Correlation states how the features are related to each other or the target variable.
Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)
Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
# For the List of Features
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
print('List of Features:')
print(features)
print('')

# To Compute for the Correlation of each Features in the Dataset
correlation_matrix = dataset.corr()
correlation_features = correlation_matrix.index
plot.figure(figsize=(20,20))

# To Plot the Heatmap
g=sns.heatmap(dataset[correlation_features].corr(),annot=True,cmap="RdYlGn")

You can see that all features have low correlation with the outcome. That is, correlation matrix is not applicable for the classification task since output in classification are discreet values.

###### Dr. Robert G. de Luna, PECE
rgdeluna@pup.edu.ph