**Feature selection & Hyperparameter Optimization:**


In this session [Feature selection](https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection), cross-validation, and hyperparameter optimization will be presented. You can find the datasets here: [nasa.csv ](https://drive.google.com/file/d/1JPUOwwIuFbyIWPNyMCWj78S-CIuqb75d/view?usp=sharing) , and [life expectance](https://drive.google.com/file/d/1JbIYKkp6N4C16ixcEe7zDQ536ksYr7Wr/view?usp=sharing).

Import Libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import linear_model, metrics
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

Read the Data:

In [None]:
Forecasting_data = pd.read_csv("/content/life_dexpectency.csv") # life_expectancy dataset

print(Forecasting_data.columns)
Forecasting_data.head(3)

Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', 'BMI', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', 'HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years',
       ' thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')


Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9


In [None]:
Forecasting_data.info()
Forecasting_data.dropna(inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1990 entries, 0 to 1989
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          1990 non-null   object 
 1   Year                             1990 non-null   int64  
 2   Status                           1990 non-null   object 
 3   Life expectancy                  1983 non-null   float64
 4   Adult Mortality                  1983 non-null   float64
 5   infant deaths                    1990 non-null   int64  
 6   Alcohol                          1868 non-null   float64
 7   percentage expenditure           1990 non-null   float64
 8   Hepatitis B                      1731 non-null   float64
 9   Measles                          1990 non-null   int64  
 10  BMI                              1985 non-null   float64
 11  under-five deaths                1990 non-null   int64  
 12  Polio               

Split the data into train and test subsets (Note: Time series data cannot be randomly subsampled):

In [None]:
X = Forecasting_data.iloc[:, 4:]
y = Forecasting_data.iloc[:, 3]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)

X_train.shape

(1184, 18)

Train linear regression model:

In [None]:
# create linear regression object
linear_reg = linear_model.LinearRegression(fit_intercept=True)
# fit the linear regression model to your training data
linear_reg.fit(X_train, y_train)
print(f"intercept:{linear_reg.intercept_}\n")
print(f"Estiamted coefficients:{linear_reg.coef_}")

intercept:53.294539125871935

Estiamted coefficients:[-1.78858020e-02  9.44143574e-02 -5.59347397e-02  4.95675263e-04
 -4.97152971e-03 -1.52782559e-05  2.23746537e-02 -6.94784460e-02
  3.28862607e-03  6.85316819e-02  1.77698109e-02 -4.26957277e-01
 -1.06982534e-05  8.69117744e-11 -3.79821240e-03 -6.51336795e-02
  1.11031486e+01  8.72767720e-01]


Linear Regression Equation:   $ŷ = β_{0} + \beta_{1} X_{1} + ...+ \beta_{k} X_{k}$

Test the prediction performance:

In [None]:
y_pred = linear_reg.predict(X_test)

MSE = metrics.mean_squared_error(y_test, y_pred)
print(f"MSE: {MSE}")
MAE= metrics.mean_absolute_error(y_test, y_pred)
print(f"MAE: {MAE}")
MAPE = metrics.mean_absolute_percentage_error(y_test, y_pred)
print(f"MAPE: {MAPE}")

MSE: 13.175918026295063
MAE: 2.7992174550629
MAPE: 0.04283894398151048


## Feature Selection

Sklearn library [documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)

In [None]:
from sklearn.feature_selection import f_regression # to find the significance of model and each feature

f_statistic, p_values = f_regression(X_train,y_train)
print(p_values)

[1.86633890e-191 1.57319890e-018 3.15818733e-051 2.98277941e-048
 4.34514986e-010 2.61547353e-001 6.09906870e-085 1.69915781e-023
 3.25860277e-029 9.72463475e-007 3.39026811e-035 6.90183508e-113
 7.95954416e-055 6.17791665e-001 8.24214009e-073 5.61795254e-074
 1.61587644e-205 2.75270306e-192]


See documentations for [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) for more score functions.

In [None]:
from sklearn.feature_selection import SelectKBest

Selected = SelectKBest(f_regression, k=8).fit(X_train, y_train)  # Select features based on the k highest scores.
print(f"selected features are: {Selected.get_feature_names_out()}")
X_selected = Selected.transform(X_train) # reduces X_train to the selected features.
X_selected.shape

selected features are: ['Adult Mortality' 'BMI' 'HIV/AIDS' 'GDP' ' thinness  1-19 years'
 ' thinness 5-9 years' 'Income composition of resources' 'Schooling']


(1184, 8)

In [None]:
X_test_new = Selected.transform(X_test)
print(X_test_new[:2])
X_test.head(2)

[[1.83000000e+02 1.92000000e+01 1.00000000e-01 2.16681584e+02
  1.53000000e+01 1.54000000e+01 7.35000000e-01 1.34000000e+01]
 [3.66000000e+02 2.84000000e+01 3.70000000e+00 1.44114170e+03
  5.70000000e+00 5.70000000e+00 5.07000000e-01 1.04000000e+01]]


Unnamed: 0,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
1695,183.0,4,2.05,13.390922,97.0,21,19.2,4,97.0,3.37,97.0,0.1,216.681584,19968.0,15.3,15.4,0.735,13.4
355,366.0,47,0.01,61.392636,87.0,831,28.4,71,86.0,4.1,87.0,3.7,1441.1417,2223994.0,5.7,5.7,0.507,10.4


Evaluate the new model's performance on the test set:

In [None]:
linear_reg = linear_model.LinearRegression(fit_intercept=True)
# fit the linear regression model to your training data
linear_reg.fit(X_selected, y_train)
y_pred = linear_reg.predict(X_test_new)

# using sklearn metrics
MSE = metrics.mean_squared_error(y_test, y_pred)
print(f"MSE_SelectKBest: {MSE}")
MAE = metrics.mean_absolute_error(y_test, y_pred)
print(f"MAE_SelectKBest: {MAE}")
MAPE = metrics.mean_absolute_percentage_error(y_test, y_pred)
print(f"MAPE_SelectKBest: {MAPE}")

MSE_SelectKBest: 15.468017319518415
MAE_SelectKBest: 2.9559667847514493
MAPE_SelectKBest: 0.04580735749315981


### SequentialFeatureSelector
Forward or Backward

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

Selected_2 = SequentialFeatureSelector(estimator = linear_model.LinearRegression(fit_intercept=True), # or knn, svr, etc.
                                       n_features_to_select='auto', tol=0.05, direction='forward',
                                       scoring='r2', cv=5)
X_selected = Selected_2.fit_transform(X_train, y_train)
print(f"selected features are: {Selected_2.get_feature_names_out()}")

X_selected.shape

selected features are: ['Adult Mortality' 'HIV/AIDS' 'Income composition of resources']


(1184, 3)

In [None]:
linear_reg.fit(X_selected, y_train)
X_test_new = Selected_2.transform(X_test)
y_pred = linear_reg.predict(X_test_new)

# Performance metrics
MSE = metrics.mean_squared_error(y_test, y_pred)
MAE = metrics.mean_absolute_error(y_test, y_pred)
MAPE = metrics.mean_absolute_percentage_error(y_test, y_pred)
print(f"MSE_SelectKBest: {MSE}\n MAE_SelectKBest: {MAE}\n MAPE_SelectKBest: {MAPE}")


MSE_SelectKBest: 20.24622215635919
 MAE_SelectKBest: 3.2011462624641185
 MAPE_SelectKBest: 0.04980794117981581


### Regularization, Lasso

Linear Model trained with L1 prior as regularizer. [Documentations](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

In [None]:
from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha=1.2, fit_intercept=True, tol=0.0001, random_state=123, max_iter=1500)
lasso_model.fit(X_train, y_train)
y_pred = lasso_model.predict(X_test)

# Performance metrics
MSE = metrics.mean_squared_error(y_test, y_pred)
MAE = metrics.mean_absolute_error(y_test, y_pred)
MAPE = metrics.mean_absolute_percentage_error(y_test, y_pred)
print(f"MSE_SelectKBest: {MSE}\n MAE_SelectKBest: {MAE}\n MAPE_SelectKBest: {MAPE}")

print(lasso_model.coef_)

MSE_SelectKBest: 15.453609600801201
 MAE_SelectKBest: 3.0071221235936534
 MAPE_SelectKBest: 0.046216751832338206
[-2.31285973e-02  5.99964350e-02  0.00000000e+00  2.53594871e-04
 -3.92315289e-03  8.89313421e-06  4.82041317e-02 -4.69250701e-02
  5.09100567e-03  0.00000000e+00  2.53532539e-02 -3.80059011e-01
  5.64003707e-05  5.79532111e-09 -2.43369555e-02 -5.83563858e-03
  0.00000000e+00  1.00526063e+00]


### DecisionTrees Feature Importance



The feature importance form decision trees can also be used as a feature selection module before creating any estimator. Please refere to the decision tree notebook for more details.

## Cross-validation: Logistic Regression

###Read the Data:

In [None]:
# Nasa Dataset
classification_data = pd.read_csv("/content/nasa.csv")
classification_data.head(3)

Unnamed: 0,Name,Eccentricity,Inclination,Hazardous,Neo Reference ID,Absolute Magnitude,Est Dia in Feet(min),Est Dia in Feet(max),Close Approach Date,Epoch Date Close Approach,...,Jupiter Tisserand Invariant,Semi Major Axis,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Perihelion Time,Mean Anomaly,Mean Motion
0,3723955,0.351674,28.412996,0,3723955,21.3,479.22562,1071.581063,12784,789000000000.0,...,5.457,1.107776,136.717242,425.869294,0.7182,313.091975,1.497352,2457794.969,173.741112,0.84533
1,2446862,0.348248,4.237961,1,2446862,20.3,759.521423,1698.341531,12791,790000000000.0,...,4.557,1.458824,259.475979,643.580228,0.950791,248.415038,1.966857,2458120.468,292.893654,0.559371
2,3671135,0.563441,17.927751,0,3671135,19.6,1048.43142,2344.363926,12798,790000000000.0,...,4.724,1.323532,178.971951,556.160556,0.5778,198.145969,2.069265,2458009.403,354.237368,0.647295


###Randomly split the data into train and test subsets

In [None]:
from sklearn.model_selection import train_test_split

X_c = classification_data.drop(["Name", "Hazardous"], axis = 1)
y_c = classification_data["Hazardous"]

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y_c, test_size = 0.2, random_state = 123)

print(f"Shape of X_train: {X_train_c.shape}")

Shape of X_train: (369, 25)


Look for the scikit-learn cross-validation documentations [here](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values).

In [None]:
from sklearn.model_selection import cross_val_score

# create classifier object
clf = linear_model.LogisticRegression()
# fit classifier with 5-fold cross-validation
acc_scores = cross_val_score(clf, X_train_c, y_train_c, cv=5, scoring="accuracy")
print(f"accuracies:{acc_scores}")
print("%0.3f accuracy with a standard deviation of %0.3f" % (acc_scores.mean(), acc_scores.std()))

accuracies:[0.54054054 0.54054054 0.54054054 0.54054054 0.54794521]
0.542 accuracy with a standard deviation of 0.003


## Hyperparameter Tuning

*   Random Search
*   Grid Search
*   Bayesian Optimization

Random Search:

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Set the number of random iterations
n_iter = 10

# define tha hyperparameter domain
params = {
    'n_neighbors': range(1, 10),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski'],
}

knn = KNeighborsClassifier()
# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=knn, param_distributions=params, n_iter=n_iter, cv=5, scoring='accuracy')

random_search.fit(X_train_c, y_train_c)


In [None]:
# Access the best hyperparameters and score:
best_params = random_search.best_params_
best_score = random_search.best_score_

print("Best hyperparameters:", best_params)
print("Best score:", best_score)


Best hyperparameters: {'weights': 'uniform', 'n_neighbors': 4, 'metric': 'manhattan'}
Best score: 0.6177341725286931


Grid Search:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1],
    'kernel': ['rbf'],
}

svc = SVC()
# Create the GridSearchCV object
grid_search = GridSearchCV(estimator= svc, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)
grid_search.fit(X_train_c, y_train_c)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [None]:
# Access the best hyperparameters and score:
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best hyperparameters:", best_params)
print("Best score:", best_score)


Best hyperparameters: {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}
Best score: 0.5420214735283229


Baysian Optimization:

hyperopt [Github](https://github.com/hyperopt/hyperopt)

* Bayesian optimization requires defining the objective function specific to your model and metric.

* Hyperopt offers various search algorithms like tpe.suggest (Tree-Parzen Estimator) and requires exploration of different options based on the problem.

In [None]:
!pip install hyperopt
from hyperopt import hp



In [None]:
train_ratio = 0.65
validation_ratio = 0.2
test_ratio = 0.15

# train is now 65% of the entire data set
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y_c, test_size=1 - train_ratio)

# test is now 15% of the initial data set
# validation is now 20% of the initial data set
X_val_c, X_test_c, y_val_c, y_test_c = train_test_split(X_test_c, y_test_c, test_size=test_ratio/(test_ratio + validation_ratio))

print(f"Shape of partitions:\n X_train: {X_train_c.shape}, \n X_val: {X_val_c.shape},\n X_test: {X_test_c.shape}")

Shape of partitions:
 X_train: (300, 25), 
 X_val: (92, 25),
 X_test: (70, 25)


In [None]:
# For a KNN model
search_space = {
    'n_neighbors': hp.choice('n_neighbors', range(1, 20)),
    'weights': hp.choice('weights', ['uniform', 'distance']),
    'metric': hp.choice('metric', ['euclidean', 'manhattan', 'minkowski']),
}


def objective_function(params):
    # Train the model with the provided hyperparameters

    clf = KNeighborsClassifier(n_neighbors=params['n_neighbors'],
                               weights=params['weights'],
                               metric=params['metric'])

    clf.fit(X_train_c, y_train_c)
    y_pred = clf.predict(X_val_c)

    # Evaluate model performance
    score = metrics.accuracy_score(y_val_c, y_pred)

    # Negate the accuracy because we want to minimize the objective
    return -score

In [None]:
# Create the optimizer object
from hyperopt import fmin, tpe

# Set the number of iterations (trials)
max_evals = 50

# Create the optimizer object
optimizer = fmin(fn=objective_function, space=search_space, algo= tpe.suggest, max_evals=max_evals)
best_params = optimizer
print("Best hyperparameters:", best_params)

100%|██████████| 50/50 [00:00<00:00, 81.39trial/s, best loss: -0.717391304347826]
Best hyperparameters: {'metric': 1, 'n_neighbors': 4, 'weights': 0}
