# Feature Selection

# 1. Introduction

Feature selection is a crucial step in data science that involves choosing the most relevant and informative features from a dataset to improve the performance of a machine learning model and reduce overfitting. There are various methods for feature selection, each with its advantages and use cases. Here are some common methods:

### 1. Univariate Feature Selection:
- Univariate feature selection methods consider each feature independently and rank them based on their individual relevance to the target variable. They are particularly useful for classification tasks with categorical target variables. Common univariate selection methods include chi-squared, ANOVA F-value, and mutual information.
- SelectKBest: This method selects the top K features with the highest scores from a given statistical test. For instance, SelectKBest with chi-squared test is appropriate for categorical target variables, while SelectKBest with ANOVA is suitable for numerical target variables.
- SelectPercentile: This method selects the top features based on a user-defined percentile of the highest-scoring features. It is useful when you want to keep a specific percentage of the most relevant features.

### 2. Recursive Feature Elimination (RFE):
- RFE is a recursive method that starts with all features and iteratively removes the least important feature at each step. It repeatedly trains the model and evaluates performance until the desired number of features is reached or the model's performance stops improving. RFE is applicable to both classification and regression problems and is commonly used with linear models and tree-based models like Random Forest.
- It is important to note that RFE relies on model performance as the criterion for feature removal, and its effectiveness may depend on the choice of the underlying model.

### 3. Lasso Regression (L1 Regularization):
- Lasso regression introduces an L1 penalty term to the loss function, which results in some feature coefficients becoming exactly zero. This sparsity-inducing property of Lasso allows it to perform feature selection by effectively eliminating less important features. Lasso is suitable for linear models and can handle both regression and classification tasks.
- The regularization strength (alpha) determines the degree of sparsity, and cross-validation can be used to find the optimal alpha value.

### 4. Random Forest Feature Importance:
- Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. Feature importance is calculated based on the average impurity reduction (or information gain) from each feature across all trees. Features with higher importance scores are considered more relevant to the target variable.
- Random Forest is a powerful and versatile model, making it well-suited for a wide range of problems. Feature importance from Random Forest can also be used to interpret the impact of features on the model's predictions.

### 5. Recursive Feature Addition (RFA):
- RFA is the opposite of RFE. It starts with an empty set of features and iteratively adds the most important feature based on model performance. This method can be useful when the goal is to identify a minimal set of features that achieve satisfactory model performance.

### 6. Tree-based Methods:
- Decision trees and tree-based ensembles, such as Gradient Boosting Machines, provide feature importances during the tree-building process. Tree-based models are capable of handling various data types (e.g., numerical and categorical) and can capture complex relationships between features and the target variable.
- Feature importances from tree-based methods can be used to select the most informative features or gain insights into the underlying data patterns.

### 7. Principal Component Analysis (PCA):
- PCA is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components. These components represent the maximum variance in the data. By selecting the top principal components, you effectively perform feature selection and reduce the dimensionality of the data.
- PCA is particularly useful when dealing with high-dimensional datasets and can aid in visualization and computation efficiency.

### 8. Regularization-Based Methods:
- Regularization techniques, such as L1 (Lasso) and L2 (Ridge) regularization, can be applied to linear models to shrink or eliminate coefficients. L1 regularization results in sparse models by setting some coefficients to zero, while L2 regularization penalizes large coefficients without eliminating them entirely.
- Regularization allows models to be more robust to multicollinearity and reduces the risk of overfitting when dealing with high-dimensional datasets.

### 9. Feature Importance from Gradient Boosting Machines (GBM):
- Gradient Boosting Machines (GBM) are a powerful ensemble method that builds decision trees sequentially, each tree attempting to correct the errors of its predecessors. Similar to Random Forest, GBM provides feature importances, which can be used for feature selection.
- GBM's ability to handle both numerical and categorical features makes it applicable to a wide range of problems, and its feature importances can be leveraged to identify relevant features and improve model interpretability.

### 10. Correlation-based Feature Selection:
- Correlation-based feature selection aims to identify and remove features that have high correlation with one another. Highly correlated features can carry redundant information, leading to potential overfitting and model instability.
- Before applying correlation-based feature selection, it's essential to preprocess and normalize the data to ensure meaningful correlation measures.

# 2. Load the California housing dataset

### 1. Introduction to the Iris Dataset:
The California housing dataset is a widely-used dataset in machine learning and statistics. It was derived from the 1990 U.S. census and includes housing data from various districts in California. The dataset is often used for regression tasks, where the goal is to predict the median house value for each district based on several features.

Here is a more detailed description of the features in the California housing dataset:

1. MedInc: Median income of the households in the district. This feature represents the key predictor of the median house value. Districts with higher median incomes tend to have higher median house values.

2. HouseAge: Median age of the houses in the district. This feature indicates the age of the housing structures in the district. Older houses might have lower values compared to newer ones.

3. AveRooms: Average number of rooms in the houses in the district. This feature provides information about the average size of houses in the district. Larger houses may have higher median values.

4. AveBedrms: Average number of bedrooms in the houses in the district. This feature reflects the average size of households. Districts with more bedrooms may have higher median house values.

5. Population: Total population of the district. This feature gives an idea of the population density, which can be related to the housing demand and thus affect housing prices.

6. AveOccup: Average household occupancy. This feature represents the average number of people living in a household. Higher occupancy might lead to higher demand for housing.

7. Latitude: Latitude of the district's location. Geographical location can play a role in determining housing prices.

8. Longitude: Longitude of the district's location. Similar to latitude, geographical location can influence housing values.

9. MedHouseVal: Median house value for California districts (the target variable). This is the value we want to predict in a regression task.

### 2. Import necessary libraries

In [41]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import SelectKBest, f_regression, SelectPercentile, mutual_info_regression
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE, VarianceThreshold, SelectFromModel

- We import the required libraries: `numpy` for numerical computations, `pandas` for data manipulation, and various feature selection methods and models from scikit-learn.

### 3. Load the California housing dataset

In [42]:
# Load the California housing dataset
california = fetch_california_housing()
X = california.data
y = california.target
feature_names = california.feature_names

- We load the California housing dataset using `fetch_california_housing()`.
- We extract the input features (`X`), target variable (`y`), and feature names (`feature_names`) from the dataset.


Now, let's move on to each feature selection method and explain the code for each part:

# 3. Univariate Feature Selection using SelectKBest with F-regression scoring

In [None]:
k_best = SelectKBest(score_func=f_regression, k=3)
X_k_best = k_best.fit_transform(X, y)
selected_feature_names_k_best = [feature_names[i] for i in k_best.get_support(indices=True)]

- We create a `SelectKBest` object with `score_func=f_regression` to perform univariate feature selection using F-regression scoring.
- We set `k=3` to select the top 3 features with the highest F-values.
- `k_best.fit_transform(X, y)` fits the feature selection model to the data and transforms the data to include only the selected features.
- `k_best.get_support(indices=True)` returns the indices of the selected features.
- We use list comprehension to extract the names of the selected features from `feature_names`, and store them in `selected_feature_names_k_best`.

In [53]:
df_k_best = pd.DataFrame(data=X_k_best, columns=selected_feature_names_k_best)
df_k_best.head()

Unnamed: 0,MedInc,AveRooms,Latitude
0,8.3252,6.984127,37.88
1,8.3014,6.238137,37.86
2,7.2574,8.288136,37.85
3,5.6431,5.817352,37.85
4,3.8462,6.281853,37.85


# 4. Univariate Feature Selection using SelectPercentile with Mutual Information scoring

In [43]:
percentile_selector = SelectPercentile(score_func=mutual_info_regression, percentile=50)
X_percentile = percentile_selector.fit_transform(X, y)
selected_feature_names_percentile = [feature_names[i] for i in percentile_selector.get_support(indices=True)]

- We create a `SelectPercentile` object with `score_func=mutual_info_regression` to perform univariate feature selection using mutual information scoring.
- We set `percentile=50` to select the top 50% of features with the highest mutual information.
- `percentile_selector.fit_transform(X, y)` fits the feature selection model to the data and transforms the data to include only the selected features.
- `percentile_selector.get_support(indices=True)` returns the indices of the selected features.
- We use list comprehension to extract the names of the selected features from feature_names, and store them in `selected_feature_names_percentile`.

In [55]:
df_percentile = pd.DataFrame(data=X_percentile, columns=selected_feature_names_percentile)
df_percentile.head()

Unnamed: 0,MedInc,AveRooms,Latitude,Longitude
0,8.3252,6.984127,37.88,-122.23
1,8.3014,6.238137,37.86,-122.22
2,7.2574,8.288136,37.85,-122.24
3,5.6431,5.817352,37.85,-122.25
4,3.8462,6.281853,37.85,-122.25


# 5. Lasso Regression (L1 Regularization) for Feature Selection

In [44]:
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
lasso_selected_features = np.where(lasso.coef_ != 0)[0]
X_lasso = X[:, lasso_selected_features]
selected_feature_names_lasso = [feature_names[i] for i in lasso_selected_features]

- We create a `Lasso` object with `alpha=0.1` to perform feature selection based on L1 regularization.
- `lasso.fit(X, y)` fits the Lasso regression model to the data.
- `np.where(lasso.coef_ != 0)[0]` returns the indices of non-zero coefficients, which correspond to the selected features.
- `X[:, lasso_selected_features]` selects the data with the selected features.
- We use list comprehension to extract the names of the selected features from `feature_names`, and store them in `selected_feature_names_lasso`.

In [57]:
df_lasso = pd.DataFrame(data=X_lasso, columns=selected_feature_names_lasso)
df_lasso.head()

Unnamed: 0,MedInc,HouseAge,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,565.0,2.181467,37.85,-122.25


# 6. Random Forest Feature Importance

In [45]:
forest = RandomForestRegressor(random_state=42)
forest.fit(X, y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
X_forest_importance = X[:, indices[:3]]
selected_feature_names_forest = [feature_names[i] for i in indices[:3]]

- We create a `RandomForestRegressor` object to determine feature importance using Random Forest.
- `forest.fit(X, y)` fits the Random Forest model to the data.
- `forest.feature_importances_ stores` the feature importances calculated by the model.
- `np.argsort(importances)[::-1]` sorts the importances in descending order and returns the indices of the sorted features.
- `X[:, indices[:3]]` selects the data with the top 3 most important features.
- We use list comprehension to extract the names of the selected features from `feature_names`, and store them in `selected_feature_names_forest`.

In [59]:
df_forest = pd.DataFrame(data=X_forest_importance, columns=selected_feature_names_forest)
df_forest.head()

Unnamed: 0,MedInc,AveOccup,Latitude
0,8.3252,2.555556,37.88
1,8.3014,2.109842,37.86
2,7.2574,2.80226,37.85
3,5.6431,2.547945,37.85
4,3.8462,2.181467,37.85


# 7. Recursive Feature Elimination (RFE) with a Random Forest model

In [60]:
rfe_selector = RFE(estimator=forest, n_features_to_select=3, step=1)
X_rfe = rfe_selector.fit_transform(X, y)
selected_feature_names_rfe = [feature_names[i] for i in rfe_selector.get_support(indices=True)]

- We create an `RFE` object with `estimator=forest` to perform recursive feature elimination using the Random Forest model.
- We set `n_features_to_select=3` to select the top 3 features recursively.
- `rfe_selector.fit_transform(X, y)` fits the RFE model to the data and transforms the data to include only the selected features.
- `rfe_selector.get_support(indices=True)` returns the indices of the selected features.
- We use list comprehension to extract the names of the selected features from `feature_names`, and store them in `selected_feature_names_rfe`.

In [62]:
df_rfe = pd.DataFrame(data=X_rfe, columns=selected_feature_names_rfe)
df_rfe.head()

Unnamed: 0,MedInc,AveOccup,Longitude
0,8.3252,2.555556,-122.23
1,8.3014,2.109842,-122.22
2,7.2574,2.80226,-122.24
3,5.6431,2.547945,-122.25
4,3.8462,2.181467,-122.25


# 8. Variance Thresholding

In [47]:
variance_selector = VarianceThreshold(threshold=0.1)
X_variance = variance_selector.fit_transform(X)
selected_feature_names_variance = [feature_names[i] for i in variance_selector.get_support(indices=True)]


- We create a `VarianceThreshold` object with `threshold=0.1` to remove features with low variance.
- `variance_selector.fit_transform(X)` fits the feature selection model to the data and transforms the data to include only the selected features.
- `variance_selector.get_support(indices=True)` returns the indices of the selected features.
- We use list comprehension to extract the names of the selected features from `feature_names`, and store them in `selected_feature_names_variance`.

In [63]:
df_variance = pd.DataFrame(data=X_variance, columns=selected_feature_names_variance)
df_variance.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


# 9. SelectFromModel with L1-based feature selection

In [48]:
sfm = SelectFromModel(estimator=forest, threshold=0.1)
X_sfm = sfm.fit_transform(X, y)
selected_feature_names_sfm = [feature_names[i] for i in sfm.get_support(indices=True)]

- We create a `SelectFromModel` object with estimator=forest to perform L1-based feature selection using the Random Forest model.
- We set `threshold=0.1` to select features with importance greater than the threshold.
- `sfm.fit_transform(X, y)` fits the feature selection model to the data and transforms the data to include only the selected features.
- `sfm.get_support(indices=True)` returns the indices of the selected features.
- We use list comprehension to extract the names of the selected features from `feature_names`, and store them in `selected_feature_names_sfm`.

In [64]:
df_sfm = pd.DataFrame(data=X_sfm, columns=selected_feature_names_sfm)
df_sfm.head()

Unnamed: 0,MedInc,AveOccup
0,8.3252,2.555556
1,8.3014,2.109842
2,7.2574,2.80226
3,5.6431,2.547945
4,3.8462,2.181467


# 10. Pinting the Name of the Seclected Features for each Methods

In [49]:
print("Selected features using SelectKBest (F-regression scoring):")
print(selected_feature_names_k_best)

print("\nSelected features using SelectPercentile (Mutual Information scoring):")
print(selected_feature_names_percentile)

print("\nSelected features using Lasso Regression (L1 Regularization):")
print(selected_feature_names_lasso)

print("\nSelected features using Random Forest Feature Importance:")
print(selected_feature_names_forest)

print("\nSelected features using Recursive Feature Elimination (RFE) with Random Forest:")
print(selected_feature_names_rfe)

print("\nSelected features using Variance Thresholding:")
print(selected_feature_names_variance)

print("\nSelected features using SelectFromModel with L1-based feature selection:")
print(selected_feature_names_sfm)


Selected features using SelectKBest (F-regression scoring):
['MedInc', 'AveRooms', 'Latitude']

Selected features using SelectPercentile (Mutual Information scoring):
['MedInc', 'AveRooms', 'Latitude', 'Longitude']

Selected features using Lasso Regression (L1 Regularization):
['MedInc', 'HouseAge', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Selected features using Random Forest Feature Importance:
['MedInc', 'AveOccup', 'Latitude']

Selected features using Recursive Feature Elimination (RFE) with Random Forest:
['MedInc', 'AveOccup', 'Longitude']

Selected features using Variance Thresholding:
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Selected features using SelectFromModel with L1-based feature selection:
['MedInc', 'AveOccup']


# 11. Methods comparison

In [74]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def decision_tree_func(df, y = california.target):
    # Separate features and target
    X = df  # Features

    # Split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    # Create a decision tree regressor
    tree = DecisionTreeRegressor()

    # Fit the model on the training data
    tree.fit(X_train, y_train)

    # Predict on the test data
    y_pred = tree.predict(X_test)

    # Calculate the mean squared error (MSE)
    mse = mean_squared_error(y_test, y_pred)

    # Calculate the mean absolute error (MAE)
    mae = mean_absolute_error(y_test, y_pred)

    # Calculate the R-squared (coefficient of determination)
    r2 = r2_score(y_test, y_pred)

    # Print the predicted values, MSE, MAE, and R-squared
    print("Mean Squared Error:", mse)
    print("Mean Absolute Error:", mae)
    print("R-squared:", r2)

    return r2

In [102]:
# List of feature selection methods
feature_selection_methods = [
    "SelectKBest (F-regression scoring)",
    "SelectPercentile (Mutual Information scoring)",
    "Lasso Regression (L1 Regularization)",
    "Random Forest Feature Importance",
    "Recursive Feature Elimination (RFE) with Random Forest",
    "Variance Thresholding",
    "SelectFromModel with L1-based feature selection"
]

In [103]:
df_list =  [df_k_best, df_percentile, df_lasso, df_forest, df_rfe, df_variance, df_sfm]

In [109]:
accuracy_dict = {}

for i in range(len(df_list)):
    print(feature_selection_methods[i] + ":")
    
    accuracy_dict.update({feature_selection_methods[i]:decision_tree_func(df_list[i])})
    print()
    

SelectKBest (F-regression scoring):
Mean Squared Error: 0.970543324312088
Mean Absolute Error: 0.6727642999031007
R-squared: 0.25569256836322907

SelectPercentile (Mutual Information scoring):
Mean Squared Error: 0.4101511023006056
Mean Absolute Error: 0.4078502156007752
R-squared: 0.6854560678651489

Lasso Regression (L1 Regularization):
Mean Squared Error: 0.467098898478125
Mean Absolute Error: 0.4406198570736434
R-squared: 0.6417829346329904

Random Forest Feature Importance:
Mean Squared Error: 0.7631435721311046
Mean Absolute Error: 0.5934224127906976
R-squared: 0.41474695882781343

Recursive Feature Elimination (RFE) with Random Forest:
Mean Squared Error: 0.7399170514728439
Mean Absolute Error: 0.5773742708333334
R-squared: 0.4325593238237416

Variance Thresholding:
Mean Squared Error: 0.5317524887394863
Mean Absolute Error: 0.4692128536821705
R-squared: 0.5922002457327926

SelectFromModel with L1-based feature selection:
Mean Squared Error: 1.0750386456495882
Mean Absolute Erro

In [110]:
accuracy_dict

{'SelectKBest (F-regression scoring)': 0.25569256836322907,
 'SelectPercentile (Mutual Information scoring)': 0.6854560678651489,
 'Lasso Regression (L1 Regularization)': 0.6417829346329904,
 'Random Forest Feature Importance': 0.41474695882781343,
 'Recursive Feature Elimination (RFE) with Random Forest': 0.4325593238237416,
 'Variance Thresholding': 0.5922002457327926,
 'SelectFromModel with L1-based feature selection': 0.17555534800997885}

In [111]:
finalleaderboard = pd.DataFrame.from_dict(accuracy_dict, orient='index', columns=['Accuracy'])
finalleaderboard = finalleaderboard.sort_values('Accuracy', ascending=False)
finalleaderboard

Unnamed: 0,Accuracy
SelectPercentile (Mutual Information scoring),0.685456
Lasso Regression (L1 Regularization),0.641783
Variance Thresholding,0.5922
Recursive Feature Elimination (RFE) with Random Forest,0.432559
Random Forest Feature Importance,0.414747
SelectKBest (F-regression scoring),0.255693
SelectFromModel with L1-based feature selection,0.175555


In [113]:
df_percentile.columns

Index(['MedInc', 'AveRooms', 'Latitude', 'Longitude'], dtype='object')