# Feature Selection
In this notebook, we'll revisit the topic of brain structure volumes in preterm babies. Specifically, we'll explore how feature selection can be a powerful tool to prevent overfitting, enhance model performance, and aid in interpreting features.

Feature selection can often make the difference between a model that performs well and one that doesn't. By selecting the right features, we can enhance our model's ability to understand and learn from our data.

Let's get started and see feature selection in action!

In [None]:
###################################
## RUN THIS
###################################
# this code is to suppress warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
####################################

## Brain structure volumes

let's get back into working with our dataset of 86 brain structure volumes from 164 preterm babies. If you recall, our goal was to predict the gestational age (GA) from the volumes. You might remember that using Multivariate Linear Regression resulted in overfitting of the data, and we used Lasso and Ridge penalties to combat this.

### Load data

The code below will help you get started. It takes care of loading the data, creating the feature matrix and the target vector, and performing feature scaling. Remember, feature scaling is an essential step when working with machine learning models as it ensures all features contribute equally to the model's performance. Let's continue!

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# read spreadsheet using pandas
data = pd.read_csv("datasets/GA-structure-volumes-preterm.csv",header=None)
# convert from 'DataFrame' to numpy array
structure_volumes = data.to_numpy()
# Features
X = structure_volumes[:,1:]
# Targets
y = structure_volumes[:,0]
# checking the size of the feature and target arrays
# note they must agree in the first dimension
print('Features shape: {}; Targets shape: {}'.format(X.shape,y.shape))
# we have 86 features and 164 samples

# Scale features
X = StandardScaler().fit_transform(X)
print('Performed feature scaling.')

It's often incredibly insightful to identify which features our models consider most predictive. It helps us understand the underlying patterns in our data better and can guide future data collection or feature engineering efforts.

The code below will assist in achieving this objective. It reads in the names of the brain structures and stores them in a `dataframe` object named `structure_names`. So, not only will we know how many features are considered important, but we'll also know exactly which ones they are! Let's move on.

In [None]:
# read file with structure names
structure_names = pd.read_csv('datasets/labels', header = None, sep='\t')
structure_names[1]

### Multivariate linear regression

As you'll remember, Multivariate Linear Regression tended to overfit the data when applied to our problem. Overfitting happens when a model is too complex and captures the noise in the data rather than the underlying pattern. This results in great performance on the training data but poor generalization to unseen data.

Now, let's take a look at the performance of this linear regression model to set a baseline. As we progress, we'll be able to compare this with our models after applying feature selection, to clearly see any improvements made. Ready to continue? Let's go!

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

model = LinearRegression()
scores = cross_val_score(model, X, y, scoring = 'neg_mean_squared_error')
print('Linear regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

Alright, ready for a little flashback? We discovered earlier that the magic alpha number for Ridge regression was about 45. Using this setting helped us dodge a lot of that pesky overfitting.

### Ridge

Let's rerun this Ridge regression model. Keep in mind the performance of Ridge regression, as it will serve as a good baseline for us to compare with future results. Remember, Ridge regression adds a penalty equivalent to the square of the magnitude of the coefficients to the loss function, which helps prevent overfitting by constraining the model.

This setting significanlty reduced overfitting. Let's rerun this model. Remember the performance of Ridge regression as a baseline for good performance.

In [None]:
from sklearn.linear_model import Ridge

model = Ridge(alpha = 45)
scores = cross_val_score(model, X, y, scoring = 'neg_mean_squared_error')
print('Ridge regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

Fantastic! We will now explore different feature selection techniques in Scikit-learn.

Feature selection is like a filter that sifts out all the redundant or less meaningful stuff from our data, letting the truly valuable features shine through. It's a key step to avoid overfitting, reduce complexity, and improve our model's performance.

Alright, are you ready? Let's unravel the potential of different feature selection techniques together!

## Univatiate feature selection

### Pearson's correlation coefficient

The Pearson correlation coefficient is a great way to understand the linear relationship between our features and the target variable. This function (`pearsonr`) from the `scipy.stats` module helps us do exactly that.

The correlation coefficient ranges from -1 to 1. A high absolute value (close to -1 or 1) means there's a strong linear relationship. This could be either a positive relationship (as one value goes up, so does the other) or a negative relationship (as one value goes up, the other goes down).

Keep in mind that while many brain volumes have high correlation with gestational age (GA), not all of them do. This is where feature selection can be particularly handy, helping us focus on those features that contribute the most to predicting GA.

So, ready to explore more about these relationships? Let's keep going!

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

from scipy.stats import pearsonr

n = X.shape[1]
cc = np.zeros(n)
for i in range(n):
    cc[i]=pearsonr(X[:,i],y)[0]

plt.figure(figsize = [16,4])
plt.bar(np.arange(n),cc)
plt.title('Pearsons correlation coefficient', fontsize = 18)
plt.xlabel('Feature', fontsize = 16)
plt.ylabel('Correlation coefficient', fontsize = 16)
plt.axis([-1,86,0,1])

### F-score

The F-score is another useful statistic when it comes to feature selection.

Just to clarify, Scikit-learn tends to work with F-values, but don't worry, for feature selection, they're just as effective as Pearson's Correlation Coefficient.

The cool part about F-values is that they can be calculated directly using the `f_regression` function in `sklearn` from Scikit-learn.

**Activity 1.1:** Your task now is to complete the code below to create a `bar ` plot of the F-scores. This plot will give you a good idea of how the F-values of your features are distributed. Ready to take it on? You've got this!

In [None]:
from sklearn.feature_selection import f_regression

f_score = f_regression(X,y)[0]

# plot f-scores
plt.figure(figsize = [16,4])
plt.bar(np.arange(n),None)
plt.title('F-value', fontsize = 18)
plt.xlabel('Feature', fontsize = 16)
plt.ylabel('F-value', fontsize = 16)

**Activity 1.2:** Your task now is to plot the relationship between Pearson's correlation coefficient and the F-score using `plot`. By creating this plot, we can better understand the relationship between these two metrics. Ready to uncover their relationship? Let's jump right in!

In [None]:
# plot relationship
plt.plot(None,None,'*')
plt.xlabel("Person's correlation coefficient", fontsize = 16)
plt.ylabel('F-value', fontsize = 16)

### Selecting features based on F-value
<img src="pictures/brain.png" width = "250" style="float: right;">

Great, let's start refining our model by selecting the most impactful features.

### Selecting Top Scoring Features
Scikit-learn has some handy tools to make this process easier, specifically `SelectKBest` and `SelectPercentile`. To start, we're going to use `SelectKBest` to pick out the top 4 features.

Now, you'll notice some code below. It's transforming our original feature matrix `X` into a new one, `X_selected`, which will only include our top 4 selected features. Neat, huh?

**Activity 1.3:** Now, let's make sure everything's gone to plan. Check the following:

- Size of the new matrix - does it match what you expect, considering we're selecting 4 features?

- Indices of the features that have been selected - these will tell us which features from the original matrix have been chosen.

- Names of the selected features - because it's always nice to know who made the cut!

In [None]:
from sklearn.feature_selection import SelectKBest

# define feature selection model
k=4
selector = SelectKBest(f_regression, k = k)

# select features
X_selected = selector.fit_transform(X,y)

# Shape of the matrix
print('Shape of the new matrix: ', X_selected.shape)

# Indices of the selected features
ind = np.where(selector.get_support())[0]
print('Indices: ', ind)

# Print the names of the selected structures
print('\n')
for i in range(k):
    print(structure_names.loc[ind[i],1])

### Univariate feature selection for improved prediction

Perfect, now let's see how these selected features fare when we use them for our multivariate linear regression.

**Activity 1.4:** Here's your next mission - apply multivariate linear regression to our carefully selected features and let's see if our performance improves.

Remember, the goal of feature selection is to enhance our model's performance by using only the most relevant features. By reducing the 'noise' from less important features, we're hoping for a more effective model.

In [None]:
# Select and fit linear regression model to selected features
model = None
model.fit(None,y)

# Calculate and print RMSE
scores = cross_val_score(model, X_selected, y, scoring = 'neg_mean_squared_error')
print('Linear regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

Alright, it seems like we've managed to reduce overfitting, which is fantastic! However, our model's performance still doesn't match that of the Lasso or Ridge methods.

### Tweaking the Number of Selected Features

**Activity 5:** But don't worry, we've got one more trick up our sleeve. It's time to experiment with the number of selected features. Adjust this number to find out what delivers the best performance.

Keep in mind that the right balance might not always be the maximum number of features – sometimes, less is more!

Once you've found that sweet spot, make a note of it. This is now the benchmark for the best performance we can get from univariate feature selection.

Ready to find that perfect number? Let's get experimenting!

## Exercise 1

Select 4 top scoring features using mutual information. Do you obtain the same or different features as for correlation coefficient?

### Mutual Information for Feature Selection
Mutual information can provide a deeper understanding of the relationship between features, as it captures any kind of dependency, not just linear.

Now, let's select the 4 top scoring features using mutual information and see how it compares to our previous method.

**Activity:** Your task is to run the mutual information feature selection and compare the selected features with those chosen based on the correlation coefficient.

Remember, the same features may not always be chosen by different selection methods. That's what makes this so interesting!

Ready to uncover the mutual information in our dataset? Let's do it!

In [None]:
from sklearn.feature_selection import mutual_info_regression

# set number of features to select
k=4

# Create feature selector
selector = None

# select features
X_selected = None

# Indices of the selected features
ind = None

# Print the names of the selected structures
print('\n')
for i in range(k):
    print(structure_names.loc[ind[i],1])

## Model based feature selection

### Lasso

We will now select the features based on `Lasso` model. We have previously found that setting `alpha=0.16` results in a best Lasso model for our example. Code below creates the model, calculates its performance and prints out the number of sparse coefficients.

Now, we're moving to another type of feature selection, where we'll use models to guide our decisions. To start with, let's revisit our friend, the Lasso model.

### Model-Based Feature Selection: Lasso
We've previously found that setting alpha=0.16 resulted in an optimal Lasso model for our dataset. The code provided below will take us through creating this model, assessing its performance, and checking out the number of sparse coefficients.

**Activity:** Your task here is to run the provided code and observe the model's performance. Note the number of sparse coefficients - these represent the features Lasso considers irrelevant. They're what makes Lasso such a handy tool for feature selection!

Ready to let Lasso guide us towards the most important features? Let's get started!

In [None]:
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.16)
scores = cross_val_score(model, X, y, scoring = 'neg_mean_squared_error')
print('Lasso regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

model.fit(X,y)
print('\n Non-zero coefficients')
print(model.sparse_coef_)
print('\n There are {} non-zero coefficients.'.format(model.sparse_coef_.count_nonzero()))

Fantastic! Now, we're going to dig a little deeper into the results from our Lasso model.

### Digging into Lasso's Decisions
The following code snippet will help us identify exactly which features Lasso deemed important. We'll find out the indices of non-zero Lasso coefficients, which correspond to the selected features.

After this, we'll also pull out the names of the selected structures for a clear view of what Lasso suggests we focus on.

**Activity:** Just run the following code and let's uncover the features our Lasso model has selected!

In [None]:
# indices of non-zero elements
ind = model.sparse_coef_.nonzero()[1]
print('Indices of non-zero elements: ', ind)
print('\n')

# print names of selected structures
print('Selected structures: \n')
for i in range(ind.size):
    print(structure_names.loc[ind[i],1])

## Exercise 2

## **Exercise 2:** Combining LassoCV and Linear Regression
In this exercise, we'll combine the strengths of `LassoCV` and `LinearRegression`. We'll use the `LassoCV` model for feature selection, and then we'll use `LinearRegression` to make predictions using the selected features.

Here's a step-by-step breakdown of your tasks:

- Implement feature selection using the SelectFromModel selector, and choose the LassoCV model.
- With the selected features, calculate the performance of a `LinearRegression` model.
- Finally, experiment with different thresholds for Lasso coefficients to see which value gives us the best performance.
- Remember, you're aiming for a good balance between the number of features selected and the model's performance. Too few features and the model may not perform well; too many and you risk overfitting.

Ready to fine-tune your feature selection skills? Let's go!

In [None]:
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel

# Create selector with LassoCV model
selector = None

# Perform feature transformation
X_selected = None

# Create and fit linear regression model to selected features
model = None
model.fit(None,y)

# Calculate and print RMSE
scores = cross_val_score(model, None, y, scoring = 'neg_mean_squared_error')
print('Linear regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

# List the number and names of the selected features
ind = None
print('\nSelected {} features: '.format(ind.size),)
for i in range(ind.size):
    print(structure_names.loc[ind[i],1])

when we employ feature selection using `LassoCV `with an optimised threshold, the performance of the Linear Regression model reaches levels similar to the optimised `Ridge` regression. That's quite impressive!

This outperforms the results we got when using univariate feature selection. So what have we learned? Well, this essentially shows us that using model-based feature selection methods like `LassoCV` can help us achieve more accurate results by intelligently deciding which features contribute the most to our predictions.

Keep going! You're doing wonderfully, and your understanding of these techniques is getting better with each step. Let's move on to the next part.

### Random forest

So let's roll up our sleeves and get our hands dirty with some Random Forest modeling. Run the following cell to train the Random Forest regressor and assess its performance on our data. Don't worry if you don't grasp all the details right now; just try to understand the big picture!

**Note:**
One of the great things about Random Forests is that they are highly resilient to overfitting, which can be a common issue with other models. This is largely due to their ensemble nature—by aggregating the results of many different trees, they're able to maintain robust performance even when some individual trees may overfit the data.

On top of this, Random Forests also have the ability to model complex, non-linear relationships, giving them a leg up over linear regression models in certain scenarios.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Select and fit the model
model = RandomForestRegressor(n_estimators=20)

# Calculate CV RMSE
scores = cross_val_score(model, X, y, scoring = 'neg_mean_squared_error')
print('Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

**Activity 2.1:** Feature importances can be access as `model.feature_importances_`. Plot the feature importances using a `bar` plot.


---
We're going to dive into one of the coolest features of Random Forest models, the **feature importances**. These importances provide an easy-to-understand ranking of which features of the model considers most useful in making its predictions.

To do this, we're going to call upon `model.feature_importances_`. This will return an array where each number represents the importance of a feature. Higher numbers mean the feature is more important to the model's decision-making process.

And what better way to visualize this data than with a `bar` plot? The length of each bar will show us the relative importance of each feature in a very intuitive manner.

Go ahead and run the following cell to complete this activity. As you look over the results, consider what insights you can gain from this visualization. Are there any surprises? Any features that are more or less important than you expected?

In [None]:
# fit the model
model.fit(X,y)

# plot feature importances
plt.figure(figsize = [16,4])
n = X.shape[1]
plt.bar(np.arange(n),None)
plt.title('Feature importances', fontsize = 18)
plt.xlabel('Features', fontsize = 16)
plt.ylabel('importances', fontsize = 16)

**Activity 2.2:** Use selector `SelectFromModel` to select the features from `RandomForestRegressor(n_estimators=20)`. Choose threshold 0.05 and print the names of the selected features. Are they consistent with the ones selected by Lasso or Correlation Coefficient?


---


Alright, now we're going to take our exploration of the Random Forest model a step further by selecting specific features using the `SelectFromModel` function from `sklearn`.

`RandomForestRegressor(n_estimators=20)`

This is a powerful method for feature selection that uses the weights of your model's features to choose the most important ones. We'll use it with our Random Forest model to identify the features that have the most impact on our predictions.

To do this, we will set a **threshold of 0.05**. This means that any feature with a importance score less than this number will not be selected. Once we've selected these features, we'll print their names to see which ones made the cut.

Then, we'll compare the features selected by the Random Forest model to those selected by the `Lasso` model and Correlation Coefficient. This will give us a broader understanding of which features are consistently identified as significant across different models.

Let's get to it! Run the cell below to execute this activity.

In [None]:
# Create selector with LassoCV model
selector = None

# Perform feature transformation
X_selected = selector.fit_transform(X, y)

# List the number and names of the selected features
ind = selector.get_support(indices=True)
print('Selected {} features: '.format(ind.size),)
for i in range(ind.size):
    print(structure_names.loc[ind[i],1])

## Recursive feature elimination

Scikit learn offers functions `RFE` and `RFECV` to perform recursive feature elimination. Any model can be used to do that, and this time we will chose `Ridge` regression.  

Let's first find 6 best features using `RFE` with `Ridge`. Run the code below to fit `RFE` model and print the names of the selected features.


---


let's continue our journey in feature selection with a technique called Recursive Feature Elimination (RFE). This method works by fitting a model and removing the weakest feature (or features) until the specified number of features is reached.

What makes RFE powerful is that it takes advantage of the model to identify which features (or combinations of features) contribute the most to predicting the target variable.

For this particular task, we will use the `RFE `function from `sklearn` along with the `Ridge` regression model. We will aim to find the top 6 features. Once we've selected these features, we'll print out their names to see which ones were selected.

Excited to find out which features are most important according to `RFE?` Run the cell below and let's see the result!

In [None]:
from sklearn.feature_selection import RFE

k=6

# create ranking model
model = Ridge(alpha=45)

# create selector
selector = RFE(model, n_features_to_select=k)

# fit selector
selector.fit(X,y)

# Print the indices of the selected features
ind = np.where(selector.get_support())[0]
print('Indices: ', ind)

# Print the names of the selected structures
print('\n')
for i in range(k):
    print(structure_names.loc[ind[i],1])

**Activity 3.1:** Transform the features, fit linear regression and calculate CV RMSE to see whether we reduced overfitting.


---

Now we're diving into Activity 3.1 - the plot thickens! This part of our journey involves transforming the features, strapping on a linear regression model and calculating the Cross-Validation Root Mean Squared Error (CV RMSE). Exciting, isn't it?

In [None]:
# Select features
X_selected = None

# Linear regression
model = None
scores = cross_val_score(model, X_selected, y,scoring = 'neg_mean_squared_error')
print('Linear regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

## Exercise 3
Let's now use method `RFECV` that can also automatically select optimal number of features using cross-validation. Write the code to:
* Fit the `RFECV` feature selection with `Ridge(alpha=45)` ranking model
* Transform the features and fit the `Ridge(alpha = 45)` model to the selected features
* Calculate the CV RMSE
* Print indices of selected features
* Print number of selected features


---


Time for Exercise 3! We're stepping up our game now. We'll be using `RFECV`, a really cool method that not only performs feature elimination but also picks the optimal number of features using cross-validation. Neat, huh?

Alright, let's break this down into bite-sized tasks:

1. First up, we'll fit `RFECV` with our `Ridge(alpha=45)` model. It's like pairing a dynamic duo ready to rank our features.

2. Then, we'll transform the features and fit them back to the `Ridge(alpha = 45)` model. It's like giving our features a new look and seeing how they perform in the Ridge model's spotlight.

3. Up next, we'll calculate the Cross-Validation Root Mean Squared Error (CV RMSE). It's like our trusty measuring tape to see how well our model is doing.

4. After that, let's print out the indices of our selected features - like shining a spotlight on our all-star features!

5. Finally, we'll print out the number of features that were selected. It's like doing a headcount of our all-star features.

I bet you're as excited as I am to see what we discover. So, let's dive in and get our hands dirty with some coding! You've got this!


In [None]:
from sklearn.feature_selection import RFECV

# create ranking model
model = None

# Create selector
selector = None

# Fit the selector and transform the features
X_selected = None

# Calculate performace of Ridge with selected features
scores = None
print('Linear regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

# Print indices of the selected features
ind = None
print('Indices: ', ind)

# Print number of selected features
print('Number of selected features: ', ind.size)

### Recursive feature elimination using Random Forest

**Activity 3.2:** Perform recursive feature elimination using `RFECV` and `RandomForestRegressor(n_estimators=20)`. Be patient, this process might take time.


---

We're now moving onto Activity 3.2, a more adventurous task. Here we'll perform recursive feature elimination but this time, we're bringing in the big guns - `RFECV` and `RandomForestRegressor(n_estimators=20)`. This is like assembling a superhero team for feature selection!

Now, a heads-up - the process might take a bit more time than usual. You know how it goes, the Random Forest algorithm can be quite a powerhouse and takes its time to churn through the data. But remember, all good things come to those who wait!

So, while the code runs, grab yourself a cup of coffee, or perhaps plan out your next coding adventure. Trust me, the insights you'll gain will be well worth the wait.

Happy coding and patience, my friend!


In [None]:
from sklearn.feature_selection import RFECV

model = RandomForestRegressor(n_estimators=20)
selector = None
selector.fit(X,y)

# Print selected features
ind = np.where(selector.get_support())[0]
print('Indices: ', ind)

print('Number of selected features: ', ind.size)

**Activity 3.3:** Transform the features (no need to fit the feature selector again) and fit the `RandomForestRegressor(n_estimators=20)` to see whether CV RMSE improved.


---

Ready for Activity 3.3? We're going to keep the momentum going with some exciting transformations. This time, we're transforming the features and using our sturdy `RandomForestRegressor(n_estimators=20)`.

We're not fitting the feature selector again. Nope, no need for that. It's already had its run and it's done a great job for us. Now we're moving on to see what the transformed features can do.

So, the spotlight is on the Random Forest Regressor now. We're going to fit our model and then check the Cross-Validation Root Mean Squared Error (CV RMSE). I'm sure you're eager to see if the CV RMSE has improved. Fingers crossed!

Let's dive in and continue this machine learning journey. Remember, this is about learning and having fun. And you're doing an amazing job at both. Let's rock it!

In [None]:
# Select features
X_selected = None

# Random Forest with reduced features
model = None
scores = cross_val_score(model, X_selected, y, scoring = 'neg_mean_squared_error')
print('Linear regression: Cross-validated RMSE is ', round(np.sqrt(-scores.mean()),2))

# Conclusion

We have seen that feature selection can prevent overfiting and improve performance of the model. We have also seen that Random forest is very resilient against overfitting and does not particularly benefit from feature selection in our example. On contrary, it is a very good tool for selecting features for other methods.

We have also seen that selected features varied a lot dependent on the selection method. We therefore need to be careful when interpreting the selected features.


---


As we've seen, feature selection is like a superpower that helps us keep overfitting at bay and improves our model's performance. It's an integral part of machine learning that makes all the difference in the world.

Interestingly, we've discovered that the Random Forest is a tough little cookie when it comes to overfitting. It's almost as if it's built a fort around itself and does not particularly need feature selection to thrive. But guess what? It still proves to be an awesome tool for selecting features for other methods. Versatility at its best, don't you think?

Now, it's important to mention that the features selected can be quite the chameleons, changing their colors based on the selection method. It's a good reminder for us to be cautious when interpreting the selected features.

Remember, there's no one-size-fits-all in machine learning. It's always a blend of different techniques, a bit of trial and error, and lots of learning. That's what makes it fun and fascinating!

Keep up the great work!