<a href="https://colab.research.google.com/github/MaralAminpour/IVM_supplementary_materials/blob/main/feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part III: Feature Selection

Feature selection is an essential step in the machine learning pipeline. It refers to the process of selecting **a subset of the most relevant features (or variables) from the original set**, enhancing the performance of the model, speeding up the training process, and preventing overfitting.

## Why we select certain features over others?

Ever wonder why we select certain features over others? Well, there are some pretty cool reasons:

1. **To Avoid Overfitting**: Think of it as **decluttering**. By removing extra stuff (or redundant data), our model doesn't get sidetracked by unnecessary **noise**.
2. **Boost Accuracy**: It's like focusing on the main story without the side plots. By cutting out misleading data, our model can be more **on-point and accurate**.
3. **Speed Things Up**: Everyone loves a quick result, right? With fewer features, our model can sprint through the training process, making everything snappier.


## Application: Prediction of age at scan

Let's focus on predicting age based on brain scans. Remember when we talked about predicting age from those 86 brain structure volumes? Using just a straightforward multivariate linear regression might make our model a bit too eager and overfit the training data. The clue? A big difference in how the model does on the full training set versus when tested with cross-validation.

Now, throughout our journey, we've come across several cool techniques to prevent this overfitting. Do any of them ring a bell?

- Keeping our model's enthusiasm in check with **regularisation** (like Ridge or Lasso).
- Reducing the number of features with methods like **PCA, ICA, or Laplacian Eigenmap**.
- Bringing together multiple models in **ensemble learning** (like our buddy, the Random Forest).
- And of course, being picky with our data using **feature selection**!

Quick recap: We aim to predict the age at scan using volumes of **86 brain structures** in preterm babies. Just using multivariate linear regression? **Oops, we overfit!**

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/select1.png' width=500px >

So, what's in our toolkit to prevent overfitting? Regularisation, dimensionality reduction, ensemble learning, and feature selection. Got it? Onward!

## Simulated features

 Imagine we've got five of these features, each sprinkled with varying amounts of noise and non-linearity. Picture it like this:

- The x-axis is showing the feature value.
- The y-axis? That’s our target value.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/select2.png' width=700px >

Now, our goal is to pick the best and most informative features. To help visualize this, we've simulated five features:

1. The first feature has a straight-line relationship with the targets, but it's kinda like static on a TV – a bit noisy.
2. The second one? Still a straight-line relationship, but clearer than a sunny day – much less noise.
3. Our third feature's relationship with the target is more like a wavy line, a bit curvy but still pretty clear.
4. The fourth one is like a roller coaster, super curvy and unpredictable. It’s not a straight shot to our target, but it’s clear without much noise.
5. And the fifth? Well, that's just like radio static – all noise and no clear connection to our targets.

## Feature importances

Feature importances helps us understand which features really make a difference when making predictions.

**Univariate Feature Importances**:

- Think of this as looking at **one feature at a time**.

- We're trying to see **how good** each feature is at **predicting the target values/labels.**

- **correlation or mutual information**: It’s kinda like finding out which ingredients make a dish taste amazing! We might use things like correlation or mutual information to get a feel.

**Model-Based Feature Importances**:

- Imagine building a LEGO castle. Each brick represents a feature, and some bricks are more crucial than others.

- If we’re talking linear models (like regressors or classifiers), we create a fancy formula that looks like this:

  $$y-hat = w_0 + w_1 x_1 + ... + w_n x_n$$

  Here, the weights (those 'w' values) tell us the importance of each feature.

- **Quick tip:** Make sure to scale your features before fitting; it’s like making sure all the LEGO bricks are the same size.

- If you’re using scikit-learn, you can peek at these weights with `model.coef_`.

**Tree-Based Methods**:

- Trees are cool! They **split data based on features** and see which ones **clear up confusion the best.**

- If you’ve got a bunch of trees (like in a forest), you **average out the results from all trees.**

- Again, if you're using scikit-learn, check out `model.feature_importances_` to see the star players.


## Feature importance in Decision Trees

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/select3.png' width=500px >

Decision trees are a standout in the machine learning landscape due to their **intrinsic interpretability**. They provide a visual representation where each decision made by the tree can be **seen and understood**. Each path from the tree's root to a leaf signifies a **specific rule**. For instance, an internal node in the tree making a decision based on the feature "Age < 25" is straightforward to comprehend.

**Feature importance** is another area where decision trees shine. While the tree itself is **transparent**, its underlying mechanism offers valuable insights that can be used to **explain even more complex models**, like **random forests** or **gradient boosting machines**. In decision trees, the **importance** of a feature is often computed based on the **total reduction of a criterion**, such as the **Gini impurity or entropy**, brought about by that feature. A higher value indicates that the feature plays a more pivotal role in making decisions within the tree.

On the other hand, in linear regression models, c**oefficients give us a direct measure of a feature's importance in a linear context**. Each coefficient indicates the change in the output for a unit change in that feature, assuming other features remain constant. However, the **simplicity of these coefficients is also their limitation**. They provide insights into linear relationships but **can't capture non-linear interactions between features**.

**That's where decision trees fill the gap**. They, and by extension, **ensemble methods like random forests**, are adept at capturing non-linear relationships. The feature importance metrics from trees shed light on which features are most critical in decision-making, even in intricate, non-linear scenarios. By **juxtaposing the coefficients from linear models with the feature importance derived from decision trees**, one can obtain a nuanced understanding of the dataset. Linear models provide insights into the linear importance of each feature, while decision trees offer a broader perspective, encapsulating both linear and non-linear interactions.

In summary, decision trees are not only easy to interpret on their own, but their ability to rank features based on importance makes them a robust tool for explaining a variety of machine learning models, linear or non-linear.

## Univariate feature selection: Pearson’s correlation coefficient

1. **What's the gist?** We want to pick out the most relevant features for our data analysis. One way to do this is by looking at the Pearson’s correlation coefficient.
2. **How do we use it?** Think of this coefficient as a measure of how related two sets of data are. If a feature has a coefficient close to +1 or -1, it means it's strongly correlated with our target outcome.
3. **But wait, there's noise!** Sometimes, features might not show a clear linear relationship because of noise or non-linear patterns. For instance, a very wiggly feature or one with lots of random spikes might not have a strong correlation.
4. **So how do we pick the best features?** With univariate feature selection, we rank the features based on their importance and pick the top ones. Tools like `SelectKBest` from `scikit-learn` can help with this.
5. **A note on scikit-learn:** While it doesn't calculate the Pearson’s coefficient directly, it has a nifty function called `f_regression`. This gives us an F-value, which essentially tells us how likely it is that a feature's correlation with the target is just by chance. The good news? This F-value is closely tied to the Pearson’s coefficient.
6. **Using SelectKBest:** This tool needs two things from us - a way to score features (like our `f_regression` function) and the number of top features we want.
7. **What did we find?** For our data, the top three features (0, 1, and 2) have a straightforward, linear relationship with our target values.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/pearson_select.png' width=700px >

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/peardon2.png' width=500px >

Let's focus on univariate feature selection. This method allows us to assess the importance of individual features based on Pearson's correlation coefficient with the target values.

Notably, features that exhibit a linear or near-linear relationship with the target will have a **high correlation** value. However, this correlation is diminished by factors like non-linearity and noise. This is evident in features like the highly **non-linear 'feature 3'** and the **noisy 'feature 4'**, both of which show almost **negligible correlation**, making them **unlikely choices for selection**.

In the univariate selection process, features are ranked based on their significance, and the best-performing ones are chosen. As an example, we can use the `SelectKBest` module from scikit-learn to **select the top 3 features**. Although scikit-learn doesn't provide a direct method to compute Pearson’s correlation coefficient, it does offer the `f_regression` object, which yields an F-value. **This F-value evaluates the likelihood that the regression coefficient differs from zero**, essentially acting as a hypothesis test. Intriguingly, the F-value is intrinsically linked to Pearson’s correlation coefficient. When utilizing `SelectKBest`, it requires two main inputs: a scoring function (in this context, `f_regression`) and the number of features we wish to select.

Upon examination, we find that the leading three features, labeled as 0, 1, and 2, all exhibit a linear correlation with the target values.

## Univariate feature selection: Mutual Information

Ready to find out which features are the best buddies with your target?

1. **What's Mutual Information?** Think of it like a friendship meter. It tells us how much two variables, in our case, feature values and target values, have in common. The more they "know" about each other, the higher the mutual information!
2. **How to pick the best buddies?** We're on the lookout for features that share a lot of secrets (information) with our target. But sometimes, the chatter (noise) can get in the way and lessen their bond.
3. **Unique and Wiggly Relations? No Problem!** The cool thing about mutual information is that it's unfazed by whether the relationship is straight or all over the place. In fact, a super wiggly feature can sometimes share more info with the target than a straight-line feature.
4. **How to Measure Friendship in Code?** `Scikit-learn` is our matchmaker here! It has a tool called `mutual_info_regression` that measures how tight-knit our features and targets are. If we were to pick the top three BFFs using `SelectKBest`, it'd likely be features 1, 2, and 3!
5. **But wait, what about Classification?** Ah, for that, `scikit-learn` has different measuring tapes. The top ones are:
   - `chi2` (chi squared)
   - `f_classif`
   - `mutual_info_classif`


If we apply mutual information, the scenario shifts. Mutual information gauges the shared information between two variables, in this context, between feature values and target values. It becomes evident that while noise diminishes importance, factors like non-linearity or uniqueness don't have the same effect.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/mutual_info.png' width=700px >


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/mutual2.png' width=500px >


In fact, the highly non-linear and non-unique 'feature 3' exhibits a stronger mutual information with the targets compared to the nearly linear 'feature 2'.

Scikit-learn provides the `mutual_info_regression` function to compute the mutual information of each feature in relation to the targets. Utilizing `SelectKBest` to cherry-pick the top three features, we'd select features 1, 2, and 3.

It's worth noting that the examples discussed are geared towards regression. For classification tasks, scikit-learn presents scoring functions like chi-squared, `f_classif`, and `mutual_info_classif`.

## Model based feature selection: Lasso

**What's the Lasso Way?**

1. Lasso is like a talent scout. It zeroes in on the most impactful features by giving them the highest "coefficients" or weights.

2. There's some magic behind the scenes! When Lasso uses the L1 norm penalty, it makes many feature weights zero, resulting in a simpler model. This is called "sparsity."

**Tuning with LassoCV:**

Lasso has this cool feature called "LassoCV." It's like a radio that automatically tunes to the best station! In our case, it finds the ideal setting for the hyperparameter 𝜆. And for our little example here, it settled on 0.003.

**Picking the Stars with SelectFromModel:**

Sklearn has this nifty tool called `SelectFromModel`. Think of it as a director casting the leading roles in a movie. It selects features based on how impactful (or high) their coefficients are. And yep, for our movie, features 1 and 2 got the leading roles! They're pretty straightforward and don't come with a lot of drama (noise).

Shifting our focus to model-based feature selection, we already know that the incorporation of the **L1 norm** during training induces model **sparsity**, which in turn **facilitates direct feature selection**. Given that we are dealing with a regression task, we'll employ the Lasso model to ascertain feature importance. In this demonstration, we're using the `LassoCV` object since it autonomously tunes the **alpha hyperparameter**, which, for this instance, settled at 0.003.

For feature selection in this context, we turn to the scikit-learn `SelectFromModel` object. This requires a machine learning model possessing the `coef_` attribute for input. Additionally, we have the liberty to set a coefficient threshold for feature selection. As observed, features 1 and 2 were chosen, both of which exhibit either linear traits or near-linear attributes and are minimally affected by noise.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/lasso.png' width=700px >


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/lasso2.png' width=500px >


## Model based feature selection: Random Forest

So, with Random Forest guiding us, we're equipped to uncover those truly special features that can make our predictions shine!

Let's venture into the world of "Random Forest" for feature selection!

**The Random Forest Way:**

1. Random Forest, like a wise old sage, picks features based on their "importance." How do they get this wisdom? It's by calculating how much each feature clears up the uncertainty (or decreases the impurity) in a decision tree.
2. Picture a bunch of trees, each with their own opinions. Random Forest's "feature importance" is the average clarity (or decrease in impurity) each feature brings across all these trees.

**Selecting the Shining Stars:**

Using `SelectFromModel`, we can handpick those standout features. Just like before, but this time, it's based on `feature_importances_`. And for this act, features 1, 2, and 3 stole the show!

**A Special Note on Feature 3:**

Isn't it fascinating? Even though Feature 3 dances to its own non-linear tune and doesn't have a unique bond with the target, Random Forest still recognizes its worth! It's like appreciating a unique dancer in a troupe. Teaming up with Feature 1, they create a magical performance.


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/random.png' width=700px >

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/random_forest2.png' width=500px >

We can use the Random Forest model to help choose important features. The Random Forest model gave a high score to feature 3, even though it's not straight-forward and doesn't have a direct link to the target. This shows that the Random Forest model can handle tricky features well. Even if a feature isn't directly linked, it can still be useful when used with others, like feature 1.

Using `SelectFromModel`, we pick the features based on their importance scores. This time, we pick features 1, 2, and 3.


## Recursive Feature Elimination

Peeling the onion layer by layer, that's what "Recursive Feature Elimination" (RFE) is all about!

**Deep Dive into RFE:**

1. RFE doesn't just pick features; it *ranks* them. How? By starting with all features and taking them away, one by one, from least important to most.
2. Here's the fun part: This isn't a one-time deal. With each feature removed, the model retrains and recalculates the importances. Think of it as a reality show, where contestants are voted off one by one, based on their performance in each episode.

**Steps in RFE's Dance:**

1. First, we fit a ranking model. For our case, we're using Ridge regression.
2. After that, we find the least important feature and bid it goodbye.
3. Rinse and repeat! We keep training the model with the remaining features and keep eliminating the weakest links.

By the end, we get a ranking of all features from the star performers (last ones standing) to the early departures (first ones out).

**Special Tools for RFE:**

If you're a fan of automation, you'd love `RFECV` from sklearn. It's RFE with a sprinkle of cross-validation magic. Instead of manually picking the number of features, `RFECV` finds the optimal number that maximizes model performance.

For our Ridge regression escapade, features 1 and 2 came out on top. Bravo!

Another way to determine important features is by using recursive feature elimination. In this method, we use a machine learning model to rank the features based on their importance. The least important feature is then removed and the model is tested again. This process continues, with one feature being removed at a time, until we have a ranking of features from most to least important. By doing this, we can decide on the top features we want to keep. If we use tools like RFECV from sklearn, it can even automatically choose the best number of features for us. When we tried this with the Ridge regression model, it selected features 1 and 2 as the most important.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/recursive.png' width=700px >


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/recursive2.png' width=500px >





## Feature selection results - summary

Time for a feature selection showdown summary!

**The Stars of the Show:**

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/1.png' width=200px >


- Features 1 and 2 are the headliners, chosen by all the methods. They're as straight as an arrow, which makes them easy to work with, and they keep the noise level down.

**Feature 3, the Underdog:**

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/2.png' width=100px >

- Feature 3 may not have a unique relationship with the target, but it's a versatile player. Mutual information and Random Forest recognized its value. It brings a dose of non-linearity to the party and keeps the noise to a minimum.
- In fact, it even outshone Feature 2 in the eyes of Random Forest and Mutual Information!

**The Noise Maker, Feature 4:**

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/3.png' width=100px >


- Last but not least, Feature 4 got the boot every time. It's like that one noisy neighbor that no one wants to invite to the party. Just full of noise, no real connection to the target.

So, there you have it! Our feature selection methods have spoken, and the winners are Features 1, 2, and 3. They bring the harmony, while Feature 4 got left out in the noise.

## Application: Prediction of age at scan

Let's circle back to our brainy task of predicting age at scan for preterm neonates using volumes of 86 brain structures.

**The Overfitting Conundrum:**

You see, when we initially tried to tackle this with multivariate linear regression, it was like trying to fit a square peg into a round hole - overfitting was lurking around the corner.

**Enter Feature Selection, Our Hero!**

But hold on, can feature selection actually save the day and prevent overfitting?

**Drumroll, Please... Conclusion Time!**

The answer is a resounding YES! Feature selection turned out to be our trusty sidekick in this adventure. It helped us avoid overfitting and led to some impressive results.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/application.png' width=700px >


We tested three feature selection methods:

1. Univariate feature selection using correlation (picking 2 features).
2. Model-based feature selection using Lasso (choosing 6 features).
3. Recursive Feature Elimination (RFE) with a linear regression model (opting for 4 features).

And the winner of this feature selection showdown is... Lasso! It performed the best, closely followed by RFE. These methods worked their magic and gave us results akin to using Ridge Regression, a pretty effective model.

So there you have it! Feature selection not only prevents overfitting but also boosts our performance. It's like finding the perfect-sized puzzle piece for our prediction task.

## Interpretation of feature importances

**Understanding Feature Importances:**

In the realm of machine learning, feature importances provide us with valuable insights into which features carry the most weight when it comes to making predictions. These insights can be enlightening and help us better understand the problem we're trying to solve.

**The Challenge of Correlated Features:**

However, there's a challenge we often face. Imagine you have a group of features that are closely related or correlated. It's like having multiple teammates who are equally skilled. Sometimes, when we use feature selection methods, they might end up choosing just one of these correlated features to represent the entire group. This can be a problem because some of those highly predictive features might not get the recognition they deserve, leading to lower importances for them.

**Comparing Feature Selection Methods:**

Let's take a look at an example where we used three different feature selection methods. Each of these methods selected its own set of top three features. This variety in selection can make it tricky to interpret which features are truly the most important.

So, in a nutshell, feature importances are like guiding lights in the world of data, showing us which features matter most. But when features are closely connected, we might end up highlighting just one, and that can sometimes overshadow other valuable features. It's all part of the data exploration journey!

## Feature interpretation

The instability we encounter here makes it tricky to confidently interpret the significance of individual features. However, if our primary aim is to reduce dimensionality rather than dissect individual features, this instability might not be a big concern.

There are some clever solutions to address this issue:

1. **Stability Selection:** This involves running the feature selection method on various subsets of the data and then calculating statistics to determine how often each feature is selected.

2. **Feature Merging:** Another approach is to identify pairs or clusters of highly correlated features and make a decision. You can either drop the less predictive ones or merge them into a single representative feature.

These strategies help us navigate the instability and make the most out of our feature selection process.

## Comic time

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/comic2.jpg' width=500px >


## Manifold learning

Manifold learning is a technique used in machine learning and data analysis to uncover the underlying structure or geometry of complex, high-dimensional data. In simpler terms, it helps us make sense of data that doesn't adhere to a straightforward, linear pattern.

When we say "manifold," we're referring to a mathematical concept that represents a lower-dimensional, continuous surface within a higher-dimensional space. Manifold learning algorithms aim to find this lower-dimensional representation of the data, where the relationships between data points become more apparent and understandable.

The primary goal of manifold learning is dimensionality reduction while preserving essential information. It's particularly useful when dealing with data that exhibits non-linear relationships or when you want to visualize high-dimensional data in a more manageable form.

Some commonly used manifold learning techniques, in addition to Laplacian Eigenmap mentioned earlier, include Isomap, t-SNE (t-Distributed Stochastic Neighbor Embedding), and LLE (Locally Linear Embedding). These methods help uncover the hidden structure in data, making it easier to analyze and interpret complex datasets.

## Laplacian Eigenmap

In the context of manifold learning, Laplacian Eigenmap is a dimensionality reduction technique used to uncover the underlying structure of complex, non-linear datasets. It's a method that helps transform high-dimensional data into a lower-dimensional space while preserving the essential relationships between data points.

Here's a more detailed explanation:

1. **Graph Representation:** Laplacian Eigenmap starts by representing the data as a graph, where data points are nodes, and edges are drawn between points that are similar to each other in some way. This similarity can be based on distance, correlation, or other measures.

2. **Graph Laplacian:** Next, it computes a mathematical construct known as the "Graph Laplacian." This construct captures how connected or isolated each data point is within the graph. It's a matrix that encodes the relationships between data points.

3. **Eigenvalues and Eigenvectors:** Laplacian Eigenmap then calculates the eigenvalues and corresponding eigenvectors of this Graph Laplacian. These eigenvalues and eigenvectors provide insight into the hidden structure of the data.

4. **Dimension Reduction:** The eigenvalues are ordered from smallest to largest, and the corresponding eigenvectors are used to create a new, lower-dimensional embedded space. This new space represents the data in a way that highlights its non-linear structure.

5. **Clustering or Visualization:** Finally, this lower-dimensional space can be used for various purposes, such as clustering similar data points together or visualizing the data in a more interpretable form.

Laplacian Eigenmap is particularly effective when dealing with data that doesn't follow a linear pattern, as it aims to capture the non-linear relationships between data points. It's a valuable tool in manifold learning for understanding and making sense of complex, high-dimensional datasets.