# HW04: Problem 3: Feature Selection

## Description

In this problem we will work with the diabetes dataset from sklearn. This data set is for a regression problem where 10 features are used to predict the progression of diabetes. The dataset is described in more detail [here](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). You task here is to use multiple techniques of feature selection to try to interpret the strength of the features in the dataset. You will need to use the following techniques:

- Pearson correlation coefficient using r_regression from sklearn (univariate feature selection)
- Mutual information using mutual_info_regression from sklearn (univariate feature selection)
- Random forest feature importance using RandomForestRegressor from sklearn (multivariate feature selection)
- Recursive feature elimination using sklearn.feature.selection.RFE with a Support Vector Regressor SVR (multivariate feature selection)

For each method you will need to plot the feature importance as a bar graph. The importance goes by different names in different algorithms. For example, in r_regression it is just the output (r value) and the mutual information in mutual_info_regression. In random variable it is called feature_importance_ and in RFE it is the ranking_. The bar graph will be sorted from most important features to least important features, with the y value being the importance of that feature, and the x value being the rank but labeled with the feature name.
You will also need to print out the top 5 features for each method. You will need to use the following code to load the data and split it into training and testing sets. You will need to use the training set for all of the feature selection methods.

* Are there 3 features that are selected in the top 5 by all 4 methods? 
* If so, what are they? 
* If not, what are the 3 features that are selected by the most methods? 
* How would it be possible that univariate methods might select different features than multivariate methods? 
* How does dependence between features affect the feature selection methods?

For good habits, make sure you split your code into training and testing. You may not even use the testing data but when you do any analysis such as feature selection, remember you must not use the testing data. You should also make sure you use the same random seed for all of your feature selection methods so that you can compare the results.

## Hints: In sorting features you use "arg" sort. This will return the indices of the sorted array. You can use these indices to sort the feature names.

This kind of code will be useful for plotting the bar graph:

```python
r_inds = np.argsort(np.abs(r_importance))[::-1]
fig, ax = plt.subplots()
rank = np.arange(len(data.feature_names))
ax.bar(rank, r_importance[r_inds])
ax.set_xticks(rank)
ax.set_xticklabels(np.array(data.feature_names)[r_inds])
```

In [104]:
# Some imports you will need
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from numpy import corrcoef
import seaborn as sns
from sklearn.feature_selection import r_regression
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

## Loading and preparing the data

In [85]:
# Load the diabetes data set as X, y
X, y = load_diabetes(return_X_y=True)
# Load the diabetes data set as data to read the description
data = load_diabetes()

In [86]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [87]:
# Print out the DESCR attribute to inpect the variables
data.DESCR

'.. _diabetes_dataset:\n\nDiabetes dataset\n----------------\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\n**Data Set Characteristics:**\n\n:Number of Instances: 442\n\n:Number of Attributes: First 10 columns are numeric predictive values\n\n:Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n:Attribute Information:\n    - age     age in years\n    - sex\n    - bmi     body mass index\n    - bp      average blood pressure\n    - s1      tc, total serum cholesterol\n    - s2      ldl, low-density lipoproteins\n    - s3      hdl, high-density lipoproteins\n    - s4      tch, total cholesterol / HDL\n    - s5      ltg, possibly log of serum triglycerides level\n    - s6      glu, blood sugar level\n\nNote: Each of thes

In [88]:
# Print the array of feature names
data.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [94]:
# Split the Data into train/testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.25)

In [95]:
# Check the shapes
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(331, 10)
(331,)
(111, 10)
(111,)


In [89]:
# convert into DF to plot it
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df['target'] = data.target

In [9]:
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [7]:
# Do a pair plot 
sns.pairplot(df)

Q: What does the EDA tell you about the data?

A Here: The first thing that pops out to me is that features 's2' (low density lipoprotein) and 's1' (total serum cholesterol) have a strong positive correlation. Additionally, the feature 's4' (total cholesterol) seems to have a strange clustering pattern and has a negative correlation wth 's3' (high density lipoprotein). The rest of the features have negligible correlations with each other.

## Univariate feature selection with r_regression

In [None]:
# Use r_regression to get the feature importance, sort by the absolute value
reg = r_regression(X, y)
r_inds = np.argsort(np.abs(reg))[::-1]

# but show the signed value on y and label on x by variable name
# Should be a bar graph
fig, ax = plt.subplots()
rank = np.arange(len(data.feature_names))
ax.bar(rank, reg[r_inds])
ax.set_xticks(rank)
ax.set_xticklabels(np.array(data.feature_names)[r_inds])

In [53]:
# print the top 5 features according to r_regression?
for i in range(5):
    feature_index = r_inds[i]
    feature_name = data.feature_names[feature_index]
    feature_importance = reg[feature_index]
    print(f"{feature_name}: {feature_importance}")

bmi: 0.5864501344746885
s5: 0.5658825924427444
bp: 0.44148175856257105
s4: 0.4304528847447733
s3: -0.39478925067091875


## Univariate feature selection with mutual information using mutual_info_regression

In [None]:
# Use mutual_info_regression to get the feature importance, sort by the absolute value
mut = mutual_info_regression(X,y)
m_inds = np.argsort(abs(mut))[::-1]

# but show the signed value on y and label on x by variable name
# Should be a bar graph
fig, ax = plt.subplots()
rank = np.arange(len(data.feature_names))
ax.bar(rank, mut[m_inds])
ax.set_xticks(rank)
ax.set_xticklabels(np.array(data.feature_names)[m_inds])

In [55]:
# What are the top 5 features according to mutual_info_regression?
# print the top 5 features according to r_regression?
for i in range(5):
    feature_index = m_inds[i]
    feature_name = data.feature_names[feature_index]
    feature_importance = mut[feature_index]
    print(f"{feature_name}: {feature_importance}")

bmi: 0.1738929602380801
s5: 0.14720030221034452
s6: 0.09382337018867837
s4: 0.08886620231126585
s3: 0.06812562637580832


## Multivariate feature selection with Random Forest feature_importance_

In [None]:
# Use random forest feature_importance_ to get the feature importance, sort by the absolute value
model = RandomForestRegressor()
model.fit(X_train, y_train)
f_imp = model.feature_importances_
f_imp_sort = np.argsort(abs(f_imp))[::-1]

# but show the signed value on y and label on x by variable name
# Should be a bar graph
fig, ax = plt.subplots()
rank = np.arange(len(data.feature_names))
ax.bar(rank, f_imp[f_imp_sort])
ax.set_xticks(rank)
ax.set_xticklabels(np.array(data.feature_names)[f_imp_sort])

In [99]:
# What are the top 5 features according to random forest feature_importance_?
for i in range(5):
    feature_index = f_imp_sort[i]
    feature_name = data.feature_names[feature_index]
    feature_importance = f_imp[feature_index]
    print(f"{feature_name}: {feature_importance}")

bmi: 0.38156187079306575
s5: 0.21876907497850503
bp: 0.10642955527230183
s6: 0.06376345502233687
age: 0.05665432119508048


## Multivariate feature selection with recursive feature elimination (RFE) using a support vector regressor

In [None]:
# Use recursive feature elimination (RFE) with a support vector regressor
svr = SVR(kernel='linear')
rfe = RFE(estimator = svr, n_features_to_select=5)
rfe.fit(X_train, y_train)

In [None]:
# to get the feature importance, sort by the absolute value
rfe_rank = rfe.ranking_
rfe_rank_sort = np.argsort(abs(rfe_rank))[::-1]

# but show the signed value on y and label on x by variable name
# Should be a bar graph
fig, ax = plt.subplots()
rank = np.arange(len(data.feature_names))
ax.bar(rank, rfe_rank[rfe_rank_sort])
ax.set_xticks(rank)
ax.set_xticklabels(np.array(data.feature_names)[rfe_rank_sort])

In [114]:
# What are the top 5 features according to RFE with SVR?
for i in range(5):
    feature_index = rfe_rank_sort[i]
    feature_name = data.feature_names[feature_index]
    feature_importance = rfe_rank[feature_index]
    print(f"{feature_name}: {feature_importance}")

sex: 6
s2: 5
age: 4
s1: 3
s6: 2


## Conclusions

Q1: Are there 3 features that are selected in the top 5 by all 4 methods?

A1: No, not all four methods included the same feature in their top 5 features.

Q2: If so, what are they? / If not, what are the 3 features that are selected by the most methods? 

A2: Feature 's6' was chosen by all methods besides the method that used r_regression. The feature 'bmi' was chosen by all methods except for RFE.

Q3: How would it be possible that univariate methods might select different features than multivariate methods?

A3: Univariate methods do not take into account how some features may be related to each other, unlike multivariate methods, and therefore cannot detect redundant features as easily.

Q4: How does dependence between features affect the feature selection methods?

Q4: If features have some sort of dependence between each other, univariate feature selection methods won't be able to take that into account and lose possibly important information. In this case, it is probably best to use a multivariate feature selection method which can take these important relationships into account.