# Feature and Model Selection in Multiple Linear Regression

Suppose we have a dataset of the from $(X,y)$ where $X$ is large collection p features $X_1, X_2, \cdots X_p$. Consider the task of predicting $y$ using X. In practical scenarios, not all of these features contribute to the model. So now the question is how to select the subset that is more significat for our purpose?

Formally, assume we want to check if a particular subset of $q$ features $X_{p-q+1}, X_{p-q+2}, \cdots, X_p$ contribute to explaining the variability in the target variable y. We can perform the multiple regression and  test if the coefficients of this subset are zero:

$$
H_0: \beta_{p-q+1} = \beta_{p-q+2} = \cdots = \beta_p = 0
$$

To test this hypothesis, you fit a reduced model that excludes these $Q$ features (i.e., use $X_1, ... ,X_q$), and then compute the F-statistic. Now we can compare the fit of the reduced model to that of the full model (which includes all featurs). If the F-statistic indicates a significant difference, it suggests that the removed predictors contribute meaningfully to the model and we can reject the null hypothesis.

We can use this technique to check which subset is more important. The basic idea is that to consider all subsets, fit the regression model for each one of them and choose the best one based on criteria like training error (e.g., residual sum of squares) and model complexity (number of predictors). One issue with this approach is if $p$ is large, the number of all substes is $2^p$ which could be huge. For example, if $p=40$, the number of subsets is over a bilion. This is called__ All subsets or best subsets regression__. However, if the feature set is large then the number of subsets will be huge! For example, if p =40, the number of substes ($2^p-1$) is over a billion.


We discuss two ways to overcome this issue.


### Forward Selection

- Begin with the null model— a model that contains an intercept but no predictors.

- Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS.

- Add to that model the variable that results in the lowest RSS for the new two-variable model.

- Continue until some stopping rule is satisfied. For example, when all remaining variables have a p-value above a threshhold.  

### Backward Selection

- Start with all variables in the model.

- Remove the variable with the largest p-value—that is, the variable that is the least statistically significant.

-  The new $(p − 1)$-variable model is fit, and the variable with the largest p-value is removed.

- This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.

__Exercise__ Let's look at the credit data set and use these techinques to select the first important features.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS


In [None]:
# Load the dataset
url = "https://raw.githubusercontent.com/hardikkamboj/An-Introduction-to-Statistical-Learning/refs/heads/master/data/Credit.csv"
data = pd.read_csv(url)
data.head()

Unnamed: 0.1,Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


We don't need the unnamed column. So let's remove it.

In [3]:
#remove the unnamed column

How to handel the categorical data? Some options: remove them or use one-hot encoding.

In [None]:
# Convert categorical variables to dummy variables (one-hot encoding)


In [None]:
# Define predictors (features) and target (Balance)


In [None]:
# Split the dataset into training and test sets


In [None]:
# Define the linear regression model
model = LinearRegression()

Okay, now we use Sequential Feature Selection (SFS) for simplicity, which is a family of feature selector. Here we use do it with forward selection.

In [None]:
# Initialize the Sequential Feature Selector with forward selection
sfs = SFS(model,
          k_features=5,  # Number of features to select
          forward=True,  # Forward selection
          floating=False,  # Non-floating (standard forward selection)
          scoring='r2',  # R² as the performance metric
          cv=5)  # 5-fold cross-validation

# Fit SFS on the training data
sfs = sfs.fit(X_train, y_train)

In [None]:
# Get the selected features
selected_features = list(sfs.k_feature_names_)
print("Selected Features:", selected_features)

Selected Features: ['Income', 'Limit', 'Rating', 'Cards', 'Student_Yes']


In [None]:
# Fit the model using only the selected features
X_train_selected = X_train[selected_features]
model.fit(X_train_selected, y_train)

# Evaluate on the test set
X_test_selected = X_test[selected_features]
r2_score = model.score(X_test_selected, y_test)
print(f"R^2 score with selected features: {r2_score}")

R^2 score with selected features: 0.9510575511309474


Does the result make sense to you?

Okay, now use Sequential Feature Selection (SFS) with backward selection.