## $\color{red}{\text{Lecture Overview}}$
1. **Feature selection**
2. **Feature selection techniques in regression**
3. **Model diagnostics**

## $\color{red}{\text{Feature Selection}}$

1. The process of selecting a subset of relevant variables for modeling
2. Feature selection techniques operate under the principal of **parsimony**
  - The simpler the model, the better - use fewer variables
  - Fewer variables mean decreased computational time

## $\color{red}{\text{Import Required Packages}}$

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pylab as plt

## $\color{red}{\text{Import Data}}$

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/DS4510/Data
housingData= pd.read_excel('housingData.xlsx', sheet_name='housingData')

/content/drive/MyDrive/DS4510/Data


## $\color{red}{\text{Analytic Task}}$
1. Using the housing data, build a **mulptiple linear regression model** to predict **price**
2. Use feature selection techniques
  - Forward regression
  - Backward regression
  - Stepwise regression
3. Model diagnostics to select the best model

## $\color{red}{\text{Data Preparation}}$
1. Excluding some variables from the analysis
2. Splitting data into train and testing

In [None]:
from sklearn.model_selection import train_test_split

current_year = pd.Timestamp.now().year
housingData['building_age'] = current_year - housingData['yr_built']

drop_years = ['yr_built', 'yr_renovated', 'id', 'date', 'zipcode']
new_housingData = housingData.drop(drop_years, axis=1)

new_housingData

#identifying the dependent and independent variable
dep_var = new_housingData['price']
indep_var = new_housingData.drop('price', axis=1)

#Partioning data training (80) and testing (20)
X_train, X_test, y_train, y_test = train_test_split(indep_var, dep_var, test_size=0.2, random_state=25)

## $\color{red}{\text{Model Building}}$

### $\color{blue}{\text{Forward Elimination}}$
- Starts with no predictors, only an intercept
- **Step 1**: Compute p-values for each predictor in a univariate regression (one at a time).
- **Step 2**: Add the predictor with the **lowest p-value**
- **Step 3**: Recalculate the model and test remaining predictors again.
- Repeat until no remaining predictor significantly improves the model
- Useful for efficiently finding the most relevant predictors but can miss interactions since it never removes variables once added

- Example of selection order:
Start → Add X3 (lowest p-value) → Add X1 → Stop when no more p < 0.05

In [None]:
#Feeding variable one at a time
#y=mx+b
#to y=b

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

#Regression Model
model = LinearRegression ()

forward_regression =SFS(model, k_features='best', forward = True, floating = False, scoring='r2', cv=5)
forward_regression = forward_regression.fit(X_train, y_train)

#Get the best variables from forward pregression
forward_regression.k_feature_names_

#Function for forward regression
def forward_regression(reg_model, indep_var, dep_var):
  forward_regression = SFS(reg_model, k_features='best', forward = True, floating = False, scoring='r2', cv=5)
  forward_reg = forward_regression.fit(indep_var, dep_var)

  best_features = list(forward_reg.k_feature_names_)
  print('Best features:', best_features)
  return forward_regression

forward_regression(model,X_train,y_train)


Best features: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'waterfront', 'view', 'condition', 'grade', 'sqft_basement', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'building_age']


### $\color{blue}{\text{Backward Elimination}}$

- Start with all predictors included in the model
- **Step 1**: Compute p-values for all predictors in the full model
- **Step 2**: Remove the predictor with the highest p-value
- **Step 3**: Refit model and recalculate p-values.
- Repeat until all remaining predictors are significant (p < threshold).

- Example Removal Order:

Start with X1, X2, X3, X4 → Remove X2 (highest p) → Remove X4 → Stop when all p-values are significant.

In [None]:
#Create a function
def stepwise_regression(reg_model, indep_var, dep_var):

  #Forward and backward regression models
  forward_vars = forward_regression(model, X_train, y_train)
  backward_vars = backward_regression(model, X_train, y_train)

  #Get combined features for both models
  stepwise_vars = set (forward_vars).intersection(backward_vars)
  print('Stepwise variables:', stepwise_vars)
  return stepwise_vars


#Forward and backward regression models
forward_vars = forward_regression(model, X_train, y_train)
backward_vars = backward_regression(model, X_train, y_train)

set(forward_vars).intersection(backward_vars)

### $\color{blue}{\text{Stepwise Regression}}$

1. Starts with no predictors, **like forward selection**
2. **Step 1**: Add the predictor with the lowest p-value
3. **Step 2**: After adding a new predictor, check all included variables and remove any with high p-vaue
4. **Step 3**: Continue adding/removing predictors until no further changes improve the model
5. Balances simplicity and accuracy, allowing variables to be removed if they become insignificant after adding others.
6. More flexible than forward or backward methods but prone to overfitting if too many variables are considered.


## $\color{red}{\text{Model Diagnostics}}$

In [None]:
from scipy import stats
#Get data containing important variables
new_trainx = X_train[list(stepwise_vars)]
new_testx = X_test[list(stepwise_vars)]

#use important variable from stepwise regression to fit model
step_model = model.fit(new_trainx, y_train)
#step_model.summary()
#Get predictions from model
test_pred = model.predict(new_testx)

test_residuals = test_y - test_pred

#Normality on the residuals (regression, normality of residual is crucial)
test_stat, test_pval = stats.shapiro(test_residuals)
print('Test statistic:', test_stat)
print('p-value:', test_pval)

In [None]:
#Residuals with fitted plot
plt.figure(figsize=(10,6))
plt.scatter(test_pred, test_residuals, color='red', label='Test Residuals')
plt.axhline(y=0, color='black', linestyle='--',)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()

NameError: name 'test_pred' is not defined

<Figure size 1000x600 with 0 Axes>