<a href="https://colab.research.google.com/github/Saraldedv/CCADMACL_EXERCISES_COM222ML/blob/main/SARALDE_Exercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

Use all feature selection methods to find the best features

## Dataset Information

## Features

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

Attribute Information:

MedInc - median income in block group

HouseAge - median house age in block group

AveRooms - average number of rooms per household

AveBedrms - average number of bedrooms per household

Population - block group population

AveOccup - average number of household members

Latitude - block group latitude

Longitude - block group longitude

## Target
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [69]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

In [70]:
housing = fetch_california_housing(as_frame=True)
df = pd.concat([housing.data, housing.target], axis=1)

1. Use any filter method to select the best features

In [71]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression

threshold = 4

skb = SelectKBest(score_func=f_regression, k=threshold)
sel_skb = skb.fit(X_train, y_train)
sel_skb_index = sel_skb.get_support()
df_housing_skb = X_train.iloc[:, sel_skb_index]

print('p_values:', sel_skb.pvalues_)
print('Selected features:', df_housing_skb.columns)

p_values: [0.00000000e+00 1.01919784e-40 2.47317844e-93 4.04532102e-11
 8.21491618e-04 4.64088908e-03 3.82179752e-76 2.54505126e-09]
Selected features: Index(['MedInc', 'HouseAge', 'AveRooms', 'Latitude'], dtype='object')


2. Use any wrapper method to select the best features

In [72]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

model_rf = RandomForestRegressor(n_estimators=500, random_state=42, max_depth=3)

threshold = 4
selector = RFE(estimator=model_rf, n_features_to_select=threshold, step=1)

selector = selector.fit(X_train, y_train)
selector_ind = selector.get_support()

df_housing_rfe = X_train.iloc[:, selector_ind]
selected_features = df_housing_rfe.columns
print("Selected features using RFE:", list(selected_features))



Selected features using RFE: ['MedInc', 'HouseAge', 'AveRooms', 'AveOccup']


3. Use any embedded methood to select the best features

In [73]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

model_rf = RandomForestRegressor(n_estimators=500, random_state=42, max_depth=3)
model_rf.fit(X_train, y_train)

sel_sfm = SelectFromModel(model_rf, prefit=True, threshold="mean")
sel_sfm_index = sel_sfm.get_support()

df_housing_sfm = X_train.iloc[:, sel_sfm_index]
selected_features = df_housing_sfm.columns
print("Selected features using SelectFromModel:", list(selected_features))


Selected features using SelectFromModel: ['MedInc', 'AveOccup']


#MODEL COMPARISON

In [74]:
X_train_skb, X_test_skb, y_train_skb, y_test_skb = train_test_split(df_housing_skb, y_train, test_size=0.2, random_state=42)
X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe = train_test_split(df_housing_rfe, y_train, test_size=0.2, random_state=42)
X_train_sfm, X_test_sfm, y_train_sfm, y_test_sfm = train_test_split(df_housing_sfm, y_train, test_size=0.2, random_state=42)


In [75]:
default_preds = model_default.predict(X_test)
skb_preds = model_skb.predict(X_test_skb)
rfe_preds = model_rfe.predict(X_test_rfe)
sfm_preds = model_sfm.predict(X_test_sfm)


default_rmse = mean_squared_error(y_test, default_preds, squared=False)
skb_rmse = mean_squared_error(y_test_skb, skb_preds, squared=False)
rfe_rmse = mean_squared_error(y_test_rfe, rfe_preds, squared=False)
sfm_rmse = mean_squared_error(y_test_sfm, sfm_preds, squared=False)



In [76]:
print(f'Default RMSE: {default_rmse}')
print(f'Filter RMSE: {skb_rmse}')
print(f'Wrapper RMSE: {rfe_rmse}')
print(f'Embedded RMSE: {sfm_rmse}')

Default RMSE: 0.7750447841432082
Filter RMSE: 0.8123176447175391
Wrapper RMSE: 0.7760559025245356
Embedded RMSE: 0.7841069967542479
