<a href="https://colab.research.google.com/github/Saraldedv/CCADMACL_EXERCISES_COM222ML/blob/main/SARALDE_Exercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

Use all feature selection methods to find the best features

## Dataset Information

## Features

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

Attribute Information:

MedInc - median income in block group

HouseAge - median house age in block group

AveRooms - average number of rooms per household

AveBedrms - average number of bedrooms per household

Population - block group population

AveOccup - average number of household members

Latitude - block group latitude

Longitude - block group longitude

## Target
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectFromModel

import pandas as pd

In [2]:
housing = fetch_california_housing(as_frame=True)
df = pd.concat([housing.data, housing.target], axis=1)

In [6]:
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1. Use any filter method to select the best features

In [7]:
threshold = 4
skb = SelectKBest(score_func=f_regression, k=threshold)
skb.fit(X_train, y_train)
skb_index = skb.get_support()
df_housing_skb = X_train.iloc[:, skb_index]
print('p_values:', skb.pvalues_)
print('Selected features using SelectKBest:', df_housing_skb.columns)

p_values: [0.00000000e+00 1.01919784e-40 2.47317844e-93 4.04532102e-11
 8.21491618e-04 4.64088908e-03 3.82179752e-76 2.54505126e-09]
Selected features using SelectKBest: Index(['MedInc', 'HouseAge', 'AveRooms', 'Latitude'], dtype='object')


2. Use any wrapper method to select the best features

In [8]:
model_rf = RandomForestRegressor(n_estimators=500, random_state=42, max_depth=3)
selector_rfe = RFE(estimator=model_rf, n_features_to_select=threshold, step=1)
selector_rfe.fit(X_train, y_train)
rfe_index = selector_rfe.get_support()
df_housing_rfe = X_train.iloc[:, rfe_index]
print("Selected features using RFE:", df_housing_rfe.columns)

Selected features using RFE: Index(['MedInc', 'HouseAge', 'AveRooms', 'AveOccup'], dtype='object')


3. Use any embedded methood to select the best features

In [9]:
model_rf.fit(X_train, y_train)
sfm = SelectFromModel(model_rf, prefit=True, threshold="mean")
sfm_index = sfm.get_support()
df_housing_sfm = X_train.iloc[:, sfm_index]
print("Selected features using SelectFromModel:", df_housing_sfm.columns)

Selected features using SelectFromModel: Index(['MedInc', 'AveOccup'], dtype='object')


#Train-test Split

In [10]:
X_train_skb, X_test_skb, y_train_skb, y_test_skb = train_test_split(df_housing_skb, y_train, test_size=0.2, random_state=42)
X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe = train_test_split(df_housing_rfe, y_train, test_size=0.2, random_state=42)
X_train_sfm, X_test_sfm, y_train_sfm, y_test_sfm = train_test_split(df_housing_sfm, y_train, test_size=0.2, random_state=42)

#MODEL TRAINING

In [11]:
model_default = RandomForestRegressor(n_estimators=500, random_state=0, max_depth=3)
model_skb = RandomForestRegressor(n_estimators=500, random_state=0, max_depth=3)
model_rfe = RandomForestRegressor(n_estimators=500, random_state=0, max_depth=3)
model_sfm = RandomForestRegressor(n_estimators=500, random_state=0, max_depth=3)


model_default.fit(X_train, y_train)
model_skb.fit(X_train_skb, y_train_skb)
model_rfe.fit(X_train_rfe, y_train_rfe)
model_sfm.fit(X_train_sfm, y_train_sfm)

In [12]:
default_preds = model_default.predict(X_test)
skb_preds = model_skb.predict(X_test_skb)
rfe_preds = model_rfe.predict(X_test_rfe)
sfm_preds = model_sfm.predict(X_test_sfm)


default_rmse = mean_squared_error(y_test, default_preds, squared=False)
skb_rmse = mean_squared_error(y_test_skb, skb_preds, squared=False)
rfe_rmse = mean_squared_error(y_test_rfe, rfe_preds, squared=False)
sfm_rmse = mean_squared_error(y_test_sfm, sfm_preds, squared=False)



In [13]:
#PRINT RESULT

print(f'Default RMSE: {default_rmse}')
print(f'Filter RMSE: {skb_rmse}')
print(f'Wrapper RMSE: {rfe_rmse}')
print(f'Embedded RMSE: {sfm_rmse}')

Default RMSE: 0.7750447841432082
Filter RMSE: 0.8123176447175391
Wrapper RMSE: 0.7760559025245356
Embedded RMSE: 0.7841069967542479
