<a href="https://colab.research.google.com/github/Gkcoli/CCADMACL_EXERCISES_COM222-ML/blob/main/CCADMACL_EXERCISE_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

Use all feature selection methods to find the best features

## Dataset Information

## Features

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

Attribute Information:

MedInc - median income in block group

HouseAge - median house age in block group

AveRooms - average number of rooms per household

AveBedrms - average number of bedrooms per household

Population - block group population

AveOccup - average number of household members

Latitude - block group latitude

Longitude - block group longitude

## Target
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [51]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [52]:
housing = fetch_california_housing(as_frame=True)
df = pd.concat([housing.data, housing.target], axis=1)

0. Default

In [53]:
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)


In [54]:
default_model = LinearRegression()
default_model.fit(X_train, y_train)
y_pred_default = default_model.predict(X_test)
default_rmse = np.sqrt(mean_squared_error(y_test, y_pred_default))

1. Use any filter method to select the best features (f_classif)

In [55]:
k = 5  # Number of features to select
filter_selector = SelectKBest(score_func=f_classif, k=k)
X_train_filter = filter_selector.fit_transform(X_train, y_train)
X_test_filter = filter_selector.transform(X_test)

selected_features_filter = X_train.columns[filter_selector.get_support()]
filter_model = LinearRegression()
filter_model.fit(X_train_filter, y_train)
y_pred_filter = filter_model.predict(X_test_filter)
filter_rmse = np.sqrt(mean_squared_error(y_test, y_pred_filter))

2. Use any wrapper method to select the best features

In [56]:
rfe_model = LinearRegression()
rfe_selector = RFE(estimator=rfe_model, n_features_to_select=5)
rfe_selector.fit(X_train, y_train)

selected_features_rfe = X_train.columns[rfe_selector.support_]
X_train_rfe = rfe_selector.transform(X_train)
X_test_rfe = rfe_selector.transform(X_test)
rfe_model.fit(X_train_rfe, y_train)

y_pred_rfe = rfe_model.predict(X_test_rfe)
rfe_rmse = np.sqrt(mean_squared_error(y_test, y_pred_rfe))

3. Use any embedded methood to select the best features

In [61]:
rf_model = RandomForestRegressor(n_estimators=500, random_state=42)
rf_model.fit(X_train, y_train)

sfm_selector = SelectFromModel(rf_model, prefit=True, threshold="mean")
X_train_sfm = sfm_selector.transform(X_train)
X_test_sfm = sfm_selector.transform(X_test)

selected_features_sfm = X_train.columns[sfm_selector.get_support()]
y_val = rf_model.predict(X_test)

sfm_rmse = np.sqrt(mean_squared_error(y_test, y_val))



Top Features

In [62]:
# Print top features from each method
print("\nSelected Features:")
print(f"Filter Method (F_classif): {list(selected_features_filter)}")
print(f"Wrapper Method (RFE): {list(selected_features_rfe)}")
print(f"Embedded Method (SelectFromModel): {list(selected_features_sfm)}")


Selected Features:
Filter Method (F_classif): ['MedInc', 'HouseAge', 'Population', 'Latitude', 'Longitude']
Wrapper Method (RFE): ['MedInc', 'AveRooms', 'AveBedrms', 'Latitude', 'Longitude']
Embedded Method (SelectFromModel): ['MedInc', 'AveOccup']


RMSE results

In [63]:
print(f"Default RMSE: {default_rmse}")
print(f"Filter RMSE: {filter_rmse}")
print(f"Wrapper RMSE: {rfe_rmse}")
print(f"Embedded RMSE: {sfm_rmse}")

Default RMSE: 0.7455813830127764
Filter RMSE: 0.7408843864985539
Wrapper RMSE: 0.7528409640011294
Embedded RMSE: 0.5021907705243096
