<a href="https://colab.research.google.com/github/robitussin/CCADMACL_EXERCISES/blob/main/Exercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

Use all feature selection methods to find the best features

## Dataset Information

## Features

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

Attribute Information:

MedInc - median income in block group

HouseAge - median house age in block group

AveRooms - average number of rooms per household

AveBedrms - average number of bedrooms per household

Population - block group population

AveOccup - average number of household members

Latitude - block group latitude

Longitude - block group longitude

## Target
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [54]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

In [71]:
housing = fetch_california_housing(as_frame=True)
df = pd.concat([housing.data, housing.target], axis=1)
df_features = housing.data
df_target = housing.target

1. Use any filter method to select the best features

In [84]:
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.5) # Variance threshold
sel = selector.fit(df_features)
sel_index = sel.get_support()
df_housing_norm_vt = df_features.iloc[:, sel_index]

df_housing_norm_vt.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'Population', 'AveOccup', 'Latitude',
       'Longitude'],
      dtype='object')

2. Use any wrapper method to select the best features

In [51]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

threshold = 5 # the number of most relevant features
model_rf = RandomForestRegressor(n_estimators=500, random_state=0, max_depth=3)
selector = RFE(model_rf, n_features_to_select=5, step=1)

selector = selector.fit(df_features, df_target.values.ravel())
selector_ind = selector.get_support()
df_housing_rfe = df_features.iloc[:, selector_ind]

df_housing_rfe.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveOccup', 'Latitude'], dtype='object')

3. Use any embedded methood to select the best features

In [85]:
from sklearn.feature_selection import SelectFromModel

model_rf = RandomForestRegressor(n_estimators=500, random_state=0, max_depth=3)
model_rf.fit(df_features, df_target.values.ravel())

sel_sfm = SelectFromModel(model_rf, prefit=True)
sel_sfm_index = sel_sfm.get_support()
df_housing_sfm = df_features.iloc[:, sel_sfm_index]

df_housing_sfm.columns

Index(['MedInc', 'AveOccup'], dtype='object')

#Model Training

In [89]:
from sklearn.model_selection import train_test_split

x = df.drop(["MedHouseVal"],axis=1)
y = df["MedHouseVal"]

x_fm = df_housing_norm_vt
y_fm = df["MedHouseVal"]

x_wm = df_housing_rfe
y_wm = df["MedHouseVal"]

x_em = df_housing_sfm
y_em = df["MedHouseVal"]

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1,test_size=0.3)
x_train1,x_test1,y_train1,y_test1 = train_test_split(x_fm,y_fm,random_state=1,test_size=0.3)
x_train2,x_test2,y_train2,y_test2 = train_test_split(x_wm,y_wm,random_state=1,test_size=0.3)
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_em,y_em,random_state=1,test_size=0.3)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

rfr = RandomForestRegressor(n_estimators=50,random_state=1)

rfr.fit(x_train,y_train)
y_pred = rfr.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"All Features RMSE: {rmse}")

rfr.fit(x_train1,y_train1)
y_pred1 = rfr.predict(x_test1)
rmse1 = np.sqrt(mean_squared_error(y_test1, y_pred1))
print(f"Filter Method RMSE: {rmse1}")

rfr.fit(x_train2,y_train2)
y_pred2 = rfr.predict(x_test2)
rmse2 = np.sqrt(mean_squared_error(y_test2, y_pred2))
print(f"Wrapper Method RMSE: {rmse2}")

rfr.fit(x_train3,y_train3)
y_pred3 = rfr.predict(x_test3)
rmse3 = np.sqrt(mean_squared_error(y_test3, y_pred3))
print(f"Embedded Method RMSE: {rmse3}")

All Features RMSE: 0.5156447470906563
Filter Method RMSE: 0.5093031148401722
Wrapper Method RMSE: 0.6130468216560241
Embedded Method RMSE: 0.7921066437138734
