# Features selection

In this part, we will compute statistical test to decide which features will further use for model training. Various feature selection methods will be apply including Filter based method and Recursive Feature Elimination (RFE). <br>\
Later on, when doing model training, LASSO will be applied to enhance features choice and decrease the computational cost. 

We learned from the correlation matrix that garageSpaces has perfect correlation with parkingSpaces so we will remove parkingSpaces and keep only garage variable

In [70]:
df_map = df_map.drop(columns='parkingSpaces')

In [75]:
# Calculate correlations of candidate features with the target "latestPrice"
target_corr = corr_matrix['latestPrice'].drop('latestPrice')

# Set a threshold of 0.2 (absolute correlation)
threshold = 0.15
selected_features = target_corr[abs(target_corr) >= threshold]

print("Features with |correlation| >= 0.1:")
print(selected_features)

Features with |correlation| >= 0.1:
longitude                  -0.176157
garageSpaces                0.163030
parkingSpaces               0.162937
numOfPhotos                 0.156259
livingAreaSqFt              0.464788
numOfPrimarySchools        -0.168356
numOfElementarySchools      0.156716
numOfHighSchools           -0.195339
avgSchoolRating             0.291039
MedianStudentsPerTeacher    0.207284
numOfBathrooms              0.543477
numOfBedrooms               0.341738
numOfStories                0.212310
Income                      0.268169
population                 -0.213534
Name: latestPrice, dtype: float64


For numerical variables, we will keep those that has moderate to high correlation with sale price, including longtitude, latitude, garageSpaces, numOfPhotos, living area square feet, number of school (primary, elementary, highschool), school rating, median students per teacher, number of bathrooms, number of bedrooms, number of stories, income, population and 2 macroeconomic variables median list price per square feet and mortagage rate

In [64]:
from scipy.stats import chi2_contingency

# Create a binary target: "High" if latestPrice is greater than or equal to the median, otherwise "Low"
median_price = df_map["latestPrice"].median()
df_map["price_category"] = df_map["latestPrice"].apply(lambda x: "High" if x >= median_price else "Low")

bool_cols = df_map.select_dtypes(include="bool").columns
chi2_results = {}

for col in bool_cols:
    contingency_table = pd.crosstab(df_map[col], df_map["price_category"])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    
    chi2_results[col] = {"chi2_stat": chi2, "p_value": p, "dof": dof}

for col, result in chi2_results.items():
    print(f"Column: {col} | Chi2: {result['chi2_stat']:.4f}, p-value: {result['p_value']:.4f}, dof: {result['dof']}")


Column: hasAssociation | Chi2: 42.9912, p-value: 0.0000, dof: 1
Column: hasCooling | Chi2: 14.5450, p-value: 0.0001, dof: 1
Column: hasGarage | Chi2: 151.2900, p-value: 0.0000, dof: 1
Column: hasHeating | Chi2: 9.9736, p-value: 0.0016, dof: 1
Column: hasSpa | Chi2: 283.7785, p-value: 0.0000, dof: 1
Column: hasView | Chi2: 234.5709, p-value: 0.0000, dof: 1


All the boolean features has a significant effect on categorize the price. So we will keep all of these variables.

In [85]:
df_training = df_map[[
    'longitude', 'latitude','garageSpaces','numOfPhotos', 'numOfElementarySchools', 'numOfHighSchools', 'numOfPrimarySchools',
    'livingAreaSqFt','avgSchoolRating','MedianStudentsPerTeacher','numOfBathrooms','numOfBedrooms', 'numOfStories', 'Income',
    'population','MORTGAGE30US', 'MEDLISPRIPERSQUFEE12420', 'hasAssociation', 'hasCooling', 'hasGarage', 'hasHeating', 'hasSpa', 'hasView', 'latestPrice', 'latest_saledate']]

In [86]:
df_training.head()

Unnamed: 0,longitude,latitude,garageSpaces,numOfPhotos,numOfElementarySchools,numOfHighSchools,numOfPrimarySchools,livingAreaSqFt,avgSchoolRating,MedianStudentsPerTeacher,...,MORTGAGE30US,MEDLISPRIPERSQUFEE12420,hasAssociation,hasCooling,hasGarage,hasHeating,hasSpa,hasView,latestPrice,latest_saledate
0,-97.74147,30.19743,0,27,0,1,1,1592.0,3.0,12,...,4.04,163,False,True,False,True,False,False,209900.0,2018-01-22
1,-97.782372,30.157705,2,1,0,1,1,2217.0,5.333333,14,...,4.04,163,True,True,True,True,False,False,260000.0,2018-01-22
2,-97.94043,30.163763,0,21,0,0,0,2582.0,8.0,15,...,4.04,163,True,True,False,True,False,False,440000.0,2018-01-22
3,-97.802513,30.443314,0,1,1,1,0,2942.0,8.666667,16,...,4.04,163,True,True,False,True,False,False,679900.0,2018-01-22
4,-97.704697,30.265963,1,19,0,1,1,2297.0,5.333333,14,...,4.04,163,False,True,True,True,False,False,599990.0,2018-01-23


In [87]:
df_training.to_pickle('filtered_features1.pkl')

***Further enhance feature selection by using Recursive Feature Elimination with Random Forest Regressor as an estimator***

In [81]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

# Separate features (X) and target (y)
X = df_training.drop(columns=["latestPrice"])
y = df_training["latestPrice"]

# Initialize the Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Set the number of features to select (adjust as needed, e.g., top 10)
n_features_to_select = 10

# Initialize RFE with the RandomForestRegressor
rfe = RFE(estimator=rf, n_features_to_select=n_features_to_select)

# Fit RFE on the data
rfe.fit(X, y)

# Get the boolean mask of selected features
selected_mask = rfe.support_

# Retrieve the names of the selected features
selected_features = X.columns[selected_mask]
print("Selected Features by RFE:")
print(selected_features)

Selected Features by RFE:
Index(['longitude', 'latitude', 'numOfPhotos', 'livingAreaSqFt',
       'avgSchoolRating', 'numOfBathrooms', 'Income', 'MORTGAGE30US',
       'MEDLISPRIPERSQUFEE12420', 'hasAssociation'],
      dtype='object')


In [88]:
df_model = df_training[['longitude', 'latitude', 'numOfPhotos', 'livingAreaSqFt',
                        'avgSchoolRating', 'numOfBathrooms', 'Income', 'MORTGAGE30US',
                        'MEDLISPRIPERSQUFEE12420', 'hasAssociation', 'latestPrice','latest_saledate']]

In [89]:
df_model.to_pickle('filtered_features_RFE.pkl')