## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from functions_variables import evaluate_model
import joblib



In [2]:
# Load the datasets
X_train = pd.read_csv('../data/processed/X_train.csv')
X_test = pd.read_csv('../data/processed/X_test.csv')
Y_train = pd.read_csv('../data/processed/Y_train.csv').squeeze()
Y_test = pd.read_csv('../data/processed/Y_test.csv').squeeze()


In [3]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", Y_train.shape)
print("y_test shape:", Y_test.shape)

X_train shape: (4511, 22)
X_test shape: (1128, 22)
y_train shape: (4511,)
y_test shape: (1128,)


In [4]:
# Initialize scaler for features
scaler_features = StandardScaler()

# List of columns to scale
columns_to_scale = [
    'description_baths', 'description_beds', 'description_garage',
    'description_sqft', 'description_stories', 'description_year_built',
    'year_sold', 'year_listed', 'days_on_market', 'city_frequency', 'state_frequency'
]

# Scale features for train and test sets
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[columns_to_scale] = scaler_features.fit_transform(X_train[columns_to_scale])
X_test_scaled[columns_to_scale] = scaler_features.transform(X_test[columns_to_scale])

# Debug: Verify scaled features
print("Scaled X_train sample:\n", X_train_scaled[columns_to_scale].head())
print("Scaled X_test sample:\n", X_test_scaled[columns_to_scale].head())

# Scale target variable
scaler_target = StandardScaler()
y_train_scaled = scaler_target.fit_transform(Y_train.values.reshape(-1, 1)).flatten()
y_test_scaled = scaler_target.transform(Y_test.values.reshape(-1, 1)).flatten()

# Debug: Verify scaled target
print("y_train_scaled sample:", y_train_scaled[:5])
print("y_test_scaled sample:", y_test_scaled[:5])


Scaled X_train sample:
    description_baths  description_beds  description_garage  description_sqft  \
0           1.712950          0.589797            0.722632          0.310417   
1           0.685544         -0.218642           -0.120782         -0.459942   
2           0.685544          0.589797            0.722632          0.739957   
3           0.685544          0.589797            1.566046          0.703170   
4          -1.369266         -0.218642           -0.964195         -0.569220   

   description_stories  description_year_built  year_sold  year_listed  \
0             2.223663                0.876046   0.734568    -0.089851   
1             0.657307                1.494672   0.734568    -0.089851   
2             0.657307                1.072881   0.734568    -0.089851   
3            -0.909049                0.763568   0.734568    -0.089851   
4            -0.909049               -2.160848   0.734568    -0.089851   

   days_on_market  city_frequency  state_frequency

In [5]:


models = {
    'Linear Regression': LinearRegression(),
    'SVR': SVR(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

results = {}

for model_name, model in models.items():
    results[model_name] = evaluate_model(model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display results
for model_name, metrics in results.items():
    print(f"\nModel: {model_name}")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.2f}")

LinearRegression:
  Train RMSE: $218347.61, Test RMSE: $226088.63
  Train MAE: $139473.13, Test MAE: $145219.73
  Train R^2: 0.43, Test R^2: 0.47
SVR:
  Train RMSE: $129658.54, Test RMSE: $152767.27
  Train MAE: $58267.05, Test MAE: $71677.26
  Train R^2: 0.80, Test R^2: 0.76
RandomForestRegressor:
  Train RMSE: $16687.72, Test RMSE: $32964.70
  Train MAE: $5060.30, Test MAE: $11235.00
  Train R^2: 1.00, Test R^2: 0.99
XGBRegressor:
  Train RMSE: $12189.43, Test RMSE: $31491.04
  Train MAE: $8020.11, Test MAE: $12429.57
  Train R^2: 1.00, Test R^2: 0.99

Model: Linear Regression
  Train RMSE: 218347.61
  Test RMSE: 226088.63
  Train MAE: 139473.13
  Test MAE: 145219.73
  Train R^2: 0.43
  Test R^2: 0.47

Model: SVR
  Train RMSE: 129658.54
  Test RMSE: 152767.27
  Train MAE: 58267.05
  Test MAE: 71677.26
  Train R^2: 0.80
  Test R^2: 0.76

Model: Random Forest
  Train RMSE: 16687.72
  Test RMSE: 32964.70
  Train MAE: 5060.30
  Test MAE: 11235.00
  Train R^2: 1.00
  Test R^2: 0.99

Model

In [6]:
print("X_train_scaled shape:", X_train_scaled.shape)
print("X_test_scaled shape:", X_test_scaled.shape)
print("y_train_scaled shape:", y_train_scaled.shape)
print("y_test_scaled shape:", y_test_scaled.shape)

X_train_scaled shape: (4511, 22)
X_test_scaled shape: (1128, 22)
y_train_scaled shape: (4511,)
y_test_scaled shape: (1128,)


In [7]:
# Create a DataFrame to store results
results = pd.DataFrame.from_dict(results, orient='index')
print(results)

                      Train RMSE      Test RMSE      Train MAE       Test MAE  \
Linear Regression  218347.610879  226088.633093  139473.132691  145219.734116   
SVR                129658.539475  152767.268657   58267.049321   71677.264279   
Random Forest       16687.718803   32964.704225    5060.303272   11235.000434   
XGBoost             12189.433116   31491.038541    8020.111574   12429.567944   

                   Train R^2  Test R^2  
Linear Regression   0.428063  0.466060  
SVR                 0.798324  0.756222  
Random Forest       0.996659  0.988649  
XGBoost             0.998218  0.989641  


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [8]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [9]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)