## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from functions_variables import evaluate_model

In [2]:
# Storage path for processed data
processed_data_path = '../data/processed/'
# List of dataset names
set_names = ['X_train', 'X_test', 'y_train', 'y_test']
# Initialize dictionary to store datasets
datasets = {}
# Load the datasets
for i in set_names:
    datasets[i] = pd.read_csv(f'{processed_data_path + i}.csv')
    print(f'{i} shape:', datasets[i].shape)

X_train shape: (4511, 22)
X_test shape: (1128, 22)
y_train shape: (4511, 1)
y_test shape: (1128, 1)


In [3]:
# Initialize scaler for features
scaler_features = StandardScaler()
# Initialize scaler for target
scaler_target = StandardScaler()

In [4]:
# Initialize dictionary to store scaled datasets
ds_scaled = {
    # Scale X_train and X_test
    'X_train': scaler_features.fit_transform(datasets['X_train']),
    'X_test': scaler_features.transform(datasets['X_test']),
    # Fit scaler on y_train and scale both y_train and y_test
    'y_train': scaler_target.fit_transform(
        datasets['y_train'].values.reshape(-1, 1)
    ).flatten(),
    'y_test': scaler_target.transform(
        datasets['y_test'].values.reshape(-1, 1))
    .flatten()
}

In [5]:
#Debug: Verify scaling
for i in set_names:
    print(f'{i}_scaled shape:', ds_scaled[i][:5])

X_train_scaled shape: [[-0.08985057  0.7345682  -0.13421245 -0.17248103 -0.06180149  0.87604566
   2.23561988  1.72047681  0.59030787  0.72263244  0.31041699  0.84261948
   0.93619224  1.04327653  1.03819281  1.32500965 -0.57129204  1.8749708
  -0.51409505 -0.47765194 -0.44513006 -0.45119313]
 [-0.08985057  0.7345682  -0.82333722  0.53014859  0.17651155  1.4946721
   0.6676968   0.6892423  -0.21809492 -0.12078152 -0.45994152 -1.18677532
   0.93619224 -0.95851864 -0.96321222 -0.75471148  1.75041823 -0.53334164
  -0.51409505 -0.47765194 -0.44513006 -0.45119313]
 [-0.08985057  0.7345682  -0.42437025 -0.37323235 -1.09449133  1.07288134
   0.6676968   0.6892423   0.59030787  0.72263244  0.73995678 -1.18677532
   0.93619224  1.04327653  1.03819281  1.32500965  1.75041823 -0.53334164
   1.94516558 -0.47765194 -0.44513006 -0.45119313]
 [-0.08985057  0.7345682  -0.2067519  -2.48112122  0.93116951  0.76356812
  -0.90022629  0.6892423   0.59030787  1.5660464   0.70316999  0.84261948
   0.93619224

In [6]:
# Initialize models
lr_model = LinearRegression()
svr_model = SVR()
rf_model = RandomForestRegressor(random_state=42)
xgb_model = XGBRegressor(random_state=42)

In [7]:
# Initialize dictionary to store results
results = {}
# List of model names
model_names = ['Linear Regression', 'SVR', 'Random Forest', 'XGBoost']

In [8]:
# Evaluate models
for m in model_names:
    results[m] = evaluate_model(lr_model, datasets, ds_scaled, scaler_target)

LinearRegression:
  Train RMSE: $218101.14, Test RMSE: $225950.79
  Train MAE: $139426.01, Test MAE: $144890.05
  Train R^2: 0.43, Test R^2: 0.47
LinearRegression:
  Train RMSE: $218101.14, Test RMSE: $225950.79
  Train MAE: $139426.01, Test MAE: $144890.05
  Train R^2: 0.43, Test R^2: 0.47
LinearRegression:
  Train RMSE: $218101.14, Test RMSE: $225950.79
  Train MAE: $139426.01, Test MAE: $144890.05
  Train R^2: 0.43, Test R^2: 0.47
LinearRegression:
  Train RMSE: $218101.14, Test RMSE: $225950.79
  Train MAE: $139426.01, Test MAE: $144890.05
  Train R^2: 0.43, Test R^2: 0.47


In [9]:
# Display results
for model, metrics in results.items():
    print(f"\nModel: {model}")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.2f}")


Model: Linear Regression
  Train RMSE: 218101.14
  Test RMSE: 225950.79
  Train MAE: 139426.01
  Test MAE: 144890.05
  Train R^2: 0.43
  Test R^2: 0.47

Model: SVR
  Train RMSE: 218101.14
  Test RMSE: 225950.79
  Train MAE: 139426.01
  Test MAE: 144890.05
  Train R^2: 0.43
  Test R^2: 0.47

Model: Random Forest
  Train RMSE: 218101.14
  Test RMSE: 225950.79
  Train MAE: 139426.01
  Test MAE: 144890.05
  Train R^2: 0.43
  Test R^2: 0.47

Model: XGBoost
  Train RMSE: 218101.14
  Test RMSE: 225950.79
  Train MAE: 139426.01
  Test MAE: 144890.05
  Train R^2: 0.43
  Test R^2: 0.47


In [10]:
# Create a DataFrame to store results
results = pd.DataFrame.from_dict(results, orient='index')

In [11]:
# Display results
print(results)

                      Train RMSE      Test RMSE      Train MAE       Test MAE  \
Linear Regression  218101.136298  225950.787519  139426.014972  144890.054491   
SVR                218101.136298  225950.787519  139426.014972  144890.054491   
Random Forest      218101.136298  225950.787519  139426.014972  144890.054491   
XGBoost            218101.136298  225950.787519  139426.014972  144890.054491   

                   Train R^2  Test R^2  
Linear Regression   0.429354  0.466711  
SVR                 0.429354  0.466711  
Random Forest       0.429354  0.466711  
XGBoost             0.429354  0.466711  


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [12]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [13]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)