## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import joblib



In [2]:
# Load the datasets
X_train = pd.read_csv('../data/processed/X_train.csv')
X_test = pd.read_csv('../data/processed/X_test.csv')
y_train = pd.read_csv('../data/processed/y_train.csv').squeeze()
y_test = pd.read_csv('../data/processed/y_test.csv').squeeze()


In [3]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (4511, 22)
X_test shape: (1128, 22)
y_train shape: (4511,)
y_test shape: (1128,)


In [4]:


# Initialize scaler for features
scaler_features = StandardScaler()

# Scale X_train and X_test
X_train_scaled = scaler_features.fit_transform(X_train)
X_test_scaled = scaler_features.transform(X_test)

# Debug: Verify scaling
print("X_train_scaled sample:", X_train_scaled[:5])
print("X_test_scaled sample:", X_test_scaled[:5])

X_train_scaled sample: [[-0.08985057  0.7345682  -0.13421245 -0.17248103 -0.06180149  0.87604566
   2.23632519  1.72131451  0.58999561  0.72263244  0.31041699  0.84261948
   0.93619224  1.04327653  1.03819281  1.32500965 -0.57129204  1.8749708
  -0.51409505 -0.47765194 -0.44513006 -0.45119313]
 [-0.08985057  0.7345682  -0.82333722  0.53014859  0.17651155  1.4946721
   0.6674196   0.69108368 -0.21847043 -0.12078152 -0.45994152 -1.18677532
   0.93619224 -0.95851864 -0.96321222 -0.75471148  1.75041823 -0.53334164
  -0.51409505 -0.47765194 -0.44513006 -0.45119313]
 [-0.08985057  0.7345682  -0.42437025 -0.37323235 -1.09449133  1.07288134
   0.6674196   0.69108368  0.58999561  0.72263244  0.73995678 -1.18677532
   0.93619224  1.04327653  1.03819281  1.32500965  1.75041823 -0.53334164
   1.94516558 -0.47765194 -0.44513006 -0.45119313]
 [-0.08985057  0.7345682  -0.2067519  -2.48112122  0.93116951  0.76356812
  -0.90148599  0.69108368  0.58999561  1.5660464   0.70316999  0.84261948
   0.9361922

In [5]:
# Initialize scaler for target
scaler_target = StandardScaler()

# Fit scaler on y_train and scale both y_train and y_test
y_train_scaled = scaler_target.fit_transform(y_train.values.reshape(-1, 1)).flatten()
y_test_scaled = scaler_target.transform(y_test.values.reshape(-1, 1)).flatten()

# Debug: Verify scaling
print("y_train_scaled sample:", y_train_scaled[:5])
print("y_test_scaled sample:", y_test_scaled[:5])

y_train_scaled sample: [ 0.4292904   0.07946856 -0.06253833 -0.15259148 -0.73100979]
y_test_scaled sample: [-0.62710231  5.2263524  -0.02097534  0.46392623 -0.78296353]


In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score
from functions_variables import evaluate_model

# Initialize models
lr_model = LinearRegression()
svr_model = SVR()
rf_model = RandomForestRegressor(random_state=42)
xgb_model = XGBRegressor(random_state=42)


# Evaluate models
results = {}
results['Linear Regression'] = evaluate_model(lr_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)
results['SVR'] = evaluate_model(svr_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)
results['Random Forest'] = evaluate_model(rf_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)
results['XGBoost'] = evaluate_model(xgb_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display results
for model, metrics in results.items():
    print(f"\nModel: {model}")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.2f}")

LinearRegression:
  Train RMSE: $218025.10, Test RMSE: $225732.64
  Train MAE: $139356.19, Test MAE: $144963.06
  Train R^2: 0.43, Test R^2: 0.47
SVR:
  Train RMSE: $107672.00, Test RMSE: $126313.37
  Train MAE: $49811.57, Test MAE: $60180.12
  Train R^2: 0.86, Test R^2: 0.83
RandomForestRegressor:
  Train RMSE: $16008.09, Test RMSE: $36654.86
  Train MAE: $4856.74, Test MAE: $11692.58
  Train R^2: 1.00, Test R^2: 0.99
XGBRegressor:
  Train RMSE: $11071.82, Test RMSE: $32057.07
  Train MAE: $7502.54, Test MAE: $12078.16
  Train R^2: 1.00, Test R^2: 0.99

Model: Linear Regression
  Train RMSE: 218025.10
  Test RMSE: 225732.64
  Train MAE: 139356.19
  Test MAE: 144963.06
  Train R^2: 0.43
  Test R^2: 0.47

Model: SVR
  Train RMSE: 107672.00
  Test RMSE: 126313.37
  Train MAE: 49811.57
  Test MAE: 60180.12
  Train R^2: 0.86
  Test R^2: 0.83

Model: Random Forest
  Train RMSE: 16008.09
  Test RMSE: 36654.86
  Train MAE: 4856.74
  Test MAE: 11692.58
  Train R^2: 1.00
  Test R^2: 0.99

Model

In [None]:
# Create a DataFrame to store results
results = pd.DataFrame.from_dict(results, orient='index')
print(results)

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)