## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import joblib
from functions_variables import evaluate_model


In [2]:
# Load the datasets
X_train = pd.read_csv('/home/t0si/LHL-Midterm-Project/notebooks/processed/X_train.csv')
X_test = pd.read_csv('/home/t0si/LHL-Midterm-Project/notebooks/processed/X_test.csv')
y_train = pd.read_csv('/home/t0si/LHL-Midterm-Project/notebooks/processed/y_train.csv').values
y_test = pd.read_csv('/home/t0si/LHL-Midterm-Project/notebooks/processed/y_test.csv').values

# Load the fitted scaler for the target variable
scaler_target = joblib.load('scaler_target.pkl')

In [None]:
from sklearn.preprocessing import StandardScaler

# Fit the scaler on the training target

scaler_target = StandardScaler()
y_train_scaled = scaler_target.fit_transform(y_train.reshape(-1, 1))
y_test_scaled = scaler_target.transform(y_test.reshape(-1, 1))

# Save the fitted scaler for reuse
import joblib
joblib.dump(scaler_target, 'scaler_target.pkl')

['scaler_target.pkl']

In [4]:
# Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Initialize models
lr_model = LinearRegression()
svr_model = SVR()
rf_model = RandomForestRegressor(random_state=42)
xgb_model = XGBRegressor(tree_method='hist', random_state=42)  # Force CPU execution


xgb_model.fit(X_train, y_train.ravel())

# Initialize results dictionary
results = {}

# Evaluate models
results['Linear Regression'] = evaluate_model(lr_model, X_train, X_test, y_train, y_test, scaler_target)
results['SVR'] = evaluate_model(svr_model, X_train, X_test, y_train, y_test, scaler_target)
results['Random Forest'] = evaluate_model(rf_model, X_train, X_test, y_train, y_test, scaler_target)
results['XGBoost'] = evaluate_model(xgb_model, X_train, X_test, y_train, y_test, scaler_target)

# Display results
print("Evaluation Results:")
print(results)



LinearRegression:
  Train RMSE: $0.75, Test RMSE: $0.78
  Train MAE: $0.48, Test MAE: $0.50
  Train R^2: 0.43, Test R^2: 0.47




SVR:
  Train RMSE: $0.45, Test RMSE: $0.53
  Train MAE: $0.20, Test MAE: $0.25
  Train R^2: 0.80, Test R^2: 0.76
RandomForestRegressor:
  Train RMSE: $0.06, Test RMSE: $0.11
  Train MAE: $0.02, Test MAE: $0.04
  Train R^2: 1.00, Test R^2: 0.99
XGBRegressor:
  Train RMSE: $0.04, Test RMSE: $0.11
  Train MAE: $0.03, Test MAE: $0.04
  Train R^2: 1.00, Test R^2: 0.99
Evaluation Results:
{'Linear Regression': {'Train RMSE': 0.7548294138434759, 'Test RMSE': 0.7815421807186136, 'Train MAE': 0.4827835786117934, 'Test MAE': 0.5025264577553163, 'Train R^2': 0.4302325559967145, 'Test R^2': 0.46815087548842516}, 'SVR': {'Train RMSE': 0.4488613347061851, 'Test RMSE': 0.53022693272083, 'Train MAE': 0.203190416304828, 'Test MAE': 0.25032574643785677, 'Train R^2': 0.798523502205782, 'Test R^2': 0.755202391015127}, 'Random Forest': {'Train RMSE': 0.05681536866568538, 'Test RMSE': 0.11244181091723067, 'Train MAE': 0.017274949424812943, 'Test MAE': 0.03805001119019502, 'Train R^2': 0.9967720138833822, 'T



In [5]:
# Create a DataFrame to store results
results_df = pd.DataFrame.from_dict(results, orient='index')
print(results_df)

                   Train RMSE  Test RMSE  Train MAE  Test MAE  Train R^2  \
Linear Regression    0.754829   0.781542   0.482784  0.502526   0.430233   
SVR                  0.448861   0.530227   0.203190  0.250326   0.798524   
Random Forest        0.056815   0.112442   0.017275  0.038050   0.996772   
XGBoost              0.040827   0.107345   0.027558  0.040714   0.998333   

                   Test R^2  
Linear Regression  0.468151  
SVR                0.755202  
Random Forest      0.988991  
XGBoost            0.989967  


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)