## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
# import models and fit
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
from sklearn.linear_model import Lasso

# Paths to the training and test data files
X_train_path = 'D:/Documents/GitHub/DS_Midterm_Project_CB/processed/X_train.csv'
X_test_path = 'D:/Documents/GitHub/DS_Midterm_Project_CB/processed/X_test.csv'
y_train_path = 'D:/Documents/GitHub/DS_Midterm_Project_CB/processed/y_train.csv'
y_test_path = 'D:/Documents/GitHub/DS_Midterm_Project_CB/processed/y_test.csv'

# Load the data
X_train = pd.read_csv(X_train_path)
X_test = pd.read_csv(X_test_path)
y_train = pd.read_csv(y_train_path).iloc[:, 0]
y_test = pd.read_csv(y_test_path).iloc[:, 0] 


In [2]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Support Vector Machines': SVR(),
    'Random Forest': RandomForestRegressor(),
    'XGBoost': XGBRegressor()
}

# Train models
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"{name} model trained.")

Linear Regression model trained.
Support Vector Machines model trained.
Random Forest model trained.
XGBoost model trained.


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [3]:
# gather evaluation metrics and compare results

# Define a function to evaluate models
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, rmse, mae, r2

# Evaluate each model

# Intialize dictionary to store results
results = {}
for name, model in models.items():
    mse, rmse, mae, r2 = evaluate_model(model, X_test, y_test)
    results[name] = {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R^2': r2
    }

# Print results
for name, metrics in results.items():
    print(f"Model: {name}")
    for metric_name, value in metrics.items():
        print(f"  {metric_name}: {value:.4f}")
    print()

Model: Linear Regression
  MSE: 95380048311.5077
  RMSE: 308836.6046
  MAE: 167770.9930
  R^2: 0.4776

Model: Support Vector Machines
  MSE: 190017618079.8647
  RMSE: 435910.1032
  MAE: 214474.7731
  R^2: -0.0407

Model: Random Forest
  MSE: 2373125381.9262
  RMSE: 48714.7348
  MAE: 13311.2005
  R^2: 0.9870

Model: XGBoost
  MSE: 1254897700.6995
  RMSE: 35424.5353
  MAE: 12321.7226
  R^2: 0.9931



Model: Linear Regression - The MSE and RMSE indicated very high errors and only
could explain about 48% of the variance. Not a good candidate.

Model: Support Vector Machines - The R2 value indicated negative! It performed 
worse than linear regression, with very high MSE and RMSE values. Not a good
candidate.

Model: Random Forest - R2 very high. This model could explain 98.7% of the variance and had lower
MSE and RMSE values. This would be a good candidate for prediction. 

Model: XGBoost - This model had the best R2 value and lowest MSE and RMSE values. This was the best
candidate for prediction as it could explain 99.3% of the variance in test data. 


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)