## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from functions_variables import evaluate_model

In [None]:
# Storage path for processed data
processed_data_path = '../data/processed/'
# List of dataset names
set_names = ['X_train', 'X_test', 'y_train', 'y_test']
# Initialize dictionary to store datasets
datasets = {}
# Load the datasets
for i in set_names:
    datasets[i] = pd.read_csv(f'{processed_data_path + i}.csv')
    print(f'{i} shape:', datasets[i].shape)

In [None]:
# Initialize scaler for features
scaler_features = StandardScaler()
# Initialize scaler for target
scaler_target = StandardScaler()

In [None]:
# Initialize dictionary to store scaled datasets
ds_scaled = {
    # Scale X_train and X_test
    'X_train': scaler_features.fit_transform(datasets['X_train']),
    'X_test': scaler_features.transform(datasets['X_test']),
    # Fit scaler on y_train and scale both y_train and y_test
    'y_train': scaler_target.fit_transform(
        datasets['y_train'].values.reshape(-1, 1)
    ).flatten(),
    'y_test': scaler_target.transform(
        datasets['y_test'].values.reshape(-1, 1))
    .flatten()
}

In [None]:
#Debug: Verify scaling
for i in set_names:
    print(f'{i}_scaled shape:', ds_scaled[i][:5])

In [None]:
# Initialize models
lr_model = LinearRegression()
svr_model = SVR()
rf_model = RandomForestRegressor(random_state=42)
xgb_model = XGBRegressor(random_state=42)

In [None]:
# Initialize dictionary to store results
results = {}
# List of model names
model_names = ['Linear Regression', 'SVR', 'Random Forest', 'XGBoost']

In [None]:
# Evaluate models
for m in model_names:
    results[m] = evaluate_model(lr_model, datasets, ds_scaled, scaler_target)

In [None]:
# Display results
for model, metrics in results.items():
    print(f"\nModel: {model}")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.2f}")

In [None]:
# Create a DataFrame to store results
results = pd.DataFrame.from_dict(results, orient='index')

In [None]:
# Display results
print(results)

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)