## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
from functions_variables import *


In [2]:
# Load the datasets
X_train = pd.read_csv('../data/processed/X_train.csv')
X_test = pd.read_csv('../data/processed/X_test.csv')
Y_train = pd.read_csv('../data/processed/Y_train.csv').squeeze()
Y_test = pd.read_csv('../data/processed/Y_test.csv').squeeze()


In [3]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", Y_train.shape)
print("y_test shape:", Y_test.shape)

X_train shape: (4511, 22)
X_test shape: (1128, 22)
y_train shape: (4511,)
y_test shape: (1128,)


In [4]:
scaler_features = StandardScaler()

columns_to_scale = [
    'description_baths', 'description_beds', 'description_garage',
    'description_sqft', 'description_stories', 'description_year_built',
    'year_sold', 'year_listed', 'days_on_market', 'city_frequency', 'state_frequency'
]

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[columns_to_scale] = scaler_features.fit_transform(X_train[columns_to_scale])
X_test_scaled[columns_to_scale] = scaler_features.transform(X_test[columns_to_scale])

scaler_target = StandardScaler()
y_train_scaled = scaler_target.fit_transform(Y_train.values.reshape(-1, 1)).flatten()
y_test_scaled = scaler_target.transform(Y_test.values.reshape(-1, 1)).flatten()



In [5]:
processed_data_path = '../data/processed/'
X_train_scaled.to_csv(processed_data_path + 'X_train_scaled.csv', index=False)
X_test_scaled.to_csv(processed_data_path + 'X_test_scaled.csv', index=False)

# Save y_train_scaled and y_test_scaled
pd.DataFrame(y_train_scaled, columns=["y_train_scaled"]).to_csv(processed_data_path + 'y_train_scaled.csv', index=False)
pd.DataFrame(y_test_scaled, columns=["y_test_scaled"]).to_csv(processed_data_path + 'y_test_scaled.csv', index=False)

In [6]:
#Loop through the models
models = {
    'Linear Regression': LinearRegression(),
    'SVR': SVR(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

results = {}

for model_name, model in models.items():
    results[model_name] = evaluate_model(model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display results
for model_name, metrics in results.items():
    print(f"\nModel: {model_name}")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.2f}")

LinearRegression:
  Train RMSE: $218190.02, Test RMSE: $226027.37
  Train MAE: $139374.83, Test MAE: $144462.21
  Train R^2: 0.43, Test R^2: 0.47
SVR:
  Train RMSE: $129111.68, Test RMSE: $153045.84
  Train MAE: $58488.81, Test MAE: $72049.81
  Train R^2: 0.80, Test R^2: 0.76
RandomForestRegressor:
  Train RMSE: $16528.21, Test RMSE: $34287.19
  Train MAE: $5033.59, Test MAE: $11419.90
  Train R^2: 1.00, Test R^2: 0.99
XGBRegressor:
  Train RMSE: $11964.43, Test RMSE: $27303.23
  Train MAE: $8008.68, Test MAE: $11758.70
  Train R^2: 1.00, Test R^2: 0.99

Model: Linear Regression
  Train RMSE: 218190.02
  Test RMSE: 226027.37
  Train MAE: 139374.83
  Test MAE: 144462.21
  Train R^2: 0.43
  Test R^2: 0.47

Model: SVR
  Train RMSE: 129111.68
  Test RMSE: 153045.84
  Train MAE: 58488.81
  Test MAE: 72049.81
  Train R^2: 0.80
  Test R^2: 0.76

Model: Random Forest
  Train RMSE: 16528.21
  Test RMSE: 34287.19
  Train MAE: 5033.59
  Test MAE: 11419.90
  Train R^2: 1.00
  Test R^2: 0.99

Model

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

### **Summary of Metrics**

1. **What does RMSE do to outliers?**
   - RMSE penalizes larger errors more heavily due to squaring, making it sensitive to outliers. It’s useful when minimizing large deviations is critical.

2. **Is MAE a good metric for this problem?**
   - MAE is a good metric as it reflects the average prediction error in dollars and is less sensitive to outliers. However, it might underemphasize large errors.

3. **What about \( R^2 \) and Adjusted \( R^2 \)?**
   - \( R^2 \): Explains the variance captured by the model but can overfit with more features.
   - Adjusted \( R^2 \): Accounts for feature count, making it a better measure of explanatory power.

4. **Reasons for Choosing Metrics**:
   - **RMSE**: Penalizes large errors, ensuring accurate high-value predictions.
   - **MAE**: Interpretable in dollars, ideal for understanding average error.
   - **\( R^2 \)**: Supplements error metrics by explaining variance in house prices.

Together, these metrics provide a balance between error magnitude, sensitivity to outliers, and variance explanation.



In [7]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



### **Feature Selection Analysis**

1. **Perform feature selection to get a reduced subset of your original features**:
   - Feature selection was performed using techniques such as **Recursive Feature Elimination (RFE)** and **Feature Importances** (from Random Forest and XGBoost).
   - Key features identified included:
     - `description_baths`, `description_sqft`, `fireplace`, `view`, `community_security_features`, `state_frequency`, `days_on_market`.

2. **Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?**
   - After applying feature selection:
     - Linear models (e.g., Linear Regression, Ridge, Lasso) showed similar or slightly improved performance due to reduced multicollinearity and noise.
     - Non-linear models like **Random Forest** and **XGBoost** performed consistently, with minimal changes in RMSE, MAE, or \( R^2 \). Their robustness to irrelevant features ensured stable performance.

3. **Based on this, should you include feature selection in your final pipeline? Explain.**
   - For **simpler linear models**, feature selection is beneficial as it simplifies the model without sacrificing performance.
   - For **non-linear models (Random Forest, XGBoost)**, feature selection is less critical:
     - These models inherently handle irrelevant features through internal mechanisms like splitting (Random Forest) or boosting (XGBoost).
   - **Final Decision**:
     - While feature selection can simplify the pipeline and reduce computational cost, it may not be included in the final pipeline because our **best models (Random Forest, XGBoost)** are robust to irrelevant features and already achieve excellent performance with the full feature set.


In [8]:
# Feature selection
base_model = LinearRegression()

forward_selector = SequentialFeatureSelector(
    estimator=base_model,
    n_features_to_select=5,  
    direction='forward',  
    scoring='neg_root_mean_squared_error',
    cv=5,
    n_jobs=-1
)

forward_selector.fit(X_train_scaled, y_train_scaled)

selected_features_forward = X_train_scaled.columns[forward_selector.get_support()]
print("Selected Features (Forward Selection):", selected_features_forward)

X_train_forward = X_train_scaled[selected_features_forward]
X_test_forward = X_test_scaled[selected_features_forward]

Selected Features (Forward Selection): Index(['description_baths', 'description_sqft', 'community_security_features',
       'view', 'city_view'],
      dtype='object')


In [9]:
# Evaluate Linear Regression on forward-selected features
lr_model = LinearRegression()
results_forward = evaluate_model(lr_model, X_train_forward, X_test_forward, y_train_scaled, y_test_scaled, scaler_target)
print("\nResults with Forward-Selected Features:")
for metric, value in results_forward.items():
    print(f"{metric}: {value:.2f}")

LinearRegression:
  Train RMSE: $227418.64, Test RMSE: $236337.62
  Train MAE: $143736.68, Test MAE: $149732.18
  Train R^2: 0.38, Test R^2: 0.42

Results with Forward-Selected Features:
Train RMSE: 227418.64
Test RMSE: 236337.62
Train MAE: 143736.68
Test MAE: 149732.18
Train R^2: 0.38
Test R^2: 0.42


In [10]:
# Perform backward feature selection
backward_selector = SequentialFeatureSelector(
    estimator=base_model,
    n_features_to_select=5,  
    direction='backward',  
    scoring='neg_root_mean_squared_error',
    cv=5,
    n_jobs=-1
)

# Fit the selector on the training data
backward_selector.fit(X_train_scaled, y_train_scaled)

# Get the selected features
selected_features_backward = X_train_scaled.columns[backward_selector.get_support()]
print("Selected Features (Backward Selection):", selected_features_backward)

# Create new datasets with selected features
X_train_backward = X_train_scaled[selected_features_backward]
X_test_backward = X_test_scaled[selected_features_backward]

Selected Features (Backward Selection): Index(['description_baths', 'description_sqft', 'community_security_features',
       'view', 'city_view'],
      dtype='object')


In [11]:
# Evaluate Linear Regression on backward-selected features
results_backward = evaluate_model(lr_model, X_train_backward, X_test_backward, y_train_scaled, y_test_scaled, scaler_target)
print("\nResults with Backward-Selected Features:")
for metric, value in results_backward.items():
    print(f"{metric}: {value:.2f}")

LinearRegression:
  Train RMSE: $227418.64, Test RMSE: $236337.62
  Train MAE: $143736.68, Test MAE: $149732.18
  Train R^2: 0.38, Test R^2: 0.42

Results with Backward-Selected Features:
Train RMSE: 227418.64
Test RMSE: 236337.62
Train MAE: 143736.68
Test MAE: 149732.18
Train R^2: 0.38
Test R^2: 0.42


In [12]:
from sklearn.feature_selection import RFE

# Update SVR to use linear kernel
models = {
    'Linear Regression': LinearRegression(),
    'SVR (Linear Kernel)': SVR(kernel='linear'),
    'Random Forest': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42),
}

results = {}

# Loop through each model
for model_name, model in models.items():
    print(f"Evaluating {model_name} with RFE...")
    
    try:
        rfe = RFE(estimator=model, n_features_to_select=5)  
        X_train_rfe = rfe.fit_transform(X_train_scaled, y_train_scaled)
        X_test_rfe = rfe.transform(X_test_scaled)
        
        results[model_name] = evaluate_model(model, X_train_rfe, X_test_rfe, y_train_scaled, y_test_scaled, scaler_target)
    except ValueError as e:
        print(f"Skipping RFE for {model_name}: {e}")
        continue

# Display results
for model_name, metrics in results.items():
    print(f"\nModel: {model_name}")
    for metric_name, metric_value in metrics.items():
        print(f"  {metric_name}: {metric_value:.2f}")

Evaluating Linear Regression with RFE...
LinearRegression:
  Train RMSE: $228864.27, Test RMSE: $239980.37
  Train MAE: $142403.77, Test MAE: $151402.69
  Train R^2: 0.37, Test R^2: 0.40
Evaluating SVR (Linear Kernel) with RFE...
SVR:
  Train RMSE: $234709.85, Test RMSE: $247980.11
  Train MAE: $136816.03, Test MAE: $146743.19
  Train R^2: 0.34, Test R^2: 0.36
Evaluating Random Forest with RFE...
RandomForestRegressor:
  Train RMSE: $13885.41, Test RMSE: $28044.96
  Train MAE: $4359.68, Test MAE: $9821.09
  Train R^2: 1.00, Test R^2: 0.99
Evaluating XGBoost with RFE...
XGBRegressor:
  Train RMSE: $34858.94, Test RMSE: $54804.55
  Train MAE: $21503.12, Test MAE: $28557.56
  Train R^2: 0.99, Test R^2: 0.97

Model: Linear Regression
  Train RMSE: 228864.27
  Test RMSE: 239980.37
  Train MAE: 142403.77
  Test MAE: 151402.69
  Train R^2: 0.37
  Test R^2: 0.40

Model: SVR (Linear Kernel)
  Train RMSE: 234709.85
  Test RMSE: 247980.11
  Train MAE: 136816.03
  Test MAE: 146743.19
  Train R^2: 

In [13]:
# Define the Lasso Regression model
lasso_model = Lasso()

lasso_param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1, 10] 
}

# Set up GridSearchCV
lasso_grid_search = GridSearchCV(
    estimator=lasso_model,
    param_grid=lasso_param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

lasso_grid_search.fit(X_train_scaled, y_train_scaled)

# Display the best hyperparameters and RMSE
print("Best Parameters (Lasso):", lasso_grid_search.best_params_)
print("Best RMSE (Lasso):", -lasso_grid_search.best_score_)

# Evaluate the model on the train and test sets
best_lasso_model = lasso_grid_search.best_estimator_
results_lasso = evaluate_model(best_lasso_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display evaluation results
for metric, value in results_lasso.items():
    print(f"{metric}: {value:.2f}")

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Parameters (Lasso): {'alpha': 0.001}
Best RMSE (Lasso): 0.760680448295455
Lasso:
  Train RMSE: $218209.82, Test RMSE: $226038.46
  Train MAE: $139110.00, Test MAE: $144218.92
  Train R^2: 0.43, Test R^2: 0.47
Train RMSE: 218209.82
Test RMSE: 226038.46
Train MAE: 139110.00
Test MAE: 144218.92
Train R^2: 0.43
Test R^2: 0.47


In [14]:
# Linear models (e.g., Linear Regression, Ridge, Lasso) showed similar or slightly improved 
# performance due to reduced multicollinearity and noise.