## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os

# Load preprocessed data
file_path = os.path.join('..', 'data', 'processed', 'preprocessed_data.csv')
data = pd.read_csv(file_path)

# Define target and features
target_column = 'list_price'  # or another target column if different
y = data[target_column]
X = data.drop(columns=[target_column])

# Ensure that the 'city', 'state', and 'type' columns are excluded from the features
X = X.drop(columns=['city', 'state', 'type', 'sold_date', 'tags', 'filtered_tags'])

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print shapes to verify
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


X_train shape: (1968, 23)
X_test shape: (492, 23)
y_train shape: (1968,)
y_test shape: (492,)


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [24]:
# Impute missing values in features (X)
X_imputer = SimpleImputer(strategy='mean')  # or strategy='median'
X_train_imputed = X_imputer.fit_transform(X_train)
X_test_imputed = X_imputer.transform(X_test)

# Impute missing values in target (y)
y_imputer = SimpleImputer(strategy='mean')  # or strategy='median'
y_train_imputed = y_imputer.fit_transform(y_train.values.reshape(-1, 1)).ravel()  # Flatten to 1D array
y_test_imputed = y_imputer.transform(y_test.values.reshape(-1, 1)).ravel()

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Support Vector Machine': SVR(),
    'Random Forest': RandomForestRegressor(),
    'XGBoost': xgb.XGBRegressor()
}

# Train and evaluate each model
results = {}

for model_name, model in models.items():
    model.fit(X_train_imputed, y_train_imputed)
    y_pred = model.predict(X_test_imputed)
    mae = mean_absolute_error(y_test_imputed, y_pred)  # You can also use other metrics like RMSE or R2
    results[model_name] = mae

# Display results
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Mean Absolute Error'])
print(results_df)

                    Model  Mean Absolute Error
0       Linear Regression         12589.039777
1  Support Vector Machine         77841.204215
2           Random Forest          9256.975018
3                 XGBoost         10601.229121


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)