## Hyperparam Tuning

Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Do a google search for hyperparameter ranges for each type of model.

GridSearch/RandomSearch are a great methods for checking off both of these tasks.

There is a fairly significant issue with this approach for this particular problem (described below). But in the interest of creating a basic functional pipeline, you can just use the default Sklearn methods for now.

## Preventing Data Leakage in Tuning - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it, if you have time!**

BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate 
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood. This is a challenging, real world data problem! To help you out, we've created some psuedocode and docstrings to get you started. 

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits. 
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data. 
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

This psuedocode may help you fill in the function:

```python
kfold = KFold() # fit sklearn k folds on X_train
train_folds = []
val_folds = []
for training_index, val_index in kfold.split(X_train):
    train_fold, val_fold = #.iloc loop variables on X_train

    # recompute training city means like you did in notebook 1 
    # merge to validation fold
        
    train_folds.append(train_fold)
    val_folds.append(val_fold)

    return train_folds, val_folds
```


**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.


This psuedocode may help you fill in the function:

```python
hyperparams = # Generate hyperparam options with itertools
hyperparam-scores = []
for hyperparam-combo in hyperparams:

    scores = []

    for folds in allmyfolds:
        # score fold the fold with the model/ hyperparams
        scores.append(score-fold)
        
    score = scores.mean()
    hyperparam-scores.append(score)
# After loop, find max of hyperparam-scores. Best params are at same index in `hyperparams` loop iteratble
```

Docstrings have been provided below to get you started. Once you're done developing your functions, you should move them to `functions_variables.py` to keep your notebook clean 

Bear in mind that these instructions are just one way to tackle this problem - the inputs and output formats don't need to be exactly as specified here.

## Hyperparam Tuning

In [1]:
from functions_variables import *


In [2]:

# Load scaled data
X_train_scaled = pd.read_csv("../data/processed/X_train_scaled.csv")
X_test_scaled = pd.read_csv("../data/processed/X_test_scaled.csv")

# Load scaled target variables
y_train_scaled = pd.read_csv("../data/processed/y_train_scaled.csv").values.flatten()  
y_test_scaled = pd.read_csv("../data/processed/y_test_scaled.csv").values.flatten()  

# Load the feature scaler
scaler_features = joblib.load("scaler_features.pkl")

# Load the target scaler
scaler_target = joblib.load("scaler_target.pkl")



In [3]:
print("X_train_scaled shape:", X_train_scaled.shape)
print("X_test_scaled shape:", X_test_scaled.shape)
print("y_train_scaled shape:", y_train_scaled.shape)
print("y_test_scaled shape:", y_test_scaled.shape)

X_train_scaled shape: (4511, 22)
X_test_scaled shape: (1128, 22)
y_train_scaled shape: (4511,)
y_test_scaled shape: (1128,)


In [4]:
# Linear Regression model
lr_model = LinearRegression()

lr_param_grid = {}

# Set up GridSearchCV
lr_grid_search = GridSearchCV(
    estimator=lr_model,
    param_grid=lr_param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

# Fit the model
lr_grid_search.fit(X_train_scaled, y_train_scaled)

print("Best Parameters (Linear Regression):", lr_grid_search.best_params_)
print("Best RMSE (Linear Regression):", -lr_grid_search.best_score_)

# Evaluate the model on the train and test sets
best_lr_model = lr_grid_search.best_estimator_
results_lr = evaluate_model(best_lr_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display evaluation results
for metric, value in results_lr.items():
    print(f"{metric}: {value:.2f}")

Fitting 5 folds for each of 1 candidates, totalling 5 fits
Best Parameters (Linear Regression): {}
Best RMSE (Linear Regression): 0.7605343929482624
LinearRegression:
  Train RMSE: $218105.78, Test RMSE: $224965.86
  Train MAE: $139397.52, Test MAE: $144474.05
  Train R^2: 0.43, Test R^2: 0.47
Train RMSE: 218105.78
Test RMSE: 224965.86
Train MAE: 139397.52
Test MAE: 144474.05
Train R^2: 0.43
Test R^2: 0.47


In [5]:
#Ridge Regression model
ridge_model = Ridge()

# Define the hyperparameter grid
ridge_param_grid = {
    'alpha': [0.1, 1, 10, 100]  # Regularization strength
}

# Set up GridSearchCV
ridge_grid_search = GridSearchCV(
    estimator=ridge_model,
    param_grid=ridge_param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

# Fit the model
ridge_grid_search.fit(X_train_scaled, y_train_scaled)

# Best parameters and performance
print("Best Parameters (Ridge):", ridge_grid_search.best_params_)
print("Best RMSE (Ridge):", -ridge_grid_search.best_score_)

# Evaluate the model on the train and test sets
best_ridge_model = ridge_grid_search.best_estimator_
results_ridge = evaluate_model(best_ridge_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display evaluation results
for metric, value in results_ridge.items():
    print(f"{metric}: {value:.2f}")

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best Parameters (Ridge): {'alpha': 1}
Best RMSE (Ridge): 0.7605293796400723
Ridge:
  Train RMSE: $218105.89, Test RMSE: $224962.01
  Train MAE: $139386.57, Test MAE: $144454.04
  Train R^2: 0.43, Test R^2: 0.47
Train RMSE: 218105.89
Test RMSE: 224962.01
Train MAE: 139386.57
Test MAE: 144454.04
Train R^2: 0.43
Test R^2: 0.47


In [6]:
# Define the SVR model
svr_model = SVR()

# Define the hyperparameter grid
svr_param_grid = {
    'kernel': ['rbf'],  
    'C': [0.1, 1, 10],  
    'gamma': ['scale', 'auto']  
}

# Set up GridSearchCV
svr_grid_search = GridSearchCV(
    estimator=svr_model,
    param_grid=svr_param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

# Fit the model
svr_grid_search.fit(X_train_scaled, y_train_scaled)

# Best parameters and performance
print("Best Parameters (SVR):", svr_grid_search.best_params_)
print("Best RMSE (SVR):", -svr_grid_search.best_score_)

# Evaluate the model on the train and test sets
best_svr_model = svr_grid_search.best_estimator_
results_svr = evaluate_model(best_svr_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display evaluation results
for metric, value in results_svr.items():
    print(f"{metric}: {value:.2f}")

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best Parameters (SVR): {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best RMSE (SVR): 0.28412505156944706
SVR:
  Train RMSE: $54612.45, Test RMSE: $75196.92
  Train MAE: $31221.08, Test MAE: $37356.25
  Train R^2: 0.96, Test R^2: 0.94
Train RMSE: 54612.45
Test RMSE: 75196.92
Train MAE: 31221.08
Test MAE: 37356.25
Train R^2: 0.96
Test R^2: 0.94


In [7]:
# Gridsearch for Random Forest

rf_grid_model = RandomForestRegressor(random_state=42)

# Define the parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': [None, 'sqrt', 'log2']
}

# Set up GridSearchCV for Random Forest
best_rf_grid_model = GridSearchCV(
    estimator=rf_grid_model,
    param_grid=rf_param_grid,
    scoring='neg_root_mean_squared_error',
    cv=5,
    verbose=1,
    n_jobs=-1
)

# Fit GridSearchCV for Random Forest
print("Fitting GridSearchCV for Random Forest...")
best_rf_grid_model.fit(X_train_scaled, y_train_scaled)

# Output best parameters and performance
print("\nBest Parameters (Random Forest):", best_rf_grid_model.best_params_)
print("Best RMSE (Random Forest):", -best_rf_grid_model.best_score_)


Fitting GridSearchCV for Random Forest...
Fitting 5 folds for each of 108 candidates, totalling 540 fits

Best Parameters (Random Forest): {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Best RMSE (Random Forest): 0.16489315812457042


In [8]:
best_rf_grid_model = RandomForestRegressor(
    n_estimators=300,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    max_depth=None,
    random_state=42
)

# Fit the Random Forest model
best_rf_grid_model.fit(X_train_scaled, y_train_scaled)

# Evaluate the Random Forest model
rf_results = evaluate_model(best_rf_grid_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)
print("\nRandom Forest Results on Train/Test Set:")
for metric, value in rf_results.items():
    print(f"{metric}: {value:.2f}")

RandomForestRegressor:
  Train RMSE: $13682.07, Test RMSE: $29029.00
  Train MAE: $4512.14, Test MAE: $9801.14
  Train R^2: 1.00, Test R^2: 0.99

Random Forest Results on Train/Test Set:
Train RMSE: 13682.07
Test RMSE: 29029.00
Train MAE: 4512.14
Test MAE: 9801.14
Train R^2: 1.00
Test R^2: 0.99


In [9]:
# RamdonizedSearch for Random Forest
rf_model_random = RandomForestRegressor(random_state=42)

# Define the parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

# Set up RandomizedSearchCV for Random Forest
best_rf_random_model = RandomizedSearchCV(
    estimator=rf_model_random,
    param_distributions=rf_param_grid,
    n_iter=50,  # Number of combinations to try
    scoring='neg_root_mean_squared_error',
    cv=5,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Fit RandomizedSearchCV for Random Forest
best_rf_random_model.fit(X_train_scaled, y_train_scaled)

# Output best parameters and performance
print("Best Parameters (Random Forest):", best_rf_random_model.best_params_)
print("Best RMSE (Random Forest):", -best_rf_random_model.best_score_)


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters (Random Forest): {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None}
Best RMSE (Random Forest): 0.16358682300912028


In [10]:
best_rf_random_model = RandomForestRegressor(
    n_estimators=500,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='log2',
    max_depth=None,
    random_state=42
)

# Fit the Random Forest model
best_rf_grid_model.fit(X_train_scaled, y_train_scaled)

# Evaluate the Random Forest model
rf_results = evaluate_model(best_rf_grid_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)
print("\nRandom Forest Results on Train/Test Set:")
for metric, value in rf_results.items():
    print(f"{metric}: {value:.2f}")

RandomForestRegressor:
  Train RMSE: $13682.07, Test RMSE: $29029.00
  Train MAE: $4512.14, Test MAE: $9801.14
  Train R^2: 1.00, Test R^2: 0.99

Random Forest Results on Train/Test Set:
Train RMSE: 13682.07
Test RMSE: 29029.00
Train MAE: 4512.14
Test MAE: 9801.14
Train R^2: 1.00
Test R^2: 0.99


In [11]:
#GridSearch for XGBoost

xgb_grid_model = XGBRegressor(
    random_state=42,
    objective='reg:squarederror'
)

# Define the hyperparameter grid
xgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Set up GridSearchCV
best_xgb_grid_model = GridSearchCV(
    estimator=xgb_grid_model,
    param_grid=xgb_param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='neg_root_mean_squared_error',  # Minimize RMSE
    verbose=1,
    n_jobs=-1
)

# Fit the grid search
best_xgb_grid_model.fit(X_train_scaled, y_train_scaled)

# Best parameters and performance
print("Best Parameters (XGBoost):", best_xgb_grid_model.best_params_)
print("Best RMSE (XGBoost):", -best_xgb_grid_model.best_score_)


Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters (XGBoost): {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300, 'subsample': 0.8}
Best RMSE (XGBoost): 0.11118157332074514


In [12]:
best_xgb_grid_model = XGBRegressor(
    subsample=0.8,
    n_estimators=300,
    max_depth=7,
    learning_rate=0.1,
    gamma=0,
    colsample_bytree=0.8,
    random_state=42,
    objective='reg:squarederror'
)

best_xgb_grid_model.fit(X_train_scaled, y_train_scaled)

# Evaluate the model on the train and test sets
results_xgb = evaluate_model(best_xgb_grid_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display evaluation results
for metric, value in results_xgb.items():
    print(f"{metric}: {value:.2f}")

XGBRegressor:
  Train RMSE: $4120.04, Test RMSE: $22078.77
  Train MAE: $2600.69, Test MAE: $5328.10
  Train R^2: 1.00, Test R^2: 0.99
Train RMSE: 4120.04
Test RMSE: 22078.77
Train MAE: 2600.69
Test MAE: 5328.10
Train R^2: 1.00
Test R^2: 0.99


In [13]:
#RandomizedSearchCV for XGBoost

xgb_random_model = XGBRegressor(random_state=42, objective='reg:squarederror')

# Define the parameter grid for XGBoost
xgb_param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 1, 5],
}

# Set up RandomizedSearchCV for XGBoost
best_xgb_random_model = RandomizedSearchCV(
    estimator=xgb_random_model,
    param_distributions=xgb_param_grid,
    n_iter=50,  # Number of combinations to try
    scoring='neg_root_mean_squared_error',
    cv=5,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Fit RandomizedSearchCV for XGBoost
best_xgb_random_model.fit(X_train_scaled, y_train_scaled)

# Output best parameters and performance
print("Best Parameters (XGBoost):", best_xgb_random_model.best_params_)
print("Best RMSE (XGBoost):", -best_xgb_random_model.best_score_)


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters (XGBoost): {'subsample': 1.0, 'n_estimators': 200, 'max_depth': 7, 'learning_rate': 0.2, 'gamma': 0, 'colsample_bytree': 0.6}
Best RMSE (XGBoost): 0.10630633999453828


In [14]:
# Train the best XGBoost model

best_xgb_random_model = XGBRegressor(
    subsample=1.0,
    n_estimators=200,
    max_depth=7,
    learning_rate=0.2,
    gamma=0,
    colsample_bytree=0.6,
    random_state=42,
    objective='reg:squarederror'
)

best_xgb_random_model.fit(X_train_scaled, y_train_scaled)

# Evaluate the model on the train and test sets
results_xgb = evaluate_model(best_xgb_random_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)

# Display evaluation results
for metric, value in results_xgb.items():
    print(f"{metric}: {value:.2f}")

XGBRegressor:
  Train RMSE: $3444.88, Test RMSE: $24228.19
  Train MAE: $2079.72, Test MAE: $4794.03
  Train R^2: 1.00, Test R^2: 0.99
Train RMSE: 3444.88
Test RMSE: 24228.19
Train MAE: 2079.72
Test MAE: 4794.03
Train R^2: 1.00
Test R^2: 0.99


In [15]:
# Stacking Regressor for Random Forest and XGBoost
from sklearn.ensemble import StackingRegressor
base_models = [
    ('Random Forest', best_rf_random_model),
    ('XGBoost', best_xgb_random_model)
]

# Define the meta-model
meta_model = LinearRegression()

# Create the Stacking Regressor
stacked_model = StackingRegressor(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,
    n_jobs=-1
)

# Train the Stacking Regressor
stacked_model.fit(X_train_scaled, y_train_scaled)

# Evaluate the Stacking Regressor
stacking_results = evaluate_model(stacked_model, X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled, scaler_target)
print("\nStacking Model Results on Test Set:")
for metric, value in stacking_results.items():
    print(f"{metric}: {value:.2f}")

StackingRegressor:
  Train RMSE: $3486.38, Test RMSE: $24254.97
  Train MAE: $2214.41, Test MAE: $4858.01
  Train R^2: 1.00, Test R^2: 0.99

Stacking Model Results on Test Set:
Train RMSE: 3486.38
Test RMSE: 24254.97
Train MAE: 2214.41
Test MAE: 4858.01
Train R^2: 1.00
Test R^2: 0.99


In [16]:
# cross-validation
cv_scores = cross_val_score(
    estimator=stacked_model,
    X=X_train_scaled,
    y=y_train_scaled,
    scoring="neg_root_mean_squared_error",
    cv=5,
    n_jobs=-1
)

cv_rmse = -cv_scores.mean()
print(f"Cross-Validation RMSE: {cv_rmse:.2f}")

Cross-Validation RMSE: 0.11


In [17]:
# Cross Validation
models = {
    "Linear Regression": lr_model,
    "Ridge Regression": ridge_model,
    "Polynomial Regression (Linear)": lr_model,
    "Polynomial Regression (Ridge)": ridge_model,
    "SVR": SVR(kernel='rbf', C=1.0, gamma='scale'),  
    "Random Forest": RandomForestRegressor(n_estimators=300, random_state=42), 
    "XGBoost": XGBRegressor(n_estimators=300, max_depth=7, learning_rate=0.1, random_state=42)
}

cv_results = perform_cross_validation(
    models=models,
    X=X_train_scaled,   
    y=y_train_scaled,   
    cv=5                
)

for model_name, result in cv_results.items():
    print(f"\nModel: {model_name}")
    print(f"Per-Fold RMSE: {result['Per-Fold RMSE']}")
    print(f"Mean RMSE: {result['Mean RMSE']:.2f}")

Performing cross-validation for Linear Regression...
Linear Regression Mean RMSE: 0.76
Performing cross-validation for Ridge Regression...
Ridge Regression Mean RMSE: 0.76
Performing cross-validation for Polynomial Regression (Linear)...
Polynomial Regression (Linear) Mean RMSE: 0.76
Performing cross-validation for Polynomial Regression (Ridge)...
Polynomial Regression (Ridge) Mean RMSE: 0.76
Performing cross-validation for SVR...
SVR Mean RMSE: 0.53
Performing cross-validation for Random Forest...
Random Forest Mean RMSE: 0.18
Performing cross-validation for XGBoost...
XGBoost Mean RMSE: 0.12

Model: Linear Regression
Per-Fold RMSE: [0.80507954 0.78157251 0.7834044  0.69273471 0.73988079]
Mean RMSE: 0.76

Model: Ridge Regression
Per-Fold RMSE: [0.80502612 0.78162051 0.78342169 0.69273827 0.7398403 ]
Mean RMSE: 0.76

Model: Polynomial Regression (Linear)
Per-Fold RMSE: [0.80507954 0.78157251 0.7834044  0.69273471 0.73988079]
Mean RMSE: 0.76

Model: Polynomial Regression (Ridge)
Per-Fol

We want to make sure that we save our models.  In the old days, one just simply pickled (serialized) the model.  Now, however, certain model types have their own save format.  If the model is from sklearn, it can be pickled, if it's xgboost, for example, the newest format to save it in is JSON, but it can also be pickled.  It's a good idea to stay with the most current methods. 
- you may want to create a new `models/` subdirectory in your repo to stay organized

In [18]:
from joblib import dump

# Save the trained stacking model to a file
dump(stacked_model, 'stacked_model.pkl')



['stacked_model.pkl']

## Building a Pipeline (Stretch)

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it if you have time!**

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py` 

In [19]:
# Build pipeline here
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from joblib import load
import pandas as pd
import json

# Custom Transformer for Feature Selection
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self.features = features
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.features]

# Custom Transformer for JSON to DataFrame conversion
class JSONLoader(BaseEstimator, TransformerMixin):
    def __init__(self, feature_columns):
        self.feature_columns = feature_columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, json_file_path):
        with open(json_file_path, 'r') as file:
            data = json.load(file)
        
        # Flatten the JSON and convert to DataFrame
        results = data.get('data', {}).get('results', [])
        df = pd.json_normalize(results)
        
        # Select only the necessary columns
        df = df[self.feature_columns]
        return df

In [24]:
# Features to be used for prediction (based on feature selection)
selected_features = [
    'description_baths', 'description_beds', 'description_sqft', 
    'community_security_features', 'fireplace', 'view'
]

# Load the saved model
stacked_model_path = 'stacked_model.pkl'
stacked_model = load(stacked_model_path)

# Create the pipeline
prediction_pipeline = Pipeline([
    ('json_loader', JSONLoader(feature_columns=selected_features)),
    ('scaler', StandardScaler()),  # Ensure consistent scaling
        ('model', stacked_model)
    ])

In [21]:
# Path to the new JSON data
new_json_data_path = '/home/t0si/Documents/LHL-Midterm-Project/data/housing/MO_JeffersonCity_3.json'

# Load the JSON data
with open(new_json_data_path, 'r') as file:
    data = json.load(file)

# Flatten the JSON and convert to DataFrame
results = data.get('data', {}).get('results', [])
df = pd.json_normalize(results)

# Extract tags and process features if necessary
if 'tags' in df.columns:
    df['tags'] = df['tags'].apply(lambda x: x if isinstance(x, list) else [])
    features_to_extract = [
        'community.securityFeatures',
        'description.fireplace',
        'description.view'
    ]
    for feature in features_to_extract:
        df[feature] = df['tags'].apply(lambda x: 1 if feature in x else 0)
    df = df.drop(columns=['tags'], errors='ignore')

# Selected features for the model
selected_features = [
    'description.baths', 'description.beds', 'description.sqft', 
    'community.securityFeatures', 'description.fireplace', 'description.view'
]

# Validate features
missing_features = [feature for feature in selected_features if feature not in df.columns]
if missing_features:
    print(f"Error: Missing features in the DataFrame: {missing_features}")
else:
    # Prepare the feature matrix
    feature_matrix = df[selected_features].values

    # Load the pipeline
    prediction_pipeline = load('prediction_pipeline.pkl')

    # Check pipeline steps
    print("Pipeline Steps:", prediction_pipeline.steps)

    # Remove json_loader step if present
    if 'json_loader' in dict(prediction_pipeline.steps):
        prediction_pipeline.steps.pop(0)

    # Predict with the pipeline
    new_data_predictions = prediction_pipeline.predict(feature_matrix)
    print("Predictions on new data:")
    print(new_data_predictions)

Pipeline Steps: [('json_loader', JSONLoader(feature_columns=['description_baths', 'description_beds',
                            'description_sqft', 'community_security_features',
                            'fireplace', 'view'])), ('scaler', StandardScaler()), ('model', StackingRegressor(cv=5,
                  estimators=[('Random Forest',
                               RandomForestRegressor(max_features='log2',
                                                     n_estimators=500,
                                                     random_state=42)),
                              ('XGBoost',
                               XGBRegressor(base_score=None, booster=None,
                                            callbacks=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=0.6, device=None,
                                            ear

NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [None]:
# save your pipeline here# Save the pipeline
dump(prediction_pipeline, 'prediction_pipeline.pkl')
print("Prediction pipeline saved as 'prediction_pipeline.pkl'")