# **Sales Price Prediction**

## Objectives

* Fit and evaluate a classification model to predic house sales price.

## Inputs

outputs/datasets/cleaned/clean_house_price_records.csv

## Outputs

* Train set 
* Test set
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline
* Feature importance

## Additional Comments

* This file and its contents were inspired by and adapted from the Churnometer Walkthrough Project 2 and other lessons from Code Institute.  

---

### Change working directory

We need to change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

### Load Data

Load dataset:

In [None]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/cleaned/clean_house_price_records.csv"))
df.head()

### ML Pipeline with all data

- ML pipeline for Data Cleaning and Feature Engineering

In [None]:
from sklearn.pipeline import Pipeline

### Feature Engineering
from feature_engine import transformation as vt
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
import numpy as np
from sklearn.impute import SimpleImputer
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer 
import pandas as pd

selection_method = "cardinality"
corr_method = "spearman"

def PipelineOptimization(model):
    pipeline_base = Pipeline([
        ("NumericMissingValueImputer", MeanMedianImputer(imputation_method='median',
                                                         variables=['1stFlrSF', 'LotArea', 'GrLivArea', 
                                                                    'MasVnrArea', 'OpenPorchSF'])),
        
        ("CategoricalMissingValueImputer", CategoricalImputer(imputation_method='frequent',
                                                              variables=['BsmtExposure', 'BsmtFinType1', 
                                                                         'GarageFinish', 'KitchenQual'])),
        
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['BsmtExposure', 'BsmtFinType1', 
                                                                'GarageFinish', 'KitchenQual'])),
        
        ("NumericLogTransform", vt.LogTransformer(variables=['1stFlrSF', 'LotArea', 'GrLivArea'])),
        
        ("NumericPowerTransform", vt.PowerTransformer(variables=['MasVnrArea'])),
        
        ("NumericYeoJohnsonTransform", vt.YeoJohnsonTransformer(variables=['OpenPorchSF'])),
        
        ("feat_scaling", StandardScaler()),

        ("FinalImputer", SimpleImputer(strategy="mean")),

        ("feat_selection", SelectFromModel(model)),
        
        ("model", model),
        
    ])

    return pipeline_base

Custom Class for Hyperparameter Optimisation

In [None]:
from sklearn.model_selection import GridSearchCV

class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

### Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['SalePrice'], axis=1),
    df['SalePrice'],
    test_size=0.2,
    random_state=0
)


print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

### Grid Search CV - Sklearn

#### Use standard hyperparameters to find most suitable algorithm

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

Quick GridSearch CV - Binary Classifier

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

- Top Model Performance: ExtraTreesRegressor achieved the highest average R2 score at 0.82, surpassing the client’s requirement of at least 0.75 for model selection.

- Variability Across Models: While ExtraTreesRegressor and LinearRegression performed well with mean scores above 0.80, other models like RandomForestRegressor and XGBRegressor showed lower average R2 scores, particularly XGBRegressor, with an average score of 0.65.

- Standard Deviation Insights: The lower standard deviation in models like LinearRegression suggests consistent performance, while models such as GradientBoostingRegressor showed greater variability, indicating more fluctuation in R2 scores across different test sets.

#### Extensive search to the most suitable model

In this code, we define a model search for optimizing hyperparameters specifically for the `ExtraTreesRegressor` model. We set up a range of values for each key hyperparameter:

- `n_estimators`: Number of trees in the forest.
- `max_depth`: Maximum depth of each tree.
- `min_samples_split`: Minimum number of samples required to split an internal node.
- `min_samples_leaf`: Minimum number of samples required to be at a leaf node.

In [None]:
models_search = {
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
}

params_search = {
    "ExtraTreesRegressor": {
        'model__n_estimators': [100, 300, 500],
        'model__max_depth': [10, 20, None],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4],
    }
}

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary


Defining the best model:

In [None]:
best_model = grid_search_summary.iloc[0, 0]
best_model

In [None]:
grid_search_pipelines[best_model].best_params_

Defining the best regressor:

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

### Assess feature importance

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

data_cleaning_feat_eng_steps = 6 
columns_after_data_cleaning_feat_eng = (Pipeline(best_regressor_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support(
)].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()],
    'Importance': best_regressor_pipeline['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on: \n{best_features}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

### Evaluate on Train and Test Sets

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test, pipeline):
    print("Model Evaluation \n")
    print("* Train Set")
    regression_evaluation(X_train, y_train, pipeline)
    print("* Test Set")
    regression_evaluation(X_test, y_test, pipeline)


def regression_evaluation(X, y, pipeline):
    prediction = pipeline.predict(X)
    print('R2 Score:', r2_score(y, prediction).round(3))
    print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))
    print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))
    print('Root Mean Squared Error:', np.sqrt(
        mean_squared_error(y, prediction)).round(3))
    print("\n")


def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
    pred_train = pipeline.predict(X_train)
    pred_test = pipeline.predict(X_test)

    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
    sns.scatterplot(x=y_train, y=pred_train, alpha=alpha_scatter, ax=axes[0])
    sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
    axes[0].set_xlabel("Actual")
    axes[0].set_ylabel("Predictions")
    axes[0].set_title("Train Set")

    sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
    sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
    axes[1].set_xlabel("Actual")
    axes[1].set_ylabel("Predictions")
    axes[1].set_title("Test Set")

    plt.show()

Evaluate Performance

In [None]:
regression_performance(X_train, y_train, X_test, y_test, best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)

### Regressor with Principal Component Analysis

potential values for Principal Component n_components.

In [None]:
pipeline = PipelineOptimization(model=LinearRegression())
pipeline_pca = Pipeline(pipeline.steps[:8])
df_pca = pipeline_pca.fit_transform(df.drop(['SalePrice'],axis=1))

print(df_pca.shape,'\n', type(df_pca))

Apply PCA separately to the scaled data

In [None]:
import numpy as np
from sklearn.decomposition import PCA

n_components = 17


def pca_components_analysis(df_pca, n_components):
    pca = PCA(n_components=n_components).fit(df_pca)
    x_PCA = pca.transform(df_pca)  # array with transformed PCA

    ComponentsList = ["Component " + str(number)
                      for number in range(n_components)]
    dfExplVarRatio = pd.DataFrame(
        data=np.round(100 * pca.explained_variance_ratio_, 3),
        index=ComponentsList,
        columns=['Explained Variance Ratio (%)'])

    dfExplVarRatio['Accumulated Variance'] = dfExplVarRatio['Explained Variance Ratio (%)'].cumsum(
    )

    PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum(
    )

    print(
        f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
    plt.figure(figsize=(12, 5))
    sns.lineplot(data=dfExplVarRatio,  marker="o")
    plt.xticks(rotation=90)
    plt.yticks(np.arange(0, 110, 10))
    plt.show()


pca_components_analysis(df_pca=df_pca, n_components=n_components)

In [None]:
n_components = 9
pca_components_analysis(df_pca=df_pca, n_components=n_components)

### Rewrite ML Pipeline for Modelling

Analysis of Numerical Variables: Skewness and Correlation

In [None]:
from scipy.stats import skew
data_path = "outputs/datasets/cleaned/clean_house_price_records.csv"
df = pd.read_csv(data_path)

numerical_columns = ['1stFlrSF', 'LotArea', 'GrLivArea', 'GarageArea', 'MasVnrArea']

df = df[[col for col in numerical_columns if col in df.columns]]

for col in df.columns:
    col_skew = skew(df[col].dropna())
    print(f"{col}: Skewness = {col_skew:.2f}")
    
    plt.figure(figsize=(6, 4))
    sns.histplot(df[col].dropna(), kde=True)
    plt.title(f'Distribution of {col} (Skewness = {col_skew:.2f})')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

correlation_matrix = df.corr(method='spearman')
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, vmin=-1, vmax=1)
plt.title("Spearman Correlation Matrix for Numerical Variables")
plt.show()

print(correlation_matrix)

Pipeline(steps=[('NumericMissingValueImputer',
                 MeanMedianImputer(variables=['1stFlrSF', 'LotArea',
                                              'GrLivArea', 'MasVnrArea',
                                              'OpenPorchSF'])),
                ('CategoricalMissingValueImputer',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['BsmtExposure', 'BsmtFinType1',
                                               'GarageFinish',
                                               'KitchenQual'])),
                ('OrdinalCategoricalEncoder',
                 OrdinalEncoder(encoding_meth...
                 PowerTransformer(variables=['MasVnrArea'])),
                ('NumericYeoJohnsonTransform',
                 YeoJohnsonTransformer(variables=['OpenPorchSF'])),
                ('feat_scaling', StandardScaler()),
                ('FinalImputer', SimpleImputer()),
                ('feat_selection',
                 SelectFromModel(estimator=ExtraTreesRegressor(random_state=0))),
                ('model',
                 ExtraTreesRegressor(max_depth=20, min_samples_leaf=2,
                                     min_samples_split=10, n_estimators=300,
                                     random_state=0))])

In [None]:
def PipelineOptimization(model, n_components=7):
    pipeline_base = Pipeline([
        ("NumericMissingValueImputer", MeanMedianImputer(variables=['1stFlrSF', 'LotArea', 'GrLivArea', 'MasVnrArea', 'OpenPorchSF'])),
        
        ("CategoricalMissingValueImputer", CategoricalImputer(imputation_method='frequent', 
                                                              variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),
        
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary', 
                                                     variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),
        
        ("NumericLogTransform", LogTransformer(variables=['1stFlrSF', 'LotArea', 'GrLivArea'])),
        
        ("NumericPowerTransform", PowerTransformer(variables=['GarageArea', 'MasVnrArea'])),
        
        ("NumericYeoJohnsonTransform", YeoJohnsonTransformer(variables=['OpenPorchSF'])),
        
        ("feat_scaling", StandardScaler()),
        
        ("PCA", PCA(n_components=n_components, random_state=0)),
        
        ("FinalImputer", SimpleImputer()),
        
        ("feat_selection", SelectFromModel(estimator=ExtraTreesRegressor(random_state=0))),
        
        ("model", model),
    ])
    
    return pipeline_base

### Grid Search CV – Sklearn

In [None]:
print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

Quick optimisation search

In [None]:
from feature_engine.transformation import LogTransformer, PowerTransformer, YeoJohnsonTransformer

quick_search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
quick_search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results:

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

#### Extensive search on the most suitable model to find the best hyperparameter configuration

Extensive GridSearch CV

In [None]:
models_search = {
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
}

params_search = {
    "ExtraTreesRegressor": {
        'model__n_estimators': [100, 300, 500, 700],
        'model__max_depth': [10, 20, None],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4],
    }
}

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model

In [None]:
grid_search_pipelines[best_model].best_params_

Define the best regressor

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

### Assess feature importance

In [None]:
original_features = X_train.columns.to_list()

important_features = ['Feature_17', 'Feature_10', 'Feature_18', 'Feature_19', 'Feature_20']

important_features_mapped = {feature: original_features[int(feature.split('_')[1])] for feature in important_features}

print("Mapping of important features to original variable names:")
for placeholder, original_name in important_features_mapped.items():
    print(f"{placeholder} corresponds to original feature '{original_name}'")



In [None]:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.pipeline import Pipeline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create a pipeline that excludes PCA and feature selection to retain all features
pipeline_full_features = Pipeline([
    (name, step) for name, step in best_regressor_pipeline.steps[:-1]
    if name not in ['PCA', 'feat_selection']
])

# Transform the data up to the last step without PCA and selection
X_train_full_features = pipeline_full_features.fit_transform(X_train, y_train)

# Fit ExtraTreesRegressor on the transformed data without dimensionality reduction
model_no_reduction = ExtraTreesRegressor(random_state=0)
model_no_reduction.fit(X_train_full_features, y_train)

# Mapping original feature names based on the order in X_train
original_feature_names = X_train.columns.to_list()

# Replace placeholder names with original feature names
df_feature_importance = pd.DataFrame({
    'Feature': original_feature_names,  # Use original feature names here
    'Importance': model_no_reduction.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Select and display the top 5 important features
top_n_features = 5
df_top_features = df_feature_importance.head(top_n_features)

print(f"* The top {top_n_features} most important features in the model are:\n{df_top_features['Feature'].tolist()}")
plt.figure(figsize=(10, 6))
sns.barplot(x="Importance", y="Feature", data=df_top_features)
plt.title(f"Top {top_n_features} Feature Importance for ExtraTreesRegressor")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()


#### Evaluate on Train and Test Sets

In [None]:
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns

def evaluate_model_performance(X_train, y_train, X_test, y_test, pipeline):
    print("Evaluation Summary:\n")
    print("* Training Data Evaluation")
    assess_performance(X_train, y_train, pipeline)
    print("* Testing Data Evaluation")
    assess_performance(X_test, y_test, pipeline)

def assess_performance(X, y, pipeline):
    predictions = pipeline.predict(X)
    print(f'R2 Score: {r2_score(y, predictions):.3f}')
    print(f'Mean Absolute Error (MAE): {mean_absolute_error(y, predictions):.3f}')
    print(f'Mean Squared Error (MSE): {mean_squared_error(y, predictions):.3f}')
    print(f'Root Mean Squared Error (RMSE): {np.sqrt(mean_squared_error(y, predictions)):.3f}')
    print("\n")

def plot_model_evaluation(X_train, y_train, X_test, y_test, pipeline, alpha=0.5):
    train_preds = pipeline.predict(X_train)
    test_preds = pipeline.predict(X_test)
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    
    sns.scatterplot(x=y_train, y=train_preds, alpha=alpha, ax=axes[0])
    sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
    axes[0].set_xlabel("Actual Values")
    axes[0].set_ylabel("Predicted Values")
    axes[0].set_title("Train Data Performance")
    
    sns.scatterplot(x=y_test, y=test_preds, alpha=alpha, ax=axes[1])
    sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
    axes[1].set_xlabel("Actual Values")
    axes[1].set_ylabel("Predicted Values")
    axes[1].set_title("Test Data Performance")
    
    plt.tight_layout()
    plt.show()


Performance evaluation

In [None]:
regression_performance(X_train, y_train, X_test, y_test, best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)

#### Create Output Directory for Pipeline Results

This code snippet creates a versioned directory structure to store output files, such as models, datasets, and visualizations generated by the machine learning pipeline.

In [None]:
import os

version = "v1"  
output_dir = f"outputs/ml_pipeline/predict_saleprice/{version}"

try:
    os.makedirs(output_dir, exist_ok=True)
    print(f"Output directory created at: {output_dir}")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
print("Kolumner som användes för träning:")
print(price_pipeline.feature_names_in_)
print(f"Antal kolumner: {len(price_pipeline.feature_names_in_)}")


### Save Training and Test Datasets

This code saves the training and test datasets (`X_train`, `y_train`, `X_test`, `y_test`) as CSV files in the designated output directory. Storing these files allows for easy access to the train-test split in future analysis without recreating it each time.


In [None]:
# Save Train data
X_train.to_csv(f"{output_dir}/X_train.csv", index=False)
y_train.to_csv(f"{output_dir}/y_train.csv", index=False)

# Save Test data
X_test.to_csv(f"{output_dir}/X_test.csv", index=False)
y_test.to_csv(f"{output_dir}/y_test.csv", index=False)

print("Training and test datasets saved successfully.")

### Save Best Model Pipeline

This code saves the trained model pipeline (`best_regressor_pipeline`) as a `.pkl` file in the output directory. Storing the pipeline allows for future use of the trained model for predictions without the need for retraining. This step is essential for model deployment and reproducibility.

In [None]:
best_regressor_pipeline

In [None]:
import joblib

model_filename = f"{output_dir}/best_regressor_pipeline.pkl"
joblib.dump(value=best_regressor_pipeline, filename=model_filename)

print("Best model pipeline saved successfully as:", model_filename)

### Save and Visualize Feature Importance

This code saves the feature importance values from the trained model into a CSV file and generates a bar plot displaying each feature's importance. The plot is saved as a PNG image in the output directory. This step helps to understand which features have the most influence on the model's predictions.

In [None]:
feature_importance_filename = f"{output_dir}/feature_importance.csv"
df_feature_importance.to_csv(feature_importance_filename, index=False)
print("Feature importance saved successfully as:", feature_importance_filename)

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.barplot(x="Importance", y="Feature", data=df_feature_importance.sort_values(by="Importance", ascending=False))
plt.title("Feature Importance for Best Model")
plt.xlabel("Importance")
plt.ylabel("Feature")

feature_importance_plot_filename = f"{output_dir}/feature_importance.png"
plt.savefig(feature_importance_plot_filename, bbox_inches='tight')
print("Feature importance plot saved successfully as:", feature_importance_plot_filename)


### Visualize and Save Model Performance

This code creates scatter plots comparing actual and predicted sale prices for both the training and test sets. A red line represents the ideal fit where predictions equal actual values, providing a visual benchmark for model accuracy. The plots are saved as a PNG image in the output directory, offering a quick overview of the model's predictive performance.


In [None]:
pred_train = best_regressor_pipeline.predict(X_train)
pred_test = best_regressor_pipeline.predict(X_test)

import matplotlib.pyplot as plt
import seaborn as sns

alpha_scatter = 0.5
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

sns.scatterplot(x=y_train, y=pred_train, alpha=alpha_scatter, ax=axes[0])
sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
axes[0].set_xlabel("Actual Sale Price")
axes[0].set_ylabel("Predicted Sale Price")
axes[0].set_title("Train Set Performance")

sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
axes[1].set_xlabel("Actual Sale Price")
axes[1].set_ylabel("Predicted Sale Price")
axes[1].set_title("Test Set Performance")

performance_plot_filename = f"{output_dir}/regression_evaluation_plots.png"
plt.savefig(performance_plot_filename, bbox_inches='tight')
print("Model performance plot saved successfully as:", performance_plot_filename)


### Conclusion

The model developed for predicting house sale prices has successfully met and exceeded the business requirement of achieving an R2 score of 0.75 or higher on test data. Key findings and outcomes include:

1. **Model Performance**:
   - **Train Set**: The model achieved an R2 score of 0.849, indicating a strong fit to the training data.
   - **Test Set**: The model achieved an R2 score of 0.804, meeting the business requirement and demonstrating good generalization to new data.

2. **Feature Importance**:
   - The model's feature importance analysis highlighted several key variables that significantly impact sale price predictions, including `1stFlrSF`, `GarageArea`, and `GrLivArea`. This insight is valuable for understanding which factors are most influential in determining house prices.

3. **Residual Analysis**:
   - Visualization of actual vs. predicted sale prices on both the training and test sets shows a generally strong alignment along the ideal line, especially for properties within a certain price range. However, some variance is observed, suggesting possible improvements in capturing extreme values.

4. **Model Deployment Readiness**:
   - The entire pipeline, including data processing and model training, has been saved as reusable files, ensuring that the model can be loaded and used for predictions without retraining. This makes the model deployment-ready and easy to integrate into a larger application.

Overall, the model meets the business requirements and offers a reliable tool for predicting house sale prices. Future work could focus on further feature engineering and hyperparameter tuning to enhance performance for higher-priced properties, but the current model is well-suited for production use.