# **Notebook 5: Model Training and Evaluation**

## Objectives

The primary objective of this notebook is to develop, evaluate, and select the best-performing machine learning model(s) to predict house sale prices in Ames, Iowa. This involves training models on the processed datasets, fine-uning their hyperparameters, and assessing their performance based on defined metrics.

## Inputs

* **Training Dataset with Target (`train_with_target.csv`):** Includes processed features and the log-transformed target variable (`LogSalePrice`) for model training.
* **Testing Dataset with Target (`test_with_target.csv`):** Includes processed features and the log-transformed target variable (`LogSalePrice`) for model evaluation.
* **Key Feature Correlations (`key_drivers_correlation.csv`):** Identified the most impactful features influencing the target variable.

## Outputs

* **Trained Models:** Serialized versions of trained models saved for deployment.
* **Evaluation Metrics:** Metrics like R2, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) for model comparison.
* **Feature Importances:** Insights into which features contributed most to model predictions.
* **Model Performance Report:** Comprehensive report summarizing model performance and insights.

## Additional Comments

* The notebook is structured to ensure modularity, allowing for easy updates or experimentation with different models or preprocessing steps.
* Exploratory insights gained in the previous notebook will inform feature selection and preprocessing strategies.
* Advanced modeling techniques, such as hyperparameter optimization and ensemble learning, will be considered for improving accuracy.


---

## Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5'

---

## Data Preparation

### Load Processed Data

**Objective:** Load the preprocessed datasets required for model training and evaluation.

**Inputs:**
- `train_with_target.csv` and `test_with_target.csv` files containing the processed training and testing datasets with the target variable included.

In [7]:
import pandas as pd

# Define file paths
train_data_path = "outputs/datasets/processed/with_target/train_with_target.csv"
test_data_path = "outputs/datasets/processed/with_target/test_with_target.csv"

# Load the datasets
train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# Separate features and target
x_train = train_data.drop(columns=["LogSalePrice", "SalePriceQuartile"], errors="ignore")
y_train = train_data["LogSalePrice"]

x_test = test_data.drop(columns=["LogSalePrice", "SalePriceQuartile"], errors="ignore")
y_test = test_data["LogSalePrice"]

# Display dataset shapes
print(f"x_train data shape: {x_train.shape}")
print(f"x_test data shape: {x_test.shape}")
print(f"y_train data shape: {y_train.shape}")
print(f"y_test data shape: {y_test.shape}")

# Preview the datasets
print("\nPreview of x_train:")
display(x_train.head())

print("\nPreview of x_test:")
display(x_test.head())

print("\nPreview of y_train:")
display(y_train.head())

print("\nPreview of y_test:")
display(y_test.head())

x_train data shape: (1168, 21)
x_test data shape: (292, 21)
y_train data shape: (1168,)
y_test data shape: (292,)

Preview of x_train:


Unnamed: 0,num__LotFrontage,num__LotArea,num__OpenPorchSF,num__MasVnrArea,num__BsmtFinSF1,num__GrLivArea,num__1stFlrSF,num__YearBuilt,num__YearRemodAdd,num__BedroomAbvGr,...,num__BsmtUnfSF,num__GarageArea,num__GarageYrBlt,num__OverallCond,num__OverallQual,num__Age,num__LivingLotRatio,num__FinishedBsmtRatio,num__OverallScore,cat__HasPorch_1
0,0.14414,-0.161873,-1.096169,-0.827815,0.865283,-0.292584,0.526873,0.455469,1.346063,-0.288836,...,-0.400282,-0.863837,0.192392,0.372217,-0.820445,-0.455469,-0.116096,0.887733,-0.437833,0.0
1,-0.392921,-0.304082,0.617419,-0.827815,-1.416429,0.250597,-1.040595,-0.718609,-0.439214,-0.288836,...,0.51192,-0.456264,0.272225,1.268609,-0.088934,0.718609,0.455054,-1.415946,0.85819,1.0
2,0.006402,-0.071879,-1.096169,-0.827815,-1.416429,-1.816242,-1.052445,1.988293,1.683818,0.64568,...,0.505196,-2.257169,-4.14741,1.268609,-0.820445,-1.988293,-1.409123,-1.415946,0.102176,0.0
3,-0.340186,-0.477855,-1.096169,1.276291,0.704206,0.609851,-0.394093,1.107734,1.683818,-0.288836,...,-0.915776,-1.119755,0.152476,1.268609,-0.820445,-1.107734,0.918129,0.640194,0.102176,0.0
4,-0.911425,-1.22528,-1.096169,-0.827815,0.384534,0.474436,-0.252776,1.531707,1.683818,-0.288836,...,0.532091,-0.797488,0.119212,0.372217,-0.820445,-1.531707,1.593562,0.340697,-0.437833,0.0



Preview of x_test:


Unnamed: 0,num__LotFrontage,num__LotArea,num__OpenPorchSF,num__MasVnrArea,num__BsmtFinSF1,num__GrLivArea,num__1stFlrSF,num__YearBuilt,num__YearRemodAdd,num__BedroomAbvGr,...,num__BsmtUnfSF,num__GarageArea,num__GarageYrBlt,num__OverallCond,num__OverallQual,Age,LivingLotRatio,FinishedBsmtRatio,OverallScore,HasPorch
0,0.14414,-0.15846,-1.096169,-0.827815,0.755219,-0.922794,-0.126358,0.227176,-0.87347,-2.157869,...,-0.391317,-1.006014,0.205698,2.165,-0.088934,1986.358491,0.798551,0.515713,35.037736,0.566038
1,1.204764,0.61254,0.517257,1.413568,0.90291,1.808434,0.944129,-0.783836,-0.487465,-2.157869,...,-0.312872,1.117159,0.274443,-0.524174,1.374088,1986.358491,0.798551,0.515713,35.037736,0.566038
2,-0.556568,-0.029579,-1.096169,-0.827815,-1.416429,-1.038836,-0.246639,1.401254,1.683818,-1.223352,...,0.980347,-0.551048,0.125865,0.372217,-0.820445,1986.358491,0.798551,0.515713,35.037736,0.566038
3,-0.911425,-1.22528,0.389147,-0.827815,0.585846,0.425488,-0.321073,0.748988,1.683818,-2.157869,...,0.077111,-0.266695,0.176869,1.268609,-0.088934,1986.358491,0.798551,0.515713,35.037736,0.566038
4,0.900684,0.717202,-1.096169,0.793095,0.899659,0.343995,1.186707,-1.207808,-1.114724,-1.223352,...,0.061422,2.065003,0.305489,-0.524174,2.105599,1986.358491,0.798551,0.515713,35.037736,0.566038



Preview of y_train:


0    11.884496
1    12.089544
2    11.350418
3    12.072547
4    11.751950
Name: LogSalePrice, dtype: float64


Preview of y_test:


0    11.947949
1    12.691580
2    11.652687
3    11.976659
4    12.661914
Name: LogSalePrice, dtype: float64

### Feature and Target Separation

### Data Splitting (if applicable)

---

## Model Selection

### Overview of Models

### Baseline Model

---

## Model Training and Hyperparameter Tuning

### Train Multiple Models

### Hyperparameter Tuning

---

## Model Evaluation

### Evaluation Metrics

### Cross-Validation Results

### Test Set Results

### Residual Analysis

### Error Analysis

---

## Feature Importance and Insights

### Feature Importance

### Key Takeaways

---

## Exploratory Model Analysis

### Experimentation with Additional Models

### Comparative Analysis

### Model Interpretability

---

## Conclusion and Recommendations

---

## Save Outputs

---

## Future Improvements