# **Modeling and Evaluation**

## Objectives

* Fit and evaluate a regression model to predict sale prices of inherited houses

## Inputs

* outputs/datasets/processed/house_prices_records_train_processed.csv
* outputs/datasets/processed/house_prices_records_test_processed.csv

## Outputs

* Modelling pipelines to predict house prices

## Additional Comments

* 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\-MY STUDY-\\Coding\\projects\\project-5\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\-MY STUDY-\\Coding\\projects\\project-5'

# Load Data

In [5]:
import pandas as pd

# Load the processed training and testing datasets
train_df_encoded = pd.read_csv("outputs/datasets/processed/house_prices_records_train_processed.csv")
test_df_encoded = pd.read_csv("outputs/datasets/processed/house_prices_records_test_processed.csv")

print("Processed training set loaded:", train_df_encoded.shape)
print("Processed test set loaded:", test_df_encoded.shape)


Processed training set loaded: (1168, 37)
Processed test set loaded: (292, 37)


---

# ML pipeline

* These are the relevant features based on the correlation analysis and feature engineering

In [6]:
# Select the relevant features based on the correlation analysis and feature engineering
features = ['OverallQual', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', '1stFlrSF', 
            'YearBuilt', 'YearRemodAdd', 'KitchenQual_Ex', 'MasVnrArea', 'GarageYrBlt', 
            'OverallQual_GrLivArea', 'TotalBsmtSF_1stFlrSF']

* Relevant features and target variable for training and testing sets are:

In [7]:
X_train = train_df_encoded[features]
y_train = train_df_encoded['SalePrice']
X_test = test_df_encoded[features]
y_test = test_df_encoded['SalePrice']

* Let's identify and retain the most important features.

In [17]:
from sklearn.ensemble import RandomForestRegressor

# Train a RandomForestRegressor to determine feature importance
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("Feature Importances:\n", feature_importance_df)

Feature Importances:
                   Feature  Importance
10  OverallQual_GrLivArea    0.703645
0             OverallQual    0.051688
3             TotalBsmtSF    0.045964
5               YearBuilt    0.044243
11   TotalBsmtSF_1stFlrSF    0.044201
2              GarageArea    0.027302
6            YearRemodAdd    0.026110
4                1stFlrSF    0.017951
1               GrLivArea    0.013824
9             GarageYrBlt    0.013020
8              MasVnrArea    0.008476
7          KitchenQual_Ex    0.003575


* For improving model productivity we select top 5 features:

In [18]:
# Select the top 5 features
top_5_features = feature_importance_df['Feature'].head(5).tolist()

# Update the training and testing sets with the top 5 features
X_train_top_5 = X_train[top_5_features]
X_test_top_5 = X_test[top_5_features]

* Define the pipeline

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import mean_squared_error, r2_score


def pipeline_reg():
    pipeline = Pipeline([
        ("feat_scaling", StandardScaler()),
        ("feat_selection", SelectFromModel(RandomForestRegressor(random_state=42))),
        ("model", RandomForestRegressor(random_state=42)),
    ])
    return pipeline

* Create and train the pipeline

In [23]:
pipeline = pipeline_reg()
pipeline.fit(X_train_top_5, y_train)

After successfully loading and preparing our processed datasets, we are now ready to build and evaluate our machine learning model. The goal is to predict house prices using the selected features and assess the model's performance.

#### Step 1: Making Predictions

First, we use our trained machine learning pipeline to make predictions on the testing set. This involves feeding the features from the testing set into the pipeline, which then generates predicted values for the target variable, SalePrice.

In [30]:
# Make predictions on the training set
y_train_pred = pipeline.predict(X_train_top_5)

# Make predictions on the testing set
y_test_pred = pipeline.predict(X_test_top_5)

#### Step 2: Evaluating the Model

Next, we evaluate the performance of our model using two key metrics: Mean Squared Error (MSE) and R² Score.

* Mean Squared Error (MSE) measures the average squared difference between the actual and predicted values. Lower MSE values indicate better model performance, as they signify that the predicted values are closer to the actual values.

* R² Score measures the proportion of variance in the dependent variable that is predictable from the independent variables. R² values range from 0 to 1, with higher values indicating better model performance, as they signify that the model explains a larger portion of the variance in the target variable.

In [31]:
# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Evaluate the model on the testing set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

#### Step 3: Displaying the Results

Finally, we print the evaluation metrics to the console to summarize the model's performance.

In [32]:
# Display the results
print(f"Training Set - Mean Squared Error: {train_mse}")
print(f"Training Set - R^2 Score: {train_r2}")
print(f"Testing Set - Mean Squared Error: {test_mse}")
print(f"Testing Set - R^2 Score: {test_r2}")

Training Set - Mean Squared Error: 411496843.6935477
Training Set - R^2 Score: 0.9310095786785221
Testing Set - Mean Squared Error: 2299035504.565193
Testing Set - R^2 Score: 0.7002688748216487


---

## Interpretation of Results

### Interpretation of Results

#### Training Set Results

- **Mean Squared Error (MSE)**: 411,496,843.69
  - This value represents the average squared difference between the actual house prices and the predicted house prices on the training set. The relatively low MSE indicates that the model's predictions are close to the actual values on the training data.
- **R² Score**: 0.9310
  - The R² Score of 0.9310 indicates that the model explains approximately 93.10% of the variance in the house prices on the training set. This high R² Score suggests that the model fits the training data very well.

#### Testing Set Results

- **Mean Squared Error (MSE)**: 2,299,035,504.57
  - This value represents the average squared difference between the actual house prices and the predicted house prices on the testing set. The higher MSE compared to the training set suggests that the model's predictions are less accurate on the testing data.
- **R² Score**: 0.7003
  - The R² Score of 0.7003 indicates that the model explains approximately 70.03% of the variance in the house prices on the testing set. While this is a reasonably good score, it is lower than the R² Score on the training set.

### Conclusion

The model shows good performance on the training set, with a high R² Score and a relatively low MSE. However, there is a noticeable drop in performance on the testing set, as indicated by the higher MSE and lower R² Score. This discrepancy suggests that the model may be overfitting the training data, meaning it performs well on the training data but does not generalize as well to unseen data.


---

# Save ML model

In [33]:
import os
import joblib

# Create the models folder if it does not exist
try:
    os.makedirs(name='outputs/models')
except Exception as e:
    print(e)

# Save the initial model with top 5 features
initial_model_path = "outputs/models/house_price_model_v1_top_5.pkl"
joblib.dump(pipeline, initial_model_path)

print(f"Initial model with top 5 features saved to: {initial_model_path}")


[WinError 183] Невозможно создать файл, так как он уже существует: 'outputs/models'
Initial model with top 5 features saved to: outputs/models/house_price_model_v1_top_5.pkl
