# Detailed Report on Steps, Strategies, and Rationale Behind Code Implementation for Housing Price Prediction


## Introduction

The objective of this project is to build a predictive model for housing prices based on various features of properties. 
The approach follows a structured methodology covering data preprocessing, feature engineering, model training, evaluation, and visualization. 
This report will break down each component of the process, explaining the rationale behind each strategy and the steps taken to ensure robust model performance.



## 1. Feature Engineering (feature_engineering.py)

**Rationale**:  
Feature engineering is crucial to enhance the predictive power of the model. This step improves the dataset by creating new features, handling missing data, and mitigating the impact of skewed data distributions.

**Steps**:

- **Handling Missing Data**:
  - Numerical features with missing values are imputed using the median.
  - Categorical features with missing values are filled with the mode (most frequent value).  
  **Why**: Imputation ensures the dataset remains complete and usable for model training, preventing loss of valuable data due to missing values.

- **Adding New Features**:
  - **Total Area**: Combines `Living_Area` and `Surface_area_plot_of_land`.
  - **Total Amenities**: Counts the presence of amenities like a kitchen, terrace, swimming pool.
  - **Average Room Size**: Calculated as `Living_Area / Number_of_Rooms`.
  - **Amenities Ratio**: Ratio of amenities to rooms.  
  **Why**: These new features help provide a more detailed view of the property.

- **Log Transformation**:  
  - Applied to `Living_Area`, `Total_Area`, and `total_income`.  
  **Why**: Log transformations help reduce skewness, making the model better able to capture relationships between variables.

- **Clustering Regions**:
  - **Region Clustering**: Uses KMeans clustering on Locality and Municipality.  
  **Why**: Geographical features play a critical role in housing prices, and clustering captures complex relationships between location and price.

- **Interaction Features**:
  - **Airport-Belgium Distance**: Interaction between `Distance_to_Nearest_Airport` and `Distance_to_Brussels`.
  - **Population-Unemployment Ratio**: Ratio of population density to the unemployment rate.  
  **Why**: Interaction features help the model understand relationships between variables that could influence the target variable in non-linear ways.

- **Handling Outliers**:
  - Outliers are capped at the 5th and 95th percentiles of each numerical feature.  
  **Why**: Outliers can distort model predictions, and capping ensures the model focuses on more typical data trends.

**Strategy**:  
All these steps aim to create a structured and informative dataset, prepared for modeling.



## 2. Data Cleaning (data_cleaning.py)

**Rationale**:  
Data cleaning ensures that the dataset is free from issues such as missing values, outliers, and irrelevant data, making it more suitable for modeling.

**Steps**:

- **Handling Missing Data**:
  - Categorical columns with missing values are filled with 'Unknown'.  
  **Why**: Treating missing categorical values as a distinct category prevents the loss of valuable data.

- **Outlier Detection**:
  - Uses Isolation Forest to detect and remove outliers in numerical features.  
  **Why**: Outliers can distort model predictions, and Isolation Forest is an efficient method for detecting outliers in high-dimensional datasets.

- **Log Transformation**:
  - The target variable (`Price`) undergoes a logarithmic transformation (`log1p`).  
  **Why**: This transformation helps stabilize variance and reduce the impact of extreme values, improving the model's performance.

**Strategy**:  
The data cleaning process prepares the dataset by addressing missing values, removing outliers, and transforming the data to be suitable for model training.



## 3. Model Training (model_training.py)

**Rationale**:  
The model training step focuses on generating a predictive model that can accurately predict house prices based on the engineered features.

**Steps**:

- **Data Splitting**:  
  The dataset is split into training and testing sets using an 80-20 ratio.  
  **Why**: This ensures that the model is trained on one portion of the data and evaluated on a separate, unseen portion, helping to mitigate overfitting.

- **Model Selection (CatBoostRegressor)**:  
  The model used is **CatBoostRegressor**, chosen for its superior handling of categorical features and state-of-the-art performance.  

- **Hyperparameter Tuning**:  
  Key parameters like `learning_rate`, `iterations`, and `depth` are tuned to optimize model performance.  
  **Why**: CatBoost performs well with categorical data, and hyperparameter tuning ensures the model runs with the best settings.

- **Model Evaluation**:  
  Performance is assessed using multiple metrics:  
  - **RMSE (Root Mean Squared Error)**  
  - **MAE (Mean Absolute Error)**  
  - **R² (Coefficient of Determination)**  
  - **MAPE (Mean Absolute Percentage Error)**  
  - **sMAPE (Symmetric Mean Absolute Percentage Error)**  
  **Why**: These metrics help evaluate the model's performance, revealing how well the model generalizes.




## 4. Plotting (plotting.py)

**Rationale**:  
Visualization helps interpret the model's performance, feature importance, and decision-making process. It aids in understanding the model's predictions and explaining them to stakeholders.

**Steps**:

- **Residuals Plot**:  
  Visualizes the residuals (differences between predictions and actual values) against predicted values.  
  **Why**: Helps assess whether the model is making systematic errors or if there are patterns to address.

- **Feature Importance Plot**:  
  Visualizes the importance of each feature based on the model's interpretation.  
  **Why**: Helps identify which features most influence predictions, guiding further feature engineering.

- **SHAP Values**:  
  **SHAP (Shapley Additive Explanations)** values show how each feature contributes to individual predictions.  
  **Why**: Enhances interpretability, providing insight into model decisions, especially with complex algorithms like CatBoost.

- **SHAP Analysis**:
  - **SHAP Summary Plot**: Displays the contribution of each feature to the model's predictions.
  - **SHAP Bar Plot**: Quantifies the average effect of each feature on predictions.
  - **Visualizing a Single Prediction**: Shows how individual features impact a specific prediction.



## 5. Main Script (main.py)

**Rationale**:  
The main script orchestrates the entire workflow, from data loading to model training and evaluation, ensuring the process is streamlined and efficient.

**Steps**:

- **Data Loading**: Loads the dataset from a CSV file.
- **Feature Engineering**: The dataset undergoes feature engineering to enhance its predictive power.
- **Data Cleaning**: The data is cleaned, removing outliers and missing values.
- **Model Training and Evaluation**: The model is trained, and its performance is evaluated.
- **Visualization**: The results are visualized to provide insight into the model's performance.
- **Model and Predictions Saving**: The trained model is saved for future use, and predictions are stored for analysis.

**Strategy**:  
The main script ensures a smooth and cohesive workflow, integrating all steps from data preprocessing to model evaluation, and outputs relevant results for further analysis.



## Model Evaluation and Execution Time Summary
**Model Performance Results**:
- Training and Test Evaluation Metrics
- Training RMSE: 60,953.93
- Test RMSE: 66,504.85
- Training MAE: 44,866.97
- Test MAE: 48,726.73
- Training R²: 0.784
- Test R²: 0.748
- Training MAPE: 14.37%
- Test MAPE: 15.78%
- Training sMAPE: 14.01%
- Test sMAPE: 15.23%

**Interpretation of Metrics**:

- The R² values (0.784 for training and 0.748 for testing) indicate a good fit for the model, meaning that the model explains about 78% of the variance in housing prices during training and 74% during testing.
  
- The RMSE values suggest that, on average, the model's predictions deviate from actual values by around 60,953 on the training data and 66,504 on the test data, which are reasonable but could be improved.
  
- MAE shows the average absolute error between predicted and actual values, where the model performs slightly better on the training data.
MAPE and sMAPE give an indication of the percentage error. A slightly higher error on the test set indicates potential room for improvement or possible overfitting on the training set.

**Residuals Statistics**:

- Mean of Residuals: 4,367.32
- Standard Deviation of Residuals: 66,361.30
- Max Residual: 363,595.28
- Min Residual: -246,333.91

**Interpretation**:

- A large standard deviation in residuals indicates significant variance in prediction errors.
- The maximum and minimum residuals are extreme, showing outliers that might be skewing the results. This suggests that there may be some particularly challenging cases the model struggles with, and addressing these outliers could improve performance.

**Feature Importance**:

| Feature                            | Importance   |
|------------------------------------|--------------|
| Province                           | 16.40%       |
| Living_Area                        | 11.88%       |
| Living_Area_log                    | 11.46%       |
| State_of_the_Building              | 9.74%        |
| Locality                           | 8.66%        |
| Municipality                       | 7.60%        |
| Airport_Brussels_Interaction       | 3.68%        |
| Distance_to_Brussels               | 3.24%        |
| Total_Amenities                    | 2.83%        |
| Subtype_of_Property                | 2.70%        |
| Distance_to_Nearest_Airport       | 2.60%        |
| Surface_area_plot_of_land          | 2.45%        |
| Number_of_Rooms                    | 2.18%        |
| Total_Area                         | 1.72%        |
| Total_Area_log                     | 1.58%        |
| Fully_Equipped_Kitchen             | 1.31%        |
| total_income_log                   | 1.30%        |
| Lift                               | 1.15%        |
| Average_Room_Size                  | 1.14%        |
| total_income                        | 1.06%        |
| Number_of_Facades                  | 0.86%        |
| Region_Cluster                     | 0.79%        |
| Population Density                 | 0.69%        |
| Amenities_Ratio                    | 0.59%        |
| Density_Unemployment_Ratio         | 0.56%        |
| Terrace                            | 0.50%        |
| Type_of_Property                   | 0.43%        |
| Garden                             | 0.40%        |
| Unemployment Rate (%)              | 0.27%        |
| Employment Rate (%)                | 0.24%        |
| Swimming_Pool                      | 0.00%        |

**Interpretation**:

- Province, Living_Area, and Living_Area_log are the most important features, confirming that property size and location are strong predictors of housing prices.
- Swimming_Pool has zero importance, suggesting it has little to no effect on the price prediction, potentially due to a lack of variation in the dataset or being less relevant for the target variable.
  
**SHAP Analysis**:
- SHAP Summary Plot: Provides a high-level overview of the importance and influence of each feature on the model’s predictions.
- SHAP Bar Plot: Quantifies the contribution of each feature towards the predictions, with more significant features showing up with greater magnitude.
- Visualizing a Single Prediction: Offers insights into how individual feature values contribute to the prediction for a specific instance.
  
**Interpretation**:
- SHAP plots enhance interpretability, showing how each feature contributes to a model’s decision, which is especially useful in understanding how certain features (like Living_Area or Province) affect housing price predictions.

**Model and Predictions Saving**:
- Model Saved: catboost_model_with_tuning.cbm
- Predictions Saved: predictions.csv
  
**Interpretation**:
- The trained model and its predictions have been saved for future use or further evaluation.
- 
**Predicted Prices Analysis**:
- The predicted prices from the model are compared against the actual prices below:

| Actual Prices              | Predicted Prices   |
|----------------------------|--------------------|
| 397,999.99                 | 401,043.60         |
| 208,999.99                 | 230,399.16         |
|139,999.99	                 |  137,216.23        | 
| 259,999.99	             | 218,720.45         | 
| 405,000.00	             | 402,908.83         | 
| 219,999.99	             | 217,612.30         | 

**Analysis of Predicted Prices**:
- Accuracy and Variance:
The predicted prices closely follow the actual prices for several cases, such as properties priced at 397,999.99 and 405,000.00. However, discrepancies are notable in some cases, like a property priced at 509,999.99, where the prediction was 326,463.36, indicating significant underestimation.

- Outliers:
Cases where predictions deviate significantly, such as at 509,999.99, suggest that certain property characteristics may not have been captured effectively by the model.

- General Trends:
Predictions align reasonably well with actual prices overall, demonstrating that the model is effective at capturing pricing trends but may need refinement for edge cases or high-priced properties.


**Execution Time**:
- Total Execution Time: 207.12 seconds (~3.45 minutes)
  
**Interpretation**:
- The model training and inference process is relatively efficient, with the entire execution taking less than 4 minutes, which is acceptable for model deployment or further experimentation.



## Conclusion

- The model is performing well with moderate error metrics (MAE, RMSE, MAPE, and R² values). There is room for improvement, particularly with handling outliers and optimizing the feature selection and engineering process.
- The feature importance analysis suggests that location and property size are the most important factors in predicting housing prices.
- Future work should focus on:
- Outlier handling: Addressing the large residuals.
- Hyperparameter tuning: Further fine-tuning the CatBoost model for better accuracy.
- Cross-validation: Implementing k-fold cross-validation to get a better sense of model generalization.
