### **Analysis of the Architectural Budget Prediction Model (MATLAB)**

This document provides a technical breakdown of the process and results from the `main_script.m` and its helper functions. The analysis covers the data processing pipeline, the training of the Artificial Neural Network (ANN), and an interpretation of the key visualizations generated.

### **Part 1: The MATLAB Code - Process and Significance**

The project is logically divided into three scripts, each with a specific and crucial role in the overall workflow.

#### **1. `main_script.m`**
*   **Process:** This is the primary script that orchestrates the entire machine learning pipeline. Its process is as follows:
    1.  **Setup:** Clears the MATLAB workspace, command window, and any open figures to ensure a clean run.
    2.  **Data Loading & Merging:** Reads the architectural quantity and unit cost CSV files. It then calls the `clean_table` helper function to preprocess each table. Using the `extract_budget` function, it creates the `Budget` column and merges the two tables into a single, comprehensive dataset.
    3.  **Feature Engineering:** It programmatically creates granular cost features (e.g., `plaster_Est_Cost`) by multiplying quantities with their corresponding unit costs. It also extracts contextual features like `Num_Storeys` and `Num_Classrooms` from project names.
    4.  **Data Preparation:** The script shuffles and splits the data into training (80%) and testing (20%) sets. It then applies `zscore` (Standardization) to the input features and performs min-max scaling on the log-transformed target variable.
    5.  **ANN Training:** It defines a multi-layer neural network using MATLAB's Deep Learning Toolbox, specifying layers, activation functions (ReLU), and dropout for regularization. It then trains this network for 200 epochs using the `adam` optimizer.
    6.  **Evaluation & Visualization:** It uses the trained network to make predictions on the unseen test data, inverse-transforms these predictions back to their original currency scale, and calculates key performance metrics (R², MAE, RMSE). It then generates and saves all the analytical plots.
    7.  **Saving Assets:** Finally, it saves the trained network object (`net`) and the scaling parameters (`scalers_granular.mat`) for future use.
*   **Significance:** This script is the "control center" of the project. It executes the complete end-to-end workflow from raw data to a fully evaluated, saved model. The inclusion of visualization and asset-saving steps makes the process reproducible and prepares the model for potential deployment.

#### **2. `clean_table.m`**
*   **Process:** This is a dedicated helper function for data cleaning. It performs several key operations:
    1.  Renames the first column to `Project_Name`.
    2.  **Crucially, it standardizes project names** by converting formats like "2x4" into a consistent "2 sty 4 cl" format using regular expressions.
    3.  It creates a reliable `Join_Key` for accurately merging the quantity and unit cost tables.
    4.  It cleans the feature columns by removing non-numeric characters (like commas) and converting them to numeric types.
    5.  It imputes any missing numeric values with the median of their respective columns.
*   **Significance:** This function encapsulates all the critical data cleaning tasks. Its primary importance lies in creating a clean, consistent, and machine-readable dataset, which is the foundation for any successful machine learning model. The project name standardization is a key upgrade that ensures a more robust and accurate merge.

#### **3. `extract_budget.m`**
*   **Process:** This is a specialized helper function that takes a text cell as input and uses a regular expression to find and extract the numerical budget value. It is designed to be robust, ignoring year numbers by looking for numbers with more than four digits and handling commas.
*   **Significance:** This function is a small but vital component for data extraction. It successfully isolates the target variable (the `Budget`) from a semi-structured text field, enabling the supervised learning task.

---

### **Part 2: Analysis of Visualizations and Model Performance**

#### **1. Effect of Log-Transformation on Target Variable**
*   **Technical Explanation:** The left histogram shows the distribution of the original project budgets. The right histogram shows the same data after applying a log-transformation (`log(1 + Budget)`).
*   **Interpretation and Insights:**
    *   The "Original Budget" plot clearly shows a **right-skewed distribution**. Most projects are clustered in the lower-budget range (under ₱20 million), with a long tail of high-cost outliers. This skewness can negatively impact a model's performance.
    *   The "Log-Transformed Budget" plot on the right is much more **symmetric and bell-shaped**, resembling a normal distribution. This transformation stabilizes the target variable, making the patterns easier for the neural network to learn and reducing the influence of extreme outliers. This plot visually justifies the use of log-transformation as described in your report.

#### **2. Architectural Feature Scales Before Standardization**
*   **Technical Explanation:** This box plot displays the distribution and range of each raw input feature before scaling. The x-axis is on a logarithmic scale to properly display the wide variation in values.
*   **Interpretation and Insights:**
    *   This visualization powerfully demonstrates the problem of **varying scales**. Features like `plaster_Est_Cost` have values in the millions, while `Num_Storeys` is in the single digits.
    *   Without standardization, the features with larger numerical values would completely dominate the model's learning process, effectively ignoring the predictive information in smaller-scale but important features. This plot provides a clear rationale for using `zscore` (Standardization) to put all features on a comparable scale.

#### **3. Correlation Matrix of Granular Architectural Features**
*   **Technical Explanation:** This heatmap shows the Pearson correlation coefficient between the engineered architectural features and the final `Budget`. Bright yellow indicates a strong positive correlation (+1), while dark blue indicates a weak or negative correlation.
*   **Interpretation and Insights:**
    *   **Strong Predictors:** `plaster_Est_Cost` (**0.82**) and `Painting_masonry_Est_Cost` (**0.85**) show the strongest positive correlations with the final `Budget`. This is a powerful validation of the feature engineering approach, confirming that these individual cost components are excellent indicators of the total project cost.
    *   **Highly Predictive Contextual Features:** `Num_Storeys` (**0.50**) and `Num_Classrooms` (**0.56**) are also moderately to strongly correlated with the budget, which is logical as they are direct measures of project size.
    *   **High Multicollinearity:** There is an extremely high correlation between `plaster_Est_Cost` and `Painting_masonry_Est_Cost` (**0.81**). This is expected, as larger walls require both more plaster and more paint. While this can be an issue for linear models, neural networks are robust enough to handle these inter-correlations.

#### **4. Architectural Model Training Loss Over Epochs**
*   **Technical Explanation:** This plot tracks the model's Mean Squared Error (MSE) on the training data across all training iterations (over 200 epochs).
*   **Interpretation and Insights:**
    *   **Rapid Learning:** The loss curve shows a steep drop in the initial iterations (roughly 0-200), indicating the model is quickly learning the primary patterns in the data.
    *   **Convergence:** After the initial drop, the curve gradually flattens and stabilizes at a very low error value (approaching 0). This demonstrates that the model has successfully converged to an optimal solution and that further training would yield little benefit. The noisy but downward trend is typical of training with a mini-batch approach.

#### **5. Actual vs. Predicted Project Budget (Architectural Model)**
*   **Technical Explanation:** This scatter plot compares the model's final predictions on the unseen test data (Y-axis) against the true project budgets (X-axis). The red dashed line represents a perfect prediction.
*   **Interpretation and Insights:**
    *   **High Accuracy:** The data points are tightly clustered around the "Perfect Fit" line, which is a strong visual indicator of the model's high accuracy. This is quantitatively supported by the excellent **R-squared (R²) value of 0.9042**. This means the model can explain **90.4%** of the variability in project budgets.
    *   **Low Prediction Error:** The **Mean Absolute Error (MAE) of ₱2,866,371.94** shows that, on average, the model's predictions are off by about ₱2.87 million. The **Root Mean Squared Error (RMSE) of ₱4,380,099.82** is also very reasonable given the multi-million peso scale of the projects.
    *   **Unbiased Predictions:** The points are evenly distributed around the red line, showing no systematic tendency to over- or under-predict. This indicates the model is well-calibrated and reliable.

#### **6. Residuals Plot (Architectural Model)**
*   **Technical Explanation:** This plot shows the prediction error (Residual = Actual - Predicted) on the Y-axis against the predicted budget on the X-axis. A good model should have its residuals randomly scattered around the horizontal line at y=0.
*   **Interpretation and Insights:**
    *   **No Obvious Bias:** The residuals are mostly scattered randomly around the zero line with no clear curve or pattern. This is a good sign, indicating that the model's errors are not systematic.
    *   **Potential for Minor Heteroscedasticity:** There's a slight tendency for the errors to become larger in variance as the predicted budget gets larger (the points spread out more towards the right). This is common in financial modeling and indicates the model is slightly less certain about very high-cost projects. However, the presence of only a few large outliers suggests this is not a major issue. Overall, the plot supports the model's validity.