### **Analysis of the Architectural Budget Prediction Model**

This document provides a technical breakdown of the process and results from the `Arch_python_model_updated.ipynb` notebook. The analysis covers the data processing pipeline, the training of the Artificial Neural Network (ANN), and an interpretation of the key visualizations generated.

### **Part 1: The Code - Process and Significance**

The notebook is structured into a logical sequence of data preparation, model training, and evaluation. Each section plays a critical role in developing a reliable prediction model.

#### **Section 1 & 2: Data Loading, Cleaning, and Merging**
*   **Process:** The script begins by loading two separate CSV files: `Thesis Data - Architectural Quantity Cost.csv` and `Thesis Data - Achitectural Unit Cost.csv`. It then performs several cleaning steps:
    1.  Renames columns for clarity.
    2.  Converts columns containing numerical data (like quantities and costs) from text to numeric types, handling potential formatting issues like commas.
    3.  Imputes (fills in) any missing numerical values using the median of that column. This is a robust way to handle missing data without being skewed by outliers.
    4.  Extracts the numerical `Budget` from a text column using regular expressions.
    5.  Merges the two dataframes into a single `df_merged` based on the common `Project_Name`.
*   **Significance:** This is the foundational step. By cleaning and merging the data, we create a single, unified dataset that contains both the quantities of materials for a project and their corresponding unit costs. This allows for the creation of meaningful cost-based features, which are essential for predicting the final budget.

#### **Section 3: Granular Feature Engineering & Visualization**
*   **Process:** This section transforms the raw data into more predictive features:
    1.  **Granular Cost Calculation:** It calculates the estimated cost for each individual architectural component (e.g., `plaster_Est_Cost`) by multiplying its quantity by its median unit cost from the `df_unit_cost` dataset.
    2.  **Contextual Features:** It extracts the `Num_Storeys` and `Num_Classrooms` from the project names to provide the model with context about the project's scale.
    3.  **Target Transformation:** The target variable, `Budget`, is log-transformed (`np.log1p`) to normalize its distribution.
    4.  **Visualization Generation & Saving:** It generates and saves key plots that are analyzed in Part 2 of this document.
*   **Significance:** This is arguably the most important section for model performance. Instead of forcing the model to learn the complex relationship between raw quantities and budget, we provide it with pre-calculated, highly relevant features (the estimated costs). Log-transforming the target variable prevents the model from being disproportionately influenced by a few extremely high-budget projects, leading to a more stable and accurate model.

#### **Section 4 & 5: ANN Data Preparation and Training**
*   **Process:**
    1.  The final set of input features (X) and the target variable (y) are defined.
    2.  The data is split into a training set (80%) and a testing set (20%).
    3.  The input features (X) are standardized using `StandardScaler`, which rescales them to have a mean of 0 and a standard deviation of 1.
    4.  The log-transformed target variable (y) is scaled to a range of using `MinMaxScaler`.
    5.  All data is converted into PyTorch Tensors, the required format for the neural network.
    6.  An ANN architecture with three hidden layers and dropout for regularization is defined and trained for 200 epochs.
*   **Significance:** This section ensures the data is in the optimal format for the ANN. **Splitting** the data is crucial to evaluate the model's ability to generalize to new, unseen data. **Standardizing inputs** ensures that features with large numerical ranges (like costs) do not dominate the learning process. **Min-Max scaling the target** is necessary because the model's final sigmoid activation function outputs values between 0 and 1.

#### **Section 6 & 7: Model Evaluation and Asset Saving**
*   **Process:** After training, the model's performance is evaluated on the unseen test data. Predictions are made and then inverse-transformed back to their original PHP currency scale. Performance is measured with R-squared (R²), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). Finally, the trained model and the scalers are saved to disk.
*   **Significance:** This is the core validation step. Evaluating the model on data it has never seen before gives an honest measure of its predictive power. The final metrics (R² of 0.9042, MAE of ~₱2.87M) quantify the model's success. Saving the model (`ann_granular_model.pth`) and scalers is crucial for deploying the model in a real-world application.

---

### **Part 2: Analysis of Visualizations and Model Performance**

#### **1. Effect of Log-Transformation on Target Variable**
*   **Technical Explanation:** The left histogram displays the distribution of the original project budgets, while the right histogram shows the same data after applying a log-transformation.
*   **Interpretation and Insights:**
    *   The original budget distribution on the left is heavily **right-skewed**. A large number of projects are clustered at lower budget values (under ₱20 million), with a long tail of fewer, high-cost projects. This skewness can cause a model to be biased and perform poorly.
    *   The log-transformed distribution on the right is much more **symmetric and bell-shaped**, resembling a normal distribution. This transformation stabilizes the target variable, making the underlying patterns easier for the neural network to learn and reducing the influence of extreme outliers. This plot visually justifies why log-transformation is a critical preprocessing step.

#### **2. Architectural Feature Scales Before Standardization**
*   **Technical Explanation:** This box plot displays the distribution and range of each raw input feature before scaling. The x-axis is on a logarithmic scale to accommodate the vast differences in magnitude between features.
*   **Interpretation and Insights:**
    *   The plot dramatically illustrates the problem of **varying scales**. Features like `plaster_Est_Cost` have values in the millions, whereas `Num_Storeys` is in single digits.
    *   Without standardization, the features with larger values would completely dominate the model's learning process, effectively ignoring the predictive information in smaller-scale but important features like `Num_Storeys` and `Num_Classrooms`. This visualization provides a clear justification for using `StandardScaler` to put all features on a comparable scale.

#### **3. Correlation Matrix of Granular Architectural Features**
*   **Technical Explanation:** This heatmap shows the Pearson correlation coefficient between the engineered architectural features and the final `Budget`. Bright yellow indicates a strong positive correlation (+1).
*   **Interpretation and Insights:**
    *   **Strong Positive Correlation with Budget:** Nearly all engineered cost features show a strong positive correlation with the final `Budget`. `CHB_150mm_Est_Cost` (**0.85**), `plaster_Est_Cost` (**0.79**), and `Painting_masonry_Est_Cost` (**0.74**) are the top predictors. This is a powerful validation of the feature engineering approach.
    *   **Highly Predictive Contextual Features:** `Num_Storeys` (**0.75**) and `Num_Classrooms` (**0.52**) are also highly correlated with the budget, which is logical as they are direct measures of project size.
    *   **High Multicollinearity:** There is extremely high correlation between some input features, such as `plaster_Est_Cost` and `Painting_masonry_Est_Cost` (**0.80**). While this can be an issue for simpler linear models, neural networks are generally robust enough to handle it.

#### **4. Architectural Model Training Loss Over Epochs**
*   **Technical Explanation:** This plot tracks the Mean Squared Error (MSE) loss on the training data as it learns over 200 epochs.
*   **Interpretation and Insights:**
    *   **Rapid Learning and Convergence:** The loss drops dramatically within the first 25 epochs, showing the model quickly learns the primary patterns. Afterward, the curve flattens out, indicating that the model has converged to a stable and optimal solution.
    *   **Stable Training:** The curve is relatively smooth, though with some minor spikes (e.g., around epoch 125), which is normal during training. The overall downward trend and low final loss value (around 0.003) indicate a successful training process.

#### **5. Actual vs. Predicted Project Budget (Architectural Model)**
*   **Technical Explanation:** This scatter plot compares the model's budget predictions (Y-axis) against the actual budgets (X-axis) for the unseen test data. The red dashed line represents a perfect prediction.
*   **Interpretation and Insights:**
    *   **Excellent Accuracy:** The data points are very tightly clustered around the "Perfect Fit" line. This is a strong visual confirmation of the outstanding **R-squared (R²) value of 0.9042**. This means the model can explain **90.4%** of the variability in project budgets, indicating a highly accurate and powerful model.
    *   **Low Prediction Error:** The **Mean Absolute Error (MAE) of ₱2,866,371.94** shows that, on average, the model's predictions are off by about ₱2.87 million. The **Root Mean Squared Error (RMSE) of ₱4,380,099.82**, which penalizes larger errors more, is still very reasonable given the multi-million peso scale of the projects.
    *   **Unbiased Predictions:** The points are evenly distributed around the red line, showing no systematic tendency to over- or under-predict. This indicates the model is well-calibrated.

#### **6. Residuals Plot (Architectural Model)**
*   **Technical Explanation:** This plot shows the prediction error (Residual = Actual - Predicted) on the Y-axis against the predicted budget on the X-axis. A good model should have its residuals randomly scattered around the horizontal line at y=0.
*   **Interpretation and Insights:**
    *   **No Obvious Bias:** The residuals are mostly scattered randomly around the zero line with no clear curve or U-shape. This is a good sign, indicating that the model's errors are not systematic.
    *   **Potential for Minor Heteroscedasticity:** There's a slight tendency for the errors to increase in variance as the predicted budget gets larger (the points spread out more towards the right). This is common in financial modeling and indicates the model is slightly less certain about very high-cost projects. However, the presence of only a few large outliers suggests this is not a major issue. Overall, the plot supports the model's validity.