### **Analysis of the Structural Budget Prediction Model**

This document provides a technical breakdown of the process and results from the `struc_training.ipynb` notebook. The analysis covers the data processing pipeline, the training of the Artificial Neural Network (ANN), and an interpretation of the key visualizations generated.

### **Part 1: The Code - Process and Significance**

The notebook is structured into a logical sequence of data preparation, model training, and evaluation. Each section plays a critical role in developing a reliable prediction model.

#### **Section 1 & 2: Data Loading, Cleaning, and Merging**
*   **Process:** The script begins by loading two separate CSV files: `Stuctural Quantity Cost.csv` and `Stuctural Unit Cost.csv`. It then performs several cleaning steps:
    1.  Renames columns for clarity.
    2.  Converts columns containing numerical data (like quantities and costs) from text to numeric types, handling potential formatting issues like commas.
    3.  Imputes (fills in) any missing numerical values using the median of that column. This is a robust way to handle missing data without being skewed by outliers.
    4.  Extracts the numerical `Budget` from a text column using regular expressions.
    5.  Merges the two dataframes into a single `df_merged` based on the common `Project_Name`.
*   **Significance:** This is the foundational step. By cleaning and merging the data, we create a single, unified dataset that contains both the quantities of materials for a project and their corresponding unit costs. This allows for the creation of meaningful cost-based features, which are essential for predicting the final budget.

#### **Section 3: Granular Feature Engineering & Analysis**
*   **Process:** This section transforms the raw data into more predictive features:
    1.  **Granular Cost Calculation:** It calculates the estimated cost for each individual structural component (e.g., `Floor_Finish_Est_Cost`) by multiplying its quantity by its median unit cost from the `df_unit_cost` dataset.
    2.  **Feature Combination:** It identifies that the individual estimated costs for concrete columns, slabs, and beams are highly correlated and combines them into a single, more robust feature: `Total_Concrete_Est_Cost`.
    3.  **Contextual Features:** It extracts the `Num_Storeys` and `Num_Classrooms` from the project names to provide the model with context about the project's scale.
    4.  **Target Transformation:** The target variable, `Budget`, is log-transformed (`np.log1p`) to normalize its distribution.
    5.  **Visualization Generation:** It generates and saves five key plots that are analyzed in Part 2 of this document.
*   **Significance:** This is arguably the most important section for model performance. Instead of forcing the model to learn the complex relationship between raw quantities and budget, we provide it with pre-calculated, highly relevant features (the estimated costs). Combining correlated features helps reduce multicollinearity, which can stabilize the model. Log-transforming the target variable prevents the model from being disproportionately influenced by a few extremely high-budget projects.

#### **Section 4: Data Preparation for the ANN Model**
*   **Process:**
    1.  The final set of input features (X) and the target variable (y) are defined.
    2.  The data is split into a training set (80%) and a testing set (20%).
    3.  The input features (X) are standardized using `StandardScaler`, which rescales them to have a mean of 0 and a standard deviation of 1.
    4.  The log-transformed target variable (y) is scaled to a range of using `MinMaxScaler`.
    5.  All data is converted into PyTorch Tensors, the required format for the neural network.
*   **Significance:** This section ensures the data is in the optimal format for the ANN. **Splitting** the data is crucial to evaluate the model's ability to generalize to new, unseen data. **Standardizing inputs** ensures that features with large numerical ranges (like costs) do not dominate the learning process over features with small ranges (like `Num_Storeys`). **Min-Max scaling the target** is necessary because the model's final sigmoid activation function outputs values between 0 and 1.

#### **Section 5 & 6: ANN Training and Evaluation**
*   **Process:**
    1.  An ANN architecture with three hidden layers and dropout for regularization is defined. ReLU is used as the activation function in hidden layers, and Sigmoid is used in the output layer.
    2.  The model is trained for 200 epochs using the Adam optimizer and Mean Squared Error (MSE) as the loss function.
    3.  After training, the model makes predictions on the unseen test data.
    4.  These predictions, which are on a scaled and logged format, are inverse-transformed back to their original PHP currency scale.
    5.  Finally, performance is measured using R-squared (R²), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).
*   **Significance:** This is the core machine learning phase. The ANN learns the complex, non-linear relationships between the structural cost features and the final project budget. The evaluation on the test set provides an unbiased assessment of the model's real-world predictive power. The final metrics (R² of 0.7759, MAE of ~₱3.37M) quantify how successful the model is.

---

### **Part 2: Analysis of Visualizations and Model Performance**

#### **1. Effect of Log-Transformation on Target Variable**

*   **Technical Explanation:** The left histogram shows the distribution of the original project budgets. The right histogram shows the distribution after applying a log-transformation (`log(1 + Budget)`).
*   **Interpretation and Insights:**
    *   The original budget distribution is heavily **right-skewed**. There is a large concentration of projects in the lower-budget range (under ₱20 million) and a long tail of fewer, but much more expensive, projects. This skewness can cause a model to be biased towards predicting lower values and perform poorly on high-value projects.
    *   The log-transformed distribution on the right is much more **symmetric and bell-shaped**, resembling a normal distribution. This transformation stabilizes the target variable, making the underlying patterns easier for the neural network to learn and reducing the influence of extreme outliers. This plot visually confirms why log-transformation (Section 1.1 of the report) is a critical preprocessing step.

#### **2. Feature Scales Before Standardization**

*   **Technical Explanation:** This box plot displays the distribution and range of each raw input feature before scaling. The x-axis is on a logarithmic scale to accommodate the vast differences in magnitude between features.
*   **Interpretation and Insights:**
    *   The plot dramatically illustrates the problem of **varying scales**. Features like `Grade_60_Steel_Est_Cost` and `Formworks_Est_Cost` have values in the millions, whereas `Num_Storeys` and `Num_Classrooms` are in single or double digits.
    *   Without standardization, the features with larger values would completely dominate the model's learning process. The network's weights would primarily adjust to minimize the error from these large-scale features, effectively ignoring the predictive information in smaller-scale features like `Num_Storeys`.
    *   This visualization provides a clear justification for using `StandardScaler` (Section 1.3 of the report) to ensure all features are on a comparable scale, allowing the model to learn their true importance.

#### **3. Correlation Matrix of Final Engineered Features**

*   **Technical Explanation:** This heatmap shows the Pearson correlation coefficient between all the final input features and the `Budget`. Bright yellow indicates a strong positive correlation (+1), while dark purple indicates a strong negative correlation (-1).
*   **Interpretation and Insights:**
    *   **Strong Predictors:** `Grade_60_Steel_Est_Cost` (0.73), `Floor_Finish_Est_Cost` (0.67), and `Formworks_Est_Cost` (0.66) are the strongest individual predictors of the final `Budget`. This is logical, as steel, flooring, and formworks are major cost drivers in structural work.
    *   **Weak Predictor:** The `Total_Concrete_Est_Cost` feature has a very weak negative correlation (-0.02) with the `Budget`. This is a significant finding and suggests an underlying issue with the concrete data or its calculation, as one would expect this to be a strong positive predictor. This feature may be adding noise rather than signal to the model.
    *   **Multicollinearity:** `Num_Storeys` and `Num_Classrooms` have a strong positive correlation (0.75), which is expected. The model appears to handle this, but it highlights that they carry similar information about the project's scale.

#### **4. Structural Model Training Loss Over Epochs**

*   **Technical Explanation:** This plot tracks the Mean Squared Error (MSE) loss on the training data across 200 training epochs.
*   **Interpretation and Insights:**
    *   **Successful Convergence:** The loss decreases sharply in the first ~25 epochs, indicating rapid learning. Afterward, it continues to decrease more slowly and begins to plateau, showing that the model has converged to a good solution.
    *   **Stable Training:** The training curve is relatively smooth, without large, erratic spikes. This suggests that the learning rate and batch size were appropriate, leading to a stable optimization process. The final loss value is low (around 0.003-0.004), confirming that the model has effectively learned the patterns in the training data.

#### **5. Actual vs. Predicted Project Budget (Structural Model)**

*   **Technical Explanation:** This scatter plot compares the model's predicted budget values (Y-axis) against the actual project budgets (X-axis) for the unseen test data. The red dashed line represents a perfect prediction.
*   **Interpretation and Insights:**
    *   **Good Predictive Performance:** The data points generally cluster around the "Perfect Fit" line, visually confirming the strong **R-squared value of 0.7759**. This means the model can explain approximately 77.6% of the variability in project budgets, which is a solid result.
    *   **Error Magnitude:** While the model is good, it is not perfect. The vertical distance of points from the red line represents the prediction error. The **Mean Absolute Error (MAE) of ₱3,366,374.26** indicates that, on average, a prediction is off by about ₱3.37 million. The **Root Mean Squared Error (RMSE) of ₱6,136,331.24**, which penalizes larger errors more heavily, suggests there are some predictions with larger deviations.
    *   **Potential for Improvement:** The plot shows some variance, particularly for projects with budgets over ₱20 million. The weak predictive power of the `Total_Concrete_Est_Cost` feature, as seen in the correlation matrix, is a likely contributor to this error. Improving that feature would likely tighten the cluster of points around the line and improve all performance metrics.