### **Analysis of the MEPFS Budget Prediction Model**

This document provides a technical breakdown of the process and results from the `MEMFS_training.ipynb` notebook. The analysis covers the data processing pipeline, the training of the Artificial Neural Network (ANN), and an interpretation of the key visualizations generated.

### **Part 1: The Code - Process and Significance**

The notebook is structured into a logical sequence of data preparation, model training, and evaluation. Each section plays a critical role in developing a reliable prediction model.

#### **Section 1 & 2: Data Loading, Cleaning, and Merging**
*   **Process:** The script begins by loading two separate CSV files: `MEPFS Quantity Cost.csv` and `MEPFS Unit Cost.csv`. It then performs several cleaning steps:
    1.  Renames columns for clarity and standardizes project names (e.g., converting "1x5" to "1 STY 5 CLS") to ensure accurate merging.
    2.  Converts columns containing numerical data (like quantities and costs) from text to numeric types, handling potential formatting issues like commas and hyphens.
    3.  Imputes (fills in) any missing numerical values using the median of that column. This is a robust way to handle missing data without being skewed by outliers.
    4.  Extracts the numerical `Budget` from a text column using regular expressions.
    5.  Merges the two dataframes into a single `df_merged` based on the common `Project_Name`.
*   **Significance:** This is the foundational step. By cleaning and merging the data, we create a single, unified dataset that contains both the quantities of materials for a project and their corresponding unit costs. This allows for the creation of meaningful cost-based features, which are essential for predicting the final budget.

#### **Section 3: Granular Feature Engineering & Visualization**
*   **Process:** This section transforms the raw data into more predictive features:
    1.  **Granular Cost Calculation:** It calculates the estimated cost for each individual MEPFS component (e.g., `Fire_alarm_system_Total_Cost`) by multiplying its quantity by its median unit cost.
    2.  **Contextual Features:** It extracts the `Num_Storeys` and `Num_Classrooms` from the project names to provide the model with context about the project's scale.
    3.  **Target Transformation:** The target variable, `Budget`, is log-transformed (`np.log1p`) to normalize its distribution.
    4.  **Visualization Generation:** It generates and saves key plots that are analyzed in Part 2 of this document.
*   **Significance:** This is arguably the most important section for model performance. Instead of forcing the model to learn the complex relationship between raw quantities and budget, we provide it with pre-calculated, highly relevant features (the estimated costs). Log-transforming the target variable prevents the model from being disproportionately influenced by a few extremely high-budget projects, leading to a more stable and accurate model.

#### **Section 4 & 5: ANN Data Preparation and Training**
*   **Process:**
    1.  The final set of input features (X) and the target variable (y) are defined.
    2.  The data is split into a training set (80%) and a testing set (20%).
    3.  The input features (X) are standardized using `StandardScaler`, which rescales them to have a mean of 0 and a standard deviation of 1.
    4.  The log-transformed target variable (y) is scaled to a range of using `MinMaxScaler`.
    5.  All data is converted into PyTorch Tensors, the required format for the neural network.
    6.  An ANN architecture with three hidden layers and dropout for regularization is defined and trained for 200 epochs.
*   **Significance:** This section ensures the data is in the optimal format for the ANN. **Splitting** the data is crucial to evaluate the model's ability to generalize to new, unseen data. **Standardizing inputs** ensures that features with large numerical ranges do not dominate the learning process. **Min-Max scaling the target** is necessary because the model's final sigmoid activation function outputs values between 0 and 1.

#### **Section 6 & 7: Model Evaluation and Asset Saving**
*   **Process:**
    1.  The trained model's performance is evaluated on the unseen test data.
    2.  Predictions are made and then inverse-transformed back to their original PHP currency scale.
    3.  Performance is measured using R-squared (R²), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).
    4.  Visualizations comparing actual vs. predicted values and a residuals plot are generated and saved.
    5.  The trained model and the scalers are saved to disk.
*   **Significance:** This is the core validation step. Evaluating the model on data it has never seen before gives an honest measure of its predictive power. The final metrics (R² of 0.7845, MAE of ~₱4.54M) quantify the model's success. Saving the model (`ann_mepfs_model.pth`) and scalers is crucial for deploying the model in a real-world application.

---

### **Part 2: Analysis of Visualizations and Model Performance**

#### **1. Effect of Log-Transformation on Target Variable**

*   **Technical Explanation:** The left histogram shows the distribution of the original project budgets. The right histogram shows the distribution after applying a log-transformation.
*   **Interpretation and Insights:**
    *   The original budget distribution is heavily **right-skewed**, with most projects clustered at lower budget values and a long tail extending towards high-budget projects. This skewness can bias a model's learning.
    *   The log-transformed distribution on the right is much more **symmetric and bell-shaped**, resembling a normal distribution. This transformation stabilizes the target variable, making the underlying patterns easier for the neural network to learn and reducing the influence of extreme outliers. This plot visually confirms why log-transformation is a critical preprocessing step.

#### **2. MEPFS Feature Scales Before Standardization**

*   **Technical Explanation:** This box plot displays the distribution and range of each raw input feature before scaling. The x-axis is on a logarithmic scale to accommodate the vast differences in magnitude.
*   **Interpretation and Insights:**
    *   The plot dramatically illustrates the problem of **varying scales**. Features like `Wires_Total_Cost` have values in the millions, while `Num_Storeys` is in the single digits.
    *   Without standardization, features with larger values would dominate the model's learning process, effectively ignoring the predictive information in smaller-scale features. This visualization provides a clear justification for using `StandardScaler` to put all features on a comparable scale.

#### **3. Correlation Matrix of MEPFS Features vs. Budget**

*   **Technical Explanation:** This heatmap shows the Pearson correlation coefficient between the engineered MEPFS features and the final `Budget`. Brighter yellow indicates a strong positive correlation (+1).
*   **Interpretation and Insights:**
    *   **Strong Predictors:** `Plumbing_fixtures_Total_Cost` (0.67), `Conduits_Total_Cost` (0.66), and `Lighting_fixtures_Total_Cost` (0.64) show the strongest positive correlations with the final `Budget`. This is logical, as these are major cost components in MEPFS works.
    *   **Weakest Predictors:** `Fire_alarm_system_Total_Cost` has a very weak correlation (**0.08**). This suggests that, on its own, the cost of the fire alarm system has a poor linear relationship with the total budget in this dataset. This could be due to inconsistent data or because its cost doesn't scale predictably with project size.
    *   **Inter-Feature Correlation:** There is significant positive correlation between related items, such as `Lighting_fixtures_Total_Cost` and `Conduits_Total_Cost` (0.74), which makes sense as more lights require more conduits. The neural network is generally effective at handling this type of multicollinearity.

#### **4. MEPFS Model Training Loss Over Epochs**
*   **Technical Explanation:** This line graph tracks the Mean Squared Error (MSE) loss on the training data across 200 training epochs.
*   **Interpretation and Insights:**
    *   **Effective Learning:** The loss curve shows a steep drop in the initial epochs (0-50), indicating the model is quickly learning the primary patterns.
    *   **Stable Convergence:** After the initial drop, the curve gradually flattens and stabilizes, suggesting the model has converged to a good solution without overfitting drastically. The smooth nature of the curve indicates a stable training process.

#### **5. Actual vs. Predicted Project Budget (MEPFS Model)**
![Actual vs. Predicted Project Budget (MEPFS Model)](https://i.imgur.com/G5g2m8Q.png)
*   **Technical Explanation:** This scatter plot compares the model's final predictions on the unseen test data (Y-axis) against the true project budgets (X-axis). The red dashed line indicates a perfect prediction.
*   **Interpretation and Insights:**
    *   **Good Predictive Performance:** The data points generally follow the trend of the "Perfect Fit" line, visually confirming a strong model. This is supported by the **R-squared (R²) value of 0.7845**, meaning the model explains approximately 78.5% of the variance in project budgets.
    *   **Error Analysis:** While the model is strong, there are some noticeable prediction errors. The **Mean Absolute Error (MAE) of ₱4,540,539.50** means that, on average, a prediction deviates from the actual budget by about ₱4.54 million. The higher **Root Mean Squared Error (RMSE) of ₱7,088,214.33** suggests the presence of a few predictions with larger errors, which are being heavily penalized. One such outlier is visible in the top right, where a high-budget project was underestimated.

#### **6. Residuals Plot (MEPFS Model)**
![Residuals Plot (MEPFS Model)](https://i.imgur.com/zH3v8o3.png)
*   **Technical Explanation:** This plot shows the prediction error (Residual = Actual - Predicted) on the Y-axis against the predicted budget on the X-axis. A good model should have its residuals randomly scattered around the horizontal line at y=0.
*   **Interpretation and Insights:**
    *   **No Obvious Bias:** For the most part, the residuals are scattered randomly around the zero line, with no clear curve or pattern. This is a good sign, indicating that the model's errors are not systematic.
    *   **Potential Heteroscedasticity:** There is a slight tendency for the errors to become larger as the predicted budget increases (the points spread out more towards the right). This is common in budget prediction and suggests the model is less certain about very high-cost projects, which are also less frequent in the dataset. The presence of a few large outliers (e.g., the residual near +₱20 million) confirms this. Despite this, the overall random nature of the plot supports the model's validity.