<img src="https://upload.wikimedia.org/wikipedia/commons/0/06/Imperial_College_London_new_logo.png" width="350">

### **Course:** CIVE70111 Machine Learning  
### **Task 1 — Problem Formulation**  
**Date:** 09/12/2025  

---

# **1. Task (T)**

---

## **1.1 Task 3 — Power Forecasting (AC/DC Regression)**

**Objective:**  
Predict real-time inverter output power (AC_CLEAN and DC_CLEAN).

**Target variables:**  
- AC_CLEAN  
- DC_CLEAN  

**ML problem type:**  
Supervised regression.

**Prediction horizon:**  
Real-time prediction using only current environmental and operational conditions.

---

## **1.2 Task 4 — Operating-Condition Classification (Optimal vs. Suboptimal)**

**Objective:**  
Classify each inverter sample as **Optimal (0)** or **Suboptimal (1)**.

**Target variable:**  
Operating_Condition.

**ML problem type:**  
Binary classification.

**Prediction horizon:**  
Current-state classification using real-time power and weather data.

---

## **1.3 Temporal Forecasting (Sequence Prediction using LSTM)**

**Objective:**  
Forecast DC and AC power **1 hour ahead** using past power observations.

**Target variables:**  
- DC_1hr  
- AC_1hr  

**ML problem type:**  
Long Short-Term Memory (LSTM) sequence regression.

**Prediction horizon:**  
Lookback window → 1-hour-ahead forecast.

---

# **2. Performance (P)**

---

## **2.1 Regression Models (AC/DC Forecasting)**

### **Metrics Used**
- **RMSE (Root Mean Squared Error):** Penalises large deviations that significantly impact forecasting accuracy.  
- **MAE (Mean Absolute Error):** More interpretable; reflects typical deviation from actual power output.  
- **R_sqaure value :** The amount variation in actual value explained by the model
- **Success criterion:** Lower RMSE and MAE, higher R square value indicate better model performance

### **Justification**
- Regression accuracy depends on how close predictions are to true values.  
- RMSE and MAE directly measure deviation between predicted and actual values, clearly reflecting model performance.  

### **Data Characteristics**
- Weather and power values vary throughout the day and follow daily patterns.  
- TOTAL_YIELD increases monotonically because it reflects lifetime accumulated energy.  
- Variables differ in scale (e.g., TOTAL_YIELD ~1e6 vs. AC/DC output ~1e3).

---

## **2.2 Classification Models (Operating Condition)**

### **Metrics Used**
- **F1 score:** Balances false positives and false negatives; suitable for imbalanced datasets.  
- **Confusion matrix:** Visualises model prediction tendencies.  
- **Success criterion:** Higher F1 score indicates a better model.

### **Data Characteristics & Metric Justification**
- Over 70% of samples are labelled as Suboptimal → **class imbalance**.  
- A model predicting only Suboptimal could achieve high accuracy but perform poorly overall.  
- F1 score penalises incorrect predictions of both classes, providing a fair evaluation.  
- Classification models are sensitive to feature scale → use **MinMaxScaler** or **StandardScaler**.

---

## **2.3 Temporal LSTM Forecasting**

### **Metrics Used**
- RMSE  
- MAE  
- **Success criterion:** Lower values indicate more accurate forecasts.

### **Justification**
- Direct measurement of deviation between predicted and actual power output.  
- Similar rationale as for regression models.

---

# **3. Experience (E)**

---

## **3.1 Data Description**

### **Power Data**
- DC_POWER / AC_POWER  
- DAILY_YIELD (daily cumulative energy)  
- TOTAL_YIELD (lifetime cumulative energy)  
- Operating_Condition (Optimal/Suboptimal)

### **Weather Data**
- IRRADIATION  
- AMBIENT_TEMPERATURE  
- MODULE_TEMPERATURE  

### **Most Relevant Features**
- AC and DC output strongly depend on available irradiation.  
- Ambient temperature correlates with irradiation and can influence power output.

---

## **3.2 Temporal Structure**
- Data recorded at **15-minute intervals**.  
- 34 days of power data per inverter.  
- Multiple Operating_Condition readings within a 15-minute window, while other variables remain constant.  
- AC, DC, and IRRADIATION drop to zero at night.  
- DAILY_YIELD resets every day and must be handled carefully.

---

## **3.3 Data Quality Considerations**
- Missing sensor readings:  
  - <10% → drop  
  - >10% → linear interpolation  
- Weather and generation timestamps aligned before merging.  
- Irradiation and DAILY_YIELD sometimes show plateaus or gaps that cannot be corrected without additional data.  
- Some inverters have fewer observations.  
- Classification dataset suffers from class imbalance.

---

## **3.4 Train–Validation–Test Splitting Strategy**

### **Regression, Classification, and LSTM Forecasting**
- Use a **time-aware split** to avoid leakage:  
  - Train: first 25 days  
  - Validation: days 25–27  
  - Test: days 27–34  
- Alternatively use a **70% / 15% / 15%** chronological split or a sliding window.

### **Rationale**
- Solar PV data is temporally correlated.  
- Random shuffling leaks future information into the past and artificially improves performance.  
