# Total Production Units for Self-Consumption

### Master in Data Science and Engineering - FEUP

#### Group 4
202107955 - Beatriz Iara Nunes Silva
\
202206252 - Inês Clotilde da Costa Neves
\
202502527 - Kirill Savin
\
202502528 - Mariana Rocha Cristino
\
202202895 - Patrícia Crespo da Silva

<div id="top"></div>

# Table of Contents

- [Introduction](#introduction)
- [Methodology of Statistical Research](#methodology-of-statistical-research)
- [Phase 1: Study Design and Data Collection](#phase-1-study-design-and-data-collection)
- [Phase 2: Research Questions](#phase-2-research-questions)
    - [**General Research Question**](#general-research-question)
    - [**Specific Research Questions**](#specific-research-questions)
- [Phase 3: Data Preparation and Exploration](#phase-3-data-preparation-and-exploration)
  - [3.1 Data Preparation](#31-data-preparation)
    - [3.1.1 Overview and Initial Import](#311-overview-and-initial-import)
    - [**3.1.2 Structure and Content Analysis**](#312-structure-and-content-analysis)
    - [**3.1.3 Variable Translation and Harmonization**](#313-variable-translation-and-harmonization)
    - [**3.1.4 Data Selection and Filtering**](#314-data-selection-and-filtering)
    - [**3.1.5 Missing Values and Duplicate Handling**](#315-missing-values-and-duplicate-handling)
    - [**3.1.6 Derived and Enriched Variables**](#316-derived-and-enriched-variables)
    - [**3.1.7 Data Type Verification and Final Structure**](#317-data-type-verification-and-final-structure)
    - [**3.1.8 Descriptive Summary and Initial Observations**](#318-descriptive-summary-and-initial-observations)
    - [**3.1.9 Conclusion of Data Preparation**](#319-conclusion-of-data-preparation)
  - [Libraries](#libraries)
  - [3.1: Prepare Data](#31-prepare-data)
    - [Data Selection](#data-selection)
    - [Data Inspection and Cleaning](#data-inspection-and-cleaning)
    - [Derived Metrics](#derived-metrics)
  - [3.2: Exploratory Data Analysis](#32-exploratory-data-analysis)
    - [**3.2.1 Correlation Analysis**](#321-correlation-analysis)
    - [**3.2.2 Distribution of Total Installed Power**](#322-distribution-of-total-installed-power)
    - [**3.2.3 Total Installed Power by Voltage Level and Season**](#323-total-installed-power-by-voltage-level-and-season)
    - [**3.2.4 Lineplot by Voltage Level: Total Installed Power per Voltage Level per Quarter**](#324-lineplot-by-voltage-level-total-installed-power-per-voltage-level-per-quarter)
    - [**3.2.5 Distribution of Total Installed Power by Voltage Level (Violin/Box Plots)**](#325-distribution-of-total-installed-power-by-voltage-level-violinbox-plots)
    - [**3.2.6 District-Level Installed Power Over Time**](#326-district-level-installed-power-over-time)
    - [**3.2.7 Quarterly Percentage Change per District**](#327-quarterly-percentage-change-per-district)
    - [**3.2.8 Distribution of Number of Installations by Voltage Level**](#328-distribution-of-number-of-installations-by-voltage-level)
    - [**3.2.9 District-level Installations vs Total Installed Power Over Time**](#329-district-level-installations-vs-total-installed-power-over-time)
    - [**3.2.10 Average Power per Installation vs Number of Installations**](#3210-average-power-per-installation-vs-number-of-installations)
    - [**3.2.11 Relationship Between Total Installed Power and Number of Installations by Voltage Level**](#3211-relationship-between-total-installed-power-and-number-of-installations-by-voltage-level)
    - [**3.2.12 Number of Installations by District and Season (Top 5 Districts)**](#3212-number-of-installations-by-district-and-season-top-5-districts)
    - [**3.2.13 Number of Installations per District Over Time**](#3213-number-of-installations-per-district-over-time)
    - [**3.2.14 Proportion of Voltage Level within Each Installed Power Range**](#3214-proportion-of-voltage-level-within-each-installed-power-range)
    - [**3.2.15 Industrialization Analysis by District**](#3215-industrialization-analysis-by-district)
    - [**3.2.16 Evolution of Installed Capacity by UPAC Type**](#3216-evolution-of-installed-capacity-by-upac-type)
    - [**3.2.17 Technology Type Analysis**](#3217-technology-type-analysis)
    - [**3.2.18 Conclusion of Exploratory Data Analysis**](#3218-conclusion-of-exploratory-data-analysis)
- [Phase 4: Inferences](#phase-4-inferences)
  - [RQ1: Compare the difference of the total installed capacity between 2023 and 2024 on the selected districts.](#rq1-compare-the-difference-of-the-total-installed-capacity-between-2023-and-2024-on-the-selected-districts)
  - [RQ2: Compare the evolution of installed capacity between 2023 and 2024 across residential and industrial UPACs to assess differences in growth patterns.](#rq2-compare-the-evolution-of-installed-capacity-between-2023-and-2024-across-residential-and-industrial-upacs-to-assess-differences-in-growth-patterns)
  - [RQ3:  Compare the total installed capacity for self-consumption across different power scales (installed capacity ranges) and seasons (winter vs. summer) in selected Portuguese districts during 2023 and 2024.](#rq3--compare-the-total-installed-capacity-for-self-consumption-across-different-power-scales-installed-capacity-ranges-and-seasons-winter-vs-summer-in-selected-portuguese-districts-during-2023-and-2024)
- [Phase 5: Formulate Conclusions](#phase-5-formulate-conclusions)
    - [**Overall Interpretation**](#overall-interpretation)
- [Phase 6: Look Back and Ahead](#phase-6-look-back-and-ahead)
    - [**Look Back — Reflections and Project Challenges**](#look-back--reflections-and-project-challenges)
    - [**Look Ahead — Future Directions and Opportunities**](#look-ahead--future-directions-and-opportunities)

---
# Introduction

Self-consumption energy systems (Total Production Units for Self-Consumption – UPACs) have become an essential part of Portugal’s ongoing energy transition, mirroring the European commitment to decentralized and renewable generation. Encouraged by supportive regulation and declining photovoltaic costs, these systems allow consumers to produce, store, and use electricity locally. In doing so, they reduce dependence on the national grid and contribute to the achievement of national decarbonization objectives. Their rapid expansion in recent years demonstrates their strategic role across both residential and industrial sectors.

However, this growth has not been uniform across the country. The installed capacity of UPACs varies considerably between districts, voltage levels, and installation sizes. Residential units, generally linked to low-voltage networks, are widespread but small in capacity, while industrial systems, connected at medium or high voltage, are fewer in number but represent a substantial share of total installed power. These differences reflect variations in regional economic activity, the availability of space, and the rate of technology adoption, highlighting the need for region-specific, data-driven analyses to inform energy policy and planning.

The present study is based on a comprehensive dataset provided by e-Redes, which includes all registered UPACs for the years 2023 and 2024. The analysis integrates descriptive statistics, exploratory data analysis, and inferential techniques to investigate three central research questions: the variation of average installed capacity per UPAC across districts; temporal changes in residential and industrial installations; and seasonal and scale-related patterns in total installed capacity. This approach allows for a thorough assessment of both spatial and temporal dynamics in the deployment of self-consumption systems across Portugal.

To ensure statistical robustness, the study employs a suite of inferential methods, including t-test, ANOVA, Levene’s test for variance homogeneity, and post-hoc pairwise comparisons, providing evidence for the significance of observed differences across sectors, regions, and years. By integrating technical, temporal, and regional dimensions, the research offers a rigorous characterization of Portugal’s self-consumption landscape, revealing the dual structure of the sector.

<div id="research"></div>
<strong><a href="#top">Back to top</a></strong>

---
# Methodology of Statistical Research

The present study follows the **six-step statistical investigation method**, commonly applied in quantitative research:

1. **(2) Ask a research question**
2. **(1) Design a study and collect data**
3. **Explore the data**
4. **Draw inference**
5. **Formulate conclusions**
6. **Look back and ahead**

Although this framework traditionally begins with the formulation of a **research question**, we **reversed the first two steps**.
Because this project originated from a **data-driven perspective**, the **e-Redes dataset** was first explored to understand its variables, structure, and analytical potential.
Only then were the **research questions (RQs)** refined to ensure they were **relevant, measurable, and supported by the available data**.

After defining the questions, the methodology proceeds through:

* **Data exploration and cleaning**, to ensure consistency and accuracy;
* **Descriptive analysis**, to summarize and visualize the main patterns;
* **Inferential testing** (e.g., ANOVA, Levene’s test), to assess statistical differences across groups;
* **Interpretation and synthesis**, to draw evidence-based conclusions;
* And finally, **reflection**, evaluating methodological limitations and identifying opportunities for future research.



<div id="research"></div>
<strong><a href="#top">Back to top</a></strong>

---
# Phase 1: Study Design and Data Collection

The dataset used in this research corresponds to the **Total Production Units for Self-Consumption (UPACs)** in Portugal, provided by **e-Redes – Redes Energéticas Nacionais, S.A.**, the national electricity distribution company responsible for managing and monitoring low-, medium-, and high-voltage networks across the country.

This dataset contains detailed information on self-consumption energy production units, including **installed capacity (kW)**, **voltage level**, **district**, **municipality**, and **temporal information** such as the quarter and year of registration. These variables allow a multidimensional exploration of how regional, technical, and temporal factors influence the deployment of renewable self-consumption systems in Portugal.

The data was obtained directly from the e-Redes open data platform:

**Dataset link:** [Total Production Units for Self-Consumption (e-Redes)](https://e-redes.opendatasoft.com/explore/dataset/8-unidades-de-producao-para-autoconsumo/information/)

The dataset covers the years **2023 and 2024**, enabling a comparative analysis of the evolution of self-consumption in Portugal during this recent two-year period.



<div id="research"></div>
<strong><a href="#top">Back to top</a></strong>

---
# Phase 2: Research Questions

The central aim of this study is to understand how **seasons, regional, and power level** influence **self-consumption energy production patterns** in Portugal between **2023 and 2024**.
The research adopts a structured hierarchy of one **general** and three **specific research questions (RQs)**, each addressing a distinct analytical dimension of the dataset.

### **General Research Question**

**RQ:** How do seasonal (winter vs. summer), regional, and power level shape self-consumption energy production patterns in Portugal between 2023 and 2024?

### **Specific Research Questions**

**RQ1:** Compare the difference of the total installed capacity between 2023 and 2024 on the selected districts.
This question focuses on the **distribution of installed capacities**, evaluating how the average UPAC size varies geographically and over time.

**RQ2:** Compare the evolution of installed capacity between 2023 and 2024 across residential and industrial UPACs to assess differences in growth patterns.
This question explores the **growth trajectory** of self-consumption by UPAC type, investigating whether **residential or industrial systems** expanded more rapidly and whether this evolution differs across regions.

**RQ3:** Compare the total installed capacity for self-consumption across different power scales (installed capacity ranges) and seasons (winter vs. summer) in selected Portuguese districts during 2023 and 2024.
This question integrates **seasonality and technical scale**, examining how environmental and system-level characteristics interact to shape production capacity across Portugal.

Together, these research questions provide a **comprehensive framework** for analyzing Portugal’s recent self-consumption landscape — bridging **temporal evolution**, **regional heterogeneity**, and **technical diversity** within the renewable energy transition.



<div id="research"></div>
<strong><a href="#top">Back to top</a></strong>

---
# Phase 3: Data Preparation and Exploration

## 3.1 Data Preparation

### 3.1.1 Overview and Initial Import

The dataset used in this research was imported from the e-Redes, it contains detailed records of **Production Units for Self-Consumption (UPAC)** across Portugal, with information on geographic location, technology type, voltage level, installed capacity, number of installations, and temporal identification by quarter.

After importing the data, the dataset consisted of 121,294 observations and 16 variables. Each record represents a unique aggregation of installations for a specific parish, quarter, and voltage level.

To enhance clarity, two variables originally in Portuguese, `relacao_instalacoes_por_cpe` and `relacao_potencia_por_cpe`, were renamed to `installations_per_cpe_ratio` and `power_per_cpe_ratio`, respectively. This step ensured full linguistic consistency in English for subsequent analysis and documentation.

### **3.1.2 Structure and Content Analysis**

An initial inspection revealed that the dataset included eight categorical variables and eight numerical variables.

Categorical attributes were inspected to assess the diversity of categories and potential coding issues. For example:

* `Quarter` contained **11 unique values**, ranging from **2022T4 to 2025T2**, indicating quarterly data coverage and allowing temporal segmentation.
* `District` included **18 Portuguese administrative districts**, ensuring national representativeness.
* `Municipality` and `Parish` contained **278 and 2872 unique identifiers**, respectively, reflecting the high granularity of spatial data.
* `Technology Type` initially included **Portuguese labels** (e.g., *“Fotovoltaica”*, *“Eólica”*), requiring translation.
* `Voltage level` contained **five categories (AT, MT, BTE, BTN, NaN)**, corresponding to the different grid connection levels.

These exploratory steps confirmed the internal consistency of the dataset, but also identified the need for translation, filtering, and standardization.

### **3.1.3 Variable Translation and Harmonization**

Because the dataset contained Portuguese terminology, a translation dictionary was applied to ensure interpretability in English-language analysis and reporting. The `Technology Type` variable was harmonized using the following mapping:

| Original (PT)           | Translated (EN)            |
| ----------------------- | -------------------------- |
| Solar                   | Solar                      |
| Fotovoltaica            | Photovoltaic               |
| Eólica                  | Wind                       |
| Hídrica                 | Hydro                      |
| Biogás                  | Biogas                     |
| Biomassa                | Biomass                    |
| Cogeração não renovável | Non-renewable Cogeneration |
| Não Atribuído           | Not Assigned               |

### **3.1.4 Data Selection and Filtering**

Given the research scope, the analysis focused exclusively on **the most recent and complete years (2023 and 2024)**.
Records from 2022 and 2025 were excluded to maintain temporal comparability and avoid partial-year biases, since those periods do not contain all four quarters of data.

After filtering, the dataset retained 88,511 valid observations, distributed across eight quarters (`2023T1–2024T4`).
This filtering step also ensured that all analyses reflect recent and stable regulatory conditions under the Portuguese self-consumption framework.

Additionally, a subset of columns was retained to streamline analysis, focusing on those directly relevant to the research questions:

* Temporal: `Quarter`
* Spatial: `District`, `Municipality`, `Parish`
* Technical: `Technology Type`, `Voltage level`, `Installed power range (kW)`
* Quantitative: `Number of installations`, `Total installed power (kW)`, `CPEs (#)`
* Derived ratios: `installations_per_cpe_ratio`, `power_per_cpe_ratio`

This reduced the dataset to **12 key analytical variables** while still preserving interpretive richness.

### **3.1.5 Missing Values and Duplicate Handling**

A systematic inspection of missing data revealed **only two incomplete records**, corresponding to null values in non-critical fields. These were removed given their negligible proportion (<0.01%), ensuring a clean analytical base.

Similarly, duplicate records were detected and subsequently dropped. The number of duplicates was minimal, confirming the **internal integrity of e-Redes data**.
After these cleaning steps, the dataset remained with **88,511 unique, non-null entries**, providing robust coverage for statistical inference.

### **3.1.6 Derived and Enriched Variables**

To enhance analytical depth, several **derived variables** were constructed from the original dataset.

#### **(a) Voltage Level Proportions per District**

Following the classification provided by the *Manual de Ligações à Rede* (e-Redes), voltage levels were grouped into functional categories reflecting user type:

| Voltage                         | Typical Use        | Interpretation                         |
| ------------------------------- | ------------------ | -------------------------------------- |
| **AT (Alta Tensão)**            | >10 MVA, 60 kV     | Large industries / heavy manufacturing |
| **MT (Média Tensão)**           | 10–30 kV, >200 kVA | Industrial and commercial sectors      |
| **BTE (Baixa Tensão Especial)** | >41.4 kVA          | Small to medium businesses             |
| **BTN (Baixa Tensão Normal)**   | <41.4 kVA          | Residential and small commercial users |

To characterize the industrial vs. residential profile of each district, the percentage of installations by voltage level was computed:

$$
\text{District Voltage Share (\%)} = \frac{\text{Installations by Voltage Level}}{\text{Total Installations in District}} \times 100
$$

These metrics provide valuable proxies for **district-level industrialization**, with higher AT/MT shares indicating greater industrial or commercial energy use intensity.

#### **(b) Seasonal Classification**

A derived variable, `Season`, was added by mapping each quarter to a seasonal category:

* **Winter:** Q1 (T1) and Q4 (T4)
* **Summer:** Q2 (T2) and Q3 (T3)

This transformation allows for seasonal comparisons, helping to capture the temporal dynamics of energy production, especially in solar-dominated systems.

### **3.1.7 Data Type Verification and Final Structure**

Following cleaning and transformation, the dataset included the following key data types:

| Variable Type            | Example Variables                                            |
| ------------------------ | ------------------------------------------------------------ |
| **Categorical (object)** | `District`, `Quarter`, `Voltage level`, `Season`             |
| **Continuous (float)**   | `Total installed power (kW)`, `power_per_cpe_ratio`          |
| **Count (int)**          | `Number of installations`, `CPEs (#)`                        |
| **Derived ratios (%)**   | `District_High_Voltage_AT(%)`, `District_Low_Voltage_BTN(%)` |

The dataset’s structure now allows for both descriptive and inferential statistical analysis.

### **3.1.8 Descriptive Summary and Initial Observations**

A statistical summary of numerical variables reveals highly **skewed distributions**, typical of energy datasets dominated by small residential systems and a few large industrial installations.

| Variable                      | Mean   | Std    | Min   | Max    | Interpretation                                                       |
| ----------------------------- | ------ | ------ | ----- | ------ | -------------------------------------------------------------------- |
| `Number of installations`     | 17.48  | 58.12  | 1     | 2118   | Most parishes have few UPACs; some outliers with dense concentration |
| `Total installed power (kW)`  | 122.14 | 383.76 | 0     | 19,600 | Wide range reflecting residential vs. industrial scales              |
| `installations_per_cpe_ratio` | 0.0008 | 0.0023 | –     | 0.052  | Indicates that even in active areas, penetration remains low         |
| `power_per_cpe_ratio`         | 0.0059 | 0.0311 | –     | 3.20   | Few parishes have high self-consumption penetration                  |
| `District_Low_Voltage_BTN(%)` | 95.68% | 1.29   | 89.99 | 98.29  | Confirms overwhelming residential dominance in UPAC deployment       |
| `District_High_Voltage_AT(%)` | 0.017% | 0.017  | 0.00  | 0.075  | Industrial-scale installations remain rare                           |

These statistics illustrate the residential bias of self-consumption in Portugal, with most installations connected at the **low-voltage level (BTN)**, while **industrial high-voltage (AT/MT)** systems represent only a marginal proportion.
Nonetheless, their contribution in installed power (kW) is disproportionately large, a pattern that motivates the inferential analyses presented in later sections.

### **3.1.9 Conclusion of Data Preparation**

The data preparation phase established a clean and analytically coherent foundation for this research. The original dataset from e-Redes, containing over 121,000 observations, was systematically examined for inconsistencies, missing values, and duplicates. Only two incomplete entries and a minimal number of duplicates were removed, resulting in 88,511 unique and valid records suitable for analysis.

Key transformations included the translation and harmonization of Portuguese-language variables into English, including the “Technology Type” field and the renaming of *relacao_instalacoes_por_cpe* and *relacao_potencia_por_cpe* to `installations_per_cpe_ratio` and `power_per_cpe_ratio`. These steps ensured terminological consistency and improved interpretability for subsequent analyses.

To align the dataset with the research objectives, temporal and spatial filtering retained only complete periods (2023–2024) and relevant districts, providing a representative and comparable sample of self-consumption installations. Furthermore, derived variables were created to enrich the dataset: voltage levels were categorized by functional meaning to distinguish residential, commercial, and industrial installations, and a seasonal variable (Winter vs. Summer) was added to support temporal comparisons.

The final dataset consists of 12 key variables encompassing temporal, spatial, technical, and derived indicators, offering a structured and comprehensive framework for descriptive and inferential analysis. Overall, the data preparation stage successfully converted a raw, multilingual dataset into a clean, standardized, and enriched analytical resource, ensuring accuracy, comparability, and interpretability for the subsequent phases of the study.

---

## Libraries

In [None]:
import geopandas as gpd
import json
from matplotlib.cm import ScalarMappable
from matplotlib.colors import Normalize
import matplotlib.colors as mcolors
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
from matplotlib import patheffects
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels
import scikit_posthocs as sp

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


## 3.1: Prepare Data

Reading the dataset:

In [None]:
df_origin = pd.read_csv('../Data/UPAC_Total_Production.csv', sep=';', decimal='.')
df_origin.head(10)

Rename the last two columns to English

In [None]:
df_origin = df_origin.rename(columns={
    "relacao_instalacoes_por_cpe": "installations_per_cpe_ratio",
    "relacao_potencia_por_cpe": "power_per_cpe_ratio"
})
print(df_origin.columns)

Dataset info

In [None]:
print("\nDataset info:")
print(df_origin.info())

Non-Numeric Columns

In [None]:
non_numeric = df_origin.select_dtypes(exclude=['number'])


for col in non_numeric.columns:
    if col == 'Municipality' or col == 'Parish' or col == 'DistrictMunicipalityParishCode':
        continue
    unique_vals = df_origin[col].unique().tolist()
    print(f"Column: {col} — {len(unique_vals)} unique values")
    print(unique_vals)
    print("-" * 40)

Translation Dictionary Technology Type

In [None]:
tech_type_translation = {
    'Solar': 'Solar',
    'Não Atribuído': 'Not Assigned',
    'Eólica': 'Wind',
    'Biogás': 'Biogas',
    'Cogeração não renovável': 'Non-renewable Cogeneration',
    'Hídrica': 'Hydro',
    'Biomassa': 'Biomass',
    'Fotovoltaica': 'Photovoltaic'
}

df_origin['Technology Type'] = df_origin['Technology Type'].map(tech_type_translation).fillna(df_origin['Technology Type'])
print(df_origin['Technology Type'].unique())


### Data Selection

Removing unnecessary columns

In [None]:
cols_to_keep = [
    "Quarter",
    "District",
    "Municipality",
    "Parish",
    "Technology Type",
    "Voltage level",
    "Installed power range (kW)",
    "Number of installations",
    "Total installed power (kW)",
    "CPEs (#)",
    "installations_per_cpe_ratio",
    "power_per_cpe_ratio"
]

df_filtered = df_origin[cols_to_keep]
df_filtered.head(10)

Selecting years of interest

In [None]:
df_filtered = df_filtered[df_filtered['Quarter'].str.startswith(('2023', '2024'))].copy()
print(df_filtered['Quarter'].value_counts())

### Data Inspection and Cleaning

Checking for missing values

In [None]:
missing_df = pd.DataFrame({
    'Missing Values': df_filtered.isnull().sum(),
    'Percentage': (df_filtered.isnull().sum() / len(df_filtered)) * 100
})
print("\nMissing Values summary:")
display(missing_df[missing_df['Missing Values'] > 0])

In [None]:
print("Before dropping NA:", df_filtered.shape)
df_filtered = df_filtered.dropna()
print("After dropping NA:", df_filtered.shape)

Check for duplicate rows

In [None]:
duplicates = df_filtered[df_filtered.duplicated()]
print(duplicates)
print("Number of duplicates:", df_filtered.duplicated().sum())

In [None]:
print("Before dropping duplicates:", df_filtered.shape)
df_filtered = df_filtered.drop_duplicates()
print("After dropping duplicates:", df_filtered.shape)

### Derived Metrics

**Percentage of Installations by Voltage Level per District**

According to E-redes ([Manual de Ligações à Rede](https://provedordocliente.e-redes.pt/Files/PDF/Manual-de-Ligacoes-a-Rede.pdf)):

*High Voltage (AT)*
- Companies with capacities >10 MVA, supplied at 60 kV.
- Clear proxy for heavy industry or large commercial facilities.
- Districts with a higher proportion of AT installations → more industrialized.

*Medium Voltage (MT)*
- Companies with capacities >200 kVA, voltages of 10–30 kV.
- Also indicates industrial activity or large commercial companies.
- Complements AT; districts with a higher proportion of MT → more industrialized areas, but less intense than AT.

*Low Voltage Normal (BTN) and Low Voltage Special (BTE)*
- **BTN** → residences, small shops/offices.
- **BTE** → small/medium companies (>41.4 kVA).
- Districts dominated by BTN → mainly residential areas.
- BTE is mixed, can indicate areas with small industries or commerce, but less significant than MT/AT.


In [None]:
# Make a copy
df = df_filtered.copy()

# Group by Quarter, District, and Voltage level, summing the Number of installations
grouped = (
    df.groupby(['Quarter', 'District', 'Voltage level'], as_index=False)['Number of installations'].sum()
)

# Pivot table to have Voltage levels as columns
pivot = grouped.pivot_table(
    index=['Quarter', 'District'],
    columns='Voltage level',
    values='Number of installations',
    fill_value=0
).reset_index()

# Ensure all expected voltage columns exist
for col in ['AT', 'MT', 'BTN', 'BTE']:
    if col not in pivot.columns:
        pivot[col] = 0

# Calculate total installations per row
pivot['Total'] = pivot[['AT','MT','BTN','BTE']].sum(axis=1)

# Calculate percentage per voltage level
pivot["District_High_Voltage_AT(%)"] = pivot['AT'] / pivot['Total'] * 100
pivot["District_Medium_Voltage_MT(%)"] = pivot['MT'] / pivot['Total'] * 100
pivot["District_Low_Voltage_BTN(%)"] = pivot['BTN'] / pivot['Total'] * 100
pivot["District_Low_Voltage_BTE(%)"] = pivot['BTE'] / pivot['Total'] * 100

# Select only the relevant columns
df_result = pivot[['Quarter', 'District',
                   'District_High_Voltage_AT(%)',
                   'District_Medium_Voltage_MT(%)',
                   'District_Low_Voltage_BTN(%)',
                   'District_Low_Voltage_BTE(%)']]

df_result


In [None]:
df_final = df_filtered.merge(df_result, on=['Quarter', 'District'], how='left')
df_final.head(10)

**Map Quarters to Seasons**

In [None]:
# Function to convert Quarter to Season
def quarter_to_season(quarter):
    if quarter.endswith('T1') or quarter.endswith('T4'):
        return 'Winter'
    elif quarter.endswith('T2') or quarter.endswith('T3'):
        return 'Summer'
    else:
        return 'Unknown'

# Apply the function to create a new Season column
df_final['Season'] = df_final['Quarter'].apply(quarter_to_season)

# Display first 10 rows
df_final.head(10)


Check final data types

In [None]:
df_final.dtypes

**Summary of numeric variables**

In [None]:
df = df_final.copy()
df.describe().T

**Descriptive Analysis**

The dataset includes 88,511 records related to energy installations.

* Number of installations shows a strong right-skew (mean = 17.48, median = 2), indicating that most areas have few installations, while a few have very high counts (max = 2,118).

* Total installed power (kW) follows a similar pattern (mean = 122.14 kW, median = 30 kW, max = 19,600 kW), suggesting a large variability and the presence of high-capacity outliers.

* CPEs (#) (supply points) also exhibits wide dispersion (mean = 45,230; std = 60,333), with a large range from 1,260 to 399,456, showing substantial heterogeneity across districts or installations.

* Installations per CPE ratio (mean = 0.00082) and power per CPE ratio (mean = 0.00589) are both very low, confirming that self-consumption installations represent a small share of total CPEs. Their high standard deviations relative to the mean indicate uneven distribution across units.

* The voltage distribution shows that installations are predominantly concentrated in Low Voltage Normal (BTN) (mean = 95.68%), while Medium Voltage (MT) represents a small share (2.50%), and Low Voltage Special (BTE) (1.81%) and High Voltage (AT) (1.66%) account for minimal proportions, confirming a strong predominance of residential-scale connections over industrial or large commercial ones.

<div id="research"></div>
<strong><a href="#top">Back to top</a></strong>

## 3.2: Exploratory Data Analysis

Integrated an additional dataset to facilitate the visualization of information on maps.

In [None]:
map = gpd.read_file("../Data/Portugal_Map.gpkg", layer='cont_distritos')

print(map.columns)
print(map.head())

map = map[['distrito', 'geometry']]

### **3.2.1 Correlation Analysis**

A correlation matrix was computed for all numerical variables to identify potential linear relationships. The analysis revealed several noteworthy patterns:

* **Strong negative correlations** exist between low-voltage residential installations (`District_Low_Voltage_BTN(%)`) and both medium-voltage (`MT`) and low-voltage special (`BTE`) installations (-0.87 and -0.83, respectively). This indicates that districts dominated by residential connections tend to have proportionally fewer industrial or small commercial installations.

* **Moderate positive correlations** were observed between the number of installations and the derived ratios (`installations_per_cpe_ratio` at 0.60, `power_per_cpe_ratio` at 0.69) as well as with total installed power. This is consistent with the intuition that districts or parishes with more installations also tend to have higher total installed capacity and higher energy consumption ratios.

* Most other correlations were weak (between -0.11 and 0.19), suggesting minimal linear dependence among variables such as seasonal variation or parish-level identifiers.

In [None]:
# Select only numeric columns
num_cols = df.select_dtypes(include=['number']).columns.tolist()

# Correlation matrix
corr_matrix = df[num_cols].corr(method='pearson')

# Visualization with heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Pearson Correlation between Numeric Variables')
plt.show()

**Conclusion:** These correlations provide early evidence that **voltage level composition is a key factor distinguishing residential-dominated districts from industrial ones**, and they justify stratifying later analyses by voltage level.

---

### **3.2.2 Distribution of Total Installed Power**

To better visualize the highly skewed distribution of *'Total Installed Power (kW)'*, we applied a *log10 transformation*.

The distribution of total installed power was highly skewed, with a long tail corresponding to a small number of very large installations, mostly at high-voltage levels (`AT` and `MT`). Logarithmic transformation was applied to better visualize the patterns:

* After transformation, the majority of installations cluster at lower values, reflecting widespread small-scale residential systems.

* Boxplots confirm that a few extreme high-power installations (up to ~19,600 kW) exist, but the **bulk of the data remains concentrated under 100 kW**, highlighting the dominance of residential self-consumption.

* The log transformation facilitates comparison across districts and voltage levels by compressing extreme outliers while preserving relative differences.

In [None]:
# Transform Total Installed Power using log10
df['LogPower'] = np.log10(df['Total installed power (kW)'] + 1e-3)  # add small epsilon to avoid log(0)

The log transformation compresses the extreme high values, bringing them closer to the rest of the data, while preserving the relative differences. This makes the histogram and boxplot more interpretable, especially for a dataset with very skewed values and large outliers.

In [None]:
sns.set(style='whitegrid')

# Determine min/max for axis scaling (optional, for consistency)
x_min = df['LogPower'].min()
x_max = df['LogPower'].max()

# Create figure
fig = plt.figure(figsize=(10, 6))
gs = fig.add_gridspec(2, 1, height_ratios=[4, 1], hspace=0.30)

# --- HISTOGRAM ---
ax0 = plt.subplot(gs[0])
sns.histplot(
    data=df,
    x='LogPower',
    bins=100,
    color='#2159B4',
    kde=True,
    ax=ax0
)
ax0.set_title('Distribution of Total Installed Power (UPAC)', fontsize=14, pad=15)
ax0.set_xlabel('')
ax0.set_ylabel('Frequency')

# Convert ticks back to original scale for readability
ticks = ax0.get_xticks()
ax0.set_xticklabels([f"{10**tick:.0f}" for tick in ticks])
ax0.grid(alpha=0.3)

# --- BOXPLOT ---
ax1 = plt.subplot(gs[1])
sns.boxplot(
    data=df,
    x='LogPower',
    color='#2159B4',
    ax=ax1
)
ax1.set_xlabel('Total Installed Power (kW)')
ax1.set_yticks([])

# Match x-axis to histogram and convert ticks back to original scale
ax1.set_xlim(x_min, x_max)
ticks = ax1.get_xticks()
ax1.set_xticklabels([f"{10**tick:.0f}" for tick in ticks])
ax1.grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

This transformation reduces the effect of extreme outliers and makes the distribution more symmetric, allowing us to clearly see patterns in the majority of data points.

**Conclusion:** Self-consumption installations in Portugal are predominantly small-scale residential systems, while a minority of industrial and commercial units contribute disproportionately to total installed power.

---

### **3.2.3 Total Installed Power by Voltage Level and Season**

Total installed power was further examined by **voltage level (AT, MT, BTN, BTE)** and **season (Winter vs. Summer)**:

* **High Voltage (AT):** Mean power ranged between ~1,500–1,600 kW, with maximum values reaching nearly 20 MW. These figures correspond to large industrial installations, consistent across Winter and Summer, with only minor seasonal fluctuations.

* **Medium Voltage (MT):** Mean power was moderately high (~360–380 kW), also reflecting industrial or large commercial installations. Seasonal differences were minimal.

* **Low Voltage (BTN and BTE):** Mean power was substantially lower (~55–64 kW), reflecting small residential or commercial installations. Distribution remained concentrated and consistent between seasons.

* **Seasonal effects:** For most voltage levels, particularly BTN and BTE, the seasonal variation in total installed power is negligible, suggesting that **self-consumption patterns are stable year-round at the residential level**. Minor shifts in AT and MT may reflect industrial operational cycles or seasonal commissioning of large installations.

In [None]:
# Descriptive statistics by Voltage Level and Season (original scale)
desc_stats = df.groupby(['Season', 'Voltage level'])['Total installed power (kW)'].describe()
print(desc_stats)

Key observations:

- **AT** (High Voltage) shows the highest mean power (~1,500–1,600 kW) and extreme maximum values (~19,600 kW), reflecting large installations.

- **MT** (Medium Voltage) also has high mean values (~360–380 kW), but lower than AT.

- **BTN** and **BTE** (Low Voltage) have much lower mean installed power (~55–64 kW).

**Crucially**, differences between Winter and Summer seem minimal for most voltage levels, except for slight variations in AT and MT due to seasonal installations or operational patterns.


We visualized the distributions using log-transformed histograms and boxplots, which confirm these patterns:

- Heavy skew and long tails for AT and MT.

- Concentrated low values for BTN and BTE.

- Seasonal shifts seem minor except in high-voltage levels.

In [None]:
sns.set(style='whitegrid')

# Ensure columns exist
df['Year'] = df['Quarter'].str[:4]

# Color palette for voltage levels
palette = {'AT': '#1f77b4', 'MT': '#ff7f0e', 'BTN': '#2ca02c', 'BTE': '#d62728'}

# Order of categories
voltage_levels = ['AT', 'MT', 'BTN', 'BTE']
seasons = ['Winter', 'Summer']

# Global scale (log)
x_min = df['LogPower'].min()
x_max = df['LogPower'].max()

# Figure setup: 4 columns (voltage) × 2 rows (seasons)
fig = plt.figure(figsize=(4 * len(voltage_levels), 10))
outer_gs = fig.add_gridspec(2, len(voltage_levels), hspace=0.4, wspace=0.25)

# Loop through each Season (row) and Voltage level (column)
for row, season in enumerate(seasons):
    subset_season = df[df['Season'] == season]

    for col, voltage in enumerate(voltage_levels):
        subset = subset_season[subset_season['Voltage level'] == voltage]

        # Inner gridspec for histogram + boxplot (stacked vertically)
        inner_gs = outer_gs[row, col].subgridspec(2, 1, height_ratios=[4, 1], hspace=0.25)

        # --- HISTOGRAM ---
        ax_hist = fig.add_subplot(inner_gs[0])
        sns.histplot(
            data=subset,
            x='LogPower',
            bins=50,
            color=palette[voltage],
            kde=False,
            ax=ax_hist
        )
        if row == 0:
            ax_hist.set_title(voltage, fontsize=12)
        if col == 0:
            ax_hist.set_ylabel(f'{season}\nCount')
        else:
            ax_hist.set_ylabel('')
        ax_hist.set_xlabel('')
        ax_hist.set_xlim(x_min, x_max)
        ticks = ax_hist.get_xticks()
        ax_hist.set_xticks(ticks)
        ax_hist.set_xticklabels([f"{10**tick:.0f}" for tick in ticks])
        ax_hist.grid(alpha=0.3)

        # --- BOX PLOT ---
        ax_box = fig.add_subplot(inner_gs[1])
        sns.boxplot(
            data=subset,
            x='LogPower',
            color=palette[voltage],
            ax=ax_box
        )
        ax_box.set_xlabel('Total Installed Power (kW)')
        ax_box.set_yticks([])
        ax_box.set_xlim(x_min, x_max)
        ticks = ax_box.get_xticks()
        ax_box.set_xticks(ticks)
        ax_box.set_xticklabels([f"{10**tick:.0f}" for tick in ticks])
        ax_box.grid(alpha=0.3, axis='x')

# Global title
plt.suptitle('Distribution of Total Installed Power by Voltage Level and Season (log scale)', fontsize=16, y=0.98)
plt.tight_layout()
plt.show()


**Conclusion:** Voltage level is a **strong determinant of installed power**, with residential installations clustered at low values and industrial/commercial installations dominating high-power observations. Seasonal variation appears **secondary**, impacting only the largest installations minimally.

---

### **3.2.4 Lineplot by Voltage Level: Total Installed Power per Voltage Level per Quarter**

A line plot was created to track **total installed power per voltage level across quarters**. Key observations include:

* **AT (High Voltage)** exhibits a small incremental growth from 2023T3 to 2023T4, followed by a slight upward trend over subsequent quarters. Installations are few, reflecting the limited number of very large industrial systems.

* **BTE (Low Voltage Special)** shows a gradual but slightly higher growth than AT, indicating moderate expansion in small commercial installations.

* **BTN (Low Voltage Normal)** demonstrates the **highest total installed power**, with an almost linear and consistent increase over time. This confirms that residential self-consumption dominates the number of installations and contributes significantly to total installed power at lower levels.

* **MT (Medium Voltage)** also displays a nearly linear and substantial growth, with the **largest total installed power over time**, reflecting industrial and large commercial installations driving capacity growth.

In [None]:
df_time = df.groupby(['Quarter', 'Voltage level'], as_index=False)['Total installed power (kW)'].sum()
df_voltage = df_time.groupby(['Quarter','Voltage level'])['Total installed power (kW)'].sum().reset_index()

fig9 = px.line(
    df_voltage,
    x='Quarter',
    y='Total installed power (kW)',
    color='Voltage level',
    facet_col='Voltage level',
    title='Total Installed Power per Voltage Level per Quarter',
    markers=True
)

fig9.update_layout(template='plotly_white', title_x=0.5, showlegend=False)
fig9.show()


**Conclusion:** While AT and BTE contribute less in absolute power, BTN and MT dominate total installed power and show clear, steady growth. This reinforces the **dual pattern of residential prevalence (BTN) versus industrial dominance in power magnitude (MT)**.

---

### **3.2.5 Distribution of Total Installed Power by Voltage Level (Violin/Box Plots)**

Boxplots and violin-like distributions provide insight into the **spread of installed power** per voltage level:

* **BTN:** Concentration around 0–1,000 kW, with outliers up to 3,500 kW. This indicates most residential installations are small, but a few residential units are substantially larger.

* **MT:** Concentrated between 0–4,000 kW, with a maximum near 12,000 kW. Medium-voltage installations are mostly industrial and contribute disproportionately to total power.

* **BTE:** Concentrated between 0–400 kW, with maximum near 1,200 kW. Small commercial or mixed-use installations dominate this category.

* **AT:** Sparse overall, ranging 0–2,000 kW, with extreme outliers up to 20,000 kW. This reflects very few but extremely large industrial installations.

In [None]:
voltage_levels = df['Voltage level'].unique()

fig2 = make_subplots(
    rows=1, cols=len(voltage_levels),
    shared_yaxes=False,
    subplot_titles=voltage_levels
)

for i, voltage in enumerate(voltage_levels, start=1):
    df_voltage = df[df['Voltage level'] == voltage]
    fig2.add_trace(
        go.Box(
            y=df_voltage['Total installed power (kW)'],
            name=voltage,
            boxpoints='all',
            marker_color=px.colors.qualitative.Vivid[i % len(px.colors.qualitative.Vivid)],
            line=dict(width=0),
            fillcolor='rgba(0,0,0,0)'
        ),
        row=1, col=i
    )

fig2.update_layout(
    template='plotly_white',
    title='Total Installed Power by Voltage Level',
    title_x=0.5,
    showlegend=False,
    height=500,
    width=300*len(voltage_levels)
)
fig2.show()

**Conclusion:** The distribution is **highly skewed across all voltage levels**, with BTN being numerically dominant, while MT and AT dominate in terms of power contribution. Outliers are important in high-voltage categories, highlighting the influence of a few large installations.

---

### **3.2.6 District-Level Installed Power Over Time**

A temporal bar chart of **total installed power per district per quarter** shows:

* Gradual growth across all districts, with districts initially higher in installed power continuing to grow faster, while districts with lower initial values grow at a slower pace.

* No district shows sudden, anomalous drops, suggesting **steady adoption of self-consumption systems nationwide**.


In [None]:
agg_df = df.groupby(['Quarter', 'District', 'Technology Type', 'Voltage level',	'Installed power range (kW)'], as_index=False).agg({
    'Number of installations': 'sum',
    'Total installed power (kW)': 'sum'
})

agg_district_df = agg_df.groupby(['Quarter', 'District'], as_index=False).agg({
    'Number of installations': 'sum',
    'Total installed power (kW)': 'sum'
})

y_max = agg_district_df['Total installed power (kW)'].max() * 1.1

fig = px.bar(
    agg_district_df,
    x='District',
    y='Total installed power (kW)',
    color='District',
    animation_frame='Quarter',
    title='Total Installed Power per District Over Time'
)

fig.update_yaxes(range=[0, y_max])
fig.show()


**Conclusion:** District-level evolution confirms the **cumulative advantage effect**: districts with more established installations continue to lead growth, while smaller districts slowly catch up. This highlights regional disparities in self-consumption adoption.

---

### **3.2.7 Quarterly Percentage Change per District**

A choropleth map displaying **quarterly percentage change of total installed power per district** highlights:

* Some districts experience **substantial growth between consecutive quarters**; for example, Beja shows a marked increase from 2023T3 to 2023T4.

* Growth patterns are heterogeneous: districts with smaller absolute installed power may experience higher relative increases, whereas industrially dense districts grow steadily but proportionally less.

In [None]:
map_gdf = map.to_crs(epsg=4326)

quarters = sorted(df['Quarter'].unique())

all_changes = []
for i in range(len(quarters) - 1):
    start_q, end_q = quarters[i], quarters[i + 1]
    df_start = df[df['Quarter'] == start_q].groupby('District')['Total installed power (kW)'].sum()
    df_end   = df[df['Quarter'] == end_q].groupby('District')['Total installed power (kW)'].sum()
    df_growth = ((df_end - df_start) / df_start * 100).reset_index()
    df_growth.columns = ['District', 'Perc_Change']
    df_growth['Quarter Change'] = f"{start_q} → {end_q}"
    all_changes.append(df_growth)

growth_df = pd.concat(all_changes, ignore_index=True)

map_merged = map_gdf.merge(growth_df, left_on='distrito', right_on='District', how='left')

geojson = json.loads(map_gdf.to_json())

fig = px.choropleth(
    map_merged,
    geojson=geojson,
    locations='District',
    featureidkey='properties.distrito',
    color='Perc_Change',
    animation_frame='Quarter Change',
    hover_name='District',
    hover_data={'Perc_Change': ':.2f'},
    color_continuous_scale='Viridis',
    title='Quarterly Percentage Change of Total Installed Power per District'
)

fig.update_geos(
    fitbounds="locations",
    visible=False,
    projection_type="mercator"
)
fig.update_layout(
    width=950,
    height=750,
    margin=dict(l=0, r=0, t=70, b=0),
    coloraxis_colorbar=dict(title='Percentage Change (%)')
)

fig.show()


**Conclusion:** **Quarterly dynamics vary across regions**, with both absolute and relative growth providing complementary insights. Rapid percentage changes in smaller districts suggest emerging self-consumption markets.

---

### **3.2.8 Distribution of Number of Installations by Voltage Level**

To better understand how the *'Number of Installations'* varies across different Voltage Levels, we first computed descriptive statistics:


In [None]:
# Descriptive statistics for 'Number of Installations' by Voltage Level
desc_stats_installations = df.groupby('Voltage level')['Number of installations'].describe()
print(desc_stats_installations)

- **AT** (High Voltage): Very few installations, mostly 1 per unit, with a maximum of 2.

- **MT** (Medium Voltage): Low number of installations, median = 1, maximum = 58.

- **BTN** (Low Voltage): Largest number of installations, highly skewed distribution with some extreme outliers (max = 2118).

- **BTE** (Low Voltage): Mostly small numbers of installations, median = 1, maximum = 19.

These statistics highlight a highly skewed distribution, especially for BTN, where a few units account for a disproportionately large number of installations.

In [None]:
sns.set(style='whitegrid')

# Log-transform Number of Installations (add small epsilon to avoid log(0))
df['LogInstallations'] = np.log10(df['Number of installations'] + 1e-3)

# Color palette for Voltage levels
palette = {'AT': '#1f77b4', 'MT': '#ff7f0e', 'BTN': '#2ca02c', 'BTE': '#d62728'}
voltage_levels = ['AT', 'MT', 'BTN', 'BTE']
n_cols = len(voltage_levels)

# Determine global min/max for consistent axis scaling
x_min = df['LogInstallations'].min()
x_max = df['LogInstallations'].max()

# Create figure with gridspec
fig = plt.figure(figsize=(4 * n_cols, 6))
gs = fig.add_gridspec(2, n_cols, height_ratios=[4, 1], hspace=0.3, wspace=0.25)

for i, voltage in enumerate(voltage_levels):
    subset = df[df['Voltage level'] == voltage]

    # --- HISTOGRAM ---
    ax_hist = fig.add_subplot(gs[0, i])
    sns.histplot(
        data=subset,
        x='LogInstallations',
        bins=50,
        color=palette[voltage],
        kde=False,
        ax=ax_hist
    )
    ax_hist.set_title(voltage, fontsize=12)
    ax_hist.set_xlabel('')
    ax_hist.set_ylabel('Count' if i == 0 else '')
    ax_hist.set_xlim(x_min, x_max)

    # Convert log-ticks back to original scale
    ticks = ax_hist.get_xticks()
    ax_hist.set_xticks(ticks)
    ax_hist.set_xticklabels([f"{10**tick:.0f}" for tick in ticks])
    ax_hist.grid(alpha=0.3, which='both')

    # --- BOX PLOT ---
    ax_box = fig.add_subplot(gs[1, i])
    sns.boxplot(
        data=subset,
        x='LogInstallations',
        color=palette[voltage],
        ax=ax_box
    )
    ax_box.set_xlabel('Number of Installations')
    ax_box.set_yticks([])
    ax_box.set_xlim(x_min, x_max)

    ticks = ax_box.get_xticks()
    ax_box.set_xticks(ticks)
    ax_box.set_xticklabels([f"{10**tick:.0f}" for tick in ticks])
    ax_box.grid(alpha=0.3, axis='x')

plt.suptitle('Distribution of Number of Installations by Voltage Level (log scale)', fontsize=15, y=1.02)
plt.tight_layout()
plt.show()


**Conclusion:** The **number of installations and installed power are not proportional**: residential installations dominate in count (BTN), but industrial installations (MT and AT) contribute disproportionately to total installed power. This reinforces the importance of analyzing **both metrics** for a complete picture of self-consumption patterns.

---

### **3.2.9 District-level Installations vs Total Installed Power Over Time**

A scatter plot of **number of installations vs total installed power per district across quarters** shows:

* **Increasing trends over time** in both the number of installations and installed power.
* Districts like **Porto, Braga, and Lisboa** consistently show the **highest number of installations**.
* The trend confirms that districts with larger initial adoption continue to lead growth, consistent with prior temporal analyses.

In [None]:
agg_df = df.groupby(['Quarter', 'District', 'Technology Type', 'Voltage level',	'Installed power range (kW)'], as_index=False).agg({
    'Number of installations': 'sum',
    'Total installed power (kW)': 'sum'
})

agg_district_df = agg_df.groupby(['Quarter', 'District'], as_index=False).agg({
    'Number of installations': 'sum',
    'Total installed power (kW)': 'sum'
})

# Get global axis limits
x_min = agg_district_df['Number of installations'].min() * 0.9
x_max = agg_district_df['Number of installations'].max() * 1.1
y_min = agg_district_df['Total installed power (kW)'].min() * 0.9
y_max = agg_district_df['Total installed power (kW)'].max() * 1.1

fig = px.scatter(
    agg_district_df,
    x='Number of installations',
    y='Total installed power (kW)',
    color='District',
    size='Total installed power (kW)',
    hover_name='District',
    animation_frame='Quarter',
    title='District-level Installations vs Power Over Time'
)

# Fix axes to always show all bubbles
fig.update_xaxes(range=[x_min, x_max])
fig.update_yaxes(range=[y_min, y_max])

fig.show()


**Conclusion:** There is a **positive association between the number of installations and total installed power**, with larger urban districts driving the majority of installations.

---

### **3.2.10 Average Power per Installation vs Number of Installations**

This analysis measures the **average power per installation** against the **number of installations**:

* Districts with **high numbers of installations** (Setúbal, Faro, Lisboa, Porto) also maintain moderate to high average power per installation.
* Over time, the average power per installation remains relatively stable, while the number of installations increases.
* Heatmap analysis confirms Setúbal consistently shows the **highest average number of installations**, followed by Faro, Lisboa, and Porto.

In [None]:
# Prepare data
agg_df = df.groupby(['Quarter', 'District'], as_index=False).agg({
    'Number of installations': 'sum',
    'Total installed power (kW)': 'sum'
})
agg_df['Average power per installation'] = (
    agg_df['Total installed power (kW)'] / agg_df['Number of installations']
)

x_min = agg_df['Number of installations'].min() * 0.9
x_max = agg_df['Number of installations'].max() * 1.1
y_min = agg_df['Average power per installation'].min() * 0.9
y_max = agg_df['Average power per installation'].max() * 1.1

fig_scatter = px.scatter(
    agg_df,
    x='Number of installations',
    y='Average power per installation',
    color='District',
    size='Number of installations',
    hover_name='District',
    animation_frame='Quarter',
    title='Average Power per Installation vs Number of Installations',
    size_max=40
)
fig_scatter.update_xaxes(range=[x_min, x_max])
fig_scatter.update_yaxes(range=[y_min, y_max])
fig_scatter.update_layout(
    height=500,
    margin=dict(l=50, r=50, t=80, b=50)
)

In [None]:
# Prepare heatmap
average_production = df.groupby(['Quarter', 'District'])['Number of installations'].mean().reset_index()
pivot = average_production.pivot(index='District', columns='Quarter', values='Number of installations')

fig_heatmap = go.Figure(data=go.Heatmap(
    z=pivot.values,
    x=pivot.columns,
    y=pivot.index,
    colorscale='YlGnBu',
    text=pivot.values,
    texttemplate="%{text:.2f}",
    colorbar=dict(title='Average Number of Installations'),
    xgap=1,
    ygap=1
))
fig_heatmap.update_layout(
    title='Average Number of Installations per District Over Time',
    xaxis_title='Quarter',
    yaxis_title='District',
    yaxis=dict(autorange='reversed'),
    height=500,
    margin=dict(l=80, r=40, t=60, b=50)
)

**Conclusion:** High installation density does not necessarily reduce average power per installation; large residential or commercial uptake sustains substantial installed capacity.

---

### **3.2.11 Relationship Between Total Installed Power and Number of Installations by Voltage Level**

We start by exploring the overall relationship between Total Installed Power and Number of Installations.

When we plot Log(Total Installed Power) against Log(Number of Installations) without distinguishing voltage levels, the scatter plot shows a general positive trend but also suggests the presence of distinct clusters. These clusters appear as roughly vertical “stripes” along the x-axis, indicating that different Voltage Levels may be driving separate groups of points.

This observation motivates a second visualization in which points are colored according to Voltage Level, allowing us to confirm whether the clusters correspond to different voltage categories and to better understand the relationship within each group.

In [None]:
sns.set(style='whitegrid')
plt.figure(figsize=(8,6))

palette = {'AT': '#1f77b4', 'MT': '#ff7f0e', 'BTN': '#2ca02c', 'BTE': '#d62728'}

# Scatter plot
scatter = sns.scatterplot(
    x='LogPower',
    y='LogInstallations',
    hue='Voltage level',
    data=df,
    palette=palette,
    alpha=0.6
)

# Regression lines and coefficients
for voltage, color in palette.items():
    subset = df[df['Voltage level'] == voltage]
    X = sm.add_constant(subset['LogPower'])
    y = subset['LogInstallations']
    model = sm.OLS(y, X).fit()
    a, b = model.params
    print(f"{voltage} regression: Intercept = {a:.3f}, Slope = {b:.3f}")

    # Plot regression line (without label)
    x_vals = subset['LogPower']
    y_vals = a + b * x_vals
    plt.plot(x_vals, y_vals, color=color)

plt.xlabel('Log(Total Installed Power)')
plt.ylabel('Log(Number of Installations)')
plt.title('Scatter Plot with Regression Lines by Voltage Level')

# Move legend outside
plt.legend(title='Voltage Level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

This visualization confirms that the apparent clusters in the overall plot are indeed due to different Voltage Levels.

The regression coefficients highlight distinct patterns:

- AT (High Voltage) installations show almost no increase in the number of installations with increasing power, indicating very few, large installations;

- MT (Medium Voltage) and BTE (Low Voltage) exhibit moderate positive relationships;

- BTN (Low Voltage) shows a strong positive relationship, meaning that as total installed power increases, the number of installations grows significantly.

**Conclusion:** These results confirm that Voltage Level might be a key factor influencing both the total installed power and the number of installations, and analyzing the data without accounting for voltage would mask these differences.

---

### **3.2.12 Number of Installations by District and Season (Top 5 Districts)** 

- espelhar a diferenças em % por barras tambem (exemplo em vdp)

Stacked bar plots for 2023 and 2024:

* **Top districts:** Setúbal, Porto, Lisboa, Braga, and Aveiro.
* Seasonal comparison shows similar distributions between **Summer and Winter**, with a slight overall increase from 2023 to 2024.

In [None]:
sns.set_theme(style='whitegrid')
# Function to prepare stacked bar data by Season
def prepare_data_season(df, year):
    df_year = df[df['Quarter'].str.startswith(str(year))]
    # Top 5 districts with most installations
    top_districts = df_year.groupby('District')['Number of installations'].sum().sort_values(ascending=False).head(5).index

    # Prepare data for stacked bar
    stack_data = df_year[df_year['District'].isin(top_districts)]
    stack_data = stack_data.groupby(['District', 'Season'])['Number of installations'].sum().unstack(fill_value=0)

    seasons_order = ['Summer', 'Winter']  # English translation
    stack_data = stack_data.reindex(columns=seasons_order, fill_value=0)

    return stack_data, seasons_order

colors = ["#f15e47", "#5e83f4"]

stack_2023, seasons_2023 = prepare_data_season(df_final, 2023)
stack_2024, seasons_2024 = prepare_data_season(df_final, 2024)

fig, axes = plt.subplots(ncols=2, figsize=(16, 6), sharey=True)

for ax, stack_data, year, seasons_order in zip(axes, [stack_2023, stack_2024], [2023, 2024], [seasons_2023, seasons_2024]):
    left = pd.Series([0]*len(stack_data), index=stack_data.index)
    for i, season in enumerate(seasons_order):
        ax.barh(stack_data.index, stack_data[season], left=left, color=colors[i], label=season, height=0.5)

        for idx, value in enumerate(stack_data[season]):
            if value > 0:
                ax.text(left[idx] + value/2, idx, str(value), va='center', ha='center', color='white', fontsize=9)
        left += stack_data[season]

    ax.set_xlabel('Number of Installations')
    ax.set_title(f'Number of Installations by District (Top 5) - {year}', fontsize=14)
    ax.grid(alpha=0.3, axis='x')

axes[1].legend(title='Season')
plt.tight_layout()
plt.show()


**Conclusion:** Seasonal variations are minimal; **annual growth drives most changes**, particularly in top districts.

---

### **3.2.13 Number of Installations per District Over Time**

Animated bar charts confirm:

* **Increasing trend** across all districts over time, corroborating other temporal visualizations.
* Districts with higher initial installation counts maintain their lead, highlighting spatial adoption disparities.

In [None]:
agg_df = df.groupby(['Quarter', 'District', 'Technology Type', 'Voltage level',	'Installed power range (kW)'], as_index=False).agg({
    'Number of installations': 'sum',
    'Total installed power (kW)': 'sum'
})

agg_district_df = agg_df.groupby(['Quarter', 'District'], as_index=False).agg({
    'Number of installations': 'sum',
    'Total installed power (kW)': 'sum'
})

agg_district_df = agg_district_df.sort_values(['Quarter', 'Number of installations'], ascending=[True, False])

y_max = agg_district_df['Number of installations'].max() * 1.1

districts = agg_district_df['District'].unique()

# Generate viridis colors for each district
cmap = plt.get_cmap('viridis')
viridis_colors = [f"rgb{tuple(int(255*x) for x in cmap(i / (len(districts)-1))[:3])}" for i in range(len(districts))]

color_discrete_map = {district: viridis_colors[i] for i, district in enumerate(districts)}

fig = px.bar(
    agg_district_df,
    x='District',
    y='Number of installations',
    color='District',
    animation_frame='Quarter',
    title='Number of Installations per District Over Time',
    color_discrete_map=color_discrete_map
)

fig.update_yaxes(range=[0, y_max])
fig.show()


---

### **3.2.14 Proportion of Voltage Level within Each Installed Power Range**

Stacked bar plots of voltage-level proportion per installed power range show:

* **>1,000 kW:** dominated by AT and MT (industrial scale).
* **]0–4] kW:** almost exclusively BTN (residential).
* **]4–20.7] kW:** mostly BTN, moderate BTE, little MT.
* **[20.7–30] kW:** mix of BTN, BTE, and some MT.
* **[30–1,000] kW:** dominated by MT, moderate BTE, little BTN, almost no AT.

In [None]:
cross_tab = pd.crosstab(df['Installed power range (kW)'], df['Voltage level'])

prop_table = (cross_tab.T / cross_tab.T.sum()).T

ax = prop_table.plot(
    kind='bar',
    stacked=True,
    figsize=(10,6),
    colormap='tab10'
)

plt.title('Proportion of Voltage Level within each Installed Power Range', fontsize=14)
plt.ylabel('Proportion')
plt.xlabel('Installed Power Range (kW)')
plt.grid(alpha=0.3, axis='y')

plt.legend(
    title='Voltage Level',
    bbox_to_anchor=(1.05, 0.5),
    loc='center left',
    borderaxespad=0.
)

plt.tight_layout()
plt.show()


**Conclusion:** Voltage level effectively **stratifies installations by scale**, distinguishing residential, small commercial, and industrial capacities.

---

### **3.2.15 Industrialization Analysis by District**

In order to assess the level of industrialization across Portuguese districts, we combined geographical data with electricity installation data, using the proportion of installations by voltage level as a proxy for industrial activity.

High-voltage (AT) and medium-voltage (MT) installations typically correspond to larger companies and industrial facilities, while low-voltage installations (BTE/BTN) mostly represent residential or small commercial areas. By weighting these percentages, we can estimate a district-level industrialization index.


To explore the spatial distribution of installations across Portugal, we created choropleth maps showing the average percentage of installations per voltage level in each district. We visualized four categories: High Voltage (AT), Medium Voltage (MT), Low Voltage Normal (BTN), and Low Voltage Special (BTE).

These maps reveal the predominant type of installations in each district:

- Districts with higher AT percentages typically host heavy industrial or large commercial facilities.

- Districts dominated by MT installations indicate medium-to-large industrial activity.

- High percentages of BTN installations correspond mainly to residential areas and small businesses.

- BTE shows the presence of small to medium companies that exceed typical residential consumption.

Overall, these maps allow a quick visual assessment of which districts are more industrialized versus predominantly residential, highlighting regional differences in electricity consumption profiles.

In [None]:
percent_vars = [
    'District_High_Voltage_AT(%)',
    'District_Medium_Voltage_MT(%)',
    'District_Low_Voltage_BTN(%)',
    'District_Low_Voltage_BTE(%)'
]

df_district_pct = df.groupby('District')[percent_vars].mean()

gdf_plot = map.merge(df_district_pct, left_on='distrito', right_on='District', how='left')
fig, axes = plt.subplots(1, 4, figsize=(24, 8))

cmap = 'OrRd'

for ax, var in zip(axes, percent_vars):
    gdf_plot.plot(
        column=var,
        cmap=cmap,
        linewidth=0.8,
        ax=ax,
        edgecolor='0.8',
        legend=True
    )
    ax.set_title(var.replace('District_', '').replace('_', ' '), fontsize=14)
    ax.axis('off')

plt.tight_layout()
plt.show()

To better highlight the more industrialized districts, we created a weighted Industrialization Index combining the percentages of AT, MT, and BTE installations (weights: AT=3, MT=2, BTE=1). This index emphasizes areas with a higher concentration of industrial or large commercial activity. The corresponding choropleth map allows a clear comparison across districts, making it easier to identify the top and bottom regions in terms of industrial activity.

In [None]:
sns.set(style='whitegrid')
percent_vars = [
    'District_High_Voltage_AT(%)',
    'District_Medium_Voltage_MT(%)',
    'District_Low_Voltage_BTE(%)'
]


df_district = df.groupby('District')[percent_vars].mean()

# weights: AT=3, MT=2, BTE=1
df_district['Industrialization_Index'] = (
    3 * df_district['District_High_Voltage_AT(%)'] +
    2 * df_district['District_Medium_Voltage_MT(%)'] +
    1 * df_district['District_Low_Voltage_BTE(%)']
)

df_district['Industrialization_Index_norm'] = 100 * df_district['Industrialization_Index'] / df_district['Industrialization_Index'].max()

gdf_plot = map.merge(df_district, left_on='distrito', right_on='District', how='left')
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gdf_plot.plot(
    column='Industrialization_Index_norm',
    cmap='OrRd',
    linewidth=0.8,
    ax=ax,
    edgecolor='0.8',
    legend=True,
    legend_kwds={'label': "Industrialization Index (%)", 'shrink': 0.6}
)

ax.set_title('Industrialization by District (Weighted Voltage Levels)', fontsize=16)
ax.axis('off')

plt.show()


**Conclusion:** The **Industrialization Index** allows a clear separation of industrial vs residential districts, enabling targeted policy or investment decisions.

---

### **3.2.16 Evolution of Installed Capacity by UPAC Type**

We then compared the total installed power across the Top 5 most industrialized and Top 5 most residential districts, segmented by UPAC type (Industrial vs Residential) for 2023 and 2024.

Stacked bar plots of **Industrial vs Residential UPACs** in top districts show:

* Industrial districts (Beja, Portalegre, Aveiro, Leiria, Évora) are dominated by MT and AT installations, with high total power.
* Residential districts (Setúbal, Viana do Castelo, Vila Real, Faro, Coimbra) are dominated by BTN/BTE installations, reflecting high counts but moderate power per installation.
* Annual growth is evident, particularly for residential districts between 2023 and 2024.

In [None]:
# Variáveis de interesse
residential_vars = [
    'District_Low_Voltage_BTN(%)',
    'District_Low_Voltage_BTE(%)'
]

# Calcular média por distrito
df_district_res = df.groupby('District')[residential_vars].mean().reset_index()
df_district_res['District'] = df_district_res['District'].str.title()

# Índice residencial ponderado
df_district_res['Residential_Index'] = (
    2 * df_district_res['District_Low_Voltage_BTN(%)'] +
    0.5 * df_district_res['District_Low_Voltage_BTE(%)']
)

# Normalização de 0 a 100
df_district_res['Residential_Index_norm'] = 100 * df_district_res['Residential_Index'] / df_district_res['Residential_Index'].max()

# Top 5 distritos mais residenciais
top5_res_high = df_district_res.sort_values(by='Residential_Index_norm', ascending=False).head(5)
print("Top 5 Most Residential Districts:")
print(top5_res_high[['District', 'Residential_Index_norm']])

# Top 5 distritos menos residenciais
top5_res_low = df_district_res.sort_values(by='Residential_Index_norm', ascending=True).head(5)
print("\nTop 5 Least Residential Districts:")
print(top5_res_low[['District', 'Residential_Index_norm']])


In [None]:
sns.set(style='whitegrid')
top5_industrial = ['Beja', 'Portalegre', 'Aveiro', 'Leiria', 'Évora']
top5_residential = ['Setúbal', 'Viana Do Castelo', 'Vila Real', 'Faro', 'Coimbra']

def assign_upac(row):
    if row['Voltage level'] in ['BTN', 'BTE']:
        return 'Residential'
    else:
        return 'Industrial'

df['UPAC_Type'] = df.apply(assign_upac, axis=1)

df_top_industrial = df[df['District'].str.title().isin(top5_industrial)]
df_top_residential = df[df['District'].str.title().isin(top5_residential)]

agg_industrial = df_top_industrial.groupby(['Year','District','UPAC_Type'], as_index=False)['Total installed power (kW)'].sum()
agg_residential = df_top_residential.groupby(['Year','District','UPAC_Type'], as_index=False)['Total installed power (kW)'].sum()

def plot_stacked_bar(df_agg, title):
    fig, axes = plt.subplots(1, len(df_agg['District'].unique()), figsize=(20,6), sharey=True)

    if len(df_agg['District'].unique()) == 1:
        axes = [axes]

    for ax, district in zip(axes, df_agg['District'].unique()):
        data = df_agg[df_agg['District'] == district].pivot(index='Year', columns='UPAC_Type', values='Total installed power (kW)').fillna(0)
        data.plot(kind='bar', stacked=True, ax=ax, color=['#64b864','#f15e47'])
        ax.set_title(district)
        ax.set_ylabel('Total Installed Power (kW)')
        ax.set_xlabel('Year')
        ax.legend(title='UPAC Type')
        ax.grid(alpha=0.3)

    fig.suptitle(title, fontsize=16)
    plt.tight_layout()
    plt.show()

plot_stacked_bar(agg_industrial, 'Top 5 Most Industrialized Districts')
plot_stacked_bar(agg_residential, 'Top 5 Most Residential Districts')


**Conclusion:** **UPAC type drives installed capacity patterns**—industrial areas concentrate large installations, while residential areas dominate in numbers.

---

### **3.2.17 Technology Type Analysis**

**Stacked bar plots of installation counts and total installed power by technology type** reveal:

* **Solar and Not Assigned** dominate in number of installations, mostly at BTN.
* **Wind** appears mostly at BTN, biogas at MT, photovoltaic only at AT.
* Temporal evolution shows **rapid growth of Solar**, Wind increases in 2024T2–T4, Biogas steady, Biomass appearing from 2023T3, Hydro present in 2023, Photovoltaic in 2024, Not Assigned peaks in 2023T3 (coinciding with temporary Solar decrease).

In [None]:
dominant_techs = ['Solar', 'Not Assigned']
df_dominant = df[df['Technology Type'].isin(dominant_techs)].copy()
df_others = df[~df['Technology Type'].isin(dominant_techs)].copy()

color_map = px.colors.qualitative.Vivid
color_dict = {tech: color_map[i % len(color_map)] for i, tech in enumerate(df['Technology Type'].unique())}

tech_counts = df.groupby(['Voltage level','Technology Type']).size().reset_index(name='Count')
dominant_df = tech_counts[tech_counts['Technology Type'].isin(dominant_techs)].copy()
others_df = tech_counts[~tech_counts['Technology Type'].isin(dominant_techs)].copy()

num_levels = df['Voltage level'].nunique()
fig_height = max(450, num_levels*60)

fig6 = make_subplots(
    rows=1, cols=2,
    shared_yaxes=True,
    column_widths=[0.6, 0.4],
    horizontal_spacing=0.05,
    subplot_titles=("Solar & Not Assigned", "Other Technologies")
)

for tech in dominant_techs:
    data = dominant_df[dominant_df['Technology Type'] == tech]
    fig6.add_trace(
        go.Bar(
            y=data['Voltage level'],
            x=data['Count'],
            name=tech,
            text=data['Count'],
            textposition='inside',
            orientation='h',
            marker=dict(color=color_dict[tech], line=dict(width=0.8, color='rgba(0,0,0,0.2)')),
            customdata=data['Technology Type'],
            hovertemplate="<b>Technology:</b> %{customdata}<br><b>Voltage Level:</b> %{y}<br><b>Installations:</b> %{x}<extra></extra>"
        ),
        row=1, col=1
    )

for tech in sorted(others_df['Technology Type'].unique()):
    data = others_df[others_df['Technology Type'] == tech]
    fig6.add_trace(
        go.Bar(
            y=data['Voltage level'],
            x=data['Count'],
            name=tech,
            text=data['Count'],
            textposition='inside',
            orientation='h',
            marker=dict(color=color_dict[tech], line=dict(width=0.8, color='rgba(0,0,0,0.2)')),
            customdata=data['Technology Type'],
            hovertemplate="<b>Technology:</b> %{customdata}<br><b>Voltage Level:</b> %{y}<br><b>Installations:</b> %{x}<extra></extra>"
        ),
        row=1, col=2
    )

fig6.update_layout(
    template='plotly_white',
    title="<b>Installations by Voltage Level and Technology Type</b>",
    barmode='stack',
    width=1300,
    height=fig_height,
    legend_title="<b>Technology Type</b>",
    margin=dict(l=120, r=60, t=90, b=60)
)
fig6.update_xaxes(title_text='Number of Installations', row=1, col=1)
fig6.update_xaxes(title_text='Number of Installations', row=1, col=2)
fig6.update_yaxes(title_text='Voltage Level', row=1, col=1)
fig6.update_yaxes(showticklabels=False, row=1, col=2)

fig6.show()


**Total Installed Power by Technology Type Over Time**

In [None]:
dominant_techs = ['Solar', 'Not Assigned']
df_time = df.groupby(['Quarter', 'Technology Type'], as_index=False)['Total installed power (kW)'].sum()
df_dom = df_time[df_time['Technology Type'].isin(dominant_techs)]
df_others = df_time[~df_time['Technology Type'].isin(dominant_techs)]


fig8 = make_subplots(
    rows=1, cols=2,
    subplot_titles=['Solar & Not Assigned', 'Other Technologies'],
    shared_yaxes=False
)

for tech in dominant_techs:
    data = df_dom[df_dom['Technology Type'] == tech]
    fig8.add_trace(
        go.Scatter(
            x=data['Quarter'],
            y=data['Total installed power (kW)'],
            mode='lines+markers',
            name=tech
        ),
        row=1, col=1
    )

for tech in df_others['Technology Type'].unique():
    data = df_others[df_others['Technology Type'] == tech]
    fig8.add_trace(
        go.Scatter(
            x=data['Quarter'],
            y=data['Total installed power (kW)'],
            mode='lines+markers',
            name=tech
        ),
        row=1, col=2
    )

fig8.update_layout(
    template='plotly_white',
    title='Total Installed Power by Technology Type Over Time',
    title_x=0.5,
    height=600,
    width=1200
)
fig8.show()

**Conclusion:** Solar dominates both in counts and capacity, confirming the **residential solar adoption trend**, while industrial technologies are concentrated in MT and AT installations.

---

### **3.2.18 Conclusion of Exploratory Data Analysis**

The exploratory analysis of Portuguese self-consumption (UPAC) data highlighted consistent patterns across voltage levels, districts, and time. Correlation analysis revealed a strong negative relationship between low-voltage residential installations (BTN) and both medium-voltage (MT) and low-voltage special (BTE) installations, suggesting that residential-dominated districts host proportionally fewer industrial or small commercial systems. Moderate positive correlations between the number of installations and total installed power indicate that districts with more installations generally have higher capacity, emphasizing voltage-level composition as a key differentiator between residential and industrial areas.

Power distribution is highly skewed: a few large industrial installations dominate high-power ranges, while most residential systems remain below 100 kW. Logarithmic transformations facilitated visualization of these disparities, confirming that residential installations dominate numerically, whereas industrial and commercial systems contribute disproportionately to total installed capacity.

Analysis by voltage level and season reinforced these patterns. High-voltage (AT) installations are few but extremely large, medium-voltage (MT) systems reflect industrial and commercial users, and low-voltage (BTN/BTE) installations are concentrated at smaller capacities. Seasonal variations are minimal, indicating stable residential self-consumption throughout the year, with voltage level being the primary determinant of installed power.

Temporal trends from 2023 to 2024 reveal steady growth. Residential systems drive increases in installation counts, while industrial units remain the largest contributors to total power. District-level dynamics show that urban areas like Porto, Braga, Lisboa, and Setúbal consistently lead in numbers, with average power per installation remaining stable despite rising counts. The cumulative advantage effect is evident, as districts with higher initial installation levels continue to expand faster than less developed ones.

Spatial analysis confirmed regional heterogeneity. Vila Real, Faro, and Setúbal are residential-dominated, while Évora, Aveiro, Beja, and Portalegre exhibit industrial profiles. Voltage levels stratify installations by scale: residential ranges (≤4 kW) are mainly BTN, small commercial ranges (4–30 kW) are BTN/BTE, and industrial scales (>30 kW) are dominated by MT and AT. A composite Industrialization Index validated this division, showing that industrial districts host fewer but larger installations, whereas residential areas have many small systems.

Technology-wise, Solar and “Not Assigned” categories dominate in number, primarily at BTN, reinforcing the residential character of self-consumption. MT-level installations are associated with Wind and Biogas, while AT-level industrial setups include Photovoltaic and Biomass. Solar adoption has grown rapidly, particularly in 2024, serving as the main driver of self-consumption expansion.

In conclusion, voltage level and UPAC type emerge as the primary determinants of both installation count and total capacity. The dual structure of self-consumption in Portugal—numerous small residential installations versus concentrated industrial capacity—is reinforced by clear spatial, temporal, and technological patterns. These insights provide a robust foundation for further modeling, forecasting, and policy-oriented analysis.

Overall, the findings reveal a dual pattern in Portugal’s self-consumption landscape, with numerous small-scale residential installations alongside concentrated industrial capacity, offering a strong empirical foundation for modeling, forecasting, and policy analysis.

---


<div id="research"></div>
<strong><a href="#top">Back to top</a></strong>

# Phase 4: Inferences

In this section, inferential analyses were conducted to statistically test the relationships and differences identified during the exploratory phase. By applying appropriate inferential methods, the goal was to validate the observed trends and address the specific research questions, ensuring the robustness of the conclusions regarding self-consumption energy production and installation patterns in Portugal between 2023 and 2024.

**Population Information**

The population considered in this analysis consists of all self-consumption electricity generation installations registered in Portugal during the years 2023 and 2024. Each record in the population includes the total installed capacity (in kilowatts), the corresponding power scale range (categorical variable indicating system size), the season of observation (winter or summer), and the district where the installation is located.

**Samples Information**

A subset of four districts — **Aveiro**, **Évora**, **Faro** and **Vila Real** — was selected to represent Portugal’s diverse geographic, climatic, and socio-economic contexts, ensuring coverage of both northern and southern, as well as coastal and inland, regions.

## RQ1: Compare the difference of the total installed capacity between 2023 and 2024 on the selected districts.


This research question examines how the total installed capacity for self-consumption evolved between 2023 and 2024 across all UPAC installations in four selected districts: Aveiro, Évora, Vila Real, and Faro. The analysis focuses on assessing whether the total installed capacity differs significantly between the two years.

**Variable Definitions:**

- $\mu_{2023}$: mean installed capacity for UPACs across the selected districts in 2023.

- $\mu_{2024}$: mean installed capacity for UPACs across the selected districts in 2024.

**Hypotheses**

For this analysis, the following null and alternative hypotheses were defined to assess the difference in total installed capacity between the two years:

* *Effect of Year (2023 vs 2024):*

$$
H_0^{\text{Year}}: \mu_{2023} = \mu_{2024} \
H_1^{\text{Year}}: \mu_{2023} \neq \mu_{2024}
$$


**Statistical Test and Assumptions**

To test this hypothesis, a **Welch’s t-test for independent samples** was applied to compare the total installed capacity between 2023 and 2024. The following assumptions and procedures were considered:

* **Samples Drawn from the Population**

  * Stratified random samples representing up to less than 10% of each year’s population were drawn to ensure robustness of results.

* **Sufficiently Large Sample Size**

  * Each year's sample contained a sufficiently large number of installations, allowing the Central Limit Theorem (CLT) to apply and ensuring approximate normality of the sample mean.

* **Independence of Observations**

  * Each UPAC installation is treated as an independent observation across years.

* **Normality of Residuals**

  * The log-transformation of total installed capacity was applied to reduce skewness and improve normality.

* **Homogeneity of Variances**

  * Levene’s test was conducted to assess equality of variances between 2023 and 2024. Since variances were unequal, Welch’s t-test was used as it is robust to heteroscedasticity.

* **Robustness Assessment**

  * Stratified random sampling was repeated to verify the stability of the results across different random samples. The p-values from multiple iterations were evaluated to ensure the robustness of statistical significance.





The dataset was filtered to include only the four selected districts: Aveiro, Évora, Vila Real, and Faro. A log-transformation was applied to the total installed power to reduce skewness and improve normality.

A histogram with kernel density estimation (KDE) was plotted to compare the distribution of log-transformed total installed power between 2023 and 2024. This visualization provides a preliminary overview of potential differences in installed capacity across years.

In [None]:
df=df_final
districts = ['Aveiro','Évora','Vila Real','Faro']
df_sample = df[df['District'].isin(districts)].copy()
df_sample['Year'] = df_sample['Quarter'].str[:4].astype(int)
df_sample['log_Total_power'] = np.log(df_sample['Total installed power (kW)'])

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(
    data=df_sample,
    x='log_Total_power',
    hue='Year',
    bins=20,
    kde=True,
    alpha=0.5,
    palette='viridis'  # Try also 'mako' or 'crest'
)
plt.title("Distribution of log(Total Installed Power) by Year")
plt.xlabel("log(Total Installed Power (kW))")
plt.ylabel("Frequency")
plt.show()


The log-transformed total installed power is approximately normally distributed in both 2023 and 2024, with a slight right skew. The 2024 distribution is slightly shifted to the right, with a higher peak than 2023, indicating a higher average installed capacity.

After confirming that the data are approximately normally distributed, we applied Levene's test to assess whether the variances for 2023 and 2024 are equal.

In [None]:
df_sample = df_sample.rename(columns={"log_Total_power": "Total_power"})

# --- Levene Test for equal variances ---
stat_levene, p_levene = stats.levene(
    df_sample[df_sample['Year']==2023]['Total_power'],
    df_sample[df_sample['Year']==2024]['Total_power']
)
print(f"\nLevene Test: W={stat_levene:.3f}, p={p_levene:}")

Levene's test shows no significant difference in variances between 2023 and 2024 (p ≈ 0.15), indicating that the assumption of equal variances is reasonable.

We performed stratified sampling to select a subset of the data with the same size for each year, limited to 10% of the total population or the size of the smallest group.

In [None]:
# --- Estratified sample (same size per group, <=10% from population) ---
group_sizes = df_sample.groupby(['Year']).size()
sample_size = min(int(len(df_sample)*0.1), group_sizes.min())

df_sample = (
    df_sample
    .groupby(['Year'], group_keys=False)
    .apply(lambda x: x.sample(n=sample_size, random_state=27))
)

# --- Print the size of each group ---
group_counts = df_sample.groupby(['Year']).size()
print("Size of each group after stratified sampling:\n")
print(group_counts)

Applied an independent two-sample Welch’s t-test to compare mean log Total Power between 2023 and 2024.

In [None]:
#t-test independente (duas amostras)
t_stat, p_val = stats.ttest_ind(
    df_sample[df_sample['Year'] == 2023]['log_Total_power'],
    df_sample[df_sample['Year'] == 2024]['log_Total_power'],
    equal_var=False  # usa Welch’s t-test (mais robusto se variâncias diferentes)
)

print(f"T-test 2023 vs 2024: t = {t_stat}, p = {p_val}")

The Welch’s t-test indicates that the mean log Total Power in 2024 is significantly higher than in 2023 (t ≈ -8.48, p < 0.05).

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(
    data=df_sample,
    x='Year',
    y='Total_power',
    ci='sd',
    hue='Year',
    dodge=False,
    palette='viridis'
)
plt.title("Total Installed Power (kW) by Year")
plt.xlabel("Year")
plt.ylabel("Total Installed Power (kW)")
plt.legend([], [], frameon=False)
plt.show()


**Multiple Random Sampling**

To assess the robustness of the results, 20 stratified samples were drawn from the population dataset, and Welch’s t-test was applied in each iteration to compare the mean log Total Power between 2023 and 2024.

In [None]:
# --- Parameters ---
n_iterations = 20
frac = 0.1
results = []

# --- Filter data ---
districts = ['Aveiro','Évora','Vila Real','Faro']
df_sample = df[df['District'].isin(districts)].copy()
df_sample['log_Total_power'] = np.log(df_sample['Total installed power (kW)'])
df_sample['Year'] = df_sample['Quarter'].str[:4].astype(int)
df_sample = df_sample[df_sample['Year'].isin([2023, 2024])]

# --- Function for stratified sampling (by year) ---
def stratified_sample(df, frac=0.1, random_state=None):
    group_sizes = df.groupby('Year').size()
    sample_size = max(1, min(int(len(df)*frac), group_sizes.min()))
    return (
        df.groupby('Year', group_keys=False)
          .apply(lambda x: x.sample(n=sample_size, random_state=random_state))
          .reset_index(drop=True)
    )

# --- Multi-random sampling loop ---
for i in range(1, n_iterations + 1):
    df_iter = stratified_sample(df_sample, frac=frac, random_state=i)

    # Levene test
    stat_levene, p_levene = stats.levene(
        df_iter[df_iter['Year']==2023]['log_Total_power'],
        df_iter[df_iter['Year']==2024]['log_Total_power']
    )

    # t-test (Welch)
    t_stat, p_val = stats.ttest_ind(
        df_iter[df_iter['Year']==2023]['log_Total_power'],
        df_iter[df_iter['Year']==2024]['log_Total_power'],
        equal_var=False
    )

    results.append({
        'Iteration': i,
        'Levene_W': stat_levene,
        'Levene_p': p_levene,
        'T_stat': t_stat,
        'p_ttest': p_val
    })

# --- Results DataFrame ---
df_results = pd.DataFrame(results)



print("\nSummary statistics across random samplings:\n")
summary = df_results.agg(['mean','std'])
print(summary)

Summary statistics across 20 random stratified samples show that the results are consistent. Levene’s test statistics indicate moderate variability in variance equality (mean p ≈ 0.24), while Welch’s t-test consistently shows a highly significant difference between 2023 and 2024 (mean t ≈ -8.42, mean p ≈ 1.35×10⁻¹³), confirming the robustness of the observed increase in log Total Power.

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(df_results['Iteration'], df_results['p_ttest'], marker='o', linestyle='-')
plt.axhline(0.05, color='red', linestyle='--', label='Significance threshold (0.05)')
plt.yscale('log')  # <<--- eixo y log
plt.title("Welch t-test p-values across Random Samplings (2023 vs 2024)")
plt.xlabel("Iteration")
plt.ylabel("p-value (log scale)")
plt.legend()
plt.tight_layout()
plt.show()


### Summary of Findings (RQ1)

1. **Year Effect:** Log-transformed Total Power increased significantly from 2023 to 2024. Welch’s t-tests across 20 random stratified samples consistently show a highly significant difference (mean t ≈ -8.42, mean p ≈ 1.35×10⁻¹³), indicating robust growth in installed capacity across all districts.

2. **Variance Check:** Levene’s test results across samples (mean p ≈ 0.24) suggest that variances between years are reasonably similar, supporting the validity of the t-test comparisons despite some moderate variability.

3. **District Effect:** The analysis focused only on the selected districts (Aveiro, Évora, Vila Real, Faro). Within this subset, stratified sampling by year was used to compare mean log Total Power between 2023 and 2024. While significant differences between years were observed, this analysis does not yet indicate whether these differences apply to each district individually. This question is addressed in the subsequent analysis.

4. **Robustness:** Multiple iterations of random stratified sampling demonstrate that the observed increase from 2023 to 2024 is stable and not dependent on a particular subset of data.

**Conclusion:** The mean installed capacity significantly increased from 2023 to 2024 across all selected districts. The results are robust to sampling variability, and variance differences between years do not affect this conclusion.

We have established that installed capacity increased from 2023 to 2024 in the selected districts, but does this increase hold consistently across all districts and installation types (Residential vs Industrial)?





## RQ2: Compare the evolution of installed capacity between 2023 and 2024 across residential and industrial UPACs to assess differences in growth patterns.


After establishing that installed capacity increased from 2023 to 2024 in the selected districts, this research question evaluates how this growth differs between residential and industrial UPACs. The analysis focuses on four selected districts—Aveiro, Évora, Vila Real, and Faro—and examines differences across sectors, potential sector–district interactions, and whether the observed yearly growth is consistent across types of installations and regions.

**Variable Definitions:**
- $\mu_{R}$: mean installed capacity for Residential UPACs across selected districts and years.
- $\mu_{I}$: mean installed capacity for Industrial UPACs across selected districts and years


**Hypotheses**

For this analysis, the following null and alternative hypotheses were defined to assess differences in installed capacity across sectors, districts, and their interaction.


- *Effect of Type (Residential vs Industrial):*

$$
H_0^{\text{Type}}: \mu_R = \mu_I \\
H_1^{\text{Type}}: \mu_R \neq \mu_I
$$

- *Effect of District (Aveiro, Évora, Vila Real, Faro):*

$$
H_0^{\text{District}}: \mu_{\text{Aveiro}} = \mu_{\text{Évora}} = \mu_{\text{Vila Real}} = \mu_{\text{Faro}} \\
H_1^{\text{District}}: \text{At least one district mean differs}
$$

- *Effect of Year (2023 vs 2024):*

$$
H_0^{\text{Year}}: \mu_{2023} = \mu_{2024} \\
H_1^{\text{Year}}: \mu_{2023} \neq \mu_{2024}
$$

- *Interaction Effects:*

  - Type × District:
  $$
  H_0^{\text{Type × District}}: \text{No interaction between Type and District} \\
  H_1^{\text{Type × District}}: \text{Interaction exists}
  $$

  - Type × Year:
  $$
  H_0^{\text{Type × Year}}: \text{No interaction between Type and Year} \\
  H_1^{\text{Type × Year}}: \text{Interaction exists}
  $$

  - District × Year:
  $$
  H_0^{\text{District × Year}}: \text{No interaction between District and Year} \\
  H_1^{\text{District × Year}}: \text{Interaction exists}
  $$

  - Type × District × Year:
  $$
  H_0^{\text{3-way}}: \text{No three-way interaction} \\
  H_1^{\text{3-way}}: \text{Three-way interaction exists}
  $$


**Statistical Test and Assumptions**

To test these hypotheses, a three-way robust ANOVA would be applied, with factors Type (Residential vs Industrial), District and Year. To do so, the following assumptions were considered:

* **Samples Drawn from the Population**

  * Stratified random samples representing up to less than 10% of each year’s population were drawn to ensure robustness of results.

* **Sufficiently Large Sample Size**

  * For each combination of District, Type, and Year, the stratified random samples included a sufficiently large number of installations. This is sufficient to invoke the Central Limit Theorem (CLT), ensuring approximate normality of sample means.

* **Independence of Observations**

  * Each UPAC installation is a distinct unit, so observations are independent across groups (District × Type × Year).

* **Normality of Residuals**

  * Residuals of installed capacity are assumed to be approximately normally distributed within each group, supported by stratified random sampling and sufficiently large sample sizes.

* **Homogeneity of Variances**

  * Levene’s test indicated unequal variances across groups. Therefore, a robust ANOVA with HC3 covariance was applied to account for heteroscedasticity.




In [None]:
df=df_final

The dataset was first filtered to include only the districts of Aveiro, Évora, Vila Real, and Faro.

The voltage levels were then mapped to define the Type of UPAC as either "Residential" or "Industrial," and entries outside these two categories were excluded.

Finally, the year and quarter were extracted from the Quarter column to facilitate temporal analysis.

In [None]:
# --- Filter districts ---
districts = ['Aveiro','Évora','Vila Real','Faro']
df_sample = df[df['District'].isin(districts)].copy()

# --- Map Voltage to Type ---
df_sample['Type'] = df_sample['Voltage level'].map({
    'AT':'Industrial',
    'MT':'Industrial',
    'BTN':'Residential',
    'BTE':'Residential'
})
df_sample = df_sample[df_sample['Type'].isin(['Residential','Industrial'])]

# --- Extract Year and Quarter ---
df_sample['Year'] = df_sample['Quarter'].str[:4].astype(int)
df_sample['Quarter_num'] = df_sample['Quarter'].str[5:]

To assess the distribution of total installed power across groups, log-transformed histograms were plotted for each District, Type, and Year combination.

In [None]:
df_sample = df_sample.copy()
df_sample['log_Total_power'] = np.log(df_sample['Total installed power (kW)'])

# --- Histogram of log(Total_power) by District, Type, and Year ---
g = sns.FacetGrid(df_sample, row='District', col='Type', hue='Year',
                  margin_titles=True, sharex=False, sharey=False)
g.map(sns.histplot, 'log_Total_power', bins=10, kde=True, alpha=0.6)
g.add_legend()
plt.subplots_adjust(top=0.9)
g.fig.suptitle("Distribution of log(Total Installed Power) by District, Type, and Year")
plt.show()


This visual representation revealed that most group distributions approximate a normal shape. This indicates that the variance is more stable across districts and UPAC types, supporting the normality assumption.


A stratified random sample was then taken from the filtered dataset to ensure that each group defined by District, Type, and Year had the same sample size, while keeping the sample at most 10% of the population. This stratification preserves the relative structure of the data across groups, and the limited sample size allows the Central Limit Theorem to justify the approximation of the sample means to a normal distribution, which is necessary for the subsequent ANOVA analysis.

In [None]:
# --- Estratified sample (same size per group, <=10% from population) ---
group_sizes = df_sample.groupby(['District','Type','Year']).size()
sample_size = min(int(len(df_sample)*0.1), group_sizes.min())

df_sample = (
    df_sample
    .groupby(['District','Type','Year'], group_keys=False)
    .apply(lambda x: x.sample(n=sample_size, random_state=27))
)

# --- Print the size of each group ---
group_counts = df_sample.groupby(['District','Type','Year']).size()
print("Size of each group after stratified sampling:\n")
print(group_counts)


**Levene Test**

Before performing the ANOVA, Levene’s test was applied to assess the assumption of homogeneity of variances between Residential and Industrial UPACs. This test evaluates whether the variance of Total_power is similar across the two sectors, which is a key requirement for valid ANOVA results.

In [None]:
# --- Sort data ---
df_sample.sort_values(['District','Type','Municipality','Quarter'], inplace=True)

# --- Rename column for convenience ---
df_sample = df_sample.rename(columns={"log_Total_power": "Total_power"})

# --- Levene Test for equal variances ---
stat_levene, p_levene = stats.levene(
    df_sample[df_sample['Type']=='Residential']['Total_power'],
    df_sample[df_sample['Type']=='Industrial']['Total_power']
)
print(f"\nLevene Test: W={stat_levene:.3f}, p={p_levene:}")

Levene’s test for homogeneity of variances yielded a statistic of **W = 90.032** with a **p-value < 0.05** (specifically, p = 4.45 × $10^{-21}$
), indicating that the assumption of equal variances between Residential and Industrial UPACs is **not satisfied**.

Given the results of Levene’s test, which indicated unequal variances, a robust three-way ANOVA was applied.

This was implemented using a heteroscedasticity-consistent covariance estimator (HC3) in the OLS model, which adjusts the standard errors and F-tests to account for variance heterogeneity while keeping the same factorial structure.


In [None]:
# --- Three-way robust ANOVA including Year ---
model = ols('Total_power ~ C(Type) * C(District) * C(Year)', data=df_sample).fit(cov_type='HC3')
anova_table_robust = sm.stats.anova_lm(model, typ=2)

pd.reset_option('display.float_format')
print("\nThree-way Robust ANOVA (Type x District x Year) results:\n")
anova_table_robust


The robust three-way ANOVA results indicate that **Type** has a highly significant effect on installed capacity (F = 419.59, p < 0.05), confirming that Industrial and Residential UPACs differ substantially in their mean installed power.

**District** also shows a highly significant effect (F = 73.42, p < 0.05), indicating notable differences in installed capacity across the four regions.

Similarly, **Year** has a significant effect (F = 52.11, p < 0.05), reflecting overall growth in installed capacity between 2023 and 2024.

Several interactions are also significant:

- **Type × District** (F = 2.80, p < 0.05) demonstrates that the difference between Residential and Industrial UPACs varies depending on the district.
- **Type × Year** (F = 16.21, p < 0.05) shows that the sector difference changes slightly between years.
- **District × Year** (F = 0.68, p = 0.56) this means that the change in installed capacity from 2023 to 2024 does not differ significantly across districts
- **Type × District × Year** (F = 0.8, p = 0.47) reveals hat the effect of Year on installed capacity does not differ depending on both the Type of UPAC and the District simultaneously.

These results reinforce the importance of examining **sector and district specific patterns over time**, rather than relying solely on overall averages, and justify the use of a robust approach given the unequal variances observed in Levene’s test.


In [None]:
# --- 1. All sectors together (Residential vs Industrial) ---
plt.figure(figsize=(10,6))
sns.barplot(
    data=df_sample,
    x='District',
    y='Total_power',
    hue='Type',        # Color by sector
    ci='sd',
    palette={'Residential':'#1f78b4', 'Industrial':'#e31a1c'}  # blue and orange
)
plt.title("Total Installed Power (kW) by District and Sector")
plt.ylabel("Total Installed Power (kW)")
plt.xlabel("District")
plt.legend(title="Sector")
plt.show()

# --- 2. Residential 2023 vs 2024 ---
plt.figure(figsize=(10,6))
sns.barplot(
    data=df_sample[df_sample['Type']=='Residential'],
    x='District',
    y='Total_power',
    hue='Year',
    ci='sd',
    palette={2023:'#a6cee3', 2024:'#1f78b4'}  # light/dark blue
)
plt.title("Residential Installed Power (kW) by District and Year")
plt.ylabel("Total Installed Power (kW)")
plt.xlabel("District")
plt.legend(title="Year")
plt.show()

# --- 3. Industrial 2023 vs 2024 ---
plt.figure(figsize=(10,6))
sns.barplot(
    data=df_sample[df_sample['Type']=='Industrial'],
    x='District',
    y='Total_power',
    hue='Year',
    ci='sd',
    palette={2023:'#fdbf6f', 2024:'#e31a1c'}  # light/dark orange
)
plt.title("Industrial Installed Power (kW) by District and Year")
plt.ylabel("Total Installed Power (kW)")
plt.xlabel("District")
plt.legend(title="Year")
plt.show()


Bar plots were created to visualize the total installed power by district, sector, and year. The first plot compares the mean installed capacity between Residential and Industrial UPACs across districts. The second and third plots show Residential and Industrial separately, highlighting differences between 2023 and 2024. Error bars represent one standard deviation, illustrating both sector-specific differences and year-to-year growth, complementing the interaction effects observed in the ANOVA.

**Multiple Random Sampling**

To verify the robustness of the results, multiple random samplings were performed. Specifically, 20 stratified samples were drawn from the filtered dataset, maintaining the proportion of Residential and Industrial UPACs within each sample. For each iteration, Levene’s test and a two-way ANOVA were conducted.

In [None]:
# --- Parameters ---
n_iterations = 20  # number of random samplings
results = []

# --- Filter districts ---
districts = ['Aveiro','Évora','Vila Real','Faro']
df_sample = df[df['District'].isin(districts)].copy()
df_sample['log_Total_power'] = np.log(df_sample['Total installed power (kW)'])

# --- Map Voltage to Type ---
df_sample['Type'] = df_sample['Voltage level'].map({
    'AT':'Industrial',
    'MT':'Industrial',
    'BTN':'Residential',
    'BTE':'Residential'
})
df_sample = df_sample[df_sample['Type'].isin(['Residential','Industrial'])]
# --- Rename column for convenience ---
df_sample = df_sample.rename(columns={"log_Total_power": "Total_power"})

# --- Extract Year from Quarter ---
df_sample['Year'] = df_sample['Quarter'].str[:4].astype(int)

# --- Stratified random sampling function (by District, Type, Year) ---
def stratified_sample(df, frac=0.1, random_state=None):
    group_sizes = df.groupby(['District','Type','Year']).size()
    sample_size = max(1, min(int(len(df)*frac), group_sizes.min()))
    df_stratified = (
        df
        .groupby(['District','Type','Year'], group_keys=False)
        .apply(lambda x: x.sample(n=sample_size, random_state=random_state))
    )
    return df_stratified.reset_index(drop=True)

# --- Loop over multiple random samplings ---
for i in range(1, n_iterations + 1):
    df_iter = stratified_sample(df_sample, frac=0.1, random_state=i)

    # Levene Test for equal variances
    stat_levene, p_levene = stats.levene(
        df_iter[df_iter['Type']=='Residential']['Total_power'],
        df_iter[df_iter['Type']=='Industrial']['Total_power']
    )

    # Two-way ANOVA
    model = ols('Total_power ~ C(Type) * C(District) * C(Year)', data=df_iter).fit(cov_type='HC3')
    anova_iter = sm.stats.anova_lm(model, typ=2)

    # Store results
    results.append({
    'Iteration': i,
    'Levene_W': stat_levene,
    'Levene_p': p_levene,
    'F_Type': anova_iter.loc['C(Type)','F'],
    'p_Type': anova_iter.loc['C(Type)','PR(>F)'],
    'F_District': anova_iter.loc['C(District)','F'],
    'p_District': anova_iter.loc['C(District)','PR(>F)'],
    'F_Year': anova_iter.loc['C(Year)','F'],
    'p_Year': anova_iter.loc['C(Year)','PR(>F)'],
    'F_Type_District': anova_iter.loc['C(Type):C(District)','F'],
    'p_Type_District': anova_iter.loc['C(Type):C(District)','PR(>F)'],
    'F_Type_Year': anova_iter.loc['C(Type):C(Year)','F'],
    'p_Type_Year': anova_iter.loc['C(Type):C(Year)','PR(>F)'],
    'F_District_Year': anova_iter.loc['C(District):C(Year)','F'],
    'p_District_Year': anova_iter.loc['C(District):C(Year)','PR(>F)'],
    'F_Type_District_Year': anova_iter.loc['C(Type):C(District):C(Year)','F'],
    'p_Type_District_Year': anova_iter.loc['C(Type):C(District):C(Year)','PR(>F)']
})

# --- Final DataFrame of results ---
df_results = pd.DataFrame(results)
pd.reset_option('display.float_format')

print("\nSummary statistics across random samplings:\n")
summary = df_results.agg(['mean','std'])
print(summary)

# --- Optional: check size of each group in last iteration ---
group_counts = df_iter.groupby(['District','Type','Year']).size()
print("\nSize of each group in last iteration:\n")
print(group_counts)


The summary statistics across 20 random stratified samplings confirm the robustness of the findings. 

The robust ANOVA results are stable. The main effects of Type (F ≈ 428.50, p < 0.05), District (F ≈ 86.50, p < 0.05), and Year (F ≈ 61.46, p < 0.05) are all highly significant, and the interaction between Type and District is significant (F ≈ 5.93, p < 0.05). Interactions involving Year show mixed results: Type × Year is significant (F ≈ 15.33, p < 0.05), while District × Year and Type × District × Year are not significant, indicating that the main patterns across sectors and districts are largely consistent between 2023 and 2024.

These results demonstrate that the effects of sector, district, and their interactions on installed capacity are robust across different random samples, even under heteroscedasticity.

To illustrate the stability of the ANOVA results across the 20 random stratified samplings, a heatmap of p-values was created. Each row represents a different iteration, and each column corresponds to a factor or interaction, including Type, District, Year, and their combinations.

In [None]:
# --- Transform data to long format including Year ---
df_pval_long = df_results.melt(
    id_vars=['Iteration'],
    value_vars=[
        'p_Type', 'p_District', 'p_Year',
        'p_Type_District', 'p_Type_Year', 'p_District_Year', 'p_Type_District_Year'
    ],
    var_name='Factor',
    value_name='p_value'
)

# Rename factors for visualization
df_pval_long['Factor'] = df_pval_long['Factor'].replace({
    'p_Type':'Type',
    'p_District':'District',
    'p_Year':'Year',
    'p_Type_District':'Type x District',
    'p_Type_Year':'Type x Year',
    'p_District_Year':'District x Year',
    'p_Type_District_Year':'Type x District x Year'
})

# Pivot data for heatmap
heatmap_data = df_pval_long.pivot(index='Iteration', columns='Factor', values='p_value')

# --- Plot ---
plt.figure(figsize=(12,6))
sns.heatmap(
    heatmap_data,
    annot=True,
    fmt=".2e",   # compact scientific notation
    cmap='coolwarm',  # red = higher p-value
    cbar_kws={'label':'p-value'},
    linewidths=0.5
)

plt.title("ANOVA p-values across Iterations (including Year)", fontsize=14)
plt.ylabel("Iteration", fontsize=12)
plt.xlabel("Factor", fontsize=12)
plt.yticks(rotation=0)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


The heatmap shows that p-values remain extremely low across all iterations for all factors, confirming that the significant effects observed for sector, district, and their interaction are consistent and robust to random sampling. The visualization also allows for a quick assessment of any potential variability between iterations, which in this case is minimal.

**Post-hoc Dunn test**

In [None]:
# Ensure log transformation is present
df_sample['log_Total_power'] = np.log(df_sample['Total_power'] + 1)

# Post-hoc Dunn test (District)
print("\n=== Post-hoc Dunn test (District) ===")
posthoc_district = sp.posthoc_dunn(
    df_sample,
    val_col='log_Total_power',
    group_col='District',
    p_adjust='bonferroni'
)
display(posthoc_district)

# Post-hoc Dunn test (Year)
print("\n=== Post-hoc Dunn test (Year) ===")
posthoc_year = sp.posthoc_dunn(
    df_sample,
    val_col='log_Total_power',
    group_col='Year',
    p_adjust='bonferroni'
)
display(posthoc_year)

# Post-hoc Dunn test (Type × District interaction)
print("\n=== Post-hoc Dunn test (Type × District interaction) ===")
df_sample['Type_District'] = df_sample['Type'] + "_" + df_sample['District']
posthoc_type_district = sp.posthoc_dunn(
    df_sample,
    val_col='log_Total_power',
    group_col='Type_District',
    p_adjust='bonferroni'
)
display(posthoc_type_district)

To further explore the significant main and interaction effects identified in the ANOVA, post-hoc Dunn tests with Bonferroni correction were conducted.

The district-level comparison revealed marked pairwise differences in installed capacity: Évora consistently differed from Aveiro, Faro, and Vila Real (p < 0.001), while Aveiro and Faro showed no significant difference between them. This pattern reinforces the geographic heterogeneity observed in the ANOVA, suggesting that Évora’s higher capacity levels may reflect stronger local investment dynamics or favorable site conditions.

Across years, a highly significant contrast was found between 2023 and 2024 (p < 0.001), confirming the growth trend already identified in the ANOVA and highlighting a general expansion of self-consumption installations over time.

The Type × District interaction also showed numerous significant contrasts (p < 0.001), indicating that the magnitude of sectoral differences (Residential vs. Industrial) is district-dependent. In particular, industrial installations in Aveiro and Faro exhibited similar capacity profiles, while residential systems in Vila Real and Évora showed the lowest mean values. This heterogeneity points to region-specific development patterns, likely linked to differences in industrial density, grid capacity, or policy incentives.

Taken together, the post-hoc results reinforce the ANOVA conclusions: industrial installations dominate in power capacity, but the distribution of both sectors varies markedly across regions, and overall growth from 2023 to 2024 reflects a sustained national expansion of self-consumption.

### Summary of Findings (RQ2)

1. **Sector Effect:** Residential and Industrial UPACs show highly significant differences in installed capacity across all districts (robust ANOVA, mean F ≈ 428.5, p < 0.05). Industrial installations consistently have higher mean power.

2. **District Effect:** Installed capacity varies significantly between districts (mean F ≈ 86.5, p < 0.05), confirming regional patterns in self-consumption adoption.

3. **Year Effect:** There is a significant growth in installed capacity from 2023 to 2024 (mean F ≈ 61.5, p < 0.05), indicating increasing adoption across both sectors.

4. **Interactions:**

   * **Type × District** is significant (mean F ≈ 5.93, p < 0.05), meaning sector differences vary across districts.
   * **Type × Year** is significant (mean F ≈ 15.33, p < 0.05), showing that growth patterns differ slightly between sectors.
   * **District × Year** (mean F ≈ 0.674, p ≈ 0.617) and **Type × District × Year** (mean F ≈ 0.806, p ≈ 0.543) are not significant, suggesting that yearly growth is generally consistent across districts and combinations of sector and district.

5. **Robustness:** Multiple random stratified samplings confirm the stability of these findings, even under heteroscedasticity observed in Levene’s tests.

**Conclusion:** Residential and Industrial UPACs differ significantly in installed capacity, district-level variations exist, and overall capacity increased from 2023 to 2024. While some sector-specific growth patterns vary by district, the main trends are consistent, and the results are robust across different random samples.

We have already examined differences between years, sectors, and districts, as well as their interactions. But what about seasonal effects? Does installed capacity vary across different quarters within the year?



## RQ3:  Compare the total installed capacity for self-consumption across different power scales (installed capacity ranges) and seasons (winter vs. summer) in selected Portuguese districts during 2023 and 2024.

<div id="research"></div>
<strong><a href="#top">Back to top</a></strong>

After establishing differences in installed capacity between years, sectors, and districts in the selected regions, this research question investigates how installed capacity varies across seasons. The analysis focuses on four districts—Aveiro, Évora, Vila Real, and Faro—and examines differences between winter and summer, across power scales, and potential interactions with districts, and years.

Data were included only when complete information was available for **power scale**, **season**, and **total installed capacity** for both years.


**Variables Definitions:**

- $\mu_{Y}$: mean log total power for each Year.

- $\mu_{D}$: mean log total power for each District.

- $\mu_{S}$: mean log total power for each Season.

- $\mu_{P}$: mean log total power for each Installed power range (kW).




**Hypotheses:**
For this analysis, the following null and alternative hypotheses were defined to assess differences in log total power across Years, Districts, Seasons, Installed power ranges, and their interaction.

- *Effect of Year (2023 vs 2024):*

$$
H_0^{\text{Year}}: \mu_{2023} = \mu_{2024} \\
H_1^{\text{Year}}: \mu_{2023} \neq \mu_{2024}
$$


- *Effect of District (Aveiro, Évora, Vila Real, Faro):*

$$
H_0^{\text{District}}: \mu_{\text{Aveiro}} = \mu_{\text{Évora}} = \mu_{\text{Vila Real}} = \mu_{\text{Faro}} \\
H_1^{\text{District}}: \text{At least one district mean differs}
$$


- *Effect of Season (Winter vs Summer):*

$$
H_0^{\text{Season}}: \mu_{\text{Winter}} = \mu_{\text{Summer}} \\
H_1^{\text{Season}}: \mu_{\text{Winter}} \neq \mu_{\text{Summer}}
$$


- *Effect of Installed power range (kW):*

$$
H_0^{\text{Power}}: \mu_{(0,4]} = \mu_{(4,20.7]} = \mu_{(20.7,30]} = \mu_{(30, 1000]} = \mu_{>1000} \\
H_1^{\text{Power}}: \text{At least one power range mean differs}
$$


- *Interaction Effects:*

  - Year × District:
  $$
  H_0^{\text{Year × District}}: \text{No interaction between Year and District} \\
  H_1^{\text{Year × District}}: \text{Interaction exists}
  $$

  - Year × Season:
  $$
  H_0^{\text{Year × Season}}: \text{No interaction between Year and Season} \\
  H_1^{\text{Year × Season}}: \text{Interaction exists}
  $$

  - District × Season:
  $$
  H_0^{\text{District × Season}}: \text{No interaction between District and Season} \\
  H_1^{\text{District × Season}}: \text{Interaction exists}
  $$

  - Year × Installed power range:
  $$
  H_0^{\text{Year × Power}}: \text{No interaction between Year and Power range} \\
  H_1^{\text{Year × Power}}: \text{Interaction exists}
  $$

  - District × Installed power range:
  $$
  H_0^{\text{District × Power}}: \text{No interaction between District and Power range} \\
  H_1^{\text{District × Power}}: \text{Interaction exists}
  $$

  - Season × Installed power range:
  $$
  H_0^{\text{Season × Power}}: \text{No interaction between Season and Power range} \\
  H_1^{\text{Season × Power}}: \text{Interaction exists}
  $$

- *Three-way Interaction Effects:*

  - Year × District × Season:
  $$
  H_0^{\text{3-way}}: \text{No three-way interaction} \\
  H_1^{\text{3-way}}: \text{Three-way interaction exists}
  $$

  - Year × District × Installed power range:
  $$
  H_0^{\text{3-way}}: \text{No three-way interaction} \\
  H_1^{\text{3-way}}: \text{Three-way interaction exists}
  $$

  - Year × Season × Installed power range:
  $$
  H_0^{\text{3-way}}: \text{No three-way interaction} \\
  H_1^{\text{3-way}}: \text{Three-way interaction exists}
  $$

  - District × Season × Installed power range:
  $$
  H_0^{\text{3-way}}: \text{No three-way interaction} \\
  H_1^{\text{3-way}}: \text{Three-way interaction exists}
  $$

- *Four-way Interaction Effect:*

$$
H_0^{\text{4-way}}: \text{No four-way interaction among Year, District, Season, and Power range} \\
H_1^{\text{4-way}}: \text{Four-way interaction exists}
$$

In [None]:
# Select relevant columns
cols = ['District', 'Installed power range (kW)', 'Total installed power (kW)', 'Season', 'Quarter']
df = df[cols].dropna(subset=['District', 'Installed power range (kW)', 'Total installed power (kW)', 'Season'])

# Extract Year from Quarter
df['Year'] = df['Quarter'].astype(str).str.extract(r'(20\d{2})').astype(int)

# Filter for specific districts
df = df[df['District'].isin(['Aveiro', 'Évora', 'Faro', 'Vila Real'])]

# Normalize Season labels
df['Season'] = (
    df['Season']
    .astype(str)
    .str.strip()
    .str.lower()
    .map({'winter': 'winter', 'verão': 'summer', 'summer': 'summer', 'inverno': 'winter'})
    .fillna(df['Season'])
)

# Exclude [>1000 kW] scale and standardize labels
valid_scales = [']0, 4]', ']4, 20.7]', ']20.7, 30]', ']30, 1000]']
df['Installed power range (kW)'] = (
    df['Installed power range (kW)']
    .astype(str)
    .str.strip()
    .str.replace(r'\s+', ' ', regex=True)
)
df = df[df['Installed power range (kW)'].isin(valid_scales)]

# Define categorical types with order
df['Installed power range (kW)'] = pd.Categorical(df['Installed power range (kW)'], categories=valid_scales, ordered=True)
df['District'] = pd.Categorical(df['District'], categories=['Aveiro', 'Évora', 'Faro', 'Vila Real'], ordered=False)
df['Season'] = pd.Categorical(df['Season'], categories=['winter', 'summer'], ordered=False)
df['Year'] = pd.Categorical(df['Year'], categories=sorted(df['Year'].unique()), ordered=True)

# Check number of observations per group
counts = (
    df.groupby(['Year', 'District', 'Installed power range (kW)', 'Season'], dropna=False)
      .size()
      .reset_index(name='n')
      .sort_values(['Year', 'District', 'Installed power range (kW)', 'Season'])
)
print(counts.head(20))
print("\nTotal rows after filters:", len(df))

#### **Statistical Test and Assumptions**

To test these hypotheses, a four-way ANOVA was applied with factors Power Scale (Installed Power Range), Season (Winter vs. Summer), District, and Year. This factorial model evaluates both the main effects and interactions between structural (scale), temporal (year), seasonal, and regional factors. Before conducting the analysis, the following assumptions were verified:

* **Samples Drawn from the Population**

A subset of four districts — Aveiro, Évora, Faro, and Vila Real — was selected to represent diverse geographic and climatic regions of Portugal, ensuring coverage of both coastal and inland areas. Data were included only when complete information was available for power scale, season, and total installed capacity for both years (2023–2024). This approach ensures representativeness of the population of UPAC self-consumption installations across regional and climatic contexts.

* **Sufficiently Large Sample Size**

For each combination of District, Power Scale, Season, and Year, stratified random samples were drawn, representing up to 10% of each subgroup. These sample sizes are large enough for the Central Limit Theorem (CLT) to apply, ensuring that the distribution of sample means approximates normality even if the population data are not perfectly normal.

* **Independence of Observations**

Each UPAC installation represents an independent observation — a distinct self-consumption unit with its own installed capacity. Therefore, observations are independent both within and across groups (District × Power Scale × Season × Year).

* **Normality of Residuals**

The ANOVA assumes that residuals of installed capacity are approximately normally distributed within each group. The use of the log₁₀-transformed installed capacity (log₁₀_power) and stratified sampling support this assumption by reducing skewness and stabilizing variances.

* **Homogeneity of Variances**

Levene’s test revealed significant differences in group variances (p < 0.001), indicating that the assumption of homoscedasticity required for standard ANOVA was not met. To address this, a four-way ANOVA with heteroskedasticity-consistent standard errors (HC3) was applied. This robust approach adjusts for unequal variances across groups, providing reliable F- and p-values even under heteroscedastic conditions.

In [None]:
# Create log-transformed column
df['log_total_power'] = np.log1p(df['Total installed power (kW)'])

fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharey=False)

# Plot 1: Histogram of log-transformed total power
sns.histplot(
    data=df,
    x='log_total_power',
    hue='Installed power range (kW)',
    kde=True,
    element='step',
    ax=axes[0],
    palette='viridis'
)
axes[0].set_title('Histogram (Log-Transformed Total Installed Power)', fontsize=12)
axes[0].set_xlabel('log₁₀(Total Installed Power + 1)', fontsize=10)
axes[0].set_ylabel('Count')

# Plot 2: Boxplot by Installed Power Range
sns.boxplot(
    data=df,
    x='Installed power range (kW)',
    y='log_total_power',
    ax=axes[1],
    palette='viridis'
)
axes[1].set_title('Boxplot by Power Scale (Log-Transformed)', fontsize=12)
axes[1].set_xlabel('Installed Power Range (kW)', fontsize=10)
axes[1].set_ylabel('log₁₀(Total Installed Power + 1)', fontsize=10)
axes[1].tick_params(axis='x', rotation=30)

plt.tight_layout()
plt.show()

Visual inspection of the histogram and boxplot of the log-transformed data revealed that the distributions became approximately symmetric across all power scale groups. This transformation reduced the influence of extreme values and stabilized the variance, allowing the assumptions of ANOVA to be reasonably met. 
\
Furthermore, the boxplot shows that the spread of the boxes is fairly similar across groups, reinforcing the conclusion that the variances are approximately homogeneous. Consequently, it is appropriate to proceed with a two-way ANOVA on the log-transformed data to assess the effects of power scale and season.



**Generate Samples**

In [None]:
# === Generate stratified random samples (<10%) ===

print(f"Population size: {df.shape[0]}")

# Backup of the filtered dataset
df_rs = df.copy() # to use later on Random Sampling

# Define grouping variables for stratified sampling
group_cols = ['District', 'Installed power range (kW)', 'Season', 'Year']

# Compute group sizes
group_sizes = df.groupby(group_cols, observed=True).size()
print(f"Number of groups: {len(group_sizes)}")

# --- Define sample size logic ---
# Minimum group size across strata
min_group_size = group_sizes.min()
# Maximum allowed sample size (10% of population * 0.9 margin)
max_sample_size = round(df.shape[0] * 0.1 * 0.9)

# Choose the sample size per stratum (balance safety + representativeness)
sample_size = min(max_sample_size, min_group_size)

print(f"Sample size per group: {sample_size} (min group = {min_group_size}, 10% cap = {max_sample_size})")

# --- Stratified random sampling ---
df_sample = (
    df.groupby(group_cols, observed=True, group_keys=False)
      .apply(lambda g: g.sample(n=min(len(g), sample_size), random_state=42))
      .reset_index(drop=True)
)

# --- Show summary ---
print(f"\nNew total sample size: {len(df_sample)} rows")
group_counts = df_sample.groupby(group_cols, observed=True).size()
print("\nSample size per group:\n")
print(group_counts)

# Display sample for inspection
display(df_sample)

**Levene Test**

In [None]:

from scipy.stats import levene

# Use the sample
d = df_sample.copy()

# Ensure dependent var exists (log10; tolerant to zeros)
if 'log_total_power' not in d.columns:
    pcol = 'Total installed power (kW)' if 'Total installed power (kW)' in d.columns else 'Total_power'
    d['log_total_power'] = np.log10(d[pcol].astype(float) + 1e-6)

factors = ['Installed power range (kW)', 'Season', 'District', 'Year']
rows = []
for f in factors:
    groups = [g['log_total_power'].dropna().values for _, g in d.groupby(f, observed=True)]
    groups = [g for g in groups if len(g) >= 2]
    if len(groups) > 1:
        W, p = levene(*groups, center='median')
        rows.append([f, len(groups), W, p, 'Yes' if p >= 0.05 else 'No (heteroscedastic)'])
    else:
        rows.append([f, len(groups), np.nan, np.nan, 'Not enough groups'])

out = pd.DataFrame(rows, columns=['Factor','k groups','Levene W','p-value','Equal Variances?'])
print(out.to_string(index=False, float_format='{:,.4g}'.format))


Levene’s test indicated significant heterogeneity of variances across groups (p < 0.001), suggesting that the assumption of homoscedasticity required for a standard ANOVA was violated. Therefore, a four-way ANOVA with heteroskedasticity-consistent standard errors (HC3) was applied. The HC3 estimator adjusts the covariance matrix of the model to account for unequal variances among groups, providing robust F- and p-values even under heteroscedastic conditions. In practice, this approach is equivalent to a Welch-type correction extended to multifactor ANOVA designs, ensuring that the results remain reliable despite variance inequality.

**Four-way robust ANOVA (HC3)**

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

# Four-way robust ANOVA (HC3)
model = ols(
    'log_total_power ~ C(Year) * C(District) * C(Season) * C(Q("Installed power range (kW)"))',
    data=df_sample
).fit(cov_type='HC3')

anova_table_robust = sm.stats.anova_lm(model, typ=2)

# Format floating numbers
pd.reset_option('display.float_format')
pd.options.display.float_format = '{:.6f}'.format

print("\n=== Four-way Robust ANOVA (Year × District × Season × Scale) results ===\n")
display(anova_table_robust)

# Interpretation of p-values
print("\n--- Summary of Effects ---")
for factor, pval in anova_table_robust["PR(>F)"].items():
    if pval < 0.05:
        print(f" {factor}: Significant (p = {pval:.3e})")
    else:
        print(f" {factor}: Not significant (p = {pval:.3e})")

A four-way robust ANOVA (HC3) was conducted to examine the effects of Year, District, Season, and Installed Power Range (Scale) — and their interactions — on the log-transformed total installed power.

The Levene’s test indicated heteroscedasticity, so the HC3 correction was applied to provide robust F- and p-values.

- Results showed significant main effects for Year (p < 0.001), District (p < 0.001), and Power Scale (p < 0.001), indicating that installed capacity differs substantially across years, regions, and installation scales.

- In contrast, Season was not significant (p = 0.439), suggesting that seasonal variation (summer vs. winter) does not meaningfully affect the total installed capacity.

Regarding interactions:

* Year × Scale (p < 0.001) and District × Scale (p < 0.001) interactions were significant, showing that the influence of scale varies by both year and region.

* The three-way interaction Year × District × Scale was also significant (p = 0.024), suggesting that growth patterns by scale differ across districts and years.

* All other interactions, including those involving Season, were not significant (p > 0.05).

* The four-way interaction was also not significant (p = 0.347), indicating that seasonal effects remain consistent across years, districts, and scales.

Since the four-way ANOVA revealed significant main and interaction effects (specifically for Year, District, Power Scale, and their combinations), post-hoc pairwise comparisons were performed to identify which specific groups differed significantly. Post-hoc tests were applied only to factors or interactions that showed significant F-values in the ANOVA results. Dunn’s test with Bonferroni correction was used, as it provides a robust non-parametric alternative suitable for unequal variances and group sizes, consistent with the heteroscedasticity detected by Levene’s test. This approach ensures that multiple comparisons are appropriately adjusted while maintaining statistical validity under non-ideal data conditions.

**Year** - The ANOVA revealed a significant main effect of Year (p < 0.001), indicating that the total installed capacity differed between 2023 and 2024. Mean log-transformed installed capacity increased from 𝑥̄₍₂₀₂₃₎ = 1.52 to 𝑥̄₍₂₀₂₄₎ = 1.71, corresponding to an approximate 20% increase in total installed power. This confirms a consistent upward trend in self-consumption installations over time.

**Post-hoc Dunn test**

In [None]:
import scikit_posthocs as sp

# Post-hoc Dunn test with Bonferroni correction

# districts
posthoc_district = sp.posthoc_dunn(
    df_sample,
    val_col='log_total_power',
    group_col='District',
    p_adjust='bonferroni'
)

print("\n=== Post-hoc Dunn test (Districts) ===")
display(posthoc_district)

# power scales
posthoc_scale = sp.posthoc_dunn(
    df_sample,
    val_col='log_total_power',
    group_col='Installed power range (kW)',
    p_adjust='bonferroni'
)

print("\n=== Post-hoc Dunn test (Power Scales) ===")
display(posthoc_scale)

# Year × Scale Interaction
for year, subdf in df_sample.groupby('Year'):
    print(f"\n=== Post-hoc Dunn test for Power Scales in {year} ===")
    posthoc_year_scale = sp.posthoc_dunn(
        subdf,
        val_col='log_total_power',
        group_col='Installed power range (kW)',
        p_adjust='bonferroni'
    )
    display(posthoc_year_scale)

# District × Scale Interaction
for district, subdf in df_sample.groupby('District'):
    print(f"\n=== Post-hoc Dunn test for Power Scales in {district} ===")
    posthoc_district_scale = sp.posthoc_dunn(
        subdf,
        val_col='log_total_power',
        group_col='Installed power range (kW)',
        p_adjust='bonferroni'
    )
    display(posthoc_district_scale)

# Year × District × Scale Interaction
for (year, district), subdf in df_sample.groupby(['Year', 'District']):
    print(f"\n=== Post-hoc Dunn test for Power Scales in {district} ({year}) ===")
    posthoc_combo = sp.posthoc_dunn(
        subdf,
        val_col='log_total_power',
        group_col='Installed power range (kW)',
        p_adjust='bonferroni'
    )
    display(posthoc_combo)

**Main Effects:**

* District:
\
The post-hoc Dunn test for districts showed significant pairwise differences between Évora and both Aveiro and Faro (p < 0.001), as well as between Évora and Vila Real (p = 0.00001). These results indicate that Évora’s total installed capacity differs markedly from the other districts, while Aveiro and Faro share similar capacity levels. This pattern suggests geographic heterogeneity, likely related to local investment intensity or renewable resource potential.

* Power Scale (Installed Power Range):
\
The power scale effect was also highly significant (p < 0.001). Post-hoc comparisons revealed that nearly all power ranges differ significantly from one another (p < 0.001), particularly between small-scale (]0,4] kW) and large-scale (]30,1000] kW) categories. This confirms that installed capacity increases systematically with power scale, as expected.

**Two-Way Interactions:**

* Year × Power Scale:
\
A significant interaction (p ≈ 5.5e-13) shows that the relationship between installed capacity and power scale changed from 2023 to 2024. Post-hoc results indicate that differences between power scales became stronger in 2024, with statistically significant contrasts between almost all ranges (especially between ]0,4] and higher categories). This suggests that capacity growth in 2024 was concentrated in larger-scale UPACs.

* District × Power Scale:
\
This interaction was also significant (p < 0.001), implying that the influence of power scale on installed capacity varies by district.

    * In Aveiro, small installations (]0,4]) differed significantly from all higher categories (p < 0.001).

    * In Évora, strong contrasts emerged between small (]0,4], ]4,20.7]) and medium-to-large (≥]20.7,30]) scales (p < 0.001).

    * In Faro, significant differences were mainly driven by the largest installations (>30 kW), which showed higher capacity levels than smaller ones.

    * In Vila Real, differences were pronounced between the smallest scales and all higher categories (p < 0.001).
\
This pattern reveals that the scaling structure of installations is district-specific, reflecting regional differences in available space, economic activity, and network conditions.

**Three-Way Interaction (Year × District × Power Scale):**

Although weaker (p = 0.024), this interaction suggests that the evolution of installed capacity across power scales varied slightly among districts between 2023 and 2024. For instance, Aveiro and Évora exhibited increased differentiation between scales in 2024, while Faro showed more uniformity across ranges. Vila Real also displayed a clearer separation between small and large installations in 2024. These variations point to region-dependent growth patterns in the self-consumption sector.

**Multiple Random Sampling**

The summary statistics across 20 random stratified samplings confirm the robustness of the findings.

The robust four-way ANOVA (HC3) results are stable. The main effects of Year (F ≈ 67.23, p < 0.05), District (F ≈ 84.83, p < 0.05), and Installed Power Range (F ≈ 692.53, p < 0.05) are all highly significant, indicating strong temporal growth, regional variation, and scale effects in installed capacity. Season is not significant (F ≈ 1.27, p ≈ 0.36), showing that capacity remains stable between winter and summer. Interactions involving Scale are meaningful: Year × Installed Power Range and District × Installed Power Range show significant effects, suggesting that growth and scaling patterns differ by year and region. All other interactions, including those with Season, are not significant, indicating that seasonal influence on installed capacity is minimal.

These results demonstrate that the effects of year, district, and scale on installed capacity are robust across different random samples, even under heteroscedasticity.

To illustrate this stability, a heatmap of p-values was created across the 20 random samplings. Each row represents an iteration, and each column corresponds to a factor or interaction, including Year, District, Season, and Installed Power Range.

In [None]:
# --- Parameters ---
n_iterations = 20  # number of random samplings
results = []

# === Levene’s Test for Homogeneity of Variances ===
dependent_var = 'log_total_power'
factors = ['Installed power range (kW)', 'Season', 'District', 'Year']


group_sizes = df_rs.groupby(['District', 'Installed power range (kW)', 'Season', 'Year']).size()
sample_size = min(round(df.shape[0]*0.1*0.9), group_sizes.min())

# --- Loop over multiple random samplings ---
for i in range(1, n_iterations + 1):

    df_iter = (
        df_rs
        .groupby(['District', 'Installed power range (kW)', 'Season', 'Year'], group_keys=False)
        .apply(lambda x: x.sample(n=sample_size, random_state=42+i*2))
    )

    # === Levene’s Test for Homogeneity of Variances ===
    levene_results = []
    for factor in factors:
        groups = [g[dependent_var].dropna() for _, g in df_iter.groupby(factor, observed=True)]
        if len(groups) > 1:
            stat, p = levene(*groups, center='median')
            levene_results.append({
                'Factor': factor,
                'Levene_Statistic': stat,
                'Levene_pvalue': p,
                'Equal_Variances': 'Yes' if p >= 0.05 else 'No (heteroscedastic)'
            })

    # Create a DataFrame with Levene results
    df_levene = pd.DataFrame(levene_results)

    # --- Attach Levene’s test summary to df_iter ---
    # Option 1: same values repeated for all rows (so can merge easily later)
    for _, row in df_levene.iterrows():
        df_iter[f"Levene_{row['Factor']}_stat"] = row['Levene_Statistic']
        df_iter[f"Levene_{row['Factor']}_p"] = row['Levene_pvalue']
        df_iter[f"Levene_{row['Factor']}_equal_var"] = row['Equal_Variances']


    # --- Four-way robust ANOVA (HC3) including Year ---
    model_it = ols(
        'log_total_power ~ C(Year) * C(District) * C(Season) * C(Q("Installed power range (kW)"))',
        data=df_iter).fit(cov_type='HC3')

    anova_it = sm.stats.anova_lm(model_it, typ=2)

    # Store results
    results.append({

    'Iteration': i,


    'F_Year': anova_it.loc['C(Year)', 'F'],
    'p_Year': anova_it.loc['C(Year)', 'PR(>F)'],

    'F_District': anova_it.loc['C(District)', 'F'],
    'p_District': anova_it.loc['C(District)', 'PR(>F)'],

    'F_Season': anova_it.loc['C(Season)', 'F'],
    'p_Season': anova_it.loc['C(Season)', 'PR(>F)'],

    'F_InstalledPowerRange': anova_it.loc['C(Q("Installed power range (kW)"))', 'F'],
    'p_InstalledPowerRange': anova_it.loc['C(Q("Installed power range (kW)"))', 'PR(>F)'],

    'F_Year_District': anova_it.loc['C(Year):C(District)', 'F'],
    'p_Year_District': anova_it.loc['C(Year):C(District)', 'PR(>F)'],

    'F_Year_Season': anova_it.loc['C(Year):C(Season)', 'F'],
    'p_Year_Season': anova_it.loc['C(Year):C(Season)', 'PR(>F)'],

    'F_District_Season': anova_it.loc['C(District):C(Season)', 'F'],
    'p_District_Season': anova_it.loc['C(District):C(Season)', 'PR(>F)'],

    'F_Year_InstalledPowerRange': anova_it.loc['C(Year):C(Q("Installed power range (kW)"))', 'F'],
    'p_Year_InstalledPowerRange': anova_it.loc['C(Year):C(Q("Installed power range (kW)"))', 'PR(>F)'],

    'F_District_InstalledPowerRange': anova_it.loc['C(District):C(Q("Installed power range (kW)"))', 'F'],
    'p_District_InstalledPowerRange': anova_it.loc['C(District):C(Q("Installed power range (kW)"))', 'PR(>F)'],

    'F_Season_InstalledPowerRange': anova_it.loc['C(Season):C(Q("Installed power range (kW)"))', 'F'],
    'p_Season_InstalledPowerRange': anova_it.loc['C(Season):C(Q("Installed power range (kW)"))', 'PR(>F)'],

    'F_Year_District_Season': anova_it.loc['C(Year):C(District):C(Season)', 'F'],
    'p_Year_District_Season': anova_it.loc['C(Year):C(District):C(Season)', 'PR(>F)'],

    'F_Year_District_InstalledPowerRange': anova_it.loc['C(Year):C(District):C(Q("Installed power range (kW)"))', 'F'],
    'p_Year_District_InstalledPowerRange': anova_it.loc['C(Year):C(District):C(Q("Installed power range (kW)"))', 'PR(>F)'],

    'F_Year_Season_InstalledPowerRange': anova_it.loc['C(Year):C(Season):C(Q("Installed power range (kW)"))', 'F'],
    'p_Year_Season_InstalledPowerRange': anova_it.loc['C(Year):C(Season):C(Q("Installed power range (kW)"))', 'PR(>F)'],

    'F_District_Season_InstalledPowerRange': anova_it.loc['C(District):C(Season):C(Q("Installed power range (kW)"))', 'F'],
    'p_District_Season_InstalledPowerRange': anova_it.loc['C(District):C(Season):C(Q("Installed power range (kW)"))', 'PR(>F)'],

    'F_Year_District_Season_InstalledPowerRange': anova_it.loc['C(Year):C(District):C(Season):C(Q("Installed power range (kW)"))', 'F'],
    'p_Year_District_Season_InstalledPowerRange': anova_it.loc['C(Year):C(District):C(Season):C(Q("Installed power range (kW)"))', 'PR(>F)']
    })


# --- Final DataFrame of results ---
df_results = pd.DataFrame(results)
pd.reset_option('display.float_format')

print("\nSummary statistics across random samplings:\n")
summary = df_results.agg(['mean','std'])
print(summary)

In [None]:
# --- Transform data to long format including Year, Season, District, InstalledPowerRange ---
df_pval_long = df_results.melt(
    id_vars=['Iteration'],
    value_vars=[
        'p_Year',
        'p_District',
        'p_Season',
        'p_InstalledPowerRange',
        'p_Year_District',
        'p_Year_Season',
        'p_District_Season',
        'p_Year_InstalledPowerRange',
        'p_District_InstalledPowerRange',
        'p_Season_InstalledPowerRange',
        'p_Year_District_Season',
        'p_Year_District_InstalledPowerRange',
        'p_Year_Season_InstalledPowerRange',
        'p_District_Season_InstalledPowerRange',
        'p_Year_District_Season_InstalledPowerRange'
    ],
    var_name='Factor',
    value_name='p_value'
)

# --- Rename factors for readability ---
df_pval_long['Factor'] = df_pval_long['Factor'].replace({
    'p_Year': 'Year',
    'p_District': 'District',
    'p_Season': 'Season',
    'p_InstalledPowerRange': 'Installed power range (kW)',

    'p_Year_District': 'Year × District',
    'p_Year_Season': 'Year × Season',
    'p_District_Season': 'District × Season',

    'p_Year_InstalledPowerRange': 'Year × Installed Power Range',
    'p_District_InstalledPowerRange': 'District × Installed Power Range',
    'p_Season_InstalledPowerRange': 'Season × Installed Power Range',

    'p_Year_District_Season': 'Year × District × Season',
    'p_Year_District_InstalledPowerRange': 'Year × District × Installed Power Range',
    'p_Year_Season_InstalledPowerRange': 'Year × Season × Installed Power Range',
    'p_District_Season_InstalledPowerRange': 'District × Season × Installed Power Range',

    'p_Year_District_Season_InstalledPowerRange': 'Year × District × Season × Installed Power Range'
})

# --- Pivot for heatmap ---
heatmap_data = df_pval_long.pivot(index='Iteration', columns='Factor', values='p_value')

# --- Plot ---
plt.figure(figsize=(20, 9))
sns.heatmap(
    heatmap_data,
    annot=True,         # change to True if you want numbers on cells
    fmt=".2e",
    cmap='coolwarm',   # blue = low p, red = high p
    cbar_kws={'label': 'p-value'},
    linewidths=0.5
)

plt.title("ANOVA p-values across Iterations (4-way model)", fontsize=14)
plt.ylabel("Iteration", fontsize=12)
plt.xlabel("Factor", fontsize=12)
plt.yticks(rotation=0)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


The heatmap shows that p-values remain very low across all iterations for Year, District, and Installed Power Range, confirming that their effects on installed capacity are highly significant and stable across random samples. In contrast, Season and its interactions show consistently high p-values, indicating no meaningful seasonal influence. Overall, the visualization highlights the robustness of the ANOVA results and the consistency of these effects under heteroscedasticity.

#### **Summary of Findings(RQ3)**

**Summary of Findings (RQ3)**

1. **Seasonal Effect:** No significant effect (mean p ≈ 0.44). Installed capacity remains stable between winter and summer, showing minimal seasonal influence.

2. **District Effect:** Strongly significant (mean p ≈ 2.0e-56). Capacity differs across districts, confirming clear regional disparities in UPAC adoption.

3. **Year Effect:** Highly significant (mean p ≈ 1.4e-16). Installed capacity increased by about 20% from 2023 to 2024, confirming sustained growth.

4. **Power Scale Effect:** Strongly significant (p ≈ 0.00). Larger UPACs show higher installed power, confirming a consistent scaling pattern.

5. **Interactions:**

    * Year × Power Scale: Significant (p ≈ 5.5e-13). Capacity growth in 2024 concentrated in larger installations.

    * District × Power Scale: Significant (p ≈ 1.1e-29). Scale effects vary by district, showing regional scaling differences.

    * Year × District × Power Scale: Weakly significant (p ≈ 0.024). Growth by scale differs slightly between districts: stronger in Aveiro and Évora, more uniform in Faro, clearer separation in Vila Real.

    * Interactions involving Season: Not significant (p > 0.05). Seasonal patterns remain consistent across years, districts, and scales.

    * Four-way Interaction: Not significant (p ≈ 0.35). No combined seasonal influence detected.

6. **Robustness:** Results are stable across 20 random stratified samples. Significant effects (Year, District, Scale) persist, while seasonal and higher-order terms remain non-significant. The HC3 correction ensured reliability under heteroscedasticity (Levene’s p < 0.001).

**Conclusion:** Season has no measurable impact on installed UPAC capacity. Growth is driven by Year, District, and Scale, not by seasonal variation. Patterns are consistent and robust.

Together with RQ1 and RQ2, these results confirm that total installed power increased from 2023 to 2024, mainly due to scale and regional factors, not seasonal effects.

---
# Phase 5: Formulate Conclusions

The analysis of self-consumption Production Units (UPACs) in Portugal for 2023–2024 provides an integrated view of how regional, sectoral, and power levels shape the national self-consumption landscape. Building on the outcomes of the data preparation and exploratory analyses (Phases 3.1–3.2), the dataset’s high granularity and verified integrity enabled consistent modeling across spatial, temporal, and voltage-level dimensions. The results across the three research questions converge on several coherent conclusions.

#### **RQ1:**

The comparison between 2023 and 2024 reveals a clear upward trend in total installed capacity for self-consumption across the selected districts. The results consistently indicate that installations expanded in scale and number, reflecting accelerated adoption and investment in decentralized generation. This growth highlights the maturation of the UPAC sector and points to increasing regional engagement in self-consumption initiatives.

The observed increase is stable across repeated analyses and not driven by sampling variation, confirming that the year-on-year growth represents a genuine structural change rather than a statistical anomaly. These outcomes suggest that 2024 marked a consolidation phase for UPAC development, laying the groundwork for continued expansion in subsequent years.

In summary, RQ1 demonstrates that self-consumption capacity grew significantly between 2023 and 2024, confirming the strong temporal dynamic underlying Portugal’s distributed energy transition. This finding provides a baseline for examining how regional and sectoral factors shaped the pattern and intensity of this growth in the following research questions.

#### **RQ2:**

Regional and power-level factors are the primary drivers of self-consumption patterns in Portugal, while seasonal effects are minimal. Industrial UPACs dominate installed capacity, but their distribution varies markedly across districts, with Évora, Aveiro, and Faro exhibiting distinct adoption patterns. Residential systems, though more numerous, contribute smaller capacities and also show clear regional disparities.

Growth from 2023 to 2024 is consistent across sectors and districts, reflecting nationwide expansion. The robustness of these patterns, confirmed through multiple random stratified samplings, indicates that sectoral and regional differences are stable and not due to sampling variability.

In summary, RQ2 demonstrates that self-consumption patterns are strongly shaped by the interaction of sector type and regional context: industrial capacity is concentrated in specific districts, residential capacity is more widely distributed, and seasonal variation has negligible impact. These results directly illustrate how power level and geography jointly structure Portugal’s self-consumption landscape.

#### **RQ3:**

The analysis reveals that seasonal variation has no significant influence on installed capacity, confirming that self-consumption operates as a stable component of decentralized generation throughout the year. In contrast, strong effects were identified for Year, District, and Installed Power Range, demonstrating that temporal growth, regional disparities, and system scale remain the dominant forces shaping the UPAC landscape.

Significant interactions between Year × Power Scale and District × Power Scale indicate that capacity growth from 2023 to 2024 was driven primarily by larger installations and that the distribution of these scales varies across regions. The weaker three-way interaction (Year × District × Power Scale) further suggests that this scaling dynamic evolved differently among districts, reflecting diverse regional investment trajectories and resource conditions.

In summary, RQ3 shows that Portugal’s self-consumption sector expanded unevenly across regions and installation sizes but remained unaffected by seasonal factors. Growth is primarily structural, linked to geographic and scale dimensions, rather than cyclical. These findings reinforce the temporal and regional patterns identified in RQ1 and RQ2, confirming that the expansion of self-consumption between 2023 and 2024 is both robust and systematically driven by district-level and scale-dependent factors.

### **Overall Interpretation**

Integrating results from data preparation, exploration, and inferential analysis, the Portuguese self-consumption sector exhibits a bifurcated structure: a broad residential base of small BTN systems and a narrow but powerful industrial segment connected via MT and AT levels. Voltage level consistently emerges as a decisive factor shaping both installation count and capacity magnitude. Regional disparities remain evident, mirroring socio-economic and infrastructural contrasts. The minimal seasonal influence confirms that self-consumption, predominantly solar,is a stable, continuous contributor to decentralized generation.

In summary, the 2023–2024 period represents a consolidation phase for self-consumption in Portugal. Household adoption continues to expand, industrial investment is accelerating, and spatial diversity persists as a defining characteristic. By tracing these outcomes back to the structured data preparation and exploratory phases, the analysis provides a comprehensive and empirically grounded picture of Portugal’s evolving, decentralized energy landscape.


---

# Phase 6: Look Back and Ahead


### **Look Back — Reflections and Project Challenges**

Throughout the project, several practical and analytical challenges contributed to shaping a stronger methodological approach. During data preparation, the main focus was on ensuring data consistency and clarity. This included translating column names from Portuguese to English for uniform documentation, checking for missing values, and removing duplicate records to guarantee data integrity. Managing a large dataset of over 120,000 observations also required careful filtering to focus on the 2023–2024 period and retain only the most relevant variables. Another key challenge was addressing the strong skewness of installed power distributions, which led to the application of log-transformations and robust statistical methods such as Welch’s t-tests and HC3-corrected ANOVA. Integrating multiple analytical dimensions (temporal, spatial, and scale-related) demanded consistent validation to ensure coherent results. Overall, these challenges strengthened the project’s analytical quality and provided valuable experience in handling large-scale, heterogeneous energy datasets.

### **Look Ahead — Future Directions and Opportunities**

The findings of this research provide a foundation for expanding the analysis of self-consumption in Portugal. Building on the 2023–2024 results, future work could develop in three complementary directions:


**Temporal Extension and Forecasting**

Extending the dataset beyond 2024 would enable modeling of medium-term growth trends and potential market saturation. Including future policy measures, such as storage incentives, could help assess regulatory effectiveness and forecast adoption trajectories.


**Spatial and Socio-economic Integration**


Combining UPAC data with demographic, income, and land-use indicators could uncover the socio-economic drivers of renewable adoption. Such integration would support evidence-based incentives that address spatial disparities and encourage balanced regional expansion.


**Technical and Environmental Assessment**

Merging capacity data with actual generation outputs and meteorological information would allow evaluation of system performance and carbon-offset potential, linking deployment data to tangible environmental benefits.

From a policy perspective, the findings highlight the importance of continued support for residential-scale adoption, alongside enhanced facilitation of industrial-scale integration. Reducing regional disparities, particularly in interior districts, would further promote territorial equity and accelerate Portugal’s shift toward a decentralized, renewable-based energy system.
