# Total Production Units for Self-Consumption

Master Data Science and Engineering - FEUP

**Group 4**

Beatriz Iara Nunes Silva

Inês Clotilde da Costa Neves

Mariana Rocha Cristino

Patrícia Crespo da Silva

## Methodology of Statistical Research

Based on a given dataset the six-step statistical investigation method will be applied:

1. **(2) Ask a research question**
2. **(1) Design a study and collect data**
3. **Explore the data**
4. **Draw inference**
5. **Formulate conclusions**
6. **Look back and ahead**

---


## Phase 1: Study Design and Data Collection

A research question is posed with a proposed data set: Total Production Units for Self-Consumption.  
The data was collected and provided by **e-Redes – Redes Energéticas Nacionais, S.A.**, the Portuguese  
electricity distribution company responsible for managing and monitoring electricity networks across Portugal.

**Dataset link:** [Total Production Units for Self-Consumption (e-Redes)](https://e-redes.opendatasoft.com/explore/dataset/8-unidades-de-producao-para-autoconsumo/information/)

---

## Phase 2: Research questions

### General Research Question

RQ: Compare how seasonal (winter vs summer), regional, and technical factors shape self-consumption energy production patterns in Portugal between 2023 and 2024.

### Specific Research Questions

• RQ1: Compare the average installed capacity per UPAC across different power levels and municipalities in 2023 and 2024.

• RQ2: Compare the evolution of installed capacity between 2023 and 2024 across residential and industrial UPACs to assess differences in growth patterns.

• RQ3: Compare the total installed capacity for self-consumption across different power scales (installed capacity ranges) and seasons (winter vs. summer) in selected Portuguese districts during 2023 and 2024.

---


### Imports

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


## Phase 3: Exploratory Data Analysis

Read the CSV

In [2]:
df = pd.read_csv('../Data/UPAC_Total_Production.csv', sep=';', decimal='.')

### 3.1. Initial Data Overview

First rows of the dataset:

In [3]:
print("First rows of the dataset:")
display(df.head())

First rows of the dataset:


Unnamed: 0,Quarter,District,Municipality,Parish,Zip Code,Technology Type,Voltage level,Installed power range (kW),Number of installations,Total installed power (kW),DistrictCode,Municipality Code,DistrictMunicipalityParishCode,CPEs (#),relacao_instalacoes_por_cpe,relacao_potencia_por_cpe
0,2023T1,Coimbra,Condeixa-a-Nova,Furadouro,3150,Solar,BTN,"]0, 4]",2,3.0,6,604,60407,9537.0,0.00021,0.000315
1,2023T1,Coimbra,Condeixa-a-Nova,Zambujal,3150,Solar,BTN,"]0, 4]",2,4.32,6,604,60410,9537.0,0.00021,0.000453
2,2023T1,Coimbra,Condeixa-a-Nova,Condeixa-a-Velha e Condeixa-a-Nova,3150,Não Atribuído,BTN,"]0, 4]",1,1.05,6,604,60411,9537.0,0.000105,0.00011
3,2023T1,Coimbra,Condeixa-a-Nova,Vila Seca e Bem da Fé,3150,Solar,BTN,"]0, 4]",17,28.14,6,604,60413,9537.0,0.001783,0.002951
4,2023T1,Coimbra,Figueira da Foz,São Pedro,3090,Não Atribuído,BTN,"]0, 4]",2,3.28,6,605,60514,50436.0,4e-05,6.5e-05


Dataset info:

In [4]:
print("\nDataset info:")
print(df.info())


Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121294 entries, 0 to 121293
Data columns (total 16 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Quarter                         121294 non-null  object 
 1   District                        121294 non-null  object 
 2   Municipality                    121294 non-null  object 
 3   Parish                          121294 non-null  object 
 4   Zip Code                        121294 non-null  int64  
 5   Technology Type                 121283 non-null  object 
 6   Voltage level                   121292 non-null  object 
 7   Installed power range (kW)      121294 non-null  object 
 8   Number of installations         121294 non-null  int64  
 9   Total installed power (kW)      121294 non-null  float64
 10  DistrictCode                    121294 non-null  int64  
 11  Municipality Code               121294 non-null  int64  
 12  D

In [5]:
# Filtrar apenas anos 2023 e 2024
df = df[df['Quarter'].str.startswith(('2023', '2024'))].copy()


Missing Values summary:

In [6]:
missing_df = pd.DataFrame({
    'Missing Values': df.isnull().sum(),
    'Percentage': (df.isnull().sum() / len(df)) * 100
})
print("\nMissing Values summary:")
display(missing_df[missing_df['Missing Values'] > 0])


Missing Values summary:


Unnamed: 0,Missing Values,Percentage
Voltage level,2,0.002259


Summary statistics:

In [7]:
print("\nSummary statistics:")
display(df.describe())


Summary statistics:


Unnamed: 0,Zip Code,Number of installations,Total installed power (kW),DistrictCode,Municipality Code,CPEs (#),relacao_instalacoes_por_cpe,relacao_potencia_por_cpe
count,88545.0,88545.0,88545.0,88545.0,88545.0,88545.0,88545.0,88545.0
mean,4510.469682,17.478085,122.100283,9.463109,954.837732,45250.293229,0.000824,0.005884
std,1662.939788,58.107265,383.688157,5.343929,535.65143,60362.537318,0.002352,0.031134
min,1000.0,1.0,0.0,1.0,101.0,1260.0,3e-06,0.0
25%,3105.0,1.0,14.04,4.0,406.0,10911.0,3.6e-05,0.000469
50%,4600.0,2.0,30.0,10.0,1012.0,27430.0,0.000116,0.001443
75%,5160.0,8.0,82.79,14.0,1401.0,57414.0,0.000492,0.00407
max,8970.0,2118.0,19600.0,18.0,1824.0,399456.0,0.052016,3.203661


Number of installations by District:

In [8]:
print("\nNumber of installations by District:")
print(df.groupby('District')['Number of installations'].sum())


Number of installations by District:
District
Aveiro              128364
Beja                 24050
Braga               192514
Bragança             22625
Castelo Branco       35890
Coimbra              82187
Faro                 83619
Guarda               24379
Leiria              102699
Lisboa              182122
Portalegre           17252
Porto               217210
Santarém             98323
Setúbal             149716
Viana do Castelo     43599
Vila Real            38925
Viseu                75512
Évora                28611
Name: Number of installations, dtype: int64


Total installed power (kW) by District:

In [9]:
print("\nTotal installed power (kW) by District:")
print(df.groupby('District')['Total installed power (kW)'].sum())


Total installed power (kW) by District:
District
Aveiro              1338245.52
Beja                 303260.40
Braga               1279839.54
Bragança             106448.67
Castelo Branco       286428.14
Coimbra              555764.84
Faro                 500667.92
Guarda               143244.38
Leiria               951815.03
Lisboa              1237281.72
Portalegre           125099.26
Porto               1570484.48
Santarém             688125.72
Setúbal              629307.56
Viana do Castelo     247010.65
Vila Real            147143.12
Viseu                476273.39
Évora                224929.23
Name: Total installed power (kW), dtype: float64


Number of installations by Technology Type:

In [10]:
print("\nNumber of installations by Technology Type:")
print(df.groupby('Technology Type')['Number of installations'].sum())


Number of installations by Technology Type:
Technology Type
Biogás                          31
Biomassa                         5
Cogeração não renovável          8
Eólica                         100
Fotovoltaica                     4
Hídrica                         10
Não Atribuído                28406
Solar                      1519033
Name: Number of installations, dtype: int64


### 3.2 Data Visualization

1) Histogram of Total Installed Power (kW)
2) Violin Plot of Total Installed Power by Voltage Level
3) Boxplot of Total Installed Power by Technology Type
4) Countplot of Installed Power Ranges
5) Barplot of Number of Installations per District
6) Stacked Barplot of Technology Type Counts
(in progress)

#### 3.2.1 Histogram of Total Installed Power (kW)

In [11]:
# Histogram for Total Installed Power <= 500 kW
fig_low = px.histogram(
    df[df['Total installed power (kW)'] <= 500],
    x='Total installed power (kW)',
    nbins=50,
    color_discrete_sequence=['#0083B8'],
    marginal='box',
    title='Distribution of Total Installed Power (≤500 kW)'
)
fig_low.update_layout(
    template='plotly_white',
    xaxis_title='Total Installed Power (kW)',
    yaxis_title='Count',
    title_x=0.5,
    bargap=0.05,
)
fig_low.show()

# Histogram for Total Installed Power > 500 kW and <= 1000 kW
fig_medium = px.histogram(
    df[(df['Total installed power (kW)'] > 500) & (df['Total installed power (kW)'] <= 1000)],
    x='Total installed power (kW)',
    nbins=50,
    color_discrete_sequence=['#0083B8'],
    marginal='box',
    title='Distribution of Total Installed Power (500 < Power ≤ 1000 kW)'
)
fig_medium.update_layout(
    template='plotly_white',
    xaxis_title='Total Installed Power (kW)',
    yaxis_title='Count',
    title_x=0.5,
    bargap=0.05,
)
fig_medium.show()

# Histogram for Total Installed Power > 1000 kW
fig_high = px.histogram(
    df[df['Total installed power (kW)'] > 1000],
    x='Total installed power (kW)',
    nbins=50,
    color_discrete_sequence=['#0083B8'],
    marginal='box',
    title='Distribution of Total Installed Power (>1000 kW)'
)
fig_high.update_layout(
    template='plotly_white',
    xaxis_title='Total Installed Power (kW)',
    yaxis_title='Count',
    title_x=0.5,
    bargap=0.05,
)
fig_high.show()

#### 3.2.2 Violin Plot of Total Installed Power by Voltage Level

In [12]:
voltage_levels = df['Voltage level'].unique()

fig = make_subplots(
    rows=1, cols=len(voltage_levels),
    shared_yaxes=False,
    subplot_titles=voltage_levels
)

for i, voltage in enumerate(voltage_levels, start=1):
    df_voltage = df[df['Voltage level'] == voltage]
    fig.add_trace(
        go.Box(
            y=df_voltage['Total installed power (kW)'],
            name=voltage,
            boxpoints='all',
            marker_color=px.colors.qualitative.Vivid[i % len(px.colors.qualitative.Vivid)],
            line=dict(width=0),
            fillcolor='rgba(0,0,0,0)'
        ),
        row=1, col=i
    )

fig.update_layout(
    template='plotly_white',
    title='Total Installed Power by Voltage Level (Points Only)',
    title_x=0.5,
    showlegend=False,
    height=500,
    width=300*len(voltage_levels)
)

fig.show()

#### 3.2.3 Boxplot of Total Installed Power by Technology Type

In [13]:
fig3 = px.box(
    df,
    x='Technology Type',
    y='Total installed power (kW)',
    color='Technology Type',
    color_discrete_sequence=px.colors.qualitative.Safe,
    title='Boxplot: Total Installed Power by Technology Type',
)
fig3.update_layout(
    template='plotly_white',
    xaxis_title='Technology Type',
    yaxis_title='Total Installed Power (kW)',
    title_x=0.5,
)
fig3.show()

#### 3.2.4 Countplot of Installed Power Ranges

In [14]:
power_range_counts = df['Installed power range (kW)'].value_counts().reset_index()
power_range_counts.columns = ['Installed power range (kW)', 'Count']

fig4 = px.bar(
    power_range_counts,
    x='Installed power range (kW)',
    y='Count',
    color='Installed power range (kW)',
    title='Countplot of Installed Power Ranges',
    color_discrete_sequence=px.colors.qualitative.Bold
)
fig4.update_layout(
    template='plotly_white',
    xaxis_title='Installed Power Range (kW)',
    yaxis_title='Number of Records',
    title_x=0.5
)
fig4.show()

#### 3.2.5 Barplot of Number of Installations per District

In [15]:
installs_per_district = (
    df.groupby('District', as_index=False)['Number of installations'].sum()
    .sort_values(by='Number of installations', ascending=False)
)

fig5 = px.bar(
    installs_per_district,
    x='District',
    y='Number of installations',
    color='District',
    color_discrete_sequence=px.colors.qualitative.Safe,
    title='Number of Installations per District',
)
fig5.update_layout(
    template='plotly_white',
    xaxis_title='District',
    yaxis_title='Total Number of Installations',
    title_x=0.5,
    showlegend=False,
)
fig5.show()

#### 3.2.6 Stacked Barplot of Technology Type Counts per Voltage Level

In [16]:
tech_counts = (
    df.groupby(['Voltage level', 'Technology Type'])
    .size()
    .reset_index(name='Count')
)

# Separate dominant techs
dominant_techs = ['Solar', 'Não Atribuído']
dominant_df = tech_counts[tech_counts['Technology Type'].isin(dominant_techs)].copy()
others_df = tech_counts[~tech_counts['Technology Type'].isin(dominant_techs)].copy()

num_levels = df['Voltage level'].nunique()
fig_height = max(400, num_levels * 60)

fig6 = make_subplots(
    rows=1, cols=2,
    shared_yaxes=True,
    column_widths=[0.60, 0.40],
    horizontal_spacing=0.02,
    subplot_titles=("Solar & Não Atribuído", "Other Technologies")
)

for tech in dominant_techs:
    data = dominant_df[dominant_df['Technology Type']==tech]
    fig6.add_trace(
        go.Bar(
            y=data['Voltage level'],
            x=data['Count'],
            name=tech,
            text=data['Count'],
            orientation='h',
        ),
        row=1, col=1
    )

for tech in others_df['Technology Type'].unique():
    data = others_df[others_df['Technology Type']==tech]
    fig6.add_trace(
        go.Bar(
            y=data['Voltage level'],
            x=data['Count'],
            name=tech,
            text=data['Count'],
            orientation='h',
        ),
        row=1, col=2
    )

fig6.update_layout(
    template='plotly_white',
    title="Installations by Voltage Level and Technology Type",
    title_x=0.5,
    barmode='stack',
    showlegend=True,
    width=1200,
    height=fig_height,
    margin=dict(l=100, r=50, t=100, b=50),
)
fig6.update_traces(textposition='inside')
fig6.show()


## <a id="phase4"></a>Phase 4: Draw inference
## <a id="phase5"></a>Phase 5: Formulate conclusions
## <a id="phase6"></a>Phase 6: Look back and ahead
