# Session 18: Pandas practice

Power market data from Spain.

## Data description

- **datetime_utc**: The date and time in UTC format.
- **spot_price**: The spot price of electricity at the given datetime in euros per megawatt-hour (€/MWh).
- **gen_ccgt**: Generation from Combined Cycle Gas Turbine (CCGT) plants in megawatts (MWh).
- **gen_coal**: Generation from coal-fired power plants in megawatts (MWh).
- **gen_hydro**: Generation from hydroelectric power plants in megawatts (MWh).
- **gen_nuclear**: Generation from nuclear power plants in megawatts (MWh).
- **gen_solar_pv**: Generation from solar photovoltaic (PV) power plants in megawatts (MWh).
- **gen_solar_th**: Generation from solar thermal power plants in megawatts (MWh).
- **gen_total**: Total electricity generation from all sources in megawatts (MWh).
- **gen_wind**: Generation from wind power plants in megawatts (MWh).
- **demand_total**: Total electricity demand in megawatts (MWh).
- **year**: The year of the datetime.
- **month**: The month of the datetime.
- **day**: The day of the datetime.
- **hour**: The hour of the datetime.
- **weekday**: The day of the week (0 = Monday, 1 = Tuesday, ..., 6 = Sunday).
- **is_weekend**: A binary indicator of whether the date is a weekend (1 = Yes, 0 = No).
- **is_holiday**: A binary indicator of whether the date is a holiday (True = Yes, False = No).

In [44]:
import pandas as pd

energy = pd.read_csv('../data/df_final.csv')

energy.head()

Unnamed: 0,datetime_utc,spot_price,gen_ccgt,gen_coal,gen_hydro,gen_nuclear,gen_solar_pv,gen_solar_th,gen_total,gen_wind,demand_total,year,month,day,hour,weekday,is_weekend,is_holiday
0,2023-10-31 23:00:00+00:00,16.75,,292.0,5008.4,4304.5,51.4,194.1,23136.4,9951.7,21757.3,2023,10,31,23,1,0,False
1,2023-11-01 00:00:00+00:00,12.52,,291.0,3989.8,4305.5,51.4,194.1,21856.6,10426.5,20661.0,2023,11,1,0,2,0,True
2,2023-11-01 01:00:00+00:00,4.99,,180.0,3709.6,4306.5,51.2,194.1,21339.8,10832.7,19951.5,2023,11,1,1,2,0,True
3,2023-11-01 02:00:00+00:00,4.3,,,3433.1,4307.5,51.2,194.1,20789.1,10736.3,19426.9,2023,11,1,2,2,0,True
4,2023-11-01 03:00:00+00:00,4.3,,,3286.3,4307.5,51.2,194.1,20574.3,10669.2,19100.8,2023,11,1,3,2,0,True


## Exercise 1

What's the initial and final datetime in the dataset?

In [45]:
min_dt = energy['datetime_utc'].min()
max_dt = energy['datetime_utc'].max()

print(min_dt, max_dt)

2022-12-31 23:00:00+00:00 2024-11-29 23:00:00+00:00


In [46]:
energy['datetime_utc'].apply(['min', 'max'])

min    2022-12-31 23:00:00+00:00
max    2024-11-29 23:00:00+00:00
Name: datetime_utc, dtype: object

## Exercise 2

What's the average spot price in the dataset? And the median?

In [47]:
energy['spot_price'].apply(['mean', 'median'])

mean      73.292958
median    80.100000
Name: spot_price, dtype: float64

## Exercise 2.5

What's the cause of NaN values in the dataset?

In [48]:
energy.isna().mean()

datetime_utc    0.000000
spot_price      0.000000
gen_ccgt        0.477692
gen_coal        0.845379
gen_hydro       0.000000
gen_nuclear     0.000000
gen_solar_pv    0.000000
gen_solar_th    0.015425
gen_total       0.000000
gen_wind        0.000000
demand_total    0.000000
year            0.000000
month           0.000000
day             0.000000
hour            0.000000
weekday         0.000000
is_weekend      0.000000
is_holiday      0.000000
dtype: float64

## Exercise 3

What's the yearly evolution of the average spot price? And the monthly evolution?

In [49]:
energy.groupby("year")['spot_price'].mean()

year
2022     0.000000
2023    87.059305
2024    58.303823
Name: spot_price, dtype: float64

In [50]:
energy["year_month"] = energy['year'].astype(str) + "_" + energy['month'].astype(str)

energy.groupby("year_month")['spot_price'].mean()

year_month
2022_12      0.000000
2023_1      67.899667
2023_10     89.299320
2023_11     62.393788
2023_12     73.440680
2023_2     133.520586
2023_3      91.919390
2023_4      74.160818
2023_5      73.733051
2023_6      92.965423
2023_7      90.525160
2023_8      95.642455
2023_9     103.339110
2024_1      73.769667
2024_10     67.419320
2024_11    103.826509
2024_2      41.117385
2024_3      20.871704
2024_4      12.486341
2024_5      31.207018
2024_6      56.273056
2024_7      71.397878
2024_8      90.301262
2024_9      72.022023
Name: spot_price, dtype: float64

In [51]:
energy.groupby(['month', 'year'])[['spot_price']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,spot_price
month,year,Unnamed: 2_level_1
1,2023,67.899667
1,2024,73.769667
2,2023,133.520586
2,2024,41.117385
3,2023,91.91939
3,2024,20.871704
4,2023,74.160818
4,2024,12.486341
5,2023,73.733051
5,2024,31.207018


## Exercise 4

Calculate the gap between the `demand_total` and the `gen_total` for each row. What's the average gap?

In [59]:
# keeping the sign of the gap
energy['gap'] = energy['demand_total'] - energy['gen_total']

# taking the absolute value of the gap
energy['gap_abs'] = energy['gap'].abs()

In [60]:
gap_mean = energy['gap'].mean()
gap_abs_mean = energy['gap_abs'].mean()

print(gap_mean, gap_abs_mean)

-2089.132866273353 2852.8718043264503


In [68]:
import plotly.graph_objects as go

# histogram of the gap and gap_abs in the same plot
# use alpha to make the bars transparent

fig = go.Figure()

fig.add_trace(go.Histogram(x=energy['gap'], name='gap', opacity=0.75))
fig.add_trace(go.Histogram(x=energy['gap_abs'], name='gap_abs', opacity=0.75))

fig.update_layout(barmode='overlay')

## Exercise 5

What's the correlation between the spot price and the total generation? And the demand? And the gap?

In [58]:
df_corr = energy[['spot_price', 'gen_total', 'demand_total', 'gap']].corr()

corr_price_generation = df_corr.loc['spot_price', 'gen_total']
corr_price_demand = df_corr.loc['spot_price', 'demand_total']
corr_price_gap = df_corr.loc['spot_price', 'gap']

print(corr_price_generation, corr_price_demand, corr_price_gap)

-0.19883233445781393 0.19075136327560063 0.5966653765779915


In [None]:
# remove `datetime_utc` column because it is not numerical and cannot be used in correlation
energy[[col for col in energy.columns if col != 'datetime_utc']].corr()

Unnamed: 0,spot_price,gen_ccgt,gen_coal,gen_hydro,gen_nuclear,gen_solar_pv,gen_solar_th,gen_total,gen_wind,demand_total,year,month,day,hour,weekday,is_weekend,is_holiday,year_month,gap
spot_price,1.0,0.525182,0.472086,0.080234,0.461947,-0.339798,-0.223849,-0.198832,-0.345517,0.190751,-0.321035,0.149222,0.009978,0.1213045,-0.177696,-0.205839,-0.047603,0.06985002,0.596665
gen_ccgt,0.525182,1.0,0.283612,-0.049059,0.176534,-0.185872,-0.116084,0.030451,-0.373806,0.177118,-0.284001,0.035727,-0.02321,0.0342839,-0.127502,-0.204772,-0.043282,-0.09963389,0.222791
gen_coal,0.472086,0.283612,1.0,0.236793,0.115786,0.187118,-0.109379,0.35523,-0.209502,0.491688,-0.204944,-0.104941,-0.051996,0.171189,-0.178943,-0.18611,-0.049482,0.1108229,0.14318
gen_hydro,0.080234,-0.049059,0.236793,1.0,-0.180258,-0.413091,-0.390626,-0.077831,0.069065,0.151421,0.277962,-0.220288,0.004131,0.2018657,-0.103883,-0.131973,-0.006745,0.1179248,0.334131
gen_nuclear,0.461947,0.176534,0.115786,-0.180258,1.0,-0.091144,-0.050308,0.119518,-0.086043,0.251486,-0.271783,0.02278,0.021734,0.01177578,-0.045536,-0.065998,-0.006222,-0.07000052,0.121113
gen_solar_pv,-0.339798,-0.185872,0.187118,-0.413091,-0.091144,1.0,0.787957,0.680565,-0.292676,0.440687,0.089418,0.031182,-0.001919,0.0264038,-0.017689,-0.019719,-0.025482,-0.1228406,-0.609322
gen_solar_th,-0.223849,-0.116084,-0.109379,-0.390626,-0.050308,0.787957,1.0,0.536712,-0.312395,0.35597,0.064251,0.048379,-0.008324,0.1944609,-0.030365,-0.02219,-0.039189,-0.239639,-0.470483
gen_total,-0.198832,0.030451,0.35523,-0.077831,0.119518,0.680565,0.536712,1.0,0.254945,0.82229,0.00212,-0.124684,0.005271,0.2937035,-0.192855,-0.259424,-0.08196,-0.1149596,-0.666328
gen_wind,-0.345517,-0.373806,-0.209502,0.069065,-0.086043,-0.292676,-0.312395,0.254945,1.0,0.061869,-0.011355,-0.105972,-0.00131,0.1056454,-0.031733,-0.061319,-0.014261,0.1121411,-0.363501
demand_total,0.190751,0.177118,0.491688,0.151421,0.251486,0.440687,0.35597,0.82229,0.061869,1.0,0.005682,-0.067195,0.021684,0.3680401,-0.325019,-0.4174,-0.144117,-0.07505944,-0.123584


## Exercise 6

On average, in Spain, is the spot price higher during the weekends?

In [55]:
energy.groupby('is_weekend')[['spot_price']].mean()

Unnamed: 0_level_0,spot_price
is_weekend,Unnamed: 1_level_1
0,79.102809
1,58.728261


According to the results, it is cheaper to consume electricity during the weekends.

## Exercise 7

Knowing that the average nuclear power plant in Spain is 1000 MW, how many nuclear power plants do we have in Spain?

In [56]:
energy['gen_nuclear'].max() / 1000

7.1395

## Exercise 8

When is the demand peaking? In summer or in winter?

In [69]:
def season(month):
    if month in [12, 1, 2]:
        return 'winter'
    elif month in [3, 4, 5]:
        return 'spring'
    elif month in [6, 7, 8]:
        return 'summer'
    else:
        return 'fall'

energy['season'] = energy['month'].apply(season)

energy.groupby('season')[['demand_total']].mean()

Unnamed: 0_level_0,demand_total
season,Unnamed: 1_level_1
fall,25050.017475
spring,24330.280346
summer,26592.360542
winter,27035.341796


Demand peaks in winter.

## Exercise 9

Calculate, for each date, the difference between the maximum and the minimum demand. What's the average difference? Which month has the highest difference?

In [72]:
energy['date'] = pd.to_datetime(energy['datetime_utc']).dt.date

# now that we have the date, let's create the columns
# min_day_demand: the minimum demand of the date
# max_day_demand: the maximum demand of the date

energy['min_day_demand'] = energy.groupby('date')['demand_total'].transform('min')
energy['max_day_demand'] = energy.groupby('date')['demand_total'].transform('max')

# now let's calculate the difference between the max and min demand
energy['diff_demand'] = energy['max_day_demand'] - energy['min_day_demand']

energy.head()

Unnamed: 0,datetime_utc,spot_price,gen_ccgt,gen_coal,gen_hydro,gen_nuclear,gen_solar_pv,gen_solar_th,gen_total,gen_wind,...,is_weekend,is_holiday,year_month,gap,date,gap_abs,season,min_day_demand,max_day_demand,diff_demand
0,2023-10-31 23:00:00+00:00,16.75,,292.0,5008.4,4304.5,51.4,194.1,23136.4,9951.7,...,0,False,2023_10,-1379.1,2023-10-31,1379.1,fall,21757.3,21757.3,0.0
1,2023-11-01 00:00:00+00:00,12.52,,291.0,3989.8,4305.5,51.4,194.1,21856.6,10426.5,...,0,True,2023_11,-1195.6,2023-11-01,1195.6,fall,18891.6,25426.0,6534.4
2,2023-11-01 01:00:00+00:00,4.99,,180.0,3709.6,4306.5,51.2,194.1,21339.8,10832.7,...,0,True,2023_11,-1388.3,2023-11-01,1388.3,fall,18891.6,25426.0,6534.4
3,2023-11-01 02:00:00+00:00,4.3,,,3433.1,4307.5,51.2,194.1,20789.1,10736.3,...,0,True,2023_11,-1362.2,2023-11-01,1362.2,fall,18891.6,25426.0,6534.4
4,2023-11-01 03:00:00+00:00,4.3,,,3286.3,4307.5,51.2,194.1,20574.3,10669.2,...,0,True,2023_11,-1473.5,2023-11-01,1473.5,fall,18891.6,25426.0,6534.4


In [73]:

# average difference between the max and min demand

avg_diff_demand = energy['diff_demand'].mean()
print(avg_diff_demand)

9032.163003933138


In [74]:
# diff per month

energy.groupby('month')['diff_demand'].mean()

month
1     10701.248821
2     10372.413313
3      9180.741234
4      8001.335222
5      7581.251734
6      8204.558393
7      9590.524411
8      9613.077254
9      8389.939527
10     8344.954785
11     9049.996841
12     9736.281440
Name: diff_demand, dtype: float64

The highest difference between the maximum and the minimum demand is in day happens in January.

## Exercise 10

Does the spot price correlate with the demand difference?

In [75]:
energy[['spot_price', 'diff_demand']].corr()

Unnamed: 0,spot_price,diff_demand
spot_price,1.0,0.240926
diff_demand,0.240926,1.0


Positive correlation but not very strong.

## Exercise 11

Which month has had the day with the highest spot price?

In [77]:
index_of_max_price = energy['spot_price'].idxmax()

energy.loc[index_of_max_price, 'month']

1

January had the day with the highest spot price.

In [79]:
# double check, brute force:

max_price = energy['spot_price'].max()

energy[energy['spot_price'] == max_price]['month']

2709     1
2853     1
8934    10
Name: month, dtype: int64

Same highest price happened 3 times, 2 in January and 1 in October.

## Exercise 12

Are we using coal and gas power plants to cover the demand peaks when the renewable sources are not enough?

In [82]:
# defining 'are not enough'
# I'll assume that 'are not enough' means that the demand is greater than the generation from renewable sources

energy['renewable_gen'] = energy['gen_wind'] + energy['gen_solar_th'] + energy['gen_solar_pv'] + energy['gen_hydro']
energy['renewable_gap'] = energy['demand_total'] - energy['renewable_gen']

# let's study if the generation with coal and gas is correlated with the renewable gap

# first calculate all the thermal generation
energy['thermal_gen'] = energy['gen_ccgt'] + energy['gen_coal']

# now calculate the correlation
energy[['renewable_gap', 'thermal_gen']].corr()

Unnamed: 0,renewable_gap,thermal_gen
renewable_gap,1.0,0.834513
thermal_gen,0.834513,1.0


It's very correlated, it looks like whenever the renewable sources are not enough, we use coal and gas power plants to cover the demand that the renewable sources can't cover.