# Final exam

Rules:
- The time limit is 80 minutes.
- Every solution must be written in Python and/or pandas, and the result must be printed or returned.
- 5 minutes before the end, a warning will be given to submit what you have so far. No submissions will be accepted after the end of the exam.
- You can resubmit as many times as you want.
- The exam is open-book, open-notes, open-internet, open-everything. Except for discussing with another person by any means, sharing solutions or using AI in any form.
    - The student that uses AI to solve the questions will fail immediately and reported to the university.
- If you have a question, please raise your hand. I will come to you.
- For every exercise the solution is provided, but you need to write the code to solve it.
- If you can't get a solution, you can copy the solution provided, but you will get only 25% of the points.

Submission:
- Before the end of the exam, submit the exam using the link provided in the mid-term exam announcement.
- Only submit the completed notebook, no other files are needed, only the `.ipynb` file.

## Good luck!
And remember: if you get stuck in a question please move to the next one and come back to the question later if you have time or use the solution provided to answer the question, with the 75% penalty.

## Exam theme

You are the manager of an utility company that sells electricity to the population. You have to answer some questions about the market.

The dataset (`df_final.csv`) contains the following columns:

- **datetime_utc**: The date and time in UTC format.
- **spot_price**: The spot price of electricity at the given datetime in euros per megawatt-hour (€/MWh).
- **gen_ccgt**: Generation from Combined Cycle Gas Turbine (CCGT) plants in megawatts (MWh).
- **gen_coal**: Generation from coal-fired power plants in megawatts (MWh).
- **gen_hydro**: Generation from hydroelectric power plants in megawatts (MWh).
- **gen_nuclear**: Generation from nuclear power plants in megawatts (MWh).
- **gen_solar_pv**: Generation from solar photovoltaic (PV) power plants in megawatts (MWh).
- **gen_solar_th**: Generation from solar thermal power plants in megawatts (MWh).
- **gen_total**: Total electricity generation from all sources in megawatts (MWh).
- **gen_wind**: Generation from wind power plants in megawatts (MWh).
- **demand_total**: Total electricity demand in megawatts (MWh).
- **year**: The year of the datetime.
- **month**: The month of the datetime.
- **day**: The day of the datetime.
- **hour**: The hour of the datetime.
- **weekday**: The day of the week (0 = Monday, 1 = Tuesday, ..., 6 = Sunday).
- **is_weekend**: A binary indicator of whether the date is a weekend (1 = Yes, 0 = No).
- **is_holiday**: A binary indicator of whether the date is a holiday (True = Yes, False = No).

Each row in the dataset represents a specific hour of the day, and the corresponding values for each column.

## Read the file (1 points)

Read the data into a pandas DataFrame called `energy` and show the first 5 rows.

In [118]:
import pandas as pd
import numpy as np

energy = pd.read_csv('df_final (1).csv')

energy.head(5)

Unnamed: 0,datetime_utc,date,spot_price,gen_ccgt,gen_coal,gen_hydro,gen_nuclear,gen_solar_pv,gen_solar_th,gen_total,gen_wind,demand_total,year,month,day,hour,weekday,is_weekend,is_holiday
0,2023-10-31 23:00:00+00:00,2023-10-31,16.75,,292.0,5008.4,4304.5,51.4,194.1,23136.4,9951.7,21757.3,2023,10,31,23,1,0,False
1,2023-11-01 00:00:00+00:00,2023-11-01,12.52,,291.0,3989.8,4305.5,51.4,194.1,21856.6,10426.5,20661.0,2023,11,1,0,2,0,True
2,2023-11-01 01:00:00+00:00,2023-11-01,4.99,,180.0,3709.6,4306.5,51.2,194.1,21339.8,10832.7,19951.5,2023,11,1,1,2,0,True
3,2023-11-01 02:00:00+00:00,2023-11-01,4.3,,,3433.1,4307.5,51.2,194.1,20789.1,10736.3,19426.9,2023,11,1,2,2,0,True
4,2023-11-01 03:00:00+00:00,2023-11-01,4.3,,,3286.3,4307.5,51.2,194.1,20574.3,10669.2,19100.8,2023,11,1,3,2,0,True


## Data prep (1 points)

As we saw in class, the NaN values represent missing data. In this dataset, the missing value means that the generation or demand was zero. Replace the NaN values in the dataset with zeros.

**Hint:** Use the `fillna` method on the DataFrame, and use the `inplace=True` parameter to save the changes in the DataFrame in memory.

In [119]:
energy.fillna(0, inplace = True)

energy.isna().mean()

datetime_utc    0.0
date            0.0
spot_price      0.0
gen_ccgt        0.0
gen_coal        0.0
gen_hydro       0.0
gen_nuclear     0.0
gen_solar_pv    0.0
gen_solar_th    0.0
gen_total       0.0
gen_wind        0.0
demand_total    0.0
year            0.0
month           0.0
day             0.0
hour            0.0
weekday         0.0
is_weekend      0.0
is_holiday      0.0
dtype: float64

## Warming up (0.25 points each)

* Do mondays in september have a higher average spot price than mondays in october?
* Is Spain a windy country during springtime? Consider 'windy' when the wind generation is higher than the yearly average.
* Does Spain use more Nuclear energy or Wind energy during the weekends?
* Is summer a good season for solar energy in Spain?

In [120]:
energy.head(3)

Unnamed: 0,datetime_utc,date,spot_price,gen_ccgt,gen_coal,gen_hydro,gen_nuclear,gen_solar_pv,gen_solar_th,gen_total,gen_wind,demand_total,year,month,day,hour,weekday,is_weekend,is_holiday
0,2023-10-31 23:00:00+00:00,2023-10-31,16.75,0.0,292.0,5008.4,4304.5,51.4,194.1,23136.4,9951.7,21757.3,2023,10,31,23,1,0,False
1,2023-11-01 00:00:00+00:00,2023-11-01,12.52,0.0,291.0,3989.8,4305.5,51.4,194.1,21856.6,10426.5,20661.0,2023,11,1,0,2,0,True
2,2023-11-01 01:00:00+00:00,2023-11-01,4.99,0.0,180.0,3709.6,4306.5,51.2,194.1,21339.8,10832.7,19951.5,2023,11,1,1,2,0,True


In [121]:
mon_sep = energy[(energy['weekday'] == 0) & (energy['month'] == 9)]['spot_price'].mean()
mon_oct = energy[(energy['weekday'] == 0) & (energy['month'] == 10)]['spot_price'].mean()

print(f'Mondays in september have a higher avg spot price: {mon_sep > mon_oct}')
print(f'Sep = {mon_sep} and Oct = {mon_oct}')

Mondays in september have a higher avg spot price: False
Sep = 94.06180412371135 and Oct = 96.1101388888889


In [122]:
energy['season'] = np.where(energy['month'].isin([6, 7, 8]), 'summer', 
                            np.where(energy['month'].isin([12, 1, 2]), 'winter', 
                                     np.where(energy['month'].isin([3, 4, 5]), 'spring', 'fall')))

wind = energy.groupby('season')['gen_wind'].mean()

wind['spring']

energy['yearly_wind'] = energy.groupby('year')['gen_wind'].transform('mean')

energy['isgenmore'] = energy['gen_wind'] > energy['yearly_wind']

result = energy.groupby('season')['isgenmore'].size()

print(f'Wind generation is above the yearly average the most in {result.idxmax()}')

result

Wind generation is above the yearly average the most in spring


season
fall      4229
spring    4279
summer    4278
winter    3486
Name: isgenmore, dtype: int64

In [123]:
weekend_nuclear = energy[energy['is_weekend'] == 1]['gen_nuclear'].mean()
weekend_wind = energy[energy['is_weekend'] == 1]['gen_wind'].mean()

print(weekend_wind, weekend_nuclear)

7436.872090517241 5783.706077586206


In [124]:
solar = energy.groupby('season')['gen_solar_pv'].mean()

print(f'On average the most solar output is summer')

print(solar)

On average the most solar output is summer
season
fall      4395.996524
spring    5420.971278
summer    6893.731019
winter    3018.818789
Name: gen_solar_pv, dtype: float64


## Battery storage (1 point)

As a manager, you are considering investing in battery storage to store electricity when the price is low and sell it when the price is high.

The batteries you can buy can only store electricity for 1 hour, and release it for 1 hour. You can buy as many batteries as you want.

Taking into account the spot prices of electricity, which hour (on average) is the best to buy electricity to store in the batteries? Which hour is the best to sell the electricity stored in the batteries?

**Hint:** Calculate the average spot price for each hour of the day, and check the hours at which the average spot price is the lowest and the highest.

**The solution is:**
* Buy/store at hour 13
* Sell at hour 19

In [125]:
energy.head(3)

Unnamed: 0,datetime_utc,date,spot_price,gen_ccgt,gen_coal,gen_hydro,gen_nuclear,gen_solar_pv,gen_solar_th,gen_total,gen_wind,demand_total,year,month,day,hour,weekday,is_weekend,is_holiday,season,yearly_wind,isgenmore
0,2023-10-31 23:00:00+00:00,2023-10-31,16.75,0.0,292.0,5008.4,4304.5,51.4,194.1,23136.4,9951.7,21757.3,2023,10,31,23,1,0,False,fall,7893.359406,True
1,2023-11-01 00:00:00+00:00,2023-11-01,12.52,0.0,291.0,3989.8,4305.5,51.4,194.1,21856.6,10426.5,20661.0,2023,11,1,0,2,0,True,fall,7893.359406,True
2,2023-11-01 01:00:00+00:00,2023-11-01,4.99,0.0,180.0,3709.6,4306.5,51.2,194.1,21339.8,10832.7,19951.5,2023,11,1,1,2,0,True,fall,7893.359406,True


In [126]:
sell = energy.groupby('hour')['spot_price'].mean().idxmax()
buy = energy.groupby('hour')['spot_price'].mean().idxmin()

print(f'Buy on hour {buy}, sell on hour {sell}')

Buy on hour 13, sell on hour 19


Given these results, the best hour to store electricity is hour 13 and the best hour to sell the electricity is hour 19.

## The renewable energy (1 point, open ended)

We can consider that the renewable energy is the sum of the generation from hydro, solar, and wind sources.

Calculate the percentage of renewable energy in the total generation for each hour, using a column called `renewable_percentage`.

Is this percentage related to the spot price of electricity? Use the correlation between the percentage of renewable energy and the spot price to answer this question.

In [127]:

pd.set_option('display.max_columns', None)

energy.head(1)

Unnamed: 0,datetime_utc,date,spot_price,gen_ccgt,gen_coal,gen_hydro,gen_nuclear,gen_solar_pv,gen_solar_th,gen_total,gen_wind,demand_total,year,month,day,hour,weekday,is_weekend,is_holiday,season,yearly_wind,isgenmore
0,2023-10-31 23:00:00+00:00,2023-10-31,16.75,0.0,292.0,5008.4,4304.5,51.4,194.1,23136.4,9951.7,21757.3,2023,10,31,23,1,0,False,fall,7893.359406,True


In [128]:
energy['renewable_energy'] = energy['gen_hydro'] + energy['gen_solar_pv'] + energy['gen_solar_th'] + energy['gen_wind']
energy['renewable_percentage'] = energy['renewable_energy'] / energy['gen_total']

energy[['renewable_percentage', 'spot_price']].corr()

Unnamed: 0,renewable_percentage,spot_price
renewable_percentage,1.0,-0.733877
spot_price,-0.733877,1.0


## Fossil fuels (1 point, open ended)

Calculate the percentage of fossil fuels in the total generation for each hour, using a column called `fossil_percentage`.

We can consider that the fossil fuels are the sum of the generation from coal and CCGT sources.

Is this percentage related to the spot price of electricity? Use the correlation between the percentage of fossil fuels and the spot price to answer this question.

## The thermal gap (1 point)

The thermal gap is the difference between the total demand and the renewable power generation. Calculate the thermal gap for each hour of the day.

Save it in a column called `thermal_gap`, and also calculate the thermal gap as a percentage of the total demand, saving it in a column called `thermal_gap_percentage`.

Also, calculate the average thermal gap for each month of the year.

**The solution is:**

```python
1      9178.528641
2      9930.591831
3      6201.406861
4      5069.532281
5      5817.860541
6      8562.806098
7     10792.112760
8     10641.932663
9     10356.286442
10     8380.627462
11     7744.247308
12     8535.662327
```

In [129]:
energy['thermal_gap'] = energy['demand_total'] - energy['renewable_energy']
energy['thermal_gap_percentage'] = energy['thermal_gap'] / energy['demand_total'] * 100

energy.groupby('month')['thermal_gap'].mean()

month
1      9178.528641
2      9930.591831
3      6201.406861
4      5069.532281
5      5817.860541
6      8562.806098
7     10792.112760
8     10641.932663
9     10356.286442
10     8380.627462
11     7744.247308
12     8535.662327
Name: thermal_gap, dtype: float64

## More thermal gap (1 point, open ended)

Is the thermal gap related to the spot price of electricity? Use the correlation between the thermal gap and the spot price to answer this question.

Is it a stronger predictor or price than the relation between the percentage of renewable energy and the spot price?

In [130]:
corr = energy[['thermal_gap', 'spot_price']].corr()
print(corr)

corr2 = corr.loc['thermal_gap', 'spot_price']

print(f'It is strongly correlated at {corr2}')


             thermal_gap  spot_price
thermal_gap     1.000000    0.824972
spot_price      0.824972    1.000000
It is strongly correlated at 0.8249719458412461


## Understanding CCGT plants  (1 points)

The CCGT plants burn natural gas to generate electricity.

This power generation comes with some losses, and the efficiency of the CCGT plants is around 55%. This means that for every 100 MWh of natural gas burned, only 55 MWh are converted into electricity, which is what is measured in the `gen_ccgt` column.

Calculate the total amount of natural gas burned in the CCGT plants in the dataset in MWh.

Save this information under a column called `natural_gas_burned` in the `energy` DataFrame.

What was the total amount of natural gas burned in the CCGT plants in 2023? Save this information in a variable called `natural_gas_burned_2023`.

**The solution is `30001834.36`**

In [131]:
total_ccgt_mwh = energy['gen_ccgt'].sum()

total_ccgt_mwh / 0.55

energy['natural_gas_burned'] = energy['gen_ccgt'] / 0.55

energy[energy['year'] == 2023]['natural_gas_burned'].sum()

30001834.363636363

## Interlude (1 point)

Create a new column called `year_month` that contains the year and month of the datetime in the format `YYYY-MM`. The month shoud be a two-digit number and the year a four-digit number. Both of them should be separated by a hyphen, and the month should have a leading zero if it is a single-digit number.

Examples:
* For year 2023 and month 1, the value should be `2023-01`.
* For year 2023 and month 12, the value should be `2023-12`.

**Hint**: you can use the `np.where` function to add the leading zero to the month, and then you can concatenate the year and month into a single string value.

In [135]:
energy['month'] = np.where(energy['month'].isin([1, 2, 3, 4, 5, 6, 7, 8, 9]), '0' + energy['month'].astype(str), energy['month'])
energy['month'].value_counts()


energy["year_month"] = energy['year'].astype(str) + "_" + energy['month'].astype(str)
energy['year_month'].value_counts()

year_month
2024_03    722
2023_10    721
2024_05    721
2024_10    721
2023_07    721
2024_01    721
2023_12    721
2023_05    721
2024_08    721
2023_08    721
2023_01    721
2023_03    721
2024_07    721
2024_04    697
2024_09    697
2023_11    697
2023_09    697
2023_04    697
2023_06    697
2024_06    697
2024_11    696
2024_02    673
2023_02    649
2022_12      1
Name: count, dtype: int64

## Calculating metrics (0.25 points each)

The board of directors asked you to calculate some metrics to understand the market better.

* Using `year_month` from the previous exercise, calculate the average spot price per `year_month`, and save the results in a DataFrame called `monthly_spot_price`.
* Using `agg`, calculate, for each month, the maximum and minimum `spot_price`, `gen_total`, and `demand_total`, saving the results in a DataFrame called `monthly_max_min`.
* Calculate the average spot price of electricity for each day of the week, saving the result in a DataFrame called `daily_spot_price`.
* Calculate the average demand for each month, saving the result in a DataFrame called `monthly_demand`.

In [139]:
monthly_spot_price = energy.groupby('year_month')['spot_price'].mean()

monthly_max_min = energy.groupby('year_month')[['spot_price', 'gen_total', 'demand_total']].agg(['max', 'min'])

dialy_spot_price = energy.groupby('day')['spot_price'].mean()

monthly_demand = energy.groupby('month')['demand_total'].mean()

## Understanding the wind power (1 point)

The wind power generators extract the kinetic energy of the wind to generate electricity. This means that the wind power generation is related to the wind speed. The wind speed then is reduced by the wind power generators, because some of the speed-related energy is converted into electricity.

The maximum amount of energy that can be extracted from the wind is related to the Betz limit, which is around 59.3%. This means that the maximum efficiency of a wind power generator is around 59.3%.

Use this Betz limit to calculate the original energy of the wind before it was reduced by the wind power generators. Save this information in a column called `wind_original` in the `energy` DataFrame.

Use `map` or `apply` to calculate this value.

In [115]:
energy['wind_original'] = energy['gen_wind'].map(lambda wind: wind / 0.593)

energy[['gen_wind', 'wind_original']]

Unnamed: 0,gen_wind,wind_original
0,9951.7,16781.956155
1,10426.5,17582.630691
2,10832.7,18267.622260
3,10736.3,18105.059022
4,10669.2,17991.905565
...,...,...
16267,8957.8,15105.902192
16268,8627.9,14549.578415
16269,7945.0,13397.976391
16270,7279.6,12275.885329


## More questions from the board (0.25 point each)

* On average, what's the cheapest month to buy electricity? Save the result in a variable called `cheapest_month`.
* On average, what's the most expensive month to buy electricity? Save the result in a variable called `most_expensive_month`.
* What is the average nuclear power generation in the cheapest month? Save the result in a variable called `average_nuclear_cheapest`.
* What is the average nuclear power generation in the most expensive month? Save the result in a variable called `average_nuclear_expensive`.

**Hint:** After using idxmin or idxmax, you can extract the index with `.values[0]` to get only the number, not the whole series.

In [143]:
energy.groupby('month')['spot_price'].mean().idxmax()

'08'

In [142]:
energy.groupby('month')['spot_price'].mean().idxmin()

'04'

## Football and energy demand in Spain. (1 point, open ended)

The Euro 2024 final match happened on Sunday 14th of July 2024 at 7PM UTC. The match was between Spain and England and, of course, Spain won.

Was there any indicator in the data that the match was happening?

## Stretching and cooling down (1 point)

* Using the average price per hour, how much lower as a percentage is the minimum price compared to the maximum price?
* Using the average demand per hour, how much lower as a percentage is the minimum demand compared to the maximum demand?
* Are these two percentages related?