# Analysis of the 5min pv data Part 2 

The current notebook contains a series of graphs that will answer the following questions:

I) General Trends
- How does power generated vary over time?
- How many pv systems were operational per-year?
- How does average energy generation change by year and month?
- What is the average energy generation per year by district?

II) Trends in 2019
- What is the average energy generation by month in 2019?
- How does the average power generation change in the UK (by month in 2019)?
- How does average energy generation change by hour in 2019?
- How does power generation change by Season and Hour in 2019?




In [None]:
!pip install vaex
!pip install --upgrade vaex
!pip install dash
!pip install rtree 
!pip install pygeos
!pip install geopandas
!pip install folium matplotlib mapclassify geopandas -q -q -q
!pip install folium matplotlib mapclassify -q -q -q

In [None]:
# Have to restart runtime for this to work
import vaex as vx
import pyarrow.parquet as pq
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import geopandas as gpd
import pygeos
import rtree
import folium
from branca.utilities import split_six

# Preprocessing Data

### The Data

The Following datasets were used

- **5min parquet**: Time series data of PV solar generation data. Avalible: https://huggingface.co/datasets/openclimatefix/uk_pv/tree/main. For information about the data read more here https://huggingface.co/datasets/openclimatefix/uk_pv
- **metadata**: Metadata of the different PV systems. Avalible: https://huggingface.co/datasets/openclimatefix/uk_pv/tree/main. Read more here https://huggingface.co/datasets/openclimatefix/uk_pv
- **energy consumption**: 
Sub-national total electricity consumption in (GWh) by region in 2019. Original datasource avalible: https://www.gov.uk/government/statistical-data-sets/regional-and-local-authority-electricity-consumption-statistics. )

  - Additional Notes:
  
     - Regions Include: North East, North West, Yorkshire and The Humber, East Midlands, West Midlands, East, London, Inner London, Outer London, South East, South West, Wales, Scotland

     - Electricity consumption as reported by all local authorities (Scotland includes unallocated authorities, unclear if  other regions do the same) and measured by ‘all_meters'.
- **UK districts**: Avalible 'http://geoportal1-ons.opendata.arcgis.com/datasets/01fd6b2d7600446d8af768005992f76a_4.geojson'. Geojason representation of UK districts 



In [None]:
# Load Datasets
min5= vx.open('/content/5min.parquet')   # 5 min data
metadata= pd.read_csv('metadata.csv') # Metadata
energy_consumption= pd.read_csv('/content/energy_consumption.csv') # energy_consumption

# Load data defining boundaries of UK Districts
url= 'http://geoportal1-ons.opendata.arcgis.com/datasets/01fd6b2d7600446d8af768005992f76a_4.geojson'
districts = gpd.read_file(url) 

In [None]:
# clean energy consumption 
energy_consumption['region']= energy_consumption['region'].str.replace(' ','') 

# Cleaning disticts df
districts = districts[['nuts118nm','long','lat','geometry']]
districts['nuts118nm']= districts['nuts118nm'].str.replace('England','').str.replace('(','').str.replace(')','').str.replace('of','').str.replace(' ','') # clean renaming colum with region names
districts = districts.rename(columns={"nuts118nm": "region",
                                      "long":"district_long",
                                      "lat":"district_lat"}) 

In [None]:
# Decomposing date
min5['day'],min5['month'],min5['year'] = min5.timestamp.dt.day, min5.timestamp.dt.month,min5.timestamp.dt.year,
min5['hour'],min5['minute'],min5['second'] = min5.timestamp.dt.hour,  min5.timestamp.dt.minute, min5.timestamp.dt.second
min5.head(2)

In [None]:
ss_id = min5.ss_id.unique()
# Create a new dataframe with unique ss_id and info from metadata
ssid_df = pd.DataFrame(ss_id,columns=['ss_id'])
min5_meta  = ssid_df.merge(metadata, on='ss_id', how='left') # Merging 1 and 2
min5_meta = min5_meta.dropna()

In [None]:
# New dataframe for plotting geographical data
min5_meta_geo = gpd.GeoDataFrame(min5_meta,geometry=gpd.points_from_xy(min5_meta.longitude_rounded, 
                                                                  min5_meta.latitude_rounded))
pv_geo = gpd.sjoin(min5_meta_geo, districts, op='within')


# I) General Trends

- How does power generated vary over time?

Average power generation decreased from 2018-2019. Was relitivly constant from 2019-2020, and increased in 2021.

In [None]:
plot1 = min5.groupby(['ss_id','year']).agg({'generation_wh': 'mean'}).to_pandas_df().sort_values('year', ascending=False)

plt.style.use('ggplot')
palette={2018:'#36454F',2019:'#36454F',2020:'#36454F',2021:'#fada5e' }
sns.barplot(x='year',y='generation_wh', data=plot1,hue='year', palette=palette)

# titles
plt.xlabel('Year') 
plt.ylabel('Average Power Generated')   
plt.title("Average Power Generated by Year")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

- How many pv systems were operational per-year?

The number of pv systems operational per-years has decreased since 2019

In [None]:
# Number of pv systems operational in each year
plt2 = min5.groupby(min5.year).agg({'ss_id': 'count'}).to_pandas_df() 

plt.style.use('ggplot')
palette={2018:'#36454F',2019:'#36454F',2020:'#36454F',2021:'#fada5e' }

sns.barplot(data=plt2, x='year', y='ss_id',hue='year', palette=palette)

# titles
plt.xlabel('Year') 
plt.ylabel('Count of Pv Systems')   
plt.title("Count of Pv Systems by Year")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

- How does average energy generation change by month?

Power generation tends to increase from April to September


In [None]:
# Preparing data
# grouping data by ss_id, year and month, and finding the average power generated
by_year_month = min5.groupby(['ss_id','year','month']).agg({'generation_wh': 'mean'}).to_pandas_df() 
# Combine data from table abouve with census data 
combined_month = by_year_month.merge(metadata, on='ss_id', how='left')
# creating a colum for date for the plotty animation filter  
combined_month["date"] = combined_month['year'].astype(str) +"-"+ combined_month["month"].astype(str)

In [None]:
 #Creating a graph that will show power generation in the UK by year and month
import plotly.express as px

fig = px.scatter_geo(combined_month, lat="latitude_rounded", lon="longitude_rounded", color="generation_wh",
                     hover_name="ss_id", size="generation_wh",
                     animation_frame="date",
                     projection="natural earth",
                     scope ='europe',
                     color_continuous_scale=["blue", "purple", "red"],
                     center=dict(lat=combined_month.latitude_rounded.mean(), lon=combined_month.longitude_rounded.mean()))
fig.update_layout(
    autosize=False,
    width=1000,
    showlegend=True
)
fig.update_traces(marker=dict(size=10))
fig.update_layout(autosize=True,height=600,geo=dict(projection_scale=6))
fig.show()

In [None]:
# More detailed data of how power generation varies by month
by_month = min5.groupby(['month']).agg({'generation_wh': 'mean'}).to_pandas_df() 
by_month

- What is the average energy generation per year by district?

More power is generated in the 'East' and 'East Midlands' regions, followed by the 'South East' and 'South West' areas. This trend holds true for every year.

The least amount of energy is generated in the 'North West' region. This trend holds true for every year.

In [None]:
# Preparing data
# Consider averages by region
avg_region = min5.groupby(['ss_id','year','month','hour']).agg({'generation_wh': 'mean'}).to_pandas_df().set_index('ss_id')

# select only relevant columns
ssid_region= pv_geo[['ss_id','region',]].set_index('ss_id')

# join
region_date_gen = avg_region.join(ssid_region, on='ss_id', how='left').reset_index()
region_date_gen.head(2)

In [None]:
region_date_gen_year = region_date_gen.groupby(['region','year']).agg({'generation_wh': 'mean'}).reset_index()

plt.figure(figsize=(40,20))
ax = sns.barplot(data=region_date_gen_year, 
                 x='year',
                 y='generation_wh',
                 hue='region',
                 palette=sns.color_palette('icefire'))

plt.xlabel('Year') 
plt.ylabel('Average Power Generated (wh)')   
plt.title("Average Power Generated (wh) by Year and District")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

# II) Trends for 2019

- What is the average energy generation by month for 2019?

Again, the majory is power is generated from April to September

In [None]:
# Selecting only 2019
region_date_gen_2019 = region_date_gen[region_date_gen['year']==2019]
region_date_gen_2019.head(2)

In [None]:
plt.figure(figsize=(40,20))
ax = sns.barplot(data=region_date_gen_2019, x='month', y='generation_wh',
                 dodge=False,
                 hue='month',palette=sns.color_palette('icefire'))

plt.xlabel('Month') 
plt.ylabel('Average Power Generated (wh)')   
plt.title("Average Power Generated (wh) by Month and District")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

- How does the average power generation change in the UK (by month in 2019)


In [None]:
# ---- CREATING DATAFRAME TO USE IN THE ANALYSIS
# Preparing data
region_date_gen_2019.set_index('ss_id')
metadata = metadata.set_index('ss_id')# converting table to pd dataframe

# Combine data from table abouve with census data 
region_date_gen_loc_2019 = region_date_gen_2019.join(metadata, on='ss_id', how='left')
region_date_gen_loc_2019.head(2)

In [None]:
#Creating a graph that will show Average power generation in the UK by month in 2019

fig = px.scatter_geo(region_date_gen_loc_2019, lat="latitude_rounded", lon="longitude_rounded", color="generation_wh",
                     hover_name="ss_id", size="generation_wh",
                     animation_frame="month",
                     projection="natural earth",
                     scope ='europe',
                     color_continuous_scale=["blue", "purple", "red"],
                     center=dict(lat=region_date_gen_loc_2019.latitude_rounded.mean(), lon=region_date_gen_loc_2019.longitude_rounded.mean()))

fig.update_traces(marker=dict(size=10))
fig.update_layout(autosize=True,height=600,geo=dict(projection_scale=6))
fig.show()

- How does average energy generation change by hour for 2019?

On average, the majority of power is generated from the 9:00 to 14:00

In [None]:
# Power generated per hour by region (2019)
power_per_hour = region_date_gen_2019.groupby(['region','hour']).agg({'generation_wh': 'mean'}).reset_index()

plt.figure(figsize=(40,20))
ax = sns.barplot(data=power_per_hour, x='hour', y='generation_wh',hue='region', palette=sns.color_palette('icefire'))

plt.xlabel('Hour') 
plt.ylabel('Average Power Generated (wh)')   
plt.title("Average Power Generated (wh) by Hour and District")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

- How does power generation change by Season and Hour?

Power generation increases from 9:00 to 14:00.Barely any power is generated from 0:00-4:00 and from 20:00 -21:00. This trend holds true regardless of season. 

More power is generated during the Spring and Summer

The least amount of power is generated during the Winter
 



In [None]:
# Creating a column for season
season_dict = {1: 'Winter',
               2: 'Winter',
               3: 'Spring', 
               4: 'Spring',
               5: 'Spring',
               6: 'Summer',
               7: 'Summer',
               8: 'Summer',
               9: 'Fall',
               10: 'Fall',
               11: 'Fall',
               12: 'Winter'}
region_date_gen_loc_2019['Season'] = region_date_gen_loc_2019['month'].apply(lambda x: season_dict[x])

# Grouping by season and hour and measuring the average power generated
per_hour_season = region_date_gen_loc_2019.groupby(['Season','hour']).agg({'generation_wh': 'mean'}).reset_index()

# Plotting
plt.figure(figsize=(40,20))
ax = sns.barplot(data=per_hour_season, x='hour', y='generation_wh',hue='Season', palette=sns.color_palette('Paired'))

plt.xlabel('Hour') 
plt.ylabel('Average Power Generated (wh)')   
plt.title("Average Power Generated (wh) by Hour and Season")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()