Project Ideas


suggested project: analyse wind speed around the country with a view to a wind farm


## Project Plan

__Research wind farms in Ireland__

- where are they usually located?
- what wind conditions do they need? any other site considerations? Rural locations.
- how much electricity do they generate? summer vs winter?
- cost?
- lifespan?
- advantages
- disadvantages
- anything else?

__Project questions__

What's the relationship between wind speed and power generated? Does the wind direction affect power generation? 

Is there a trend in wind speed? Is Ireland getting winder? Variations across the year? Time of day?

Is the technology in wind turbines improving? Is more electricity being generated for the same wind speed?

Does rain/temperature/anything affect the output? 

What happens during a storm? Does amount of wind generated electricity decrease/increase? 

Predict power output for wind farms in Ireland for the next week. Tricky

As I have weather information could solar power to fill the gaps when wind speeds are low? Probably too big a task for this project. 


__Find data__

Weather data from met Éireann historical data.
    can select by site, perhaps initially analyse data for a number of weather stations near a wind farm and also weather stations not near a wind farm. From the data can I see why that site was selected?



Is there much variation in wind across the country? Eirgrid data for entire country. 

## Introduction

Background information

https://windenergyireland.com/about-wind/the-basics/facts-stats

## The Data

About the data set. 

## Organising and Cleaning the Data


Would be convenient to have all the data in one large data set. Need to research working with large data sets. More difficult to load than smaller data sets.

Clean data

In [9]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import re

## Wind electricity data

https://www.smartgriddashboard.com/#all

[Eirgrid real-time system information](https://www.eirgrid.ie/grid/real-time-system-information) . On the Eirgrid website it is only possible to view information for one day at a time and up to one month ago. Despite extensive searching I couldn't find an official source of Eirgrid histprical data. I did find a [GitHub repository by Daniel Parke](https://github.com/Daniel-Parke/EirGrid_Data_Download/tree/main), who has written a very helpful python file to download all the historical data. His GitHub repository contains raw csv files for actual amount of electricity generated, actual demand, actual amount of electricity produced by wind for every year from 2014 for all Ireland, Northern Ireland and Republic of Ireland. I will need to run his program to get the most up to date data for 2024.

As my weather data will be only for the Republic of Ireland, I am only interested in the csv files for the actual amount of electricity produced by wind for the Republic of Ireland. Each csv file containing one years worth of information was downloaded from the GitHub repository. After reading the data into pandas the next task will be to merge the files vertically using pd.concat(). Before I started on the onerous task of loading and concatenating each file separately, I came across a blog post on how to [read multiple csv file into pandas](https://saturncloud.io/blog/how-to-read-multiple-csv-files-into-python-pandas-dataframe). 

The solution to reading multiple files into pandas uses the glob module. Glob is a built-in module used to retrieve files/pathnames matching a specified pattern. It uses * wild cards to make path retrieval more simple and convenient. https://www.geeksforgeeks.org/how-to-use-glob-function-to-find-files-recursively-in-python/. [Real python](https://realpython.com/get-all-files-in-directory-python/#conditional-listing-using-glob) states that glob.glob() returns a list of filenames that match a pattern, which in this case are csv files. 

```python
# Search for all csv files in the current working directory
import glob
glob.glob('*.csv')
```

Eirgrid have data on actual wind generation and the forecast wind generation. Could getting forecast information be of interest? Might help with machine learning. Github repository only contains actual data not forecast data. 

set up a scheduled task to download the data at midnight?


In [None]:
## This was the first step to create a single dataframe from multiple csv files. 
## 
## Load the wind electricity files
#
## Find all csv files in the data/electricity directory
#csv_files = glob.glob('data/electricity/*.csv')
#
## Create an empty dataframe to store the combined data
#electricity_df = pd.DataFrame()
#
## Loop through each CSV file and append its contents to the combined dataframe
#for csv_file in csv_files:
#    df = pd.read_csv(csv_file, 
#                     header = None, 
#                     names = ['date', 'wind_actual', 'location', 'wind_value'], 
#                     index_col= 'date',
#                     parse_dates= ['date'],
#                     usecols= ['date', 'wind_value'])
#    electricity_df = pd.concat([electricity_df, df])
#
#electricity_df.head()

```python
electricity_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 361152 entries, 2021-01-01 00:00:00 to 2020-01-01 21:45:00
Data columns (total 1 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   wind_value  361004 non-null  float64
dtypes: float64(1)
memory usage: 5.5 MB
```

In [None]:
electricity_df.info()

In [None]:
electricity_df.isna().sum()

have a lot of duplicated rows. csv files have data for the 1st jan for the following year. 

In [None]:
#electricity_df.index.duplicated().sum()

In [None]:
# https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
#
#electricity_df = electricity_df[~electricity_df.index.duplicated(keep= 'first')]
#electricity_df.head()

In [None]:
#electricity_df = electricity_df.sort_index()

In [None]:
#electricity_df.shape

In [None]:
#electricity_df.to_csv('data/electricity/electricity_data.csv')

In [None]:
hourly_electricity_df = electricity_df.resample('h').mean()
hourly_electricity_df.head()

In [None]:
hourly_electricity_df.to_csv('data/electricity/hourly_electricity.csv')

In [None]:
hourly_electricity_df = pd.read_csv('data/electricity/merged_data/hourly_electricity.csv')
hourly_electricity_df.head()

## Weather Data

The weather data was downloaded from historic met eireann sites. Data from a range of weather stations was downloaded. The selected weather stations were mostly based on their proximity to a wind farm. A number were selected for the size of the data set. To see if Ireland is getting windier. Oldest weather stations with data are Dublin airport and Valentia who have data from 1 Jan 1944. 

Met Eireann weather data is recorded hourly. Electricity data recorded every 15min - resample to 1 hour. Saved resampled data to hourly_electricity.csv Should I read in this single csv file? 


Weather Data

What I'm aiming to do

Look at wind speeds for the entire country. Is there much variability? Electricity generation values for the entire country. 

    Read all the csv files in weather directory into pandas.

        Problems: some csv files have a different numbers of rows to skip. Function to remove the unnecessary rows from csv file. 

        The dataframe would ideally be the location. This is proving very difficult. Have written a function to extract the location from file name. 

    Refine the dataframe to the years 2014 onwards. Write a function. 

    Merge the dataframes
        Can all the merging be done in one step? Write a function. 

Analyse the data. 


### Is there much variation in wind speed across the country? 

Electricity generated is given for ROI not broken down by wind farm.

Electricity data from 2014, so to compare wind speed and amount of electricity generated by wind only need weather data from 2014. 


Write a function to read in the weather data. Basic elements are the same skip rows, na_values, use columns, parse_dates

What columns are needed? 
date, rain, temp, msl, wdsp, wddir, 

sun, clht, clamt not recorded for all weather stations. so not of interest. 

Remove explanatory rows in csv file. 

In [48]:
def skip_rows(csv_file):
    # Read the file, skipping metadata rows
    with open(csv_file, 'r') as file:
        lines = file.readlines()
    
    # Identify the start of the data (row where actual CSV content begins)
    for i, line in enumerate(lines):
        if line.lower().startswith('date,'):
            data_start_idx = i
    
    return data_start_idx

In [11]:
data_start_idx = skip_rows('data/weather/hly275MaceHead.csv')

In [12]:
data_start_idx

17

In [13]:
skip_rows('data/weather/hly275MaceHead.csv')

17

In [14]:
skip_rows('data/weather/hly518ShannonAirport.csv')

23

In [15]:
def extract_location(file_name):
    '''A function to extract the location from the file name'''

    pattern = r'hly\d{3,4}([A-Z][a-z]+[A-Z]?[a-z]+).csv'

    match = re.findall(pattern, file_name)

    if match:
        return match[0].lower()
    else:
        raise ValueError('File name does not match the expected pattern')


In [47]:
# Extract the location from the csv files in weather
csv_files = glob.glob('data/weather/*.csv')

location = []

for file in csv_files:
    name = extract_location(file)
    location.append(name)
print(location)

['johnstown', 'mullingar', 'athenry', 'gurteen', 'macehead', 'finner']


In [39]:
# Extract the location from the csv files in weather
csv_files = glob.glob('data/weather/*.csv')

location_dict = {}

for file in csv_files:
    name = extract_location(file)
    location_dict[name + '_df'] = file
print(location_dict)


{'johnstown_df': 'data/weather/hly1775Johnstown.csv', 'mullingar_df': 'data/weather/hly875Mullingar.csv', 'athenry_df': 'data/weather/hly1875Athenry.csv', 'gurteen_df': 'data/weather/hly1475Gurteen.csv', 'macehead_df': 'data/weather/hly275MaceHead.csv', 'finner_df': 'data/weather/hly2075Finner.csv'}


In [81]:
def load_weather(file_name, skip_rows):
    '''A function to read in a weather csv file.'''
       
    file_path = 'data/weather/'

    weather_df = pd.read_csv(file_path + file_name,
                    skiprows = skip_rows,
                    usecols= ['date', 'rain', 'temp', 'msl', 'wdsp', 'wddir'],
                    na_values = ' ',
                    index_col= 'date', 
                    parse_dates= ['date'], 
                    date_format = "%d-%b-%Y %H:%M"
                    )
    
    weather_df = weather_df['2014': '2024']

    return weather_df

In [82]:
gurteen = load_weather('hly1475Gurteen.csv', 17)
gurteen.head()

Unnamed: 0_level_0,rain,temp,msl,wdsp,wddir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-01 00:00:00,0.0,4.0,989.6,10.0,180.0
2014-01-01 01:00:00,0.0,3.8,989.3,9.0,180.0
2014-01-01 02:00:00,0.0,3.6,988.8,9.0,150.0
2014-01-01 03:00:00,0.0,2.9,988.0,11.0,170.0
2014-01-01 04:00:00,0.0,4.0,986.5,10.0,150.0


In [64]:
macehead = load_weather('hly275MaceHead.csv', 17)
macehead.head()

Unnamed: 0_level_0,rain,temp,msl,wdsp,wddir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-01 00:00:00,0.2,8.2,985.7,21.0,200.0
2014-01-01 01:00:00,0.2,7.8,985.4,21.0,200.0
2014-01-01 02:00:00,0.0,8.4,984.6,21.0,190.0
2014-01-01 03:00:00,0.5,8.3,983.6,18.0,190.0
2014-01-01 04:00:00,0.0,7.2,982.4,19.0,150.0


In [66]:
mullingar = load_weather('hly875Mullingar.csv', 17)

In [69]:
johnstown = load_weather('hly1775Johnstown.csv', 17)

In [71]:
athenry = load_weather('hly1875Athenry.csv', 17)

In [72]:
finner = load_weather('hly2075Finner.csv', 17)

refine dataframes to 2014 data onwards

In [52]:
def select_years(df):
    df = df['2014': '2024']
    return df

In [59]:
macehead = select_years(macehead)

In [60]:
gurteen = select_years(gurteen)

Practice merging dataframes

In [61]:
gurteen_macehead = gurteen.merge(macehead, on= 'date', suffixes= ['_gur', '_mace'])
gurteen_macehead.head()

Unnamed: 0_level_0,rain_gur,temp_gur,msl_gur,wdsp_gur,wddir_gur,rain_mace,temp_mace,msl_mace,wdsp_mace,wddir_mace
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-01-01 00:00:00,0.0,4.0,989.6,10.0,180.0,0.2,8.2,985.7,21.0,200.0
2014-01-01 01:00:00,0.0,3.8,989.3,9.0,180.0,0.2,7.8,985.4,21.0,200.0
2014-01-01 02:00:00,0.0,3.6,988.8,9.0,150.0,0.0,8.4,984.6,21.0,190.0
2014-01-01 03:00:00,0.0,2.9,988.0,11.0,170.0,0.5,8.3,983.6,18.0,190.0
2014-01-01 04:00:00,0.0,4.0,986.5,10.0,150.0,0.0,7.2,982.4,19.0,150.0


In [None]:
location = extract_location('hly275MaceHead.csv')
print(location)

In [None]:
location = extract_location('hly1475Gurteen.csv')
print(location)

In [None]:
gurteen = load_weather('hly1475Gurteen.csv', 17)

In [None]:
lo

regex the filename
 # Want df to have the station name in the name of the df
 #pattern = r'hly\d{3,4}([A-Z][a-z]+).csv'
 #my_string = file_name
 #match = re.search(pattern, my_string)
 #name = match.group(1).lower()
 #name_df = f"{name}_df"



In [None]:
type(name_df)

In [None]:
# Want to the file name to name the dataframe. 

In [None]:
name_df.head()

In [None]:
name_df = name('hly275MaceHead.csv')

df_name('hly875Mullingar.csv', 17)

In [None]:
mullingar_df.head()

In [None]:
name('hly875Mullingar.csv')

In [None]:
# Create a data frame with name extracted from the csv file
name_df = name('hly875Mullingar.csv')

name_df = load_weather('hly875Mullingar.csv', 17)


In [None]:
mullingar_df.head()

In [None]:
name_df = name('hly2375Belmullet.csv')
name_df = load_weather('hly2375Belmullet.csv', 23)

In [None]:
belmullet_df.head()

In [None]:
belmullet_df = pd.read_csv('data/weather/hly2375Belmullet.csv', 
                           skiprows = 23,
                           usecols= ['date', 'rain', 'temp', 'msl', 'wdsp', 'wddir'],
                           na_values = ' ',
                           index_col= 'date', 
                           parse_dates= ['date'], 
                           date_format = "%d-%b-%Y %H:%M")

belmullet_df.head()


In [None]:
# Select rows from 2014

belmullet_df = belmullet_df.loc['2014' : '2024']

belmullet_df.head()

Merge weather dataframes



### Is Ireland getting windier? Use Dublin Airport data. Recorded from 1944. Also Valentia recorded from then too. Do Dublin first.

In [None]:
dublin_df = pd.read_csv('data/weather/hly532DublinAirport.csv', 
                        skiprows = 23, 
                        na_values = ' ',
                        index_col= 'date', 
                        parse_dates= ['date'], 
                        date_format = "%d-%b-%Y %H:%M")

dublin_df.head()

In [None]:
dublin_df.info()

In [None]:
dublin_df = dublin_df.dropna()

In [None]:
dublin_df.info()

In [None]:
dublin_monthly = dublin_df.resample('ME')

In [None]:
fig, ax = plt.subplots(figsize = (15, 5))

dublin_yearly = dublin_df.resample('YE')


dublin_df['wdsp'].resample('YE').mean().scatterplot()

plt.show()

In [None]:
print(dublin_yearly['wdsp'].mean())

In [None]:
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 11, 9

decomposition = sm.tsa.seasonal_decompose(dublin_monthly['wdsp'].mean(), model= 'additive', period = 12)
fig = decomposition.plot()


In [None]:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose, STL , MSTL
#from statsforecast import StatsForecast
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
valentia_df = pd.read_csv('data/weather/hly2275Valentia.csv', 
                        skiprows = 23, 
                        na_values = ' ',
                        index_col= 'date', 
                        parse_dates= ['date'], 
                        date_format = "%d-%b-%Y %H:%M")

valentia_df.head()

In [None]:
fig, ax = plt.subplots(figsize = (15, 5))

valentia_df['wdsp'].resample('YE').mean().plot()

plt.show()

In [None]:
valentia_monthly = valentia_df.resample('ME')

In [None]:
rcParams['figure.figsize'] = 11, 9

decomposition = sm.tsa.seasonal_decompose(valentia_monthly['wdsp'].mean(), model= 'additive', period = 12)
fig = decomposition.plot()


## Exploratory Data Analysis

In [None]:
# Very quick plot of electricity generation by year

mean_wind_elect_year = electricity_df.resample('YE').mean()

mean_wind_elect_year.plot()
plt.show()

## Discussion of the Analysis

nice plots

## Machine Learning

some predictive analysis.

## Conclusion



## References


http://www.iwea.ie/technicalfaqs


### Data Sets

[GitHub Daniel Parke]https://github.com/Daniel-Parke/EirGrid_Data_Download/tree/main


__Problems that arose__

[Git LFS (large file storage)](https://git-lfs.com/). Some of the weather data filew were larger than GitHub's recommended maximum file size of 50.00 MB. Installed and used Git lfs
