# Cleaning Wind Energy Data

## Wind Energy Data

The CSV files wind energy for 2014 - 2023 were downloaded from [Daniel Parke's GitHub repository](https://github.com/Daniel-Parke/EirGrid_Data_Download/tree/main). The data for 2024 was obtained by executing the script, download_electricity.py which was modified slightly from [eirgrid_downloader.py](https://github.com/Daniel-Parke/EirGrid_Data_Download/blob/main/eirgrid_downloader.py) to give the information required for this analysis.

A blog entitled, [How to Read Multiple CSV Files in Python Pandas Dataframe](https://saturncloud.io/blog/how-to-read-multiple-csv-files-into-python-pandas-dataframe) provided a straightforward solution to reading and merging the CSV files into Pandas. The solution uses the glob function. [GeeksforGeeks.org](https://www.geeksforgeeks.org/how-to-use-glob-function-to-find-files-recursively-in-python/) states that glob is a built-in module used to retrieve files/pathnames matching a specified pattern. It assumes that all the CSV files have the same structure and it uses wild-cards, such as * and ? to make path retrieval simpler and convenient. Glob wild-cards look similar to regular expressions but they can have different meanings. The glob wild-card, * used here means match all.

[Real python](https://realpython.com/get-all-files-in-directory-python/#conditional-listing-using-glob) states that glob.glob() returns a list of filenames that match a pattern, which in this case are CSV files. The CSV files found by glob, in the data/electricity directory are read into Pandas and concatenated using pd.concat() to merge the CSV files.

```python
# Search for all CSV files in the current working directory
import glob
glob.glob('*.csv')
```

In [1]:
# Import modules
import pandas as pd
import glob

In [2]:
# Find all CSV files in the data/electricity directory
csv_files = glob.glob('data/wind_electricity/*.csv')

In [3]:
# Create an empty dataframe to store the combined data
electricity_df = pd.DataFrame()

# Loop through each CSV file found by glob and append contents to electricity_df
for csv_file in csv_files:
    df = pd.read_csv(csv_file, 
                     header = None, 
                     names = ['date', 'wind_actual', 'location', 'wind_energy'], 
                     index_col= 'date',
                     parse_dates= ['date'],
                     usecols= ['date', 'wind_energy'])
    
    # Concatenate df to electricity_df
    electricity_df = pd.concat([electricity_df, df])

    # Sort electricity_df by index
    electricity_df.sort_index(inplace= True)

electricity_df.head()

Unnamed: 0_level_0,wind_energy
date,Unnamed: 1_level_1
2014-01-01 00:00:00,1020.0
2014-01-01 00:15:00,995.0
2014-01-01 00:30:00,933.0
2014-01-01 00:45:00,959.0
2014-01-01 01:00:00,921.0


In [4]:
# Rename columns to include units
electricity_df.rename({'wind_energy' : 'Wind Energy (MW)'}, axis = 'columns', inplace= True)

In [5]:
# Shape of the dataframe
electricity_df.shape

(397344, 1)

In [6]:
electricity_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 397344 entries, 2014-01-01 00:00:00 to 2025-01-01 21:45:00
Data columns (total 1 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Wind Energy (MW)  397177 non-null  float64
dtypes: float64(1)
memory usage: 6.1 MB


In [7]:
# Count the rows with missing information
electricity_df.isna().sum()

Wind Energy (MW)    167
dtype: int64

The data set has almost 400,000 rows. The amount of electricity produced by wind energy is recorded every 15 min. There are surprisingly few rows with missing data. Only 167 rows have missing information.

https://saturncloud.io/blog/how-to-find-all-rows-with-nan-values-in-python-pandas/

https://stackoverflow.com/questions/43424199/display-rows-with-one-or-more-nan-values-in-pandas-dataframe

In [8]:
# View the rows with missing data
nan_rows = electricity_df[electricity_df.isna().any(axis= 1)]
nan_rows

Unnamed: 0_level_0,Wind Energy (MW)
date,Unnamed: 1_level_1
2014-03-30 01:00:00,
2014-03-30 01:15:00,
2014-03-30 01:30:00,
2014-03-30 01:45:00,
2015-03-29 01:00:00,
...,...
2025-01-01 20:45:00,
2025-01-01 21:00:00,
2025-01-01 21:15:00,
2025-01-01 21:30:00,


The missing data will be filled using the interpolate() function. [Geeksforgeeks.org](https://www.geeksforgeeks.org/interpolation-in-python/) describes interpolation is a method for generating points between known values. 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html

Why interpolate and not ffill?

In [9]:
# Interpolation the missing rows
electricity_df.interpolate(method= 'linear', inplace= True)

In [10]:
f'There are {electricity_df.index.duplicated().sum()} duplicated rows.'

'There are 11528 duplicated rows.'

All the downloaded csv files have data for the 1st Jan of the following year e.g. the [ROI_windactual_14_Eirgrid.csv](data\electricity\ROI_windactual_14_Eirgrid.csv) contains the data for 2014 and the 1st Jan 2015. When the data for 2015 is merged onto the dataframe it also has data for the 1st Jan 2015.

https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices

In [11]:
# Remove any duplicated rows
electricity_df = electricity_df[~electricity_df.index.duplicated(keep= 'first')]
electricity_df.head()

Unnamed: 0_level_0,Wind Energy (MW)
date,Unnamed: 1_level_1
2014-01-01 00:00:00,1020.0
2014-01-01 00:15:00,995.0
2014-01-01 00:30:00,933.0
2014-01-01 00:45:00,959.0
2014-01-01 01:00:00,921.0


In [12]:
# Check that the duplicated rows have been removed.
f'There are {electricity_df.index.duplicated().sum()} duplicated rows.'

'There are 0 duplicated rows.'

In [13]:
# Write to csv file
electricity_df.to_csv('data/electricity/clean_data/electricity_data.csv')

The electricity data is recorded every 15 min. However, the weather data is recorded hourly therefore the electricity data will be resampled to hourly.

In [14]:
hourly_electricity_df = electricity_df.resample('h').mean()
hourly_electricity_df.head()

Unnamed: 0_level_0,Wind Energy (MW)
date,Unnamed: 1_level_1
2014-01-01 00:00:00,976.75
2014-01-01 01:00:00,914.25
2014-01-01 02:00:00,938.5
2014-01-01 03:00:00,911.25
2014-01-01 04:00:00,915.0


In [15]:
# Save the hourly data to a csv file
hourly_electricity_df.to_csv('data/electricity/clean_data/hourly_electricity.csv')