# Intro to the OpenOA `PlantData` and QA Methods

In this example we will be using the ENGIE open data set for the La Haute Borne wind power plant, and demonstrating how to use the quality assurance (QA) methods in OpenOA to help get this data ready for use with the `PlantData` class. This notebook will walk through the creation of the `project_Engie` module, especially the `prepare()` method that returns either the cleaned data or a `PlantData` object.


## Using ENGIE's open data set

ENGIE provides access to the data of its 'La Haute Borne' wind farm through https://opendata-renewables.engie.com and through an API. The data can be used to create additional turbine objects and gives users the opportunity to work with further real-world data. 

The series of notebooks in the 'examples' folder uses SCADA data downloaded from https://opendata-renewables.engie.com, saved in the 'examples/data' folder. Additional plant level meter, availability, and curtailment data were synthesized based on the SCADA data.

## Imports

In [2]:
import numpy as np
import pandas as pd
from openoa import PlantData
from openoa.utils import qa

import project_ENGIE

## Step 1: Load the SCADA data

First we'll need to unzip the data, and read the SCADA data to a pandas `DataFrame` so we can take a look at the data before we can start working with it.

In [5]:
data_path = "data/la_haute_borne"
project_ENGIE.extract_data(data_path)

scada_df = pd.read_csv(f"{data_path}/la-haute-borne-data-2014-2015.csv")

scada_df.head(10)

Unnamed: 0,Wind_turbine_name,Date_time,Ba_avg,P_avg,Ws_avg,Va_avg,Ot_avg,Ya_avg,Wa_avg
0,R80736,2014-01-01T01:00:00+01:00,-1.0,642.78003,7.12,0.66,4.69,181.34,182.00999
1,R80721,2014-01-01T01:00:00+01:00,-1.01,441.06,6.39,-2.48,4.94,179.82001,177.36
2,R80790,2014-01-01T01:00:00+01:00,-0.96,658.53003,7.11,1.07,4.55,172.39,173.50999
3,R80711,2014-01-01T01:00:00+01:00,-0.93,514.23999,6.87,6.95,4.3,172.77,179.72
4,R80790,2014-01-01T01:10:00+01:00,-0.96,640.23999,7.01,-1.9,4.68,172.39,170.46001
5,R80736,2014-01-01T01:10:00+01:00,-1.0,511.59,6.69,-3.34,4.7,181.34,178.02
6,R80711,2014-01-01T01:10:00+01:00,-0.93,692.33002,7.68,4.72,4.38,172.77,177.49001
7,R80721,2014-01-01T01:10:00+01:00,-1.01,457.76001,6.48,-4.93,5.02,179.82001,174.91
8,R80711,2014-01-01T01:20:00+01:00,-0.93,580.12,7.35,6.84,4.2,172.77,179.59
9,R80721,2014-01-01T01:20:00+01:00,-1.01,396.26001,6.16,-1.94,4.88,179.82001,177.85001


The timestamps in the column `Date_time` show that we have timezone information encoded, and that the data have a 10 minute frequency to them (or "10T" according to the pandas guidance: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)

To demonstrate the breadth of data that the QA methods are inteneded to handle this demonstration will step through the data using the current format, and an alternative where the timezone data has been stripped out.

In [12]:
scada_df_tz = scada_df.loc[:, :].copy()  # timezone aware
scada_df_no_tz = scada_df.loc[:, :].copy()  # timezone unaware

# Remove the timezone information from the timezone unaware example dataframe
scada_df_no_tz['Date_time'] = [f"{el[0:10]} {el[11:19]}" for el in scada_df_no_tz["Date_time"]]

# Show the resulting change
scada_df_no_tz.head()

Unnamed: 0,Wind_turbine_name,Date_time,Ba_avg,P_avg,Ws_avg,Va_avg,Ot_avg,Ya_avg,Wa_avg
0,R80736,2014-01-01 01:00:00,-1.0,642.78003,7.12,0.66,4.69,181.34,182.00999
1,R80721,2014-01-01 01:00:00,-1.01,441.06,6.39,-2.48,4.94,179.82001,177.36
2,R80790,2014-01-01 01:00:00,-0.96,658.53003,7.11,1.07,4.55,172.39,173.50999
3,R80711,2014-01-01 01:00:00,-0.93,514.23999,6.87,6.95,4.3,172.77,179.72
4,R80790,2014-01-01 01:10:00,-0.96,640.23999,7.01,-1.9,4.68,172.39,170.46001


Below, we can see the data types for each of the columns. We should note that the timestamps are not correctly encoded, but are considered as objects at this time

In [13]:
scada_df_tz.dtypes

Wind_turbine_name     object
Date_time             object
Ba_avg               float64
P_avg                float64
Ws_avg               float64
Va_avg               float64
Ot_avg               float64
Ya_avg               float64
Wa_avg               float64
dtype: object

In [14]:
scada_df_no_tz.dtypes

Wind_turbine_name     object
Date_time             object
Ba_avg               float64
P_avg                float64
Ws_avg               float64
Va_avg               float64
Ot_avg               float64
Ya_avg               float64
Wa_avg               float64
dtype: object

## Step 2: Convert the timestamps to proper timestamp data objects

Using the `qa.convert_datetime_column()` method, we can convert the timestamp data accordingly and insert the UTC-encoded data as an index for both the timezone aware, and timezone unaware data sets.

Under the hood this method does a few helpful items to create the resulting data set:
1) Converts the column "Date_time" to a datetime object
2) Creates the new datetime columns: "Date_time_localized" and "Date_time_utc" for the localized and UTC-encoded datetime objects
3) Sets the UTC timestamp as the index
4) Creates the column "utc_offset" containing the difference between the UTC timestamp and the localized timestamp that will be used to determine if the timestamp is in DST or not.
5) Creates the column "is_dst" indicating if the timestamps are in DST (`True`), or not (`False`) that will be used later when trying to assess time gaps and duplications in the data

Notice that in the resulting data that the data type of the column "Date_time" is successfully made into a localized timestamp in the timezone aware example, but is kept as a non-localized timestamp in the unaware example.


**NEEDS REVIEWING**

Below is what the updated DataFrame object looks like after being read in and manipulated for the initial setup. Notice that there is now a UTC offset column, which directly translates to the `is_dst` column's `True`/`False` input for whether or not a particular timestamp is in Daylight Saving's Time (if it's used at all for the time zone).

In the below, the datetime_utc column should always remain in UTC time and the datetime_localized column should always remain in the localized time. Conveniently, Pandas provides two methods `tz_convert()` and `tz_localize()` to toggle back and forth between timezones, which will operate on the index of the DataFrame. It is worth noting that the local time could also be UTC, in which case the two columns would be redundant.

The localized time, even when the passed data is unaware, is adjusted using the `local_tz` keyword argument to help normalize the time strings, from which a UTC-based timestamp is created (even when local is also UTC). By calculating the UTC time from the local time, we are able to ascertain DST shifts in the data, and better assess any anomalies that may exist.

However, there may be cases where the timezone is not encoded (this example), nor known. In the former, we can use the `local_tz` keyword argument that is seen in the code above, but for the latter, this is much more difficult, and the default value of UTC may not be accurate. In this latter case it is useful to try multiple timezones, such as an operating/owner company's headquarters or often the windfarm's location to find a best fit. In the case of using a US-based windfarm, the subclass `WindToolKitQualityControlDiagnosticSuite` can be used to help better match a timezone and the data provided.

In [16]:
scada_df_tz = qa.convert_datetime_column(
    df=scada_df_tz,
    time_col="Date_time",
    local_tz="Europe/Paris",
    tz_aware=True # Indicate that we can use encoded data to convert between timezones
)
scada_df_tz.head()

Unnamed: 0_level_0,Wind_turbine_name,Date_time,Ba_avg,P_avg,Ws_avg,Va_avg,Ot_avg,Ya_avg,Wa_avg,Date_time_localized,Date_time_utc,utc_offset,is_dst
Date_time_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2014-01-01 00:00:00+00:00,R80736,2014-01-01 01:00:00+01:00,-1.0,642.78003,7.12,0.66,4.69,181.34,182.00999,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:00:00+00:00,R80721,2014-01-01 01:00:00+01:00,-1.01,441.06,6.39,-2.48,4.94,179.82001,177.36,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:00:00+00:00,R80790,2014-01-01 01:00:00+01:00,-0.96,658.53003,7.11,1.07,4.55,172.39,173.50999,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:00:00+00:00,R80711,2014-01-01 01:00:00+01:00,-0.93,514.23999,6.87,6.95,4.3,172.77,179.72,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:10:00+00:00,R80790,2014-01-01 01:10:00+01:00,-0.96,640.23999,7.01,-1.9,4.68,172.39,170.46001,2014-01-01 01:10:00+01:00,2014-01-01 00:10:00+00:00,0 days 01:00:00,False


In [18]:
print(scada_df_tz.index.dtype)
scada_df_tz.dtypes

datetime64[ns, UTC]


Wind_turbine_name                            object
Date_time              datetime64[ns, Europe/Paris]
Ba_avg                                      float64
P_avg                                       float64
Ws_avg                                      float64
Va_avg                                      float64
Ot_avg                                      float64
Ya_avg                                      float64
Wa_avg                                      float64
Date_time_localized    datetime64[ns, Europe/Paris]
Date_time_utc                   datetime64[ns, UTC]
utc_offset                          timedelta64[ns]
is_dst                                         bool
dtype: object

In [20]:
scada_df_no_tz = qa.convert_datetime_column(
    df=scada_df_no_tz,
    time_col="Date_time",
    local_tz="Europe/Paris",
    tz_aware=False  # Indicates that we're going to need to make inferences about encoding the timezones
)
scada_df_no_tz.head()

Unnamed: 0_level_0,Wind_turbine_name,Date_time,Ba_avg,P_avg,Ws_avg,Va_avg,Ot_avg,Ya_avg,Wa_avg,Date_time_localized,Date_time_utc,utc_offset,is_dst
Date_time_utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2014-01-01 00:00:00+00:00,R80736,2014-01-01 01:00:00,-1.0,642.78003,7.12,0.66,4.69,181.34,182.00999,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:00:00+00:00,R80721,2014-01-01 01:00:00,-1.01,441.06,6.39,-2.48,4.94,179.82001,177.36,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:00:00+00:00,R80790,2014-01-01 01:00:00,-0.96,658.53003,7.11,1.07,4.55,172.39,173.50999,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:00:00+00:00,R80711,2014-01-01 01:00:00,-0.93,514.23999,6.87,6.95,4.3,172.77,179.72,2014-01-01 01:00:00+01:00,2014-01-01 00:00:00+00:00,0 days 01:00:00,False
2014-01-01 00:10:00+00:00,R80790,2014-01-01 01:10:00,-0.96,640.23999,7.01,-1.9,4.68,172.39,170.46001,2014-01-01 01:10:00+01:00,2014-01-01 00:10:00+00:00,0 days 01:00:00,False


In [21]:
print(scada_df_no_tz.index.dtype)
scada_df_no_tz.dtypes

datetime64[ns, UTC]


Wind_turbine_name                            object
Date_time                            datetime64[ns]
Ba_avg                                      float64
P_avg                                       float64
Ws_avg                                      float64
Va_avg                                      float64
Ot_avg                                      float64
Ya_avg                                      float64
Wa_avg                                      float64
Date_time_localized    datetime64[ns, Europe/Paris]
Date_time_utc                   datetime64[ns, UTC]
utc_offset                          timedelta64[ns]
is_dst                                         bool
dtype: object

## Step 3: Dive into the data