## Quality Check Diagnostic Work

This notebook illustrates some quality control steps that should be considered when analyzing a new dataset. In this example we'll use the `WindToolKitQualityControlDiagnosticSuite` class to automate some of the QC analysis for SCADA data.

The `WindToolKitQualityDiagnosticSuite` is a subclass of the `QualityControlDiagnosticSuite` that adds additional methods for the use of the NREL WindToolKit database in addition to all the base QC methods.

### Step 1: Load in Data

To load in the data, we can either preload the data, or pass in a full file path and have the QC class import the data file. We'll import the data first to give a glimpse into what the data look like.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd

from operational_analysis.methods.quality_check_automation import WindToolKitQualityControlDiagnosticSuite as QC

In [3]:
scada_df = pd.read_csv('./data/la_haute_borne/la-haute-borne-data-2014-2015.csv')

In [4]:
scada_df.head()

Unnamed: 0,Wind_turbine_name,Date_time,Ba_avg,P_avg,Ws_avg,Va_avg,Ot_avg,Ya_avg,Wa_avg
0,R80736,2014-01-01T01:00:00+01:00,-1.0,642.78003,7.12,0.66,4.69,181.34,182.00999
1,R80721,2014-01-01T01:00:00+01:00,-1.01,441.06,6.39,-2.48,4.94,179.82001,177.36
2,R80790,2014-01-01T01:00:00+01:00,-0.96,658.53003,7.11,1.07,4.55,172.39,173.50999
3,R80711,2014-01-01T01:00:00+01:00,-0.93,514.23999,6.87,6.95,4.3,172.77,179.72
4,R80790,2014-01-01T01:10:00+01:00,-0.96,640.23999,7.01,-1.9,4.68,172.39,170.46001


#### Convert Date_time to a datetime object

In [5]:
# # To illustrate timezone QC functions, we'll remove the timezone information
# date = [s[0:10] for s in scada_df['Date_time']]
# time = [s[11:19] for s in scada_df['Date_time']]
# datetime = [date[s] + ' ' + time[s] for s in np.arange(len(date))]
# scada_df['datetime'] = pd.to_datetime(datetime, format = "%Y-%m-%d %H:%M:%S")

# scada_df.set_index('datetime', inplace = True, drop = False)

In [6]:
# scada_df.dtypes

### Step 2: Initializing QC and Performing the Run Method

Now that we have our dataset with the necessary columns and datatypes, we are ready to perform our quality check diagnostic. This analysis will not make the adjustments for us, but it will allow us to quickly flag some key irregularities that we need to manage before going on. 

To start, let's initialize a QC object, qc, and call its run method. 

In [7]:
qc = QC(
    data=scada_df, 
    ws_field='Ws_avg', 
    power_field= 'P_avg', 
    time_field='Date_time', 
    id_field='Wind_turbine_name', 
    freq='10T', 
    lat_lon=(48.45, 5.586),
    tz="Europe/Paris"
)

INFO:operational_analysis.methods.quality_check_automation:Initializing QC_Automation Object


In [8]:
qc.run()

INFO:operational_analysis.methods.quality_check_automation:Identifying Time Duplications
INFO:operational_analysis.methods.quality_check_automation:Identifying Time Gaps
INFO:operational_analysis.methods.quality_check_automation:Evaluating timezone deviation from UTC
INFO:root:GET: https://developer.nrel.gov/api/hsds/ [/nrel/wtk-us.h5]
INFO:root:status: 200
INFO:root:got domain json: {'root': 'g-d146c6be-85f3-11e7-bf89-0242ac110008', 'class': 'domain', 'owner': 'nrel_admin', 'created': 1503266789.3004835, 'limits': {'min_chunk_size': 1048576, 'max_chunk_size': 4194304, 'max_request_size': 104857600, 'max_chunks_per_request': 1000}, 'compressors': ['blosclz', 'lz4', 'lz4hc', 'snappy', 'gzip', 'zstd'], 'version': '0.7.0beta', 'lastModified': 1503266789.3004835, 'domain_objs': {'g-d146c6be-85f3-11e7-bf89-0242ac110008': {'id': 'g-d146c6be-85f3-11e7-bf89-0242ac110008', 'root': 'g-d146c6be-85f3-11e7-bf89-0242ac110008', 'created': 1503266789.2974205, 'lastModified': 1503267057.1529546, 'linkC

y,x indices for project: 		 (3343, 4213)
Coordinates of project: 	 (48.45, 5.586)


ValueError: Index (3343) out of range (0-1601)

### Step 3: Deep Dive with QC Diagnostic Results

Let's take a deeper look at the results of our QC diagnostic. 

#### Perform a general scan of the distributions for each numeric variable

In [None]:
qc.column_histograms()

#### Check ranges of each variable

In [None]:
qc._max_min

These values look fairly reasonable and consistent. 

#### Identify any timestamp duplications and timestamp gaps. 

Duplications in October and gaps in March would suggest DST.

In [None]:
qc._time_duplications 

In [None]:
qc._time_gaps

Based on the duplicated timestamps, it does seem like there is a DST correction in spring but no time gap in the fall

#### Check the DST plot to look in more detail

In [None]:
qc.daylight_savings_plot()

So we do in fact have a gap in the spring data when DST kicks in (as well as duplicated data for some reason) but not duplicated data in the fall.

The final question regarding datetime is whether we're in UTC or local. Given the daylights savings gap, it's likely we're in local. This is further confirmed by the raw datetime info provided in the SCADA file, which shows either a +1h or +2h timezone from UTC. So we are operating in local time. Therefore, the project import script for La Haute Borne should shift the timestep back to put it into UTC.

### Inspect the turbine power curves

Now that we have gathered some useful information about our timeseries, the one last check we may want to make is to inspect each turbine profile. We can look at each turbine's power curve and perform an initial scan for irregularities.

In [None]:
qc.plot_by_id('Ws_avg', 'P_avg')

Overall, these power curves look pretty common with some downtime, derating, and what look like a few erroneous data points. 

### Step 4: Performing adjustments on our data

Recall that this notebook is only for diagnostic QC of plant data and does not actually change the data in the project import script. Any issues identifed here should be incorporated into the project import script. 

Note that the necessary corrections have alreayd been applied to the project import script for this data.