# BluWave Challenge

The goal of this challenge is to determine one or more outputs from 4 wind turbines are stationary.

In [1]:
# import modules

import pandas as pd
import numpy as np
import os
import sys
import random
import copy
import cPickle as pickle
import datetime

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import colorlover as cl
import matplotlib.pyplot as plt 

init_notebook_mode(connected=True)

%reload_ext autoreload
%autoreload 2

pd.options.display.float_format = '{:,.4f}'.format

In [30]:
# Load data

data = pd.read_excel("Data.xlsx", sheet_name=None).values()[0]

In [58]:
pw_cols = ['PW1', 'PW2', 'PW3', 'PW4']

## Explore/Prepare Data Set

First we take a look at the size, column types and the first couple rows of the dataframe, shown below.  The wind turbine outputs are the columns PW1, PW2, PW3 and PW4. There is a TimeStamp column which contains the epoch timestamps (number of seconds since Jan 1, 1970) for each sample. 

In [20]:
data.shape

(106560, 15)

In [103]:
data.dtypes

TimeStamp              float64
Date                    object
Time                    object
Wind Speed             float64
Wind Dir               float64
Temp                   float64
Unnamed: 6             float64
Unnamed: 7              object
Unnamed: 8              object
PW1                     object
PW2                     object
PW3                     object
PW4                     object
Unnamed: 13             object
 Wind Farm MW Value    float64
dtype: object

In [8]:
data.head()

Unnamed: 0,TimeStamp,Date,Time,Wind Speed,Wind Dir,Temp,Unnamed: 6,Unnamed: 7,Unnamed: 8,PW1,PW2,PW3,PW4,Unnamed: 13,Wind Farm MW Value
0,1501545600.0,"""Aug 01/17""",00:00:00,8.6741,211.8667,20.0,,,,1106.214,1071.4309,1165.2189,1079.5228,,4407.5907
1,1501545900.0,"""Aug 01/17""",00:05:00,8.288,213.7232,20.0,,,,1023.2526,920.8148,1126.9757,955.0125,,3979.1336
2,1501546200.0,"""Aug 01/17""",00:10:00,8.5543,216.4733,20.0,,,,1066.6329,1173.6629,1116.1848,950.9738,,4312.0781
3,1501546500.0,"""Aug 01/17""",00:15:00,8.5694,218.5832,20.0,,,,1067.4646,1043.5497,1171.3774,1153.1642,,4359.9587
4,1501546800.0,"""Aug 01/17""",00:20:00,8.7656,218.0367,20.0,,,,1092.3378,1058.1656,1197.3882,1243.6106,,4602.9744


In terms of time ranges, the samples start and end at the following dates:

In [4]:
'start: ' + data['Date'].iloc[0] + ' end: ' + data['Date'].iloc[-1]

u'start:  "Aug 01/17" end:  "Aug 05/18"'

So there's roughly a year's worth of data. In addition the time interval between each pair of successive points appears to be 5 minutes.

Its strange, however that the PW columns are object types since they appear to be floats. The unique values for the columns are shown below as well as the number of null values

In [110]:
for w in pw_cols:
    print np.unique(data[w])

[-47.165427 -40.573715 -40.265819 ..., 3008.55566 3011.689958 u' ']
[-51.420665 -48.12117 -46.364245 ..., 3007.130397 3009.956445 u' ']
[-46.234508 -45.469031 -44.202472 ..., 3008.425407 3008.767463 u' ']
[-55.90785 -49.78317 -49.744794 ..., 3007.361389 3007.539973 u' ']


In [46]:
data.isnull().sum()

TimeStamp                 11
Date                       0
Time                       0
Wind Speed                11
Wind Dir                  11
Temp                      11
Unnamed: 6             96205
Unnamed: 7             96205
Unnamed: 8             96205
PW1                       11
PW2                       11
PW3                       11
PW4                       11
Unnamed: 13            96205
 Wind Farm MW Value       11
dtype: int64

There are strings mixed in with the floats and each turbine has 11 points of missing data. Combining NaN and string values, we get the following missing values for each turbine:

In [111]:
data.isnull().sum() + (data == ' ').sum()

TimeStamp                  11
Date                        0
Time                        0
Wind Speed                 11
Wind Dir                   11
Temp                       11
Unnamed: 6              96205
Unnamed: 7             106560
Unnamed: 8             106560
PW1                      1778
PW2                      1830
PW3                      1143
PW4                      1133
Unnamed: 13            106560
 Wind Farm MW Value        11
dtype: int64

These 1000 or so missing values for each turbine represents about 0.9% of the data.  This isn't a huge amount but could become a reocurring issue.

In [120]:
n = ((data == ' ') | (data.isnull()))
len(list(data.loc[n['PW1'], ['Date', 'Time']].groupby('Date')))

32

In [74]:
miss_data = data.loc[data.loc[:, pw_cols].isnull().any(1),:]
miss_data_inds = miss_data.index
miss_data

Unnamed: 0,TimeStamp,Date,Time,Wind Speed,Wind Dir,Temp,Unnamed: 6,Unnamed: 7,Unnamed: 8,PW1,PW2,PW3,PW4,Unnamed: 13,Wind Farm MW Value
96194,,"""Jul 01/18""",00:05:00,,,,,,,,,,,,
96195,,"""Jul 01/18""",00:10:00,,,,,,,,,,,,
96196,,"""Jul 01/18""",00:15:00,,,,,,,,,,,,
96197,,"""Jul 01/18""",00:20:00,,,,,,,,,,,,
96198,,"""Jul 01/18""",00:25:00,,,,,,,,,,,,
96199,,"""Jul 01/18""",00:30:00,,,,,,,,,,,,
96200,,"""Jul 01/18""",00:35:00,,,,,,,,,,,,
96201,,"""Jul 01/18""",00:40:00,,,,,,,,,,,,
96202,,"""Jul 01/18""",00:45:00,,,,,,,,,,,,
96203,,"""Jul 01/18""",00:50:00,,,,,,,,,,,,


We can see these missing values are consecutive and occured on July 1rst for an hour for all wind turbines.  It appears to be some kind of blackout in the data collector  - perhaps the collectors were powered down for a yearly maintenance. We could assume that these values are missing at random - i.e. their missing-ness is not related to parameters of interest.

We'll need to either remove these values completely or impute.  Given that these values represent only 0.01% of the data and occurred over a single hour, its unlikely that the method we use will impact any tests for stationarity significantly. However, this could become a recurring issue we need to deal with.

There are a couple basic imputing methods specific to time series we could consider:

1. Last observation carried forward (LOCF)
2. Next observation carried backward (NOCB)
3. Linear interpolation
4. Spline interpolation
5. Moving average

We'll first plot the data from the turbines before and after the missing chunk.  We used the 12 hours before and after since trends in weather will appear in this time frame.

In [61]:
st_missing = 96194
end_missing = 96204

# Get indices of points 12h before gap and 12h after
inds = range(st_missing - (12*12), st_missing + 1) + range(end_missing, end_missing + (12*12))

iplot(go.Figure(data = [go.Scatter(x = data['TimeStamp'].iloc[inds], y = data[w].iloc[inds], 
                                   mode = 'lines', name = w) for w in pw_cols]))

The gap appears to occur during a period of high/variable winds so using LCOF and NOCB would create a 'jump' in the data. Linear interpolation wouldn't capture the variability going on in the data during that period. Spline interpolation would capture too much noise. Moving average could help to impute the general trend.

In [72]:
rolldata = data.loc[:, pw_cols].iloc[inds, :].rolling(window = 10).mean()
iplot(go.Figure(data = [go.Scatter(x = range(rolldata.shape[0]), y = rolldata[w], 
                                   mode = 'lines', name = w) for w in pw_cols]))

In [62]:
44.0 * 100.0 / (data.shape[0] * 4)

0.010322822822822823

Now we'll plot the output against time for each wind turbine using plotly. Given that there are so many points, we'll pick the first point of every day, just to get a sense of the data. 

In [21]:
pt_interval = int(3600.0 * 24.0 / (5.0 * 60))
for w in ['PW1', 'PW2', 'PW3', 'PW4']:
    iplot(go.Figure(data = [go.Scatter(x = data['TimeStamp'].iloc[::pt_interval], 
                                   y = data[w].iloc[::288], mode = 'lines')],
               layout = go.Layout(yaxis = dict(range = [0, 4000]), title = w)))

At first glance, there doesn't appear to be a trend in these time series.

We can also plot summary statistics over time for each time series.  Here, we plot the mean and variance for each day.

In [100]:
clean_data = data.drop(miss_data_inds).reset_index()
pd.to_numeric(clean_data.loc[:, pw_cols]).groupby(clean_data['Date']).mean()

TypeError: arg must be a list, tuple, 1-d array, or Series

In [108]:
clean_data.loc[clean_data['PW1'] == ' ', :]

Unnamed: 0,index,TimeStamp,Date,Time,Wind Speed,Wind Dir,Temp,Unnamed: 6,Unnamed: 7,Unnamed: 8,PW1,PW2,PW3,PW4,Unnamed: 13,Wind Farm MW Value
1612,1612,1502029200.0000,"""Aug 06/17""",14:20:00,9.3233,158.0000,21.0000,,,,,1329.4832,1574.6273,1379.9878,,4222.1501
1613,1613,1502029500.0000,"""Aug 06/17""",14:25:00,8.6513,158.0000,21.0000,,,,,950.5430,1166.8148,1708.4798,,3708.1795
1614,1614,1502029800.0000,"""Aug 06/17""",14:30:00,8.7041,158.0000,21.0000,,,,,875.0353,1471.0648,1372.7089,,3688.3795
1615,1615,1502030100.0000,"""Aug 06/17""",14:35:00,9.0740,158.0000,21.0000,,,,,923.7121,1464.8160,1659.3166,,3943.1179
1616,1616,1502030400.0000,"""Aug 06/17""",14:40:00,8.8560,158.0000,21.0000,,,,,1296.3109,1223.3349,1527.4078,,4006.2336
1617,1617,1502030700.0000,"""Aug 06/17""",14:45:00,9.7810,158.0000,21.0000,,,,,1275.8300,1533.2222,1935.9453,,4586.2631
1618,1618,1502031000.0000,"""Aug 06/17""",14:50:00,9.6160,158.0000,21.0000,,,,,1397.1342,1247.7170,1833.4828,,4445.2696
1619,1619,1502031300.0000,"""Aug 06/17""",14:55:00,9.1280,158.0000,21.0000,,,,,1594.5655,1220.8153,1278.5715,,4015.7100
1620,1620,1502031600.0000,"""Aug 06/17""",15:00:00,9.3431,158.0000,21.0000,,,,,1163.0402,1712.1366,1455.7527,,4278.8583
1621,1621,1502031900.0000,"""Aug 06/17""",15:05:00,9.0097,158.0000,21.0000,,,,,1037.5529,1372.5808,1469.8112,,3860.3102


In [45]:
data.isnull().sum()

TimeStamp                 11
Date                       0
Time                       0
Wind Speed                11
Wind Dir                  11
Temp                      11
Unnamed: 6             96205
Unnamed: 7             96205
Unnamed: 8             96205
PW1                       11
PW2                       11
PW3                       11
PW4                       11
Unnamed: 13            96205
 Wind Farm MW Value       11
dtype: int64

In [44]:
day_means = data.groupby('Date').mean()
day_means

Unnamed: 0_level_0,TimeStamp,Wind Speed,Wind Dir,Temp,Unnamed: 6,Wind Farm MW Value
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"""Apr 01/18""",1522583850.0000,9.4738,208.1105,4.2292,,6683.9452
"""Apr 02/18""",1522670250.0000,5.9605,222.7329,1.3277,,2007.4516
"""Apr 03/18""",1522756650.0000,4.2754,180.0378,2.7087,,770.6178
"""Apr 04/18""",1522843050.0000,6.7455,130.4163,3.5698,,2631.5541
"""Apr 05/18""",1522929450.0000,11.9473,250.7715,1.0854,,9155.8805
"""Apr 06/18""",1523015850.0000,10.5332,243.7872,-1.1614,,8207.8424
"""Apr 07/18""",1523102250.0000,5.6646,161.5783,1.9786,,1666.7223
"""Apr 08/18""",1523188650.0000,5.0219,141.0974,0.4599,,1682.5584
"""Apr 09/18""",1523275050.0000,12.0483,326.2896,-0.0067,,1293.4475
"""Apr 10/18""",1523361450.0000,6.4960,241.1790,1.6558,,2372.3685


In [22]:
date_time = data.apply(lambda x: x['Date'].split('"') + x['Time'], axis = 0)
#datetime.datetime.strptime(data['Date'].iloc[0].split('"')[1], "%b %d/%y")

KeyError: ('Date', u'occurred at index TimeStamp')

In [16]:
data['Date'].iloc[0].split('"')[1]

[u' ', u'Aug 01/17', u'']

In [18]:
data['Time'].iloc[0]

u' 00:00:00'