# 3W dataset's General Presentation

This is a general presentation of the 3W dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.

For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223)).

# 1. Introduction

This Jupyter Notebook presents the 3W dataset in a general way. For this, some tables, graphs, and statistics are presented.

# 2. Imports and Configurations

In [294]:
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

sys.path.append(os.path.join('..','..'))
import toolkit as tk

from itertools import (takewhile,repeat)
import bisect
import sklearn
import sklearn.model_selection

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# 3. Instances' Structure

Below, all 3W dataset's instances are loaded and the first one of each knowledge source (real, simulated and hand-drawn) is partially displayed.

In [4]:
def rawincount(filename):
    '''https://stackoverflow.com/questions/845058/how-to-get-line-count-of-a-large-file-cheaply-in-python'''
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

In [93]:
real_instances, simulated_instances, drawn_instances = tk.get_all_labels_and_files()

real_instances = pd.DataFrame(real_instances, columns=['label', 'path'])
real_instances['nlines'] = real_instances['path'].apply(rawincount)

simulated_instances = pd.DataFrame(simulated_instances, columns=['label', 'path'])
simulated_instances['nlines'] = simulated_instances['path'].apply(rawincount)

drawn_instances = pd.DataFrame(drawn_instances, columns=['label', 'path'])
drawn_instances['nlines'] = drawn_instances['path'].apply(rawincount)

In [316]:
train_df, val_df = sklearn.model_selection.train_test_split(real_instances, test_size=0.2, 
                                                random_state=200560, shuffle=True, 
                                                stratify=real_instances['label'])
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

In [355]:
df = pd.read_csv('../../dataset/4/WELL-00002_20131209050000.csv', index_col="timestamp", parse_dates=["timestamp"])
flist0 = ['P-PDG', 'P-TPT', 'T-TPT', 'P-MON-CKP', 'T-JUS-CKP', 'P-JUS-CKGL', 'T-JUS-CKGL', 'QGL', 'class']
df

Unnamed: 0_level_0,P-PDG,P-TPT,T-TPT,P-MON-CKP,T-JUS-CKP,P-JUS-CKGL,T-JUS-CKGL,QGL,class
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-12-09 05:00:00,0.0,16846860.0,118.0778,7802836.0,173.0961,4106967.0,,0.0,4
2013-12-09 05:00:01,0.0,16846970.0,118.0775,7796857.0,173.0961,4107032.0,,0.0,4
2013-12-09 05:00:02,0.0,16847080.0,118.0772,7790877.0,173.0961,4107097.0,,0.0,4
2013-12-09 05:00:03,0.0,16847190.0,118.0769,7784897.0,173.0961,4107162.0,,0.0,4
2013-12-09 05:00:04,0.0,16847310.0,118.0765,7778917.0,173.0961,4107228.0,,0.0,4
...,...,...,...,...,...,...,...,...,...
2013-12-09 06:59:45,0.0,16820820.0,118.0437,7568130.0,173.0961,4524903.0,,0.0,4
2013-12-09 06:59:46,0.0,16820980.0,118.0437,7571650.0,173.0961,4524937.0,,0.0,4
2013-12-09 06:59:47,0.0,16821150.0,118.0437,7575171.0,173.0961,4524969.0,,0.0,4
2013-12-09 06:59:48,0.0,16821320.0,118.0437,7578691.0,173.0961,4525003.0,,0.0,4


Each instance is stored in a CSV file and loaded into a pandas DataFrame. Each observation is stored in a line in the CSV file and loaded as a line in the pandas DataFrame. The first line of each CSV file contains a header with column identifiers. Each column of CSV files stores the following type of information:

* **timestamp**: observations timestamps loaded into pandas DataFrame as its index;
* **P-PDG**: pressure variable at the Permanent Downhole Gauge (PDG);
* **P-TPT**: pressure variable at the Temperature and Pressure Transducer (TPT);
* **T-TPT**: temperature variable at the Temperature and Pressure Transducer (TPT);
* **P-MON-CKP**: pressure variable upstream of the production choke (CKP);
* **T-JUS-CKP**: temperature variable downstream of the production choke (CKP);
* **P-JUS-CKGL**: pressure variable upstream of the gas lift choke (CKGL);
* **T-JUS-CKGL**: temperature variable upstream of the gas lift choke (CKGL);
* **QGL**: gas lift flow rate;
* **class**: observations labels associated with three types of periods (normal, fault transient, and faulty steady state).

Other information are also loaded into each pandas Dataframe:

* **label**: instance label (event type);
* **well**: well name. Hand-drawn and simulated instances have fixed names. Real instances have names masked with incremental id;
* **id**: instance identifier. Hand-drawn and simulated instances have incremental id. Each real instance has an id generated from its first timestamp.

More information about these variables can be obtained from the following publicly available documents:

* ***Option in Portuguese***: R.E.V. Vargas. Base de dados e benchmarks para prognóstico de anomalias em sistemas de elevação de petróleo. Universidade Federal do Espírito Santo. Doctoral thesis. 2019. https://github.com/ricardovvargas/3w_dataset/raw/master/docs/doctoral_thesis_ricardo_vargas.pdf.
* ***Option in English***: B.G. Carvalho. Evaluating machine learning techniques for detection of flow instability events in offshore oil wells. Universidade Federal do Espírito Santo. Master's degree dissertation. 2021. https://github.com/ricardovvargas/3w_dataset/raw/master/docs/master_degree_dissertation_bruno_carvalho.pdf.

# Preprocessing

The following table shows the amount of instances that compose the 3W dataset, by knowledge source (real, simulated and hand-drawn instances) and by instance label.

In [356]:
df = pd.read_csv('../../dataset/4/WELL-00002_20131209050000.csv', index_col="timestamp", parse_dates=["timestamp"])

if np.any(df['class'].isna()):
    df['class'] = df['class'].fillna(method='ffill')
df['class'] = df['class'].astype('int')


flist0 = ['P-PDG', 'P-TPT', 'T-TPT', 'P-MON-CKP', 'T-JUS-CKP', 'P-JUS-CKGL', 'T-JUS-CKGL', 'QGL', 'class']
flist = []
for f in flist0:
    if np.sum(df[f].isna()) < len(df.index) * 0.2:
        flist.append(f)

fdict=dict()
for f in flist:
    fdict[f] = ['mean','std']

def mode(series):
    return pd.Series.mode(series)[0]
    
fdict['class'] = [mode]
    
df['minute'] = (df.index-df.index[0])//np.timedelta64(1,'m')

#ds = df.groupby('minute').agg(fdict).dropna().iloc[:-1]

ds = df.groupby('minute').agg(fdict)

if df.index[-1].second != 59:
    ds = ds.iloc[:-1]

print(len(df.index), len(ds.index))

7190 119


In [233]:
sum(df['class'].isna())

297

In [10]:
for f in flist0:
    if f not in flist:
        ds[f, 'mean'] = np.NaN
        ds[f, 'std'] = np.NaN
ds = ds[flist0]

In [33]:
def plot_ds(ds, flist):
    fig, axs = plt.subplots(nrows=len(flist), figsize=(12, 12), sharex=True)

    for i, vs in enumerate(flist[:-1]):
        axs[i].plot(ds.index, ds[(vs, 'mean')])
        axs[i].fill_between(ds.index, ds[(vs, 'mean')]-1.96*ds[(vs, 'std')], 
                        ds[(vs, 'mean')]+1.96*ds[(vs, 'std')], 
                        alpha=0.2)
        axs[i].set_ylabel(vs)
    axs[i+1].plot(ds.index, ds[flist[-1]])
    axs[i+1].set_ylabel(flist[-1])
    axs[i+1].set_xlabel('minute')

    plt.show()

In [139]:
seq_length = 15
ts = tf.keras.utils.timeseries_dataset_from_array(
    ds.drop('class', axis=1),
    ds['class'].iloc[seq_length-1:].append(ds['class'].iloc[:seq_length-1]).reset_index(drop=True),
    sequence_length=seq_length,
    sequence_stride=1,
    sampling_rate=1,
    batch_size=32,
    shuffle=False,
    seed=None,
    start_index=None,
    end_index=None
)

In [128]:
ts = tf.keras.utils.timeseries_dataset_from_array(
    np.arange(100).reshape((50, 2)),
    -np.arange(14,65),
    sequence_length=15,
    sequence_stride=1,
    sampling_rate=1,
    batch_size=32,
    shuffle=False,
    seed=None,
    start_index=None,
    end_index=None
)

In [408]:
class CustomDataGen(tf.keras.utils.Sequence):
    '''https://medium.com/analytics-vidhya/write-your-own-custom-data-generator-for-tensorflow-keras-1252b64e41c3'''
    
    def __init__(self, df, X_col, y_col, categories,
                 batch_size,
                 seq_length=15):
        
        self.df = df.copy()
        self.X_col = X_col
        self.y_col = y_col
        self.categories = categories
        self.batch_size = batch_size
        self.seq_length = seq_length
        self.file = None
        self.ts = None
        
        self.n = self.__calc_n()
    
    def __calc_n(self):
        self.df['nbatches'] = np.int32(np.ceil(((self.df['nlines'] // 60)-self.seq_length)/self.batch_size))
        self.df['ibatch'] = self.df['nbatches'].cumsum() - 1
        return int(self.df['nbatches'].sum())
    
    def plot(self, ifile):
        
        ds = self.__get_ds(self.df['path'][ifile], Norm=False)
        
        fig, axs = plt.subplots(nrows=len(flist)+1, figsize=(10, 12), sharex=True)
        
        fig.suptitle(self.df['path'][ifile])

        for i, vs in enumerate(self.X_col):
            axs[i].plot(ds.index, ds[(vs, 'mean')])
            axs[i].fill_between(ds.index, ds[(vs, 'mean')]-1.96*ds[(vs, 'std')], 
                            ds[(vs, 'mean')]+1.96*ds[(vs, 'std')], 
                            alpha=0.2)
            axs[i].set_ylabel(vs)
            axs[i].grid()
        
        id = np.argsort(ds[(self.y_col, 'mode')])
        
        axs[i+1].scatter([ds.index[i] for i in id], [str(ds[(self.y_col, 'mode')][i]) for i in id], marker='.')
        
        axs[i+1].set_ylabel(self.y_col)
        
        axs[i+1].set_xlabel('minute')

        plt.show()    
    
    def on_epoch_end(self):
            pass
    
    def __get_ds(self, p, Norm=True):
    
        dfo = pd.read_csv(p, index_col="timestamp", parse_dates=["timestamp"])

        if np.any(dfo[self.y_col].isna()):
            dfo[self.y_col] = dfo[self.y_col].fillna(method='ffill')
        dfo[self.y_col] = dfo[self.y_col].astype('int')
        
        flist = []
        flist0 = []
        for f in self.X_col:
            nas = np.sum(dfo[f].isna())
            if nas > 0:
                if nas < len(dfo.index) * 0.2:
                    dfo[f] = dfo[f].fillna(method='ffill')
                    flist.append(f)
                else:
                    flist0.append(f)
            else:
                flist.append(f)

        fdict=dict()
        for f in flist:
            fdict[f] = ['mean','std']

        def mode(series):
            return pd.Series.mode(series)[0]

        fdict[self.y_col] = [mode]

        dfo['minute'] = (dfo.index-dfo.index[0])//np.timedelta64(1,'m')

        ds = dfo.groupby('minute').agg(fdict)

        if dfo.index[-1].second != 59:
            ds = ds.iloc[:-1]

        for f in flist0:
            ds[f, 'mean'] = np.NaN
            ds[f, 'std'] = np.NaN
        
        if Norm:
            ds = self.__Norm(ds)
        
        return ds[self.X_col + [self.y_col]]
        
    def __Norm(self, ds, nas_v=0):
        dn = ds.fillna(value=nas_v)
        sc = sklearn.preprocessing.StandardScaler()
        dn = pd.DataFrame(sc.fit_transform(dn), 
                                           index=dn.index, 
                                           columns=dn.columns)
        dn[(self.y_col, 'mode')] = ds[(self.y_col, 'mode')]
        return dn
    
    def __get_dt(self, p):

        ds = self.__get_ds(p)
        
        self.ts = tf.keras.utils.timeseries_dataset_from_array(
            ds.drop(self.y_col, axis=1, level=0),
            ds[self.y_col].iloc[self.seq_length-1:].append(ds[self.y_col].iloc[:self.seq_length-1]).reset_index(drop=True),
            sequence_length=self.seq_length,
            sequence_stride=1,
            sampling_rate=1,
            batch_size=self.batch_size,
            shuffle=False,
            seed=None,
            start_index=None,
            end_index=None
        )        
        
        self.ts = list(self.ts)

        return
    
    def __get_output(self, y):
        
        ohe = sklearn.preprocessing.OneHotEncoder(categories=self.categories, sparse= False)
        
        return ohe.fit_transform(y)
    
    def __get_data(self, i, j, p):
        # Generates data containing batch_size samples

        if p != self.file:
            self.__get_dt(p)

        return self.ts[j]
    
    def __getitem__(self, index):
        
        i = bisect.bisect_left(self.df.ibatch, index)
        if i > 0:
            j = index - self.df.ibatch[i-1] - 1
        else:
            j = index
        
        p = self.df.path[i]       
        
        #print(index, i, j, p)
        
        X, y = self.__get_data(i, j, p)        
        
        return X, self.__get_output(y)
    
    def __len__(self):
        return self.n

In [409]:
flist0 = ['P-PDG', 'P-TPT', 'T-TPT', 'P-MON-CKP', 'T-JUS-CKP', 'P-JUS-CKGL', 'T-JUS-CKGL', 'QGL']
categories=[[0,1,2,3,4,5,6,7,8,101,102,103,104,105,106,107,108]]
train = CustomDataGen(train_df, flist0, 'class', categories, 32, 15)
val = CustomDataGen(val_df, flist0, 'class', categories, 32, 15)

In [410]:
train.df

Unnamed: 0,label,path,nlines,nbatches,ibatch
0,0,..\..\dataset\0\WELL-00002_20170215180118.csv,17923,9,8
1,0,..\..\dataset\0\WELL-00002_20170621030054.csv,17947,9,17
2,4,..\..\dataset\4\WELL-00010_20180425040224.csv,7057,4,21
3,0,..\..\dataset\0\WELL-00001_20170219120021.csv,17980,9,30
4,4,..\..\dataset\4\WELL-00002_20131215000010.csv,7191,4,34
...,...,...,...,...,...
812,4,..\..\dataset\4\WELL-00010_20180425120029.csv,7172,4,5677
813,0,..\..\dataset\0\WELL-00001_20170219170053.csv,17948,9,5686
814,0,..\..\dataset\0\WELL-00005_20170401170000.csv,17957,9,5695
815,5,..\..\dataset\5\WELL-00015_20171013140047.csv,7755,4,5699


In [None]:
#train.__get_dt(real_instances[real_instances.label==2].iloc[1].path)
for index in range(train.__len__()):
    train.__getitem__(index)

In [415]:
MAX_EPOCHS = 20

def compile_and_fit(model, train, val, patience=2, lr=0.001):
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                    patience=patience,
                                                    mode='min')

    model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
                  metrics=[tf.keras.metrics.CategoricalAccuracy()])

    history = model.fit(train, epochs=MAX_EPOCHS,
                        validation_data=val,
                        callbacks=[early_stopping])
    return history

In [416]:
linear_model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(15, 16)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=17, activation='sigmoid')
])
linear_model.summary()

Model: "sequential_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_10 (Flatten)        (None, 240)               0         
                                                                 
 dense_13 (Dense)            (None, 17)                4097      
                                                                 
Total params: 4,097
Trainable params: 4,097
Non-trainable params: 0
_________________________________________________________________


In [417]:
val_performance = []

history = compile_and_fit(linear_model, train, val)

val_performance['Linear'] = linear_model.evaluate(val)
#performance['Linear'] = linear.evaluate(single_step_window.test, verbose=0)

Epoch 1/20
Epoch 2/20

KeyboardInterrupt: 