# 3W dataset's General Presentation

This is a general presentation of the 3W dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.

For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223)).

# 1. Introduction

This Jupyter Notebook presents the 3W dataset in a general way. For this, some tables, graphs, and statistics are presented.

# 2. Imports and Configurations

In [1]:
#pip install toolkit
#pip install alive-progress
#pip install natsort

In [2]:
import sys
import os

sys.path.append(os.path.join('..','..'))
import toolkit as tk

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# 3. Instances' Structure

Below, all 3W dataset's instances are loaded and the first one of each knowledge source (real, simulated and hand-drawn) is partially displayed.

In [3]:
real_instances, simulated_instances, drawn_instances = tk.get_all_labels_and_files()
tk.load_instance(real_instances[0])

Unnamed: 0_level_0,label,well,id,P-PDG,P-TPT,T-TPT,P-MON-CKP,T-JUS-CKP,P-JUS-CKGL,T-JUS-CKGL,QGL,class
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-06-25 22:01:27,0,WELL-00002,20170625220127,0.0,8698015.0,117.6015,2142158.0,75.63453,2310426.0,,0.0,0
2017-06-25 22:01:28,0,WELL-00002,20170625220127,0.0,8698015.0,117.6014,2172395.0,75.65491,2310427.0,,0.0,0
2017-06-25 22:01:29,0,WELL-00002,20170625220127,0.0,8698015.0,117.6013,2202631.0,75.67529,2310427.0,,0.0,0
2017-06-25 22:01:30,0,WELL-00002,20170625220127,0.0,8698015.0,117.6012,2180472.0,75.69567,2310427.0,,0.0,0
2017-06-25 22:01:31,0,WELL-00002,20170625220127,0.0,8698015.0,117.6011,2158313.0,75.71606,2310428.0,,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2017-06-26 02:59:56,0,WELL-00002,20170625220127,0.0,8573995.0,117.7521,1354130.0,76.21077,2319469.0,,0.0,0
2017-06-26 02:59:57,0,WELL-00002,20170625220127,0.0,8574327.0,117.7520,1338698.0,76.24621,2319470.0,,0.0,0
2017-06-26 02:59:58,0,WELL-00002,20170625220127,0.0,8574660.0,117.7519,1323267.0,76.28165,2319470.0,,0.0,0
2017-06-26 02:59:59,0,WELL-00002,20170625220127,0.0,8574992.0,117.7518,1307835.0,76.31710,2319471.0,,0.0,0


In [4]:
tk.load_instance(simulated_instances[0])

Unnamed: 0_level_0,label,well,id,P-PDG,P-TPT,T-TPT,P-MON-CKP,T-JUS-CKP,P-JUS-CKGL,T-JUS-CKGL,QGL,class
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-04-02 08:37:37,6,SIMULATED,00035,22877860.0,15699480.0,,2857058.0,86.05852,,,,0
2018-04-02 08:37:38,6,SIMULATED,00035,22877850.0,15699630.0,,2857058.0,86.05852,,,,0
2018-04-02 08:37:39,6,SIMULATED,00035,22877830.0,15699590.0,,2857058.0,86.05851,,,,0
2018-04-02 08:37:40,6,SIMULATED,00035,22877810.0,15699540.0,,2857058.0,86.05851,,,,0
2018-04-02 08:37:41,6,SIMULATED,00035,22877810.0,15699560.0,,2857058.0,86.05851,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2018-04-02 16:07:31,6,SIMULATED,00035,25737120.0,18756440.0,,8092972.0,82.71154,,,,6
2018-04-02 16:07:32,6,SIMULATED,00035,25737120.0,18756440.0,,8092972.0,82.71153,,,,6
2018-04-02 16:07:33,6,SIMULATED,00035,25737120.0,18756440.0,,8092972.0,82.71153,,,,6
2018-04-02 16:07:34,6,SIMULATED,00035,25737120.0,18756440.0,,8092972.0,82.71152,,,,6


In [5]:
tk.load_instance(drawn_instances[0])

Unnamed: 0_level_0,label,well,id,P-PDG,P-TPT,T-TPT,P-MON-CKP,T-JUS-CKP,P-JUS-CKGL,T-JUS-CKGL,QGL,class
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-08-20 11:25:44,7,DRAWN,00009,235.0643,87.05141,109.9409,21.04113,59.85347,,,,0
2018-08-20 11:25:45,7,DRAWN,00009,235.0643,87.05141,109.9409,21.04113,59.85347,,,,0
2018-08-20 11:25:46,7,DRAWN,00009,235.0644,87.05141,109.9409,21.04113,59.85347,,,,0
2018-08-20 11:25:47,7,DRAWN,00009,235.0644,87.05141,109.9409,21.04113,59.85347,,,,0
2018-08-20 11:25:48,7,DRAWN,00009,235.0644,87.05141,109.9409,21.04113,59.85347,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2018-08-20 23:25:40,7,DRAWN,00009,235.9267,88.00000,109.9370,21.99229,59.82134,,,,107
2018-08-20 23:25:41,7,DRAWN,00009,235.9267,88.00000,109.9370,21.99229,59.82134,,,,107
2018-08-20 23:25:42,7,DRAWN,00009,235.9267,88.00000,109.9370,21.99229,59.82134,,,,107
2018-08-20 23:25:43,7,DRAWN,00009,235.9267,88.00000,109.9370,21.99229,59.82134,,,,107


Each instance is stored in a CSV file and loaded into a pandas DataFrame. Each observation is stored in a line in the CSV file and loaded as a line in the pandas DataFrame. The first line of each CSV file contains a header with column identifiers. Each column of CSV files stores the following type of information:

* **timestamp**: observations timestamps loaded into pandas DataFrame as its index;
* **P-PDG**: pressure variable at the Permanent Downhole Gauge (PDG);
* **P-TPT**: pressure variable at the Temperature and Pressure Transducer (TPT);
* **T-TPT**: temperature variable at the Temperature and Pressure Transducer (TPT);
* **P-MON-CKP**: pressure variable upstream of the production choke (CKP);
* **T-JUS-CKP**: temperature variable downstream of the production choke (CKP);
* **P-JUS-CKGL**: pressure variable upstream of the gas lift choke (CKGL);
* **T-JUS-CKGL**: temperature variable upstream of the gas lift choke (CKGL);
* **QGL**: gas lift flow rate;
* **class**: observations labels associated with three types of periods (normal, fault transient, and faulty steady state).

Other information are also loaded into each pandas Dataframe:

* **label**: instance label (event type);
* **well**: well name. Hand-drawn and simulated instances have fixed names. Real instances have names masked with incremental id;
* **id**: instance identifier. Hand-drawn and simulated instances have incremental id. Each real instance has an id generated from its first timestamp.

More information about these variables can be obtained from the following publicly available documents:

* ***Option in Portuguese***: R.E.V. Vargas. Base de dados e benchmarks para prognóstico de anomalias em sistemas de elevação de petróleo. Universidade Federal do Espírito Santo. Doctoral thesis. 2019. https://github.com/petrobras/3W/raw/master/docs/doctoral_thesis_ricardo_vargas.pdf.
* ***Option in English***: B.G. Carvalho. Evaluating machine learning techniques for detection of flow instability events in offshore oil wells. Universidade Federal do Espírito Santo. Master's degree dissertation. 2021. https://github.com/petrobras/3W/raw/master/docs/master_degree_dissertation_bruno_carvalho.pdf.

# 4. Table of Instances

The following table shows the amount of instances that compose the 3W dataset, by knowledge source (real, simulated and hand-drawn instances) and by instance label.

In [6]:
toi = tk.create_table_of_instances(real_instances, simulated_instances, drawn_instances)
toi

SOURCE,REAL,SIMULATED,HAND-DRAWN,TOTAL
INSTANCE LABEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0 - Normal Operation,597,0,0,597
1 - Abrupt Increase of BSW,5,114,10,129
2 - Spurious Closure of DHSV,22,16,0,38
3 - Severe Slugging,32,74,0,106
4 - Flow Instability,344,0,0,344
5 - Rapid Productivity Loss,12,439,0,451
6 - Quick Restriction in PCK,6,215,0,221
7 - Scaling in PCK,4,0,10,14
8 - Hydrate in Production Line,0,81,0,81
TOTAL,1022,939,20,1981


# 5. Rare Undesirable Events

Considering only **real instances** and **threshold of 1%**, the 3W dataset has the following amount of instances.

In [7]:
threshold = 0.01
rue = tk.filter_rare_undesirable_events(toi, threshold)
rue

SOURCE,REAL,SIMULATED,HAND-DRAWN,TOTAL
INSTANCE LABEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1 - Abrupt Increase of BSW,5,114,10,129
6 - Quick Restriction in PCK,6,215,0,221
7 - Scaling in PCK,4,0,10,14
8 - Hydrate in Production Line,0,81,0,81
TOTAL,15,410,20,445


If **simulated instances** are also considered, the amount of instances in 3W dataset become the one listed below.

In [8]:
rue = tk.filter_rare_undesirable_events(toi, threshold, simulated=True)
rue

SOURCE,REAL,SIMULATED,HAND-DRAWN,TOTAL
INSTANCE LABEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7 - Scaling in PCK,4,0,10,14
TOTAL,4,0,10,14


After also considering the **hand-drawn instances**, we get the final amount of instances in 3W dataset.

In [9]:
rue = tk.filter_rare_undesirable_events(toi, threshold, simulated=True, drawn=True)
rue

SOURCE,REAL,SIMULATED,HAND-DRAWN,TOTAL
INSTANCE LABEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7 - Scaling in PCK,4,0,10,14
TOTAL,4,0,10,14


# 6. Scatter Map of Real Instances

A scatter map with all the **real instances** is shown below. The oldest one occurred in the middle of 2012 and the most recent one in the middle of 2018. In addition to the total number of considered wells, this map provides an overview of the occurrences distributions of each instance over time and between wells.

In [None]:
tk.create_and_plot_scatter_map(real_instances)

# 7. Some Statistics

The 3W dataset's fundamental aspects related to inherent difficulties of actual data are presented next.

In [None]:
stats = tk.calc_stats_instances(real_instances, simulated_instances, drawn_instances)
stats

In [12]:
#create .csv file and save first block of data
df = tk.load_instance(real_instances[0])
df.to_csv("3Wdataset.csv", index = False)

#while loop to append all remaining real instances to the file
n = 1
while n < len(real_instances):
    df = tk.load_instance(real_instances[n])
    df.to_csv("3Wdataset.csv", mode = "a", index = False, header = False)
    n = n + 1

In [13]:
#while loop to append all simulated instances to the file
n = 0
while n < len(simulated_instances):
    df = tk.load_instance(simulated_instances[n])
    df.to_csv("3Wdataset.csv", mode = "a", index = False, header = False)
    n = n + 1

In [14]:
#while loop to append all drawn instances to the file
n = 0
while n < len(drawn_instances):
    df = tk.load_instance(drawn_instances[n])
    df.to_csv("3Wdataset.csv", mode = "a", index = False, header = False)
    n = n + 1