# Initial Analysis

## Business Understanding
Paper about this dataset [click here](../../A%20realistic%20and%20public%20dataset%20with%20rare%20u%20-%20Ricardo%20Emanuel%20Vaz%20Vargas.pdf)

## Data Understanding

### Imports and Configurations

In [6]:
import sys
import os

sys.path.append(os.path.join('..','..'))
import toolkit as tk

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

### Glossary
* Instance: represents a label associated to a undesirable event;
* timestamp: observations timestamps loaded into pandas DataFrame as its index;
* P-PDG: pressure variable at the Permanent Downhole Gauge (PDG);
* P-TPT: pressure variable at the Temperature and Pressure Transducer (TPT);
* T-TPT: temperature variable at the Temperature and Pressure Transducer (TPT);
* P-MON-CKP: pressure variable upstream of the production choke (CKP);
* T-JUS-CKP: temperature variable downstream of the production choke (CKP);
* P-JUS-CKGL: pressure variable upstream of the gas lift choke (CKGL);
* T-JUS-CKGL: temperature variable upstream of the gas lift choke (CKGL);
* QGL: gas lift flow rate;
* class: observations labels associated with three types of periods (normal, fault transient, and faulty steady state).

In [17]:
# Using 3W Toolkit
# Real instances: actually occurred in Petrobras' actual wells during oil production
# Simulated instances: obtained with a standard flow simulator (OLGA)
# Hand-drawn instances created from Petrobras experts

real_instances, simulated_instances, drawn_instances = tk.get_all_labels_and_files()

# amount of instances that compose the 3W dataset, by knowledge source and by instance label.
toi = tk.create_table_of_instances(real_instances, simulated_instances, drawn_instances)
toi

SOURCE,REAL,SIMULATED,HAND-DRAWN,TOTAL
INSTANCE LABEL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0 - Normal Operation,597,0,0,597
1 - Abrupt Increase of BSW,5,114,10,129
2 - Spurious Closure of DHSV,22,16,0,38
3 - Severe Slugging,32,74,0,106
4 - Flow Instability,344,0,0,344
5 - Rapid Productivity Loss,12,439,0,451
6 - Quick Restriction in PCK,6,215,0,221
7 - Scaling in PCK,4,0,10,14
8 - Hydrate in Production Line,0,81,0,81
TOTAL,1022,939,20,1981


In [46]:
# tk.calc_stats_instances(real_instances, simulated_instances, drawn_instances)

In [37]:
# tk.create_and_plot_scatter_map(real_instances)

## Data Preparation

In [45]:
df = tk.load_instance(real_instances[4])
df.describe()

Unnamed: 0,label,P-PDG,P-TPT,T-TPT,P-MON-CKP,T-JUS-CKP,P-JUS-CKGL,T-JUS-CKGL,QGL,class
count,17921.0,17921.0,17921.0,17921.0,17921.0,17921.0,17921.0,0.0,17921.0,17921.0
mean,0.0,0.0,8499704.0,117.355572,1722819.0,74.632529,2326468.0,,0.0,0.0
std,0.0,0.0,230380.5,0.287969,212108.3,1.326988,3607.33,,0.0,0.0
min,0.0,0.0,7999780.0,116.4031,1189927.0,71.965,2320173.0,,0.0,0.0
25%,0.0,0.0,8335331.0,117.1726,1559202.0,73.51748,2323334.0,,0.0,0.0
50%,0.0,0.0,8599737.0,117.3406,1704526.0,74.68918,2326494.0,,0.0,0.0
75%,0.0,0.0,8692362.0,117.5749,1870320.0,75.8548,2329655.0,,0.0,0.0
max,0.0,0.0,8777813.0,118.0437,2424703.0,77.33486,2332113.0,,0.0,0.0


## Modeling

## Evaluation