# Experimento com dados de todos os poços

Nesse notebook será realizado as tarefas de obtenção de dados, tratamento, modelagem, e validação de dados.
O objetivo aqui é obter um classificador de anomalias para dados de todos os poços (aqui será incluido todos os tipos de dados: inclusive dados de fontes simulação e desenhados).

## Aquisição de dados

Configurando ambiente: 

In [1]:
# Environment configuration
import raw_data_manager.raw_data_acquisition as rda
import raw_data_manager.raw_data_inspector as rdi
import raw_data_manager.raw_data_splitter as rds
from data_exploration.metric_acquisition import MetricAcquisition
from data_preparation.transformation_manager import TransformationManager
from constants import utils, config
import pathlib

# Set default logging level.
from absl import logging
logging.set_verbosity(logging.DEBUG)

2023-09-11 14:36:53.347187: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-11 14:36:53.350606: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-09-11 14:36:53.420097: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-09-11 14:36:53.421620: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Baixar dataset 3W (se não disponível) & gerar tabela de metadados.

In [2]:
## Acquire data (of entire 3W dataset)
rda.acquire_dataset_if_needed()
latest_converted_data_path, latest_converted_data_version = (
    rda.get_latest_local_converted_data_version(config.DIR_PROJECT_DATA)
)

# Helper to overview metadata (of entire 3W dataset)
inspector_all_data = rdi.RawDataInspector(
    latest_converted_data_path,
    config.PATH_DATA_INSPECTOR_CACHE,
    True
)
metadata_all_data = inspector_all_data.get_metadata_table()
metadata_all_data

INFO:absl:Directory with the biggest version: /home/ubuntu/lemi_3w/data/dataset_converted_v10101
INFO:absl:Version: 10101
INFO:absl:Latest local version is 10101
INFO:absl:Going to fetch config file from $https://raw.githubusercontent.com/petrobras/3W/main/dataset/dataset.ini
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2786  100  2786    0     0   6517      0 --:--:-- --:--:-- --:--:--  6524
INFO:absl:Latest online version is 10101
INFO:absl:Found existing converted data with dataset version of 10101
INFO:absl:Directory with the biggest version: /home/ubuntu/lemi_3w/data/dataset_converted_v10101
INFO:absl:Version: 10101


Unnamed: 0_level_0,class_type,source,well_id,path,timestamp,file_size,num_timesteps
hash_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
74203bb,NORMAL,REAL,1.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-05-24 03:00:00,491415,17885
9fbd6f9,NORMAL,REAL,2.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-08-09 06:00:00,520154,17933
28804c5,NORMAL,REAL,6.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-05-08 09:00:31,349162,17970
42afe91,NORMAL,REAL,8.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-07-01 14:01:35,251880,17799
fa71d94,NORMAL,REAL,6.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-08-23 19:00:00,279737,17949
...,...,...,...,...,...,...,...
ea66cf6,SEVERE_SLUGGING,SIMULATED,,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,NaT,2315903,61999
34f032a,SEVERE_SLUGGING,SIMULATED,,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,NaT,2259539,61999
876a969,SEVERE_SLUGGING,REAL,14.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-09-25 06:00:42,1005717,17959
deac7ec,SEVERE_SLUGGING,SIMULATED,,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,NaT,2045137,61999


Dividir dados (de forma estratificada) em treinamento e teste.

In [3]:
# splits data, from the selected well, into train and test datasets
splitter = rds.RawDataSplitter(metadata_all_data, latest_converted_data_version)
split_train_dir, split_test_dir = splitter.stratefy_split_of_data(
    data_dir=config.DIR_PROJECT_DATA, 
    test_size=0.20,
)

# generates metadata tables for split data
train_metadata = rdi.RawDataInspector(
    dataset_dir=split_train_dir,
    cache_file_path=config.DIR_PROJECT_CACHE / "train_metadata_all_data.parquet",
    use_cached=True
)
test_metadata = rdi.RawDataInspector(
    dataset_dir=split_test_dir,
    cache_file_path=config.DIR_PROJECT_CACHE / "test_metadata_all_data.parquet",
    use_cached=True
)

DEBUG:absl:size of train data: 1582 --- size of test data: 396
DEBUG:absl:train path /home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_train --- test path /home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_test


DONE:   0%|          | 0/1582 [00:00<?, ?it/s]

DONE:   0%|          | 0/396 [00:00<?, ?it/s]

INFO:absl:Processing 475 events of class type 0.


DONE:   0%|          | 0/475 [00:00<?, ?it/s]

INFO:absl:Processing 275 events of class type 4.


DONE:   0%|          | 0/275 [00:00<?, ?it/s]

INFO:absl:Processing 360 events of class type 5.


DONE:   0%|          | 0/360 [00:00<?, ?it/s]

INFO:absl:Processing 103 events of class type 1.


DONE:   0%|          | 0/103 [00:00<?, ?it/s]

INFO:absl:Processing 12 events of class type 7.


DONE:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:absl:Processing 177 events of class type 6.


DONE:   0%|          | 0/177 [00:00<?, ?it/s]

INFO:absl:Processing 65 events of class type 8.


DONE:   0%|          | 0/65 [00:00<?, ?it/s]

INFO:absl:Processing 31 events of class type 2.


DONE:   0%|          | 0/31 [00:00<?, ?it/s]

INFO:absl:Processing 84 events of class type 3.


DONE:   0%|          | 0/84 [00:00<?, ?it/s]

INFO:absl:Found 9. The first one is [EventMetadata(hash_id='74203bbab06b84bf2c0303b641133f990baa98a3bc295b5a8eb08a3bbab91de9', class_type='NORMAL', source='REAL', well_id=1, path='/home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_train/0/WELL-00001_20170524030000.parquet', timestamp=Timestamp('2017-05-24 03:00:00'), file_size=491415, num_timesteps=17885), EventMetadata(hash_id='9fbd6f90b945864b7219a36f6ebc898128410ad30a5ffedb63622720f766ba7e', class_type='NORMAL', source='REAL', well_id=2, path='/home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_train/0/WELL-00002_20170809060000.parquet', timestamp=Timestamp('2017-08-09 06:00:00'), file_size=520154, num_timesteps=17933), EventMetadata(hash_id='28804c50819c88a932a18434f553a2582ff07e28bf638fd29426e189113f430d', class_type='NORMAL', source='REAL', well_id=6, path='/home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_train/0/

DONE:   0%|          | 0/119 [00:00<?, ?it/s]

INFO:absl:Processing 69 events of class type 4.


DONE:   0%|          | 0/69 [00:00<?, ?it/s]

INFO:absl:Processing 90 events of class type 5.


DONE:   0%|          | 0/90 [00:00<?, ?it/s]

INFO:absl:Processing 26 events of class type 1.


DONE:   0%|          | 0/26 [00:00<?, ?it/s]

INFO:absl:Processing 3 events of class type 7.


DONE:   0%|          | 0/3 [00:00<?, ?it/s]

INFO:absl:Processing 44 events of class type 6.


DONE:   0%|          | 0/44 [00:00<?, ?it/s]

INFO:absl:Processing 16 events of class type 8.


DONE:   0%|          | 0/16 [00:00<?, ?it/s]

INFO:absl:Processing 7 events of class type 2.


DONE:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:absl:Processing 22 events of class type 3.


DONE:   0%|          | 0/22 [00:00<?, ?it/s]

INFO:absl:Found 9. The first one is [EventMetadata(hash_id='42afe910143a182e986ae8692f7dfdc5ec7f5874ce368c4524b5ce67782e74f6', class_type='NORMAL', source='REAL', well_id=8, path='/home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_test/0/WELL-00008_20170701140135.parquet', timestamp=Timestamp('2017-07-01 14:01:35'), file_size=251880, num_timesteps=17799), EventMetadata(hash_id='4f339b10715248cb64ced140489482fbd06d7df82b5cb0f789b668f7cc04cc12', class_type='NORMAL', source='REAL', well_id=8, path='/home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_test/0/WELL-00008_20170914020321.parquet', timestamp=Timestamp('2017-09-14 02:03:21'), file_size=240749, num_timesteps=17704), EventMetadata(hash_id='982b4a6a524b424a59f775ccaf0e7e05e86124805832f4865896e2f76403b937', class_type='NORMAL', source='REAL', well_id=1, path='/home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_test/0/WEL

Tabela de anomalias por tipo de fonte - treinamento.

In [4]:
rdi.RawDataInspector.generate_table_by_anomaly_source(train_metadata.get_metadata_table())

Unnamed: 0_level_0,real_count,simul_count,drawn_count,soma
anomaly,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NORMAL,475,0,0,475
ABRUPT_INCREASE_BSW,4,91,8,103
SPURIOUS_CLOSURE_DHSV,18,13,0,31
SEVERE_SLUGGING,25,59,0,84
FLOW_INSTABILITY,275,0,0,275
RAPID_PRODUCTIVITY_LOSS,9,351,0,360
QUICK_RESTRICTION_PCK,5,172,0,177
SCALING_IN_PCK,4,0,8,12
HYDRATE_IN_PRODUCTION_LINE,0,65,0,65
Total,815,751,16,1582


Tabela de anomalias por tipo de fonte - teste.

In [5]:
rdi.RawDataInspector.generate_table_by_anomaly_source(test_metadata.get_metadata_table())

Unnamed: 0_level_0,real_count,simul_count,drawn_count,soma
anomaly,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NORMAL,119,0,0,119
ABRUPT_INCREASE_BSW,1,23,2,26
SPURIOUS_CLOSURE_DHSV,4,3,0,7
SEVERE_SLUGGING,7,15,0,22
FLOW_INSTABILITY,69,0,0,69
RAPID_PRODUCTIVITY_LOSS,2,88,0,90
QUICK_RESTRICTION_PCK,1,43,0,44
SCALING_IN_PCK,1,0,2,3
HYDRATE_IN_PRODUCTION_LINE,0,16,0,16
Total,204,188,4,396


# Procesamento de dados

Pegar valores da média dos valores e do desvio padrão. Foram calculados préviamente.

In [6]:
# Get cached metrics---standard deviation and average---to be used for data transformation
mean_and_std_metric_cache_file_name = config.CACHE_NAME_TRAIN_MEAN_STD_DEV
mean_and_std_metric_table = MetricAcquisition.get_mean_and_std_metric_from_cache(
    mean_and_std_metric_cache_file_name
)
mean_metric_list = mean_and_std_metric_table['mean_of_means']
std_metric_list = mean_and_std_metric_table['mean_of_stds']

mean_and_std_metric_table

Unnamed: 0,mean_of_means,mean_of_stds
P-PDG,16507490.0,12015520.0
P-TPT,15175840.0,3687992.0
T-TPT,106.1237,16.29953
P-MON-CKP,4729793.0,3068575.0
T-JUS-CKP,78.43213,18.30099
P-JUS-CKGL,350111700.0,245058900.0
QGL,0.2603752,0.1638849


In [None]:
train_tranformed_folder_name = split_train_dir.name

train_transformation_manager = TransformationManager(
    train_metadata.get_metadata_table(), 
    output_folder_base_name=train_tranformed_folder_name
)

transformation_param_sample_interval_seconds=60
transformation_param_num_timesteps_for_window=20

train_transformation_manager.apply_transformations_to_table(
    output_parent_dir=config.DIR_PROJECT_DATA,
    sample_interval_seconds=transformation_param_sample_interval_seconds,
    num_timesteps_for_window=transformation_param_num_timesteps_for_window,
    avg_variable_mean=mean_metric_list,
    avg_variable_std_dev=std_metric_list,
)

## Modelagem

In [None]:
# Get transformed files paths
tranformed_train_single_well_dataset_dir = (
    config.DIR_PROJECT_DATA / 
    (TransformationManager.TRANSFORMATION_NAME_PREFIX + train_tranformed_folder_name))

inspector_test_single_well_transformed = rdi.RawDataInspector(
    tranformed_train_single_well_dataset_dir,
    config.DIR_PROJECT_CACHE / "single_well_transformed.parquet",
    True
)
metadata_train_single_well_transformed = inspector_test_single_well_transformed.get_metadata_table()
transformed_train_data_single_well_file_path_list = metadata_train_single_well_transformed["path"]


def data_generator_loop(file_path_list):
    """Generator returning batches of data for each file path"""
    while True:
        for file_path in file_path_list:
            X, y = TransformationManager.retrieve_pair_array(pathlib.Path(file_path))
            yield X, y

def data_generator_non_loop(file_path_list):
    """Generator returning batches of data for each file path"""
    
    for file_path in file_path_list:
        X, y = TransformationManager.retrieve_pair_array(pathlib.Path(file_path))
        yield X, y

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from constants import module_constants

num_features = X.shape[2]
num_outputs = module_constants.num_class_types

model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(transformation_param_num_timesteps_for_window, num_features)))
model.add(Dropout(0.5))
model.add(Dense(100, activation='relu'))
model.add(Dense(num_outputs, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()