# Experimento com dados de todos os poços

Nesse notebook será realizado as tarefas de obtenção de dados, tratamento, modelagem, e validação de dados.
O objetivo aqui é obter um classificador de anomalias para dados de todos os poços (aqui será incluido todos os tipos de dados: inclusive dados de fontes simulação e desenhados).

## Aquisição de dados

Configurando ambiente: 

In [1]:
# Environment configuration
import raw_data_manager.raw_data_acquisition as rda
import raw_data_manager.raw_data_inspector as rdi
import raw_data_manager.raw_data_splitter as rds
from data_exploration.metric_acquisition import MetricAcquisition
from data_preparation.transformation_manager import TransformationManager
from constants import utils, config
import pathlib

# Set default logging level.
from absl import logging
logging.set_verbosity(logging.DEBUG)

2023-09-11 18:07:37.102043: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-11 18:07:37.105291: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-09-11 18:07:37.183559: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-09-11 18:07:37.185507: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Baixar dataset 3W (se não disponível) & gerar tabela de metadados.

In [2]:
## Acquire data (of entire 3W dataset)
rda.acquire_dataset_if_needed()
latest_converted_data_path, latest_converted_data_version = (
    rda.get_latest_local_converted_data_version(config.DIR_PROJECT_DATA)
)

# Helper to overview metadata (of entire 3W dataset)
inspector_all_data = rdi.RawDataInspector(
    latest_converted_data_path,
    config.PATH_DATA_INSPECTOR_CACHE,
    True
)
metadata_all_data = inspector_all_data.get_metadata_table()
metadata_all_data

INFO:absl:Directory with the biggest version: /home/ubuntu/lemi_3w/data/dataset_converted_v10101
INFO:absl:Version: 10101
INFO:absl:Latest local version is 10101
INFO:absl:Going to fetch config file from $https://raw.githubusercontent.com/petrobras/3W/main/dataset/dataset.ini
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2786  100  2786    0     0   3750      0 --:--:-- --:--:-- --:--:--  3749
INFO:absl:Latest online version is 10101
INFO:absl:Found existing converted data with dataset version of 10101
INFO:absl:Directory with the biggest version: /home/ubuntu/lemi_3w/data/dataset_converted_v10101
INFO:absl:Version: 10101


Unnamed: 0_level_0,class_type,source,well_id,path,timestamp,file_size,num_timesteps
hash_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
74203bb,NORMAL,REAL,1.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-05-24 03:00:00,491415,17885
9fbd6f9,NORMAL,REAL,2.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-08-09 06:00:00,520154,17933
28804c5,NORMAL,REAL,6.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-05-08 09:00:31,349162,17970
42afe91,NORMAL,REAL,8.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-07-01 14:01:35,251880,17799
fa71d94,NORMAL,REAL,6.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-08-23 19:00:00,279737,17949
...,...,...,...,...,...,...,...
ea66cf6,SEVERE_SLUGGING,SIMULATED,,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,NaT,2315903,61999
34f032a,SEVERE_SLUGGING,SIMULATED,,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,NaT,2259539,61999
876a969,SEVERE_SLUGGING,REAL,14.0,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,2017-09-25 06:00:42,1005717,17959
deac7ec,SEVERE_SLUGGING,SIMULATED,,/home/ubuntu/lemi_3w/data/dataset_converted_v1...,NaT,2045137,61999


Dividir dados (de forma estratificada) em treinamento e teste.

In [3]:
# splits data, from the selected well, into train and test datasets
splitter = rds.RawDataSplitter(metadata_all_data, latest_converted_data_version)
split_train_dir, split_test_dir = splitter.stratefy_split_of_data(
    data_dir=config.DIR_PROJECT_DATA, 
    test_size=0.20,
)

# generates metadata tables for split data
train_metadata = rdi.RawDataInspector(
    dataset_dir=split_train_dir,
    cache_file_path=config.DIR_PROJECT_CACHE / "train_metadata_all_data.parquet",
    use_cached=True
)
test_metadata = rdi.RawDataInspector(
    dataset_dir=split_test_dir,
    cache_file_path=config.DIR_PROJECT_CACHE / "test_metadata_all_data.parquet",
    use_cached=True
)

DEBUG:absl:size of train data: 1582 --- size of test data: 396
DEBUG:absl:train path /home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_train --- test path /home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_test


DONE:   0%|          | 0/1582 [00:00<?, ?it/s]

DONE:   0%|          | 0/396 [00:00<?, ?it/s]

Tabela de anomalias por tipo de fonte - treinamento.

In [4]:
rdi.RawDataInspector.generate_table_by_anomaly_source(train_metadata.get_metadata_table())

Unnamed: 0_level_0,real_count,simul_count,drawn_count,soma
anomaly,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NORMAL,475,0,0,475
ABRUPT_INCREASE_BSW,4,91,8,103
SPURIOUS_CLOSURE_DHSV,18,13,0,31
SEVERE_SLUGGING,25,59,0,84
FLOW_INSTABILITY,275,0,0,275
RAPID_PRODUCTIVITY_LOSS,9,351,0,360
QUICK_RESTRICTION_PCK,5,172,0,177
SCALING_IN_PCK,4,0,8,12
HYDRATE_IN_PRODUCTION_LINE,0,65,0,65
Total,815,751,16,1582


Tabela de anomalias por tipo de fonte - teste.

In [5]:
rdi.RawDataInspector.generate_table_by_anomaly_source(test_metadata.get_metadata_table())

Unnamed: 0_level_0,real_count,simul_count,drawn_count,soma
anomaly,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NORMAL,119,0,0,119
ABRUPT_INCREASE_BSW,1,23,2,26
SPURIOUS_CLOSURE_DHSV,4,3,0,7
SEVERE_SLUGGING,7,15,0,22
FLOW_INSTABILITY,69,0,0,69
RAPID_PRODUCTIVITY_LOSS,2,88,0,90
QUICK_RESTRICTION_PCK,1,43,0,44
SCALING_IN_PCK,1,0,2,3
HYDRATE_IN_PRODUCTION_LINE,0,16,0,16
Total,204,188,4,396


# Procesamento de dados

Pegar valores da média dos valores e do desvio padrão. Foram calculados préviamente.

In [6]:
# Get cached metrics---standard deviation and average---to be used for data transformation
mean_and_std_metric_cache_file_name = config.CACHE_NAME_TRAIN_MEAN_STD_DEV
mean_and_std_metric_table = MetricAcquisition.get_mean_and_std_metric_from_cache(
    mean_and_std_metric_cache_file_name
)
mean_metric_list = mean_and_std_metric_table['mean_of_means']
std_metric_list = mean_and_std_metric_table['mean_of_stds']

mean_and_std_metric_table

Unnamed: 0,mean_of_means,mean_of_stds
P-PDG,16507490.0,12015520.0
P-TPT,15175840.0,3687992.0
T-TPT,106.1237,16.29953
P-MON-CKP,4729793.0,3068575.0
T-JUS-CKP,78.43213,18.30099
P-JUS-CKGL,350111700.0,245058900.0
QGL,0.2603752,0.1638849


Realizar a transformação dos dados.

In [7]:
train_tranformed_folder_name = split_train_dir.name

train_transformation_manager = TransformationManager(
    train_metadata.get_metadata_table(), 
    output_folder_base_name=train_tranformed_folder_name
)

transformation_param_sample_interval_seconds=60
transformation_param_num_timesteps_for_window=20

train_transformation_manager.apply_transformations_to_table(
    output_parent_dir=config.DIR_PROJECT_DATA,
    sample_interval_seconds=transformation_param_sample_interval_seconds,
    num_timesteps_for_window=transformation_param_num_timesteps_for_window,
    avg_variable_mean=mean_metric_list,
    avg_variable_std_dev=std_metric_list,
)

DEBUG:absl:TransformationManager initialized with 1582 items.
            Folder name is dataset_converted_v10101_split-20_source-all_class-all_well-all_train.


DONE:   0%|          | 0/1582 [00:00<?, ?it/s]

ERROR:   0%|          | 0/1582 [00:00<?, ?it/s]

ValueError: Exception while to split_sequences_into_windows. Path is: /home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_train/5/SIMULATED_00046.parquet

In [12]:
path = "/home/ubuntu/lemi_3w/data/dataset_converted_v10101_split-20_source-all_class-all_well-all_train/5/SIMULATED_00046.parquet"

utils.get_event(path)["class"].isna().sum()

#sum(np.isnan(data))

0

## Modelagem

Obter lista dos arquivos a serem usados no treinamento.

In [None]:
# Get transformed files paths
train_tranformed_dataset_dir = (
    config.DIR_PROJECT_DATA / 
    (TransformationManager.TRANSFORMATION_NAME_PREFIX + train_tranformed_folder_name))

train_transformed_inspector = rdi.RawDataInspector(
    train_tranformed_dataset_dir,
    config.DIR_PROJECT_CACHE / "inspector_transformed_all_data_train.parquet",
    True
)
train_transformed_metadata = train_transformed_inspector.get_metadata_table()
train_transformed_data_file_path_list = train_transformed_metadata["path"]


def data_generator_loop(file_path_list):
    """Generator returning batches of data for each file path"""
    while True:
        for file_path in file_path_list:
            X, y = TransformationManager.retrieve_pair_array(pathlib.Path(file_path))
            yield X, y

def data_generator_non_loop(file_path_list):
    """Generator returning batches of data for each file path"""
    
    for file_path in file_path_list:
        X, y = TransformationManager.retrieve_pair_array(pathlib.Path(file_path))
        yield X, y

Exemplo de um par X, y.

In [None]:
example_transformed_file_path = train_transformed_data_file_path_list[0]
X_transformed_ex, y_transformed_ex = TransformationManager.retrieve_pair_array(pathlib.Path(example_transformed_file_path))

X_transformed_ex[0], y_transformed_ex[0]

Montagem da estrutura do modelo.

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from constants import module_constants

num_features = X_transformed_ex.shape[2]
num_outputs = module_constants.num_class_types

model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(transformation_param_num_timesteps_for_window, num_features)))
model.add(Dropout(0.5))
model.add(Dense(100, activation='relu'))
model.add(Dense(num_outputs, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Treinamento do modelo.

In [None]:
num_epochs = 15
# times the generator (data_gen) should be called to complete one full pass through your training dataset
steps_per_epoch = len(train_transformed_data_file_path_list)


train_data_gen = data_generator_loop(train_transformed_data_file_path_list)


# Train the model
model.fit(
    train_data_gen, 
    steps_per_epoch=steps_per_epoch,
    epochs=num_epochs, 
    verbose=1
)

Ver se está funcionando com um exemplo.

In [None]:
import numpy as np
test_file = train_transformed_data_file_path_list[-1]
print(test_file)

Xhat, yhat = TransformationManager.retrieve_pair_array(pathlib.Path(test_file))
print(f"True value: {yhat[0]}")

Xhat0 = Xhat[0].reshape(1, transformation_param_num_timesteps_for_window, num_features)
print(f"Predicted value: {model.predict(Xhat0)}")

# Validação
Aqui pegaremos nosso banco de testes, o transformaremos, para então o utilizar para validar a perfomance do nosso modelo.

Obter lista de arquivos usados para a validação.

In [None]:
test_tranformed_folder_name = split_test_dir.name

test_transformation_manager = TransformationManager(
    test_metadata.get_metadata_table(), 
    output_folder_base_name=test_tranformed_folder_name
)

test_transformation_manager.apply_transformations_to_table(
    output_parent_dir=config.DIR_PROJECT_DATA,
    sample_interval_seconds=transformation_param_sample_interval_seconds,
    num_timesteps_for_window=transformation_param_num_timesteps_for_window,
    avg_variable_mean=mean_metric_list,
    avg_variable_std_dev=std_metric_list,
)

# Get transformed files paths
test_tranformed_dataset_dir = (
    config.DIR_PROJECT_DATA / 
    (TransformationManager.TRANSFORMATION_NAME_PREFIX + test_tranformed_folder_name)
)

test_transformed_inspector = rdi.RawDataInspector(
    test_tranformed_dataset_dir,
    config.DIR_PROJECT_CACHE / "inspector_transformed_all_data_test.parquet",
    True
)
test_transformed_metadata = test_transformed_inspector.get_metadata_table()
test_transformed_file_path_list = test_transformed_metadata["path"]
test_transformed_file_path_list

In [None]:
test_data_gen = data_generator_non_loop(test_transformed_file_path_list)
num_steps = len(test_transformed_file_path_list)

model.evaluate(
    test_data_gen,
    verbose=1,
)

In [None]:
test_data_gen = data_generator_non_loop(test_transformed_file_path_list)

y_test_predictions = model.predict(
    test_data_gen,
)

(
    f"Number of predictions: {len(y_test_predictions)}", 
    f"Shape of y array: {y_test_predictions.shape}", 
    y_test_predictions[0]
)

In [None]:
test_data_gen = data_generator_non_loop(test_transformed_file_path_list)
y_test_labels = []

for X, y in test_data_gen:
    y_test_labels.append(y)

y_test_labels = np.concatenate(y_test_labels, axis=0)

(
    f"Number of predictions: {len(y_test_labels)}", 
    f"Shape of y array: {y_test_labels.shape}", 
    y_test_labels[0],
)

In [None]:
from tensorflow.math import confusion_matrix

y_test_labels_1d = np.argmax(y_test_labels, axis=1)
y_test_predictions_1d = np.argmax(y_test_predictions, axis=1)

confusion_matrix(
    y_test_labels_1d,
    y_test_predictions_1d,
    num_classes=num_outputs,
)