Algorithms to be tested which use segmentation based techniques.

In [None]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [None]:
import pandas as pd
import numpy as np
import random

from library import lib_aws, esp, preprocess, visualization, metrics

import matplotlib.pyplot as plt
import matplotlib as mpl

In [None]:
# Opttions
pd.set_option('display.max_rows', 500)
mpl.rcParams['figure.figsize'] = (25,5)
mpl.rcParams['axes.grid'] = False
plt.style.use('dark_background')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

# Import the Data
- Import failure_df
    - Failures Not to use: `['Unknown', 'Stuck Pump', 'Low Production']`
- Import main data
- Preprocess the data
    - Resample
    - Scale

In [None]:
# Import failures
path = r's3://et-oasis/failure-esp/Oasis ESP Failure Analysis.xlsx'
failure_df = pd.read_excel(path)

# basic cleaning 
failure_df['WELL_NAME'] = failure_df['WELL_NAME'].apply(preprocess.node_clean)  # clean well names
failure_df['Reason For Pull'] = failure_df['Reason For Pull'].fillna('Unknown')  # fill in nan values
failure_df = failure_df[['WELL_NAME', 'Install Date', 'Start Date', 'Fail Date', 'Run Life (Days)', 'Reason For Pull']]  # change columns if need be
failure_df.rename(columns={'WELL_NAME': 'NodeID'}, inplace=True)  # modify col name
failure_df['Reason For Pull'] = failure_df['Reason For Pull'].map({'Grounded downhole':'Grounded Downhole'}).fillna(failure_df['Reason For Pull'])  # clean up typo

# Failures to skip
fail_drop = ['Unknown', 'Stuck Pump', 'Low Production']
failure_df = failure_df[~failure_df['Reason For Pull'].isin(fail_drop)]
failure_df.reset_index(inplace=True, drop=True)

fail_wells = list(failure_df.NodeID.unique())  # List of wells that are present in failure df
failure_df.sample(5)

In [None]:
%%time
# Import data
query = """
select * from m_esp_data
where "NodeID" in {}
""".format(tuple(fail_wells))

with lib_aws.PostgresRDS(db='esp-data') as engine:
    full_data = pd.read_sql(query, engine, parse_dates=['Date'])

wells_in_data = full_data.NodeID.unique()
full_data.sample(5)

In [None]:
# Quick overview
print('List of wells in failures but not in main data\n',
      *(set(fail_wells) - set(wells_in_data)), sep='\n')



# get avg sampling rate for each well
data_info = (full_data.groupby('NodeID')
                      .agg({'Date': lambda x: np.mean(x.diff()/pd.Timedelta('1 min')),
                            'MotorCurrent': 'count'})
                      .rename(columns={'Date': 'Avg Sampling in Min', 'MotorCurrent': 'DataPoints'})
                      .round(2))

print('\n------\nAvg Sampling in the entire dataset {:.2f}'.format(data_info['Avg Sampling in Min'].mean()))
print('\n\nNote: Check data_info dataframe for well specific sampling')

## Preprocessing
- Resample the data: 1 hr
- Transfer Labels
- Visualize

In [None]:
%%time
# Resampling
# TODO: Check which columns to drop and use for analysis
data_resampled = full_data.copy() # create a copy
data_resampled.drop(columns=['OutputAmps', 'OutputVolts', 'YVib'], inplace=True) # drop these columns
data_resampled.set_index(['NodeID', 'Date'], inplace=True)  # set index
data_resampled.dropna(how='all', inplace=True)  # drop all rows where only nans present, will reduce it even further where we drop unnecessary columns
data_resampled.reset_index(inplace=True)

# Resampling
data_resampled = data_resampled.groupby('NodeID').resample('1H', on='Date').mean()
data_resampled.reset_index(inplace=True)
data_resampled.tail(5)

**Transferring Labels**

- Pick a `forecasting_delta` for each failure. For now we consider `15 days`. Change this if need be.
- Quick Steps in how its done:
    - `Start Date` to (`Fail Date` - `forecasting_delta`) --> Label as `Normal`
    - (`Fail Date` - `Forecasting_delta`) to `Fail Date` --> Label as `Reason to Pull`
    - `Fail Date` to (`Fail Date` + `1 day`)  -->  Label as `Actual + Reason to Pull`
    - Label everythng esle as `Drop`
- This gives us the a labeled dataset on which we can run our analysis and implement some splitting strategies
- Use the library function: `library.esp.label_esp_data()`
- Check docstring for more info

In [None]:
%%time
# Transferring Labels
esp.label_esp_data(data_resampled, 
                   failure_df, forecasting_delta='15 days',
                   verbose=0)  # Change to 1 to see the well specific code

In [None]:
data_resampled.Label.value_counts()

In [None]:
# PLot
# plot Specific wells
visualization.plot_features(df=data_resampled,
                      well_name='Kaitlin Federal 5693 41-28B',
                      fail_col='Label',
                      zero_label = 'Normal',
                      feature_cols=['MotorCurrent','PIP', 'PDP', 'MotorTemperature'],
                      mov_avg=None)

**Final Cleanup**
- Drop rows with labels
    - `Drop` : Dont have info
    - `Actual Label` : Building a forecasting model so we dont need these labels


In [None]:
data = data_resampled.copy()
data = data[~data['Label'].str.contains('Drop')]  # Dropping labels Drop
data = data[~data['Label'].str.contains('Actual')]  # Dropping actual failures
data.reset_index(inplace=True, drop=True)
data.Label.value_counts()

## Normalizing the data

Library Class: `library.preprocess.Normalization`

A very important task will be on identifying how we normalize the data:

**Technique 1: Normalize the Entire Dataset**
- Normalize the data on the entire dataset. (This will include all wells and all datapoints)
- Once the scaler is trained. Save it and use it whenever needed (while in production)
- For Dev: `library.preprocess.Normalization.full_scaling()`

**Tehnique2: Well Specific Normalization**
- Build a custom scaling funcntion in a well specifc basis.
- This will save the max/min values for each KPI.
- And while scaling, fucntion will pull the correct max/min to scale the data or build a scaler.
- In production, use the scaler which is needed for each group.
- For Dev: `library.preprocess.Normalization.well_specific()`

**Technique3: Using Scaling in a pipeline**


In [None]:
columns_to_normalize = ['MotorCurrent', 'Frequency', 'PIP', 'PDP', 'TubingPressure', 'CasingPressure', 'PIT', 'MotorTemperature', 'XVib']  # columns which will be normalized
columns_to_keep = ['NodeID', 'Date', 'Label']  # additional columns we need in the dataset

# Well Specific Scaler
data_scaled_well = preprocess.Normalization.well_specific(dataset=data,
                                                         cols_norm=columns_to_normalize,
                                                         cols_keep=columns_to_keep)

data_scaled_well.sample(5)

In [None]:
# Well Full Scaling
data_scaled_full, trained_scaler = preprocess.Normalization.full_scaling(dataset=data,
                                                                         cols_norm=columns_to_normalize,
                                                                         cols_keep=columns_to_keep)

data_scaled_full.sample(5)

# Testing Algorithms

We will use Tree based algorithms in this dataset.

Shuffling:
Method1: Sklearn Train-Test-Split
