# Notebook Info

This Notebook, is for the development of a window forecasting model. The following tables/schemas are considered

```
Main Database: oasis-prod

For Failures:
schema: analysis
table: failure_info

For Features:
schema: xspoc
table: xdiag

Columns used as Features:
- PPRL
- MPRL
- FluidLoadonPump
- PumpIntakePressure

```

**Notes**

Following Are some Conditions and Assumption made:

- **NAN Values are not handled**. They are dropped for traininng and predictions

- An equivalent window is used for predictions.
    - The window used is 15 days
    - This however can be optimized
    - *Last step can be to try have specific windows for specific failures*

- **Failure Data Points are not used** for training and Predictions
    - One of the reasons being, we want to see the trends pointing towards the failures and not the actual failures
    - Many a time, when failures occur a well is shutdown and no values are present. Dropping them will help us avoid worrying about imputing the data
    - At first glance this may not impact the algo.
    - Can try including failures data to see if recults get better.

- Multi-Class Classification is performed (Not MultiLabel)

- Training Data
    - We use a rolling mean as a feature
    - Train test split is performed using a cut off threshold based on time.


In [3]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import joblib
from library import lib_aws, lib_cleaning
from library.lib_metrics import MultiClassMetrics

# Importing Data

Following steps are performed
- Import failures (analysis.failure_info)
- Import features/data (xspoc.xdiag)
- Merge the info
- Clean and modify the data depending on how we want the input features

**Note: Use `Feature Analysis.ipynb` to analyse the data and failures**

## Importing Failures

In [5]:
%%time
# Querying the entire failure info
query_failures = """
SELECT *
FROM
    analysis.failure_info
ORDER BY "NodeID";
"""

with lib_aws.PostgresRDS(db='oasis-prod', verbose=1) as engine:
    failures = pd.read_sql(query_failures, engine, parse_dates=['Failure Start Date', 'Failure End Date'])

well_list = failures.NodeID.unique().tolist()  # List of wells to use for querying the features and model training
failures.head()

Connected to oasis-prod DataBase
Connection Closed
Wall time: 6.05 s


Unnamed: 0,NodeID,Failure Start Date,Failure End Date,Failures
0,Aagvik 1-35H,2019-10-29 10:43:24,2020-01-05 11:27:10,Tubing - Body
1,Acklins 6092 12-18H,2019-12-22 13:03:58,2020-01-05 04:47:00,Polish Rod
2,Alder 6092 43-8H,2020-03-05 13:28:53,2020-03-14 09:01:01,Pump - Stuck Pump
3,Alder 6092 43-8H,2019-12-20 06:53:56,2020-01-08 08:56:36,Pump - Traveling Valve
4,Andersmadson 5201 42-24 3B,2019-08-12 14:20:29,2019-08-16 18:00:57,Pump - Plunger


In [14]:
# # Manually importing data from s3
# file_path = "s3://et-oasis/failure-excel/Oasis Complete Failure List 2018-2020_ver6.xlsx"

# failures = pd.read_excel(file_path)
# # Cleaning WELL NAMES
# failures['Well'] = (failures['Well'].str.replace("#", "")  # remove #
#                                  .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
#                                  .str.strip()  # Remove trailing whitespaces
#                                  .str.lower()  # lower all character
#                                  .str.title()  # Uppercase first letter of each word
#                                  .map(lambda x: x[0:-2] + x[-2:].upper()))

# failures.rename(columns={'Well': 'NodeID',
#                         'LAST OIL': 'Failure Start Date',
#                         'FAILURE END DATE': 'Failure End Date',
#                         'Components': 'Failures'}, inplace=True)
# failures = failures[['NodeID','Failure Start Date', 'Failure End Date', 'Failures']]
# well_list = failures.NodeID.unique().tolist()  # List of wells to use for querying the features and model training
# failures.head()

Unnamed: 0,NodeID,Failure Start Date,Failure End Date,Failures
0,Ava 5693 43-35T,2019-08-14 12:02:16,2019-08-28 11:19:14,BHA - Seat Nipple
1,Bouvardia Federal 2658 12-12H,2019-08-02 09:57:11,2019-08-24 08:31:02,BHA - Seat Nipple
2,Johnson 29-30H,2019-07-08 12:36:03,2019-07-16 21:27:22,BHA - Seat Nipple
3,Didrick 4X-27H,2020-01-30 11:53:33,2020-02-15 10:16:06,BHA - TAC
4,Langved 5393 43-10 9T2,2019-10-10 08:44:27,2019-11-11 11:58:17,BHA - TAC


In [15]:
# # Adding the data. Need to have write permissions
# lib_aws.AddData.add_data(df=failures, 
#                          db='oasis-prod',
#                          schema='analysis',
#                          table='failure_info',
#                          merge_type='replace', 
#                          index_col='NodeID')

Data replaceed on Table failure_info in time 25.82s


In [6]:
# Reducing Failure classes
failures.loc[failures.Failures.str.contains('Tubing'), 'Failures'] = 'Tubing'
failures.loc[failures.Failures.str.contains('Pump'), 'Failures'] = 'Pump'
failures.loc[failures.Failures.str.contains('Rod'), 'Failures'] = 'Rod'
failures.loc[failures.Failures.str.contains('BHA'), 'Failures'] = 'BHA'

failures.sample(5)

Unnamed: 0,NodeID,Failure Start Date,Failure End Date,Failures
313,Yeiser 5603 42-33H,2020-06-13 14:05:13,2020-06-26 10:43:41,Tubing
36,Cade 12-19HA,2019-07-17 16:32:23,2019-07-28 08:06:59,Rod
148,Jase 5892 21-30T,2019-09-14 14:52:05,2019-09-28 17:04:47,Pump
196,Logan 5601 42-35H,2019-09-07 07:17:33,2019-09-19 11:42:51,Pump
118,Harbour 5501 13-4H,2020-03-19 00:00:00,2020-03-28 09:41:04,Rod


## Importing Data

Import the data depending on how the model is to be trained and the use case:

- For Training: Only wells which have been labelled
- For Hisorical Predictions: The entire dataset
- For Real Time Predictions: Will depend on how the data pipeline has been set up


In [7]:
%%time
# for well_list use only those wells which have been labeled
data_query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM xspoc.xdiag
WHERE "NodeID" in {}
ORDER BY "NodeID","Date"
""".format(tuple(well_list))

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data = pd.read_sql(data_query, engine, parse_dates=['Date'])
    
well_list_features = data.NodeID.unique()
data.head()

Wall time: 1min 10s


Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


In [8]:
# Use only those wells in failure_info, present in data
failure_info = failures[failures.NodeID.isin(well_list_features)]
failure_info.reset_index(inplace=True, drop=True)

# info
display(data.head())
print("Failure info in these in these wells")
display(failure_info)
print("Failure Label Distribution")
display(failure_info.Failures.value_counts())

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


Failure info in these in these wells


Unnamed: 0,NodeID,Failure Start Date,Failure End Date,Failures
0,Aagvik 1-35H,2019-10-29 10:43:24,2020-01-05 11:27:10,Tubing
1,Anderson 7-18H,2019-10-03 08:54:20,2019-10-26 09:00:34,Rod
2,Andre 5501 14-5 3B,2020-03-06 07:19:20,2020-03-17 09:47:20,Pump
3,Andrea 5502 44-7T,2019-09-23 15:22:28,2019-11-10 09:54:56,Pump
4,Anvers Federal 5602 13-18H,2020-04-01 14:11:53,2020-05-07 06:20:47,Tubing
...,...,...,...,...
237,White 6-7H,2020-01-14 08:20:02,2020-02-10 16:50:47,Tubing
238,Wilson Federal 1X-20H,2019-12-19 14:55:38,2020-12-31 13:57:49,Tubing
239,Woll 6093 12-1T,2020-02-10 10:49:37,2020-02-23 13:14:07,Tubing
240,Yeiser 5603 42-33H,2020-06-13 14:05:13,2020-06-26 10:43:41,Tubing


Failure Label Distribution


Pump      96
Rod       71
Tubing    70
BHA        5
Name: Failures, dtype: int64

## Merging Info

In [9]:
"""
Before analysing the data we need to merge the information
Transfering info from failures to data (copy of features)
Using a for loop -- may not be very efficient
"""

def fill_null(df, chk_col='PPRL', well_col='NodeID', time_col='Date'):
    """
    This function will fill in Null Values on those dates where no datapoints are present
    Helps Show failures where no data was present
    Will have to take this into account when running analysis 
    """
    data_temp = df.copy()
    # Set time col as index if it is not
    if time_col in data_temp.columns:
        data_temp.set_index(time_col, inplace=True)
    
    data_gp = data_temp.groupby(well_col).resample('1D').max()  # Groupby wellname and resample to Day freq
    data_gp.drop(columns=[well_col], inplace=True)  # Drop these columns as they are present in the index
    data_gp.reset_index(inplace=True)  # Get Back WellCol from
    data_null = data_gp[data_gp.loc[:, chk_col].isnull()]  # Get all null values, which need to be added to the main data file
    data_null.reset_index(inplace=True, drop=True)
    data_temp.reset_index(inplace=True)  # get timestamp back in the column for concating
    data_full = pd.concat([data_temp, data_null], axis=0, ignore_index=True)  # concat null and og files
    data_full.sort_values(by=[well_col, time_col], inplace=True)
    data_full.drop_duplicates(subset=[well_col, time_col], inplace=True)
    data_full.reset_index(drop=True, inplace=True)
    
    return data_full

# TODO: transfer_cols only works for multiple columns, get it to work with 1 column
def failure_merge(df, failure_df, transfer_cols):
    """
    Merges the failures info
    :param df: dataframe to which info is being transferred to. (Should have columns "NodeID" and "Date")
    :param failure_df: Failure info data (Should have columns "NodeID", "Start Date" and "End Data")
    :param cols: Columns which need to be transferred
    """
    merged = df.copy()  
    for col in transfer_cols:
        merged[col] = 'Normal'  # for now putting everything as normal (even NAN's)
        
    for i in failure_df.index:
        well = failure_df.loc[i, 'NodeID']
        t_start = failure_df.loc[i, 'Failure Start Date']
        t_end = failure_df.loc[i, 'Failure End Date'] + pd.Timedelta('1 day')  # As we have day based frequency (the times in a day are considered as 00:00:00)
        bool_ = (merged.NodeID == well) & (merged.Date >= t_start) & (merged.Date <= t_end)  # Boolean mask for main data
        merged.loc[bool_, transfer_cols] = failure_df.loc[i, transfer_cols].values
        
    return merged

In [10]:
%%time
# Filling NAN's where data was missing
# Wont be useful while training, but useful for visualization
data = fill_null(data)  

# Transfer 'Job Bucket' from failure_info to fill_data
transfer_col = ['Failures', 'Failure Start Date']
data = failure_merge(data, failure_info, transfer_col)
data.drop(columns='Failure Start Date', inplace=True)

data.head()

Wall time: 25.7 s


Unnamed: 0,Date,NodeID,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,Failures
0,2019-06-21 15:58:34,Aagvik 1-35H,27639.0,16811.0,3280.0,,Normal
1,2019-06-21 16:25:36,Aagvik 1-35H,27457.0,16752.0,3241.0,,Normal
2,2019-06-21 18:25:16,Aagvik 1-35H,27448.0,16594.0,3330.0,,Normal
3,2019-06-21 18:28:10,Aagvik 1-35H,27424.0,16595.0,3327.0,,Normal
4,2019-06-21 20:25:01,Aagvik 1-35H,27662.0,16711.0,3341.0,,Normal


## Feature Engg

Depending on What features we and labels we want to use for our model, we can use the functions

`get_agg()`:

    - For now only gives us moving averages
    - Can modify it to give other aggregate functions like standard deviation
    
`create_prediction_zones()`:
    
    - Will create new classes depending on what windows we choose for failures
  
**Note: Both these fucntions will give out separate dataframes/series and will have to be merged accordingly**


In [11]:
"""
Helper Functions
"""

def get_agg(df, freq, time_col='Date', well_col = 'NodeID'):
    
    frames = []
    
    for well in df[well_col].unique():
        temp_df = df[df[well_col] == well].copy()
        temp_df.set_index(time_col, inplace=True)
        temp_df = temp_df.rolling(freq).mean()
        temp_df = temp_df.add_prefix(freq+'_')
        temp_df[well_col] = well
        temp_df.reset_index(inplace=True)
        frames.append(temp_df)
        
    rolled_df = pd.concat(frames)
    rolled_df.reset_index(inplace=True, drop=True)
    
    return rolled_df


def create_prediction_zones(df, fail_col, prediction_zone_dict):
    """
    Depending on the prediction_zone_dict will create predictions zones for failures 
    in the Failure column.
    :param df: The dataframe to extract it from
    :param fail_col: Failure column to use from the dataframe
    :param prediction_zone_dict: A dict with timedeltas for each type of Failure in fail_col
    :return Will return a Series or an Array of these Prediction Zones
    """
    
    test_data = df[['NodeID', 'Date', fail_col]].copy()
    fail_zones = test_data[fail_col]  # fail_zones will be initialized as a copy of the fail col
    
    # Getting start of predictions from fail col
    fail_dates = test_data[test_data[fail_col] != 'Normal']  # everthing other than normal is considered as a prediction
    fail_start = fail_dates[fail_dates.Date.diff().abs().fillna(pd.Timedelta('10D')) > pd.Timedelta('1d 12H')]
    fail_start.reset_index(inplace=True, drop=True)
    
    # Adding zones by iterating over each prediction start date
    for i in fail_start.index:
        temp_well = fail_start.loc[i, 'NodeID']  # well name
        zone_end_date = fail_start.loc[i, 'Date']  # prediction start date
        fail = fail_start.loc[i, fail_col]  # actual prediction class
        zone_delta = pd.Timedelta(prediction_zone_dict[fail])  # delta to subtract from the dictionary
        zone_start_date = zone_end_date - zone_delta

        bool_ = (test_data.NodeID == temp_well) & (test_data.Date < zone_end_date) & (test_data.Date >= zone_start_date)
        fail_zones[bool_] = 'fz_' + fail
        
    return fail_zones

Say we want to use rolling averages with a frequency of 7 days for our features and a constant 10 day window for our failures. Follow the next few sections to see how it will be done

In [12]:
%%time
# 7 day rolling averages
avg_data = get_agg(df=data, freq='15D')  

# Merge it with the original data 
full_data = data.set_index(['NodeID', 'Date']).merge(avg_data.set_index(['NodeID', 'Date']), 
                                                     left_index=True,
                                                     right_index=True).reset_index()


full_data.head()

Wall time: 10.1 s


Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,Failures,15D_PPRL,15D_MPRL,15D_FluidLoadonPump,15D_PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,,Normal,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,,Normal,27548.0,16781.5,3260.5,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,,Normal,27514.666667,16719.0,3283.666667,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,,Normal,27492.0,16688.0,3294.5,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,,Normal,27526.0,16692.6,3303.8,


In [13]:
# Drop Columns that we dont need
# Can play around with features here
# test for different configs and moving windows
cols_drop = [
    'PPRL',
    'MPRL',
    'PumpIntakePressure',
    'FluidLoadonPump',
]

full_data.drop(columns=cols_drop, inplace=True)

full_data.head()

Unnamed: 0,NodeID,Date,Failures,15D_PPRL,15D_MPRL,15D_FluidLoadonPump,15D_PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,Normal,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,Normal,27548.0,16781.5,3260.5,
2,Aagvik 1-35H,2019-06-21 18:25:16,Normal,27514.666667,16719.0,3283.666667,
3,Aagvik 1-35H,2019-06-21 18:28:10,Normal,27492.0,16688.0,3294.5,
4,Aagvik 1-35H,2019-06-21 20:25:01,Normal,27526.0,16692.6,3303.8,


In [14]:
full_data.isnull().sum(axis=0)/len(full_data) * 100

NodeID                    0.000000
Date                      0.000000
Failures                  0.000000
15D_PPRL                  2.652931
15D_MPRL                  2.652931
15D_FluidLoadonPump       2.652931
15D_PumpIntakePressure    3.206547
dtype: float64

In [15]:
# pred zone dict
# manual
# pred_zone_dict = {
#     'PUMP': '15 days',
#     'ROD': '15 days',
#     'TUBING': '15 days',
#     'BHA': '15 days'
# }

# automated all failures will have the same failure window
fail_window = '15 days'
fail_labels = full_data.Failures.unique().tolist()
fail_labels.remove('Normal')
pred_zone_dict = {x: fail_window for x in fail_labels}

In [16]:
# Create pred windows
# Note:  The output of the fucntion will be a pandas Series
full_data['Label'] = create_prediction_zones(df=full_data, 
                                             fail_col='Failures', 
                                             prediction_zone_dict=pred_zone_dict)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Some Assumtions we are going to make:
- Drop Nan values
- Remove the actual failures as classes and only use windows

In [17]:
full_data.dropna(inplace=True)

class_drop = fail_labels  # just for now dont need to use it
full_data = full_data[~full_data.Label.isin(class_drop)]
full_data.Label = full_data.Label.str.replace('fz_', '').str.strip()  # keeping the name of the zone classes the same as the original one
full_data.reset_index(inplace=True, drop=True)

In [19]:
full_data.head()

Unnamed: 0,NodeID,Date,Failures,15D_PPRL,15D_MPRL,15D_FluidLoadonPump,15D_PumpIntakePressure,Label
0,Aagvik 1-35H,2019-08-13 11:19:56,Normal,28599.0,16903.666667,6386.666667,608.0,Normal
1,Aagvik 1-35H,2019-08-14 00:00:00,Normal,28599.0,16903.666667,6386.666667,608.0,Normal
2,Aagvik 1-35H,2019-08-15 06:04:48,Normal,31613.0,17062.5,9915.0,611.0,Normal
3,Aagvik 1-35H,2019-08-15 07:53:36,Normal,31615.333333,17012.666667,10011.0,564.666667,Normal
4,Aagvik 1-35H,2019-08-15 10:02:31,Normal,31594.5,17021.25,9975.25,576.5,Normal


In [None]:
"""
Setting up data for algos
"""


    

In [20]:
X = full_data.drop(columns=['NodeID', 'Date', 'Failures', 'Label'])
Y = full_data.Label

print("Features")
display(X.head())

print("Classes Being Predicted")
display(Y.value_counts())

Features


Unnamed: 0,15D_PPRL,15D_MPRL,15D_FluidLoadonPump,15D_PumpIntakePressure
0,28599.0,16903.666667,6386.666667,608.0
1,28599.0,16903.666667,6386.666667,608.0
2,31613.0,17062.5,9915.0,611.0
3,31615.333333,17012.666667,10011.0,564.666667
4,31594.5,17021.25,9975.25,576.5


Classes Being Predicted


Normal    790865
Pump       11856
Rod         9738
Tubing      9019
BHA          585
Name: Label, dtype: int64

# Testing Algos

In [21]:
# Imports
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [22]:
"""
Model 1
Random Forest classifier
"""

def build_rfc_model ():
    """
    Define A Random Forrest Classifier Model
    :return: RFC Model
    """
    scaler = StandardScaler() # Define Scaler
 
    # RFC Params
    rfc_params = {
        'n_estimators': 100,
        'min_samples_split': 2,
        'min_samples_leaf': 1,
        'class_weight': 'balanced',
        'verbose': 0,
        'max_features': 'auto',
        'max_depth': None,
    }
    
    # RFC Classifier
    rfc = RandomForestClassifier(**rfc_params)
    
    #
    model = Pipeline([
        ('scaler', scaler),
        ('rfc', rfc)
    ])

    return model

In [23]:
# define the model
rfc_model = build_rfc_model()

# Get some metrics
MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

# cv_rfc = MultiClassMetrics.cv_validation(X, Y, rfc_model)
# print("CV Metrics")
# display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, rfc_model)
print("Kfold Metrics")
display(kf_rfc)

Weighted Metrics
Precision : 99.67
Recall: 99.67
F-score: 99.67

Macro Metrics
Precision : 98.21
Recall: 94.12
F-score: 96.10

Classification Report
              precision    recall  f1-score   support

         BHA       0.95      0.93      0.94       175
      Normal       1.00      1.00      1.00    237260
        Pump       0.99      0.92      0.95      3557
         Rod       0.99      0.94      0.96      2921
      Tubing       0.99      0.93      0.96      2706

    accuracy                           1.00    246619
   macro avg       0.98      0.94      0.96    246619
weighted avg       1.00      1.00      1.00    246619

Kfold Metrics


Unnamed: 0,Precision_wt,Recall_wt,F-score_wt,Precision_macro,Recall_macro,F-score_macro
0,99.690002,99.690002,99.68,98.830002,94.07,96.380005
1,99.709999,99.709999,99.709999,99.029999,94.129997,96.480003
2,99.690002,99.699997,99.690002,99.18,94.709999,96.879997
3,99.690002,99.690002,99.690002,99.010002,94.07,96.459999
4,99.720001,99.730003,99.720001,99.169998,95.850006,97.459999
Mean,99.700002,99.704001,99.698001,99.044,94.566,96.732001
STD,0.012648,0.014967,0.014697,0.127686,0.68579,0.403305


In [24]:
"""
Model 2
Extra Tree Classifier
"""
def build_et_model():

    scaler = StandardScaler()

    et_params ={ 
        'n_estimators': 100,  # Default
        'class_weight': 'balanced',
        'criterion': 'gini',  # Default
        'max_features': 'auto',  # Default
        'verbose': 0
    }

    et = ExtraTreesClassifier(**et_params)

    model = Pipeline([
        ('scaler', scaler),
        ('et', et)
    ])

    return model

In [25]:
# define the model
et_model = build_et_model()

# Get some metrics
MultiClassMetrics.baseline_metrics(X, Y, et_model)

kf_et = MultiClassMetrics.kfold_validation(X, Y, et_model)
print("Kfold Metrics")
display(kf_et)

Weighted Metrics
Precision : 99.80
Recall: 99.80
F-score: 99.80

Macro Metrics
Precision : 99.09
Recall: 96.81
F-score: 97.93

Classification Report
              precision    recall  f1-score   support

         BHA       0.99      0.97      0.98       175
      Normal       1.00      1.00      1.00    237260
        Pump       0.99      0.95      0.97      3557
         Rod       0.99      0.96      0.98      2921
      Tubing       0.99      0.96      0.97      2706

    accuracy                           1.00    246619
   macro avg       0.99      0.97      0.98    246619
weighted avg       1.00      1.00      1.00    246619

Kfold Metrics


Unnamed: 0,Precision_wt,Recall_wt,F-score_wt,Precision_macro,Recall_macro,F-score_macro
0,99.82,99.82,99.82,99.360001,96.350006,97.82
1,99.82,99.830002,99.82,99.419998,97.43,98.409996
2,99.830002,99.830002,99.830002,99.479996,97.659996,98.559998
3,99.830002,99.830002,99.830002,99.559998,96.270004,97.869995
4,99.830002,99.830002,99.830002,99.18,96.979996,98.059998
Mean,99.826001,99.828001,99.826001,99.399998,96.938,98.143997
STD,0.0049,0.004001,0.0049,0.128373,0.558043,0.293571


In [23]:
# cv_rfc = MultiClassMetrics.cv_validation(X, Y, et_model)
# print("CV Metrics")
# display(cv_rfc)

CV Metrics


Unnamed: 0,F-Score_wt,Precision_wt,Recall_wt,F-Score_macro,Precision_macro,Recall_macro
0,92.99,92.47,94.31,14.41,43.85,16.95
1,92.07,92.44,92.19,23.21,38.2,26.22
2,92.85,92.92,93.99,20.07,49.72,22.03
Mean,92.636667,92.61,93.496667,19.23,43.923333,21.733333
STD,0.40475,0.219545,0.933143,3.641355,4.703306,3.790271


In [29]:
"""
Model 3
Gradient Boosted Classifier
"""

def build_gbc_model():
    # Define the model pipeline
    scaler = StandardScaler()

    baseline_param = {
        'n_estimators': 10,  # default
        'loss': 'deviance',
        'learning_rate': 0.1,  # Can tune this
        'criterion': 'friedman_mse',
        'verbose': 1
    }

    gbc = GradientBoostingClassifier(**baseline_param)

    model = Pipeline([
        ('scaler', scaler),
        ('gbc', gbc)
    ])

    return model

In [31]:
# # define the model
# gbc_model = build_gbc_model()

# # Get some metrics
# MultiClassMetrics.baseline_metrics(X, Y, gbc_model)

# kf_gbc = MultiClassMetrics.kfold_validation(X, Y, gbc_model)
# print("Kfold Metrics")
# display(kf_gbc)

### Custom Time series split

In [38]:
from sklearn.metrics import jaccard_score, precision_recall_fscore_support, f1_score, recall_score, precision_score, \
    hamming_loss, classification_report

In [28]:
def time_split(dataset, pct, date_col='Date', group_col='NodeID'):
    """
    Splits the dataset into train and test taking time into consideration
    """
    train = []
    test = []
    
    # iterate over each group
    for gp in dataset[group_col].unique():
        temp = dataset[dataset[group_col] == gp].copy() # get gp data
        temp.reset_index(inplace=True, drop=True)
        temp.sort_values(by=[date_col], inplace=True)  # sort by date
        tr, te = np.split(temp, [int(pct * len(temp))])  # split
        train.append(tr)
        test.append(te)
    
    # combine all groups
    train = pd.concat(train) 
    test = pd.concat(test)
    return train, test


In [34]:
train_data, test_data = time_split(full_data, 0.6)

print(f'Length of Train samples:\t{len(train_data)}\nLength of Test samples:\t{len(test_data)}')
print(f'\nLabels Split in train and test')
_ = pd.concat([train_data.Failures.value_counts(), test_data.Failures.value_counts()], axis=1, keys=['Train', 'Test']).fillna(0)
display(_)

Length of Train samples:	493156
Length of Validation samples:	328907

Labels Split in train and valid


Unnamed: 0,Train,Valid
Normal,491656,328679.0
Pump,534,61.0
Tubing,487,126.0
Rod,465,41.0
BHA,14,0.0


In [35]:
"""
getting features and labels
"""
x_train = train_data.drop(columns=['NodeID', 'Date', 'Failures', 'Label'])
y_train = train_data.Label

x_test = test_data.drop(columns=['NodeID', 'Date', 'Failures', 'Label'])
y_test = test_data.Label

In [37]:
print("Train Features")
display(x_train.head())
print("Train Classes")
display(y_train.value_counts())

print("Test Features")
display(x_test.head())
print("Test Classes")
display(y_test.value_counts())

Features


Unnamed: 0,15D_PPRL,15D_MPRL,15D_FluidLoadonPump,15D_PumpIntakePressure
0,28599.0,16903.666667,6386.666667,608.0
1,28599.0,16903.666667,6386.666667,608.0
2,31613.0,17062.5,9915.0,611.0
3,31615.333333,17012.666667,10011.0,564.666667
4,31594.5,17021.25,9975.25,576.5


Classes Being Predicted


Normal    466639
Pump       10650
Rod         8516
Tubing      6766
BHA          585
Name: Label, dtype: int64

Features


Unnamed: 0,15D_PPRL,15D_MPRL,15D_FluidLoadonPump,15D_PumpIntakePressure
1628,31774.180328,15546.52459,7556.295082,585.372881
1629,31787.758065,15537.983871,7550.709677,587.7
1630,31792.253968,15525.809524,7564.333333,581.786885
1631,31791.546875,15528.71875,7560.8125,583.225806
1632,31802.892308,15521.046154,7557.492308,584.587302


Classes Being Predicted


Normal    324226
Tubing      2253
Rod         1222
Pump        1206
Name: Label, dtype: int64

In [41]:
# define the model
et_model = build_rfc_model()

et_model.fit(x_train, y_train)

y_pred = et_model.predict(x_test)

sc = np.array(precision_recall_fscore_support(y_test, y_pred, average='weighted'), dtype=np.float32).round(
    4) * 100
print('Weighted Metrics')
print('Precision : {:.2f}\nRecall: {:.2f}\nF-score: {:.2f}'.format(sc[0], sc[1], sc[2]))

print('\nMacro Metrics')
sc_macro = np.array(precision_recall_fscore_support(y_test, y_pred, average='macro'), dtype=np.float32).round(
    4) * 100
print('Precision : {:.2f}\nRecall: {:.2f}\nF-score: {:.2f}'.format(sc_macro[0], sc_macro[1], sc_macro[2]))

print("\nClassification Report")
print(classification_report(y_test, y_pred))

  _warn_prf(average, modifier, msg_start, len(result))


Weighted Metrics
Precision : 97.27
Recall: 97.49
F-score: 97.38

Macro Metrics
Precision : 22.45
Recall: 21.81
F-score: 22.06

Classification Report
              precision    recall  f1-score   support

         BHA       0.00      0.00      0.00         0
      Normal       0.99      0.99      0.99    324226
        Pump       0.00      0.00      0.00      1206
         Rod       0.10      0.06      0.08      1222
      Tubing       0.04      0.04      0.04      2253

    accuracy                           0.97    328907
   macro avg       0.22      0.22      0.22    328907
weighted avg       0.97      0.97      0.97    328907



# Training and Saving the model

Using the `lib_aws.S3` class. We can serialize and save a trained model.  

In [32]:
model_train = build_et_model() # Define a model

In [33]:
%%time
model_train.fit(X, Y)  # fit the model

Wall time: 25.4 s


Pipeline(steps=[('scaler', StandardScaler()),
                ('et', ExtraTreesClassifier(class_weight='balanced'))])

In [34]:
%%time
# saving the trained model in s3
s3 = lib_aws.S3(bucket='et-oasis')
model_name = 'production-models/rod-pumps/test_et_model.pkl'
s3.save_model(obj=model_train, name=model_name)

# Saving the model locally
# Save the model as a pickle in a file 
# joblib.dump(model_train, 'model_rfc.pkl') 

Model Updated successfully
Wall time: 41min


# Making Predictions

While making predictions modify the features accordingly.

In [62]:
%%time
# To mimic actual working in production query the entire data again
# Find moving averages
# drop nan values 
# make predictions
full_query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM xspoc.xdiag
ORDER BY "NodeID","Date"
"""

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data = pd.read_sql(full_query, engine, parse_dates=['Date'])
    
data.head()

Wall time: 7min 19s


Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


In [63]:
%%time
# Feature Engg
# Will find moving averages
# Drop NAN's
# 7 day rolling averages
avg_data = get_agg(df=data, freq='7D')  

# Merge it with the original data 
# and use only those columns which will be of use
# While working with large datasets try optmizing the copies of dataframes you create
# May not even have to merge it

full_data = data.set_index(['NodeID', 'Date']).merge(avg_data.set_index(['NodeID', 'Date']), 
                                                     left_index=True,
                                                     right_index=True).reset_index()

# Drop Columns that we dont need
cols_drop = [
    'PPRL',
    'MPRL',
    'PumpIntakePressure',
    'FluidLoadonPump',
]

full_data.drop(columns=cols_drop, inplace=True)
full_data.dropna(inplace=True)
full_data.reset_index(inplace=True, drop=True)
full_data.head()

Wall time: 2min 49s


Unnamed: 0,NodeID,Date,7D_PPRL,7D_MPRL,7D_FluidLoadonPump,7D_PumpIntakePressure
0,Aagvik 1-35H,2019-08-13 11:19:56,31488.0,17075.0,9968.0,608.0
1,Aagvik 1-35H,2019-08-15 06:04:48,31613.0,17062.5,9915.0,611.0
2,Aagvik 1-35H,2019-08-15 07:53:36,31615.333333,17012.666667,10011.0,564.666667
3,Aagvik 1-35H,2019-08-15 10:02:31,31594.5,17021.25,9975.25,576.5
4,Aagvik 1-35H,2019-08-15 12:14:05,31621.4,16997.8,9907.0,603.0


In [60]:
%%time
# #load the model
# s3 = lib_aws.S3(bucket='et-oasis')
# model_name = '/algo/failure-forecasting/rfc_model_v2.pkl'
# imported_model = s3.import_model(model_name)

# Load model Locally
# imported_model = joblib.load('model_rfc.pkl') 

# If training done in the same notebook instance use that itself
imported_model = model_train

Wall time: 0 ns


In [78]:
%%time
x_pred = full_data.drop(columns=['NodeID', 'Date'])
predictions = imported_model.predict(x_pred)  # Get Label Predictions
probabilities = imported_model.predict_proba(x_pred)  # Get Probabilities

Wall time: 1min 36s


In [79]:
%%time
# Creating final pred_df
classes_predicted = imported_model[1].classes_  # Trained rfc part for making predictions
pred_df = pd.concat([
    full_data[["NodeID", "Date"]], # NodeID and Date for indexing
    pd.DataFrame(np.round(probabilities * 100, 2), columns=classes_predicted),  # probabilities for each class
    pd.DataFrame(predictions, columns=['Prediction'])  # Actual Predictions
], axis=1)

Wall time: 1.22 s


In [87]:
rem_arr = np.array(["NodeID", "Date", "Normal", "Prediction"])
prob_cols = np.setdiff1d(pred_df.columns, rem_arr)
pred_df['FailureProb'] = pred_df[prob_cols].max(axis=1)

# Dropping class based probabilities
pred_df.drop(columns = prob_cols, inplace=True)

In [88]:
pred_df.head()

Unnamed: 0,NodeID,Date,Normal,Prediction,FailureProb
0,Aagvik 1-35H,2019-08-13 11:19:56,100.0,Normal,0.0
1,Aagvik 1-35H,2019-08-15 06:04:48,100.0,Normal,0.0
2,Aagvik 1-35H,2019-08-15 07:53:36,100.0,Normal,0.0
3,Aagvik 1-35H,2019-08-15 10:02:31,100.0,Normal,0.0
4,Aagvik 1-35H,2019-08-15 12:14:05,100.0,Normal,0.0


In [75]:
# # Custom mapping only for now, modify the classes while training
# # This has been taken care of, dont have to run this
# class_mapping = {
#     'fz_BHA': 'BHA',
#     'fz_PUMP': 'PUMP',
#     'fz_ROD': 'ROD',
#     'fz_TUBING': 'TUBING'
# }
# pred_df.Prediction = pred_df.Prediction.map(class_mapping).fillna(pred_df.Prediction)  # mapping Predictions to actual class values
# pred_df.rename(columns=class_mapping, inplace=True)  # mapping column names to actual class values


In [89]:
%%time
# Filling nulls with str columns as well
pred_df.set_index(['NodeID', 'Date'], inplace=True)  # set index

#seperate string columns and num columns
pred_str = pred_df.select_dtypes(include='object')
pred_num = pred_df.select_dtypes(exclude='object')

# fill nulls in num columns
pred_num.reset_index(inplace=True)
pred_num = fill_null(pred_num, chk_col='Normal', well_col='NodeID', time_col='Date')
pred_num.set_index(['NodeID', 'Date'], inplace=True)

# merge for final df
pred_df = pd.concat([pred_num, pred_str], axis=1)
pred_df.reset_index(inplace=True)

Wall time: 1min 23s


In [67]:
# pred_df.columns = pred_df.columns.str.replace('"', 'in')

In [90]:
pred_df.head()

Unnamed: 0,NodeID,Date,Normal,FailureProb,Prediction
0,Aagvik 1-35H,2019-08-13 11:19:56,100.0,0.0,Normal
1,Aagvik 1-35H,2019-08-14 00:00:00,,,
2,Aagvik 1-35H,2019-08-15 06:04:48,100.0,0.0,Normal
3,Aagvik 1-35H,2019-08-15 07:53:36,100.0,0.0,Normal
4,Aagvik 1-35H,2019-08-15 10:02:31,100.0,0.0,Normal


In [91]:
# Adding data to DB
# Replace the full bounds df
# Will only work if INSERT/Update ACCESS provided to databse
lib_aws.AddData.add_data(df=pred_df, 
                         db='oasis-prod', 
                         schema='xspoc',
                         table='sample_predictions', 
                         merge_type='replace', 
                         card_col=None, 
                         index_col='Date')

Data replaceed on Table sample_predictions in time 92.55s


In [92]:
# Setting up indexes
# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX sample_predictions_idx ON xspoc.sample_predictions ("NodeID", "Date");""")


## Some Additional Functions

Help with adding some specific tables to the database:

**Add a failure db with failures as 0,1 (Only used for visualization)**

**Fill in NAN values in the data**
- After querying the whole dataset
- Find those datapoints (Date, NodeID as index) where we dont have Data present


In [72]:
# Check Failure Info
data_query = """
SELECT
    "NodeID",
    "Date",
    "PPRL"
FROM xspoc.xdiag
WHERE "NodeID" in {}
ORDER BY "NodeID","Date"
""".format(tuple(well_list))

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    only_fails = pd.read_sql(data_query, engine, parse_dates=['Date'])

In [73]:
%%time
only_fails = fill_null(only_fails)  # FIlling in Nan's where data was missing

# Transfer 'Job Bucket' from failure_info to fill_data
transfer_col = ['Components', 'Failure Start Date']
only_fails = failure_merge(only_fails, failure_info, transfer_col)
only_fails.drop(columns='Failure Start Date', inplace=True)
only_fails.drop(columns = 'PPRL', inplace=True)
only_fails.head()

Wall time: 24 s


Unnamed: 0,Date,NodeID,Components
0,2019-06-21 15:58:34,Aagvik 1-35H,Normal
1,2019-06-21 16:25:36,Aagvik 1-35H,Normal
2,2019-06-21 18:25:16,Aagvik 1-35H,Normal
3,2019-06-21 18:28:10,Aagvik 1-35H,Normal
4,2019-06-21 20:25:01,Aagvik 1-35H,Normal


In [74]:
bool_ = only_fails.Components != 'Normal'
only_fails.loc[bool_, 'BinaryFails'] = 1

In [75]:
only_fails

Unnamed: 0,Date,NodeID,Components,BinaryFails
0,2019-06-21 15:58:34,Aagvik 1-35H,Normal,
1,2019-06-21 16:25:36,Aagvik 1-35H,Normal,
2,2019-06-21 18:25:16,Aagvik 1-35H,Normal,
3,2019-06-21 18:28:10,Aagvik 1-35H,Normal,
4,2019-06-21 20:25:01,Aagvik 1-35H,Normal,
...,...,...,...,...
738435,2020-08-15 10:37:07,Zdenek 6093 42-24H,Normal,
738436,2020-08-15 11:43:01,Zdenek 6093 42-24H,Normal,
738437,2020-08-15 13:26:35,Zdenek 6093 42-24H,Normal,
738438,2020-08-15 14:32:27,Zdenek 6093 42-24H,Normal,


In [76]:
# Adding data to DB
# Replace the full bounds df
lib_aws.AddData.add_data(df=only_fails, 
                         db='oasis-prod', 
                         schema='xspoc',
                         table='only_fails', 
                         merge_type='replace', 
                         card_col=None, 
                         index_col='Date')

Data replaceed on Table only_fails in time 62.18s


In [77]:
# Setting up indexes
# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX idx_only_fails_node_date ON xspoc.only_fails ("NodeID", "Date");""")


# Older Build

Older Codes. Can be used for reference.

## Importing Labeled Data

Labeled data is stored in the database `oasis-dev` in the table `clean.xspoc`

In [None]:
# Setuo the query
failure_wells = ['Cade 12-19HA', 'Cook 12-13 6B', 'Helling Trust 43-22 16T3',
                'Helling Trust 44-22 5B', 'Johnsrud 5198 14-18 13T',
                'Johnsrud 5198 14-18 15TX', 'Rolfson N 5198 12-17 5T',
                'Rolfson N 5198 12-17 7T', 'Rolfson S 5198 11-29 2TX',
                'Rolfson S 5198 11-29 4T', 'Rolfson S 5198 12-29 8T',
                'Rolfson S 5198 14-29 11T', 'Stenehjem 14X-9HA']

query = """
SELECT 
    "NodeID",
    "Date",
    "cardPPRL",
    "cardMPRL",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure",
    "FailureBin",
    "FailureLabel"
FROM
    clean.xspoc
WHERE
    "NodeID" in {}
ORDER BY
    "NodeID", "Date";
""".format(tuple(failure_wells))

In [None]:
%%time
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    data = pd.read_sql(query, engine, parse_dates=['Date'])
    
data.head()

In [None]:
data = data[data.Date < t]
data.reset_index(inplace=True, drop=True)

In [None]:
data.groupby(['NodeID']).agg({"Date": [min, max, "count"]})

In [None]:
# Modifying data for intern project

data.rename(columns={"NodeID": "WellName", "FailureBin":"BinaryLabel", "FailureLabel":"MultiLabel"}, inplace=True)
wells = data.WellName.unique()

In [None]:
new_wells = ["Well " + i for i in list('ABCDEFGHIJKLM')]
well_map = dict(zip(wells, new_wells))

data.WellName = data.WellName.map(well_map)

In [None]:
data.head()

In [None]:
data.MultiLabel.value_counts()

In [None]:
data.set_index("Date").to_csv("sample_data.csv")

In [None]:
"""
Generate Windows
"""

def window_func(df, window):
    """
    Generate MultiLabel windows
    0 = Does not fail
    'Label' = Actual Failure or Fails in the next n window
    :param df: DataFrame with a single well, the Timestamp col should be the index
    :param window: Window Value
    """
    
    df['WinLabel'] = 'Normal'  # Initialize it with 0
    
    mask_ = df.index >= (df.index.max() - pd.Timedelta(window))  
    df.loc[mask_, 'WinLabel'] = -1  # Will eliminate the final window fn
    
    # Iterate over all the labels
    for code in df.loc[df.FailureBin == 1, 'FailureLabel'].unique():
        
         # dates where that code occurs
        code_dates = df[df.FailureLabel == code].index
        # print(code)

        # counter
        c = 0

        # iterate over these dates
        for t in code_dates:
            if c == 0:
                bool_ = (df.index < code_dates[c]) & (df.index >= (code_dates[c] - pd.Timedelta(window)))
                df.loc[bool_, 'WinLabel'] = code
            else:
                bool_ = (df.index < code_dates[c]) & (df.index >= (code_dates[c] - pd.Timedelta(window))) & (
                        df.index > code_dates[c - 1])
                df.loc[bool_, 'WinLabel'] = code
            c = c + 1

        df.loc[df.FailureLabel == code, 'WinLabel'] = code
    
    return df


"""
Function for Moving AVGs
"""

def get_ma(df, cols, freq):
    """
    Rolling Values
    :param df: DataFrame
    :param cols: Columns which are being Rolled
    :param freq: Rolling Window( example: 7D)
    :return: DataFrame with Rolled Values
    """
    for i in cols:
        col_name_1 = i + '_MA'
        df[col_name_1] = df[i].rolling(freq).mean()
    return df


In [None]:
rol_cols = [
    "cardPPRL",
    "cardMPRL",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure"
]
frames = []

for well in data.NodeID.unique():
    print("Well: {}".format(well))
    
    tempdf = data[data.NodeID == well]
    tempdf.set_index("Date", inplace=True)
    
    tempdf = window_func(tempdf, '3 days')
    tempdf = get_ma(tempdf, rol_cols, '7D')
    tempdf.reset_index(inplace=True)
    frames.append(tempdf)

In [None]:
train_data = pd.concat(frames)  # creeating a train df
train_data = train_data[train_data.WinLabel != -1]
train_data.sort_values(by=['NodeID', 'Date'], inplace=True)

print("Null Value Distribution")
display(train_data.isnull().sum(axis=0))

print("Wells")
display(train_data.NodeID.value_counts())

print("Labels")
display(train_data.WinLabel.value_counts())

In [None]:
"""
Plotting
"""
col = 'NetProd_MA'
well = 'Helling Trust 43-22 16T3'

well_df = train_data[train_data.NodeID == well]
# well_df.loc[well_df.FailureBin == 1, [col, 'WinLabel']] = np.nan  # Nan where Failures are present
fig, ax = plt.subplots(figsize=(25,8))

ax.plot(well_df.Date, well_df[col], label=col)
bool_ = (well_df.WinLabel != 'Normal')

ax.scatter(well_df.loc[bool_, "Date"], well_df.loc[bool_, col], c='r', label='Failure')

ax.set_xlabel("Date")
ax.set_ylabel("KPI")
ax.legend(loc='best')
plt.show()

In [None]:
# # Droping Failure Data Point
# train_data[train_data.FailureBin == 1]

feature_cols = ['PPRL_MA', 'MPRL_MA', 'NetProd','FluidLoadonPump_MA', 'PumpIntakePressure_MA']
add_cols=feature_cols + ['NodeID', 'Date', 'WinLabel']
final_train = train_data[add_cols].dropna()
final_train.reset_index(drop=True, inplace=True)

# Features
X = final_train[feature_cols]
Y = final_train.WinLabel

print("Feature df")
display(X.head())

print("Labels Being Predicted")
display(Y.value_counts())

## Algo Test

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
"""
Model 1
Random Forest classifier
"""

def build_rfc_model():
    """
    Define A Random Forrest Classifier Model
    :return: RFC Model
    """
    scaler = StandardScaler()

    rfc_params = {
        'n_estimators': 100,
        'min_samples_split': 2,
        'min_samples_leaf': 1,
        'class_weight': 'balanced',
        'verbose': 0,
        'max_features': 'auto',
        'max_depth': None,
    }

    rfc = RandomForestClassifier(**rfc_params)

    model = Pipeline([
        ('scaler', scaler),
        ('rfc', rfc)
    ])

    return model

In [None]:
rfc_model = build_rfc_model()
rfc_model

In [None]:
MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

cv_rfc = MultiClassMetrics.cv_validation(X, Y, rfc_model)
print("CV Metrics")
display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, rfc_model)
print("Kfold Metrics")
display(kf_rfc)

In [40]:
"""
Model 2
Gradient Boosted Classifier with Oversampling
"""

def build_gbc_model(y):
    # Building Smote dict
    max_count = int(y.value_counts()[0] / 3)
    class_list = list(y.value_counts().index)
    class_list.remove('Normal')
    smote_dict = {key: max_count for key in class_list}
    print(smote_dict)

    # Define the model pipeline
    scaler = StandardScaler()
    smote = SMOTE(sampling_strategy=smote_dict, random_state=42)
    baseline_param = {
        'n_estimators': 4,
        'max_depth': 8,
        'learning_rate': 0.1,
        'loss': 'deviance',
        'min_samples_split': 2,
        'verbose': 0
    }

    gbc = GradientBoostingClassifier(**baseline_param)

    model = Pipeline([
        ('scaler', scaler),
        ('smote', smote),
        ('gbc', gbc)
    ])

    return model


In [41]:
gbc_model = build_gbc_model(Y)
gbc_model

{'fz_PUMP': 1026909, 'fz_ROD': 1026909, 'fz_TUBING': 1026909, 'fz_BHA': 1026909}


Pipeline(steps=[('scaler', StandardScaler()),
                ('smote',
                 SMOTE(random_state=42,
                       sampling_strategy={'fz_BHA': 1026909, 'fz_PUMP': 1026909,
                                          'fz_ROD': 1026909,
                                          'fz_TUBING': 1026909})),
                ('gbc',
                 GradientBoostingClassifier(max_depth=8, n_estimators=4))])

In [None]:
MultiClassMetrics.baseline_metrics(X, Y, gbc_model)

cv_rfc = MultiClassMetrics.cv_validation(X, Y, gbc_model)
print("CV Metrics")
display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, gbc_model)
print("Kfold Metrics")
display(kf_rfc)

## Making Predictions on the entire dataset

- Task Done for showing quick results in the dashboard
- All wells used in the training set.
- These same values are used in predictions as well
- Visually when ploted the results will be, how we expect our results to look
- Take the predictions with a big grain of salt


### Prediciton Table

The Results are added to a prediciton table in the 'oasis-dev' database.

Following Columns will be present in the `clean.win_predictons` table:
- NodeID
- Date
- FailureProb
- Prob1 
- Prob2
- Prob3

WIll use a basic rfc model and for features we use the following 7-day Moving Averages
- PPRL_MA
- MPRL_MA
- FluidloadonPump_MA
- PumpIntakePressure_MA

This combination gave the best results in the tests.

In [None]:
# import the entire dataset with the columns we need for making predictions and the failrue info

query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure",
    "FailureBin",
    "FailureLabel"
FROM
    clean.xspoc
ORDER BY
    "NodeID", "Date"
"""

In [None]:
%%time
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    data = pd.read_sql(query, engine, parse_dates=['Date'])
    
data.head()

In [None]:
# Generating Features
rol_cols = [
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure"
]
frames = []

for well in data.NodeID.unique():
    print("Well: {}".format(well))
    
    tempdf = data[data.NodeID == well]
    tempdf.set_index("Date", inplace=True)
    
    tempdf = window_func(tempdf, '3 days')
    tempdf = get_ma(tempdf, rol_cols, '7D')
    tempdf.reset_index(inplace=True)
    frames.append(tempdf)

In [None]:
train_data = pd.concat(frames)  # creeating a train df
train_data = train_data[train_data.WinLabel != -1]
train_data.sort_values(by=['NodeID', 'Date'], inplace=True)

print("Null Value Distribution")
display(train_data.isnull().sum(axis=0))

print("Wells")
display(train_data.NodeID.value_counts())

print("Labels")
display(train_data.WinLabel.value_counts())

In [None]:
# # Droping Failure Data Point
# train_data[train_data.FailureBin == 1]

feature_cols = ['PPRL_MA', 'MPRL_MA', 'NetProd_MA','FluidLoadonPump_MA', 'PumpIntakePressure_MA']
add_cols=feature_cols + ['NodeID', 'Date', 'WinLabel']
final_train = train_data[add_cols].dropna()
final_train.reset_index(drop=True, inplace=True)

# Features
X = final_train[feature_cols]
Y = final_train.WinLabel

print("Feature df")
display(X.head())

print("Labels Being Predicted")
display(Y.value_counts())

In [None]:
# quick test
rfc_model = build_rfc_model()
display(rfc_model)

MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

In [None]:
# Fit the whole df 
rfc_model = build_rfc_model()
rfc_model.fit(X, Y)

In [None]:
"""
Predictions
"""
print("Classes Predicted {}".format(rfc_model.classes_))
y_hat = rfc_model.predict(X.to_numpy())                                          # Get predictions
y_prob = rfc_model.predict_proba(X.to_numpy()) 

In [None]:
ind = final_train.index
data_pred = final_train[["NodeID", "Date"]]
data_pred.loc[ind, 'PredClass'] = y_hat 

pred_classes = rfc_model.classes_

for i in range(np.shape(pred_classes)[0]):
    print(i)
    col = 'Prob ' + str(pred_classes[i])
    data_pred.loc[ind, col] = y_prob[:, i] * 100
data_pred = data_pred.round(3)
data_pred['FailureProb'] = 100 - data_pred['Prob Normal']
data_pred.drop(columns='Prob Normal', inplace=True)

In [None]:
data_pred.head()

In [None]:
"""
Adding Prob data to DF
"""

# Replace the full bounds df
lib_aws.AddData.add_data(df=data_pred, db='oasis-dev', table='xpred', schema='clean',
                         merge_type='replace', card_col=None, index_col='Date')

# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX xpred_idx ON clean.xpred ("NodeID", "Date");""")
