# Notebook Info

This Notebook, is for the development of a window forecasting model. The following tables/schemas are considered

```
Main Database: oasis-prod

For Failures:
schema: analysis
table: failure_info

For Features:
schema: xspoc
table: xdiag

Columns used as Features:
- PPRL
- MPRL
- FluidLoadonPump
- PumpIntakePressure

```

**Notes**

Following Are some Conditions and Assumption made:

- **NAN Values are not handled**. They are dropped for traininng and predictions
- A failure specific variable window is used for predictions? 
- **Failure Data Points are not used** for training and Predictions
    - One of the reasons being, we want to see the trends pointing towards the failures and not the actual failures
    - Many a time, when failures occur a well is shutdown and no values are present. Dropping them will help us avoid worrying about imputing the data
    - At first glance this may not impact the algo
    - Further Discussion is needed
- Multi-Class Classification is performed (Not MultiLabel)
- Training Data
    - Do we train on the entire dataset or only where failures have occured


In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import joblib
from library import lib_aws, lib_cleaning
from library.lib_metrics import MultiClassMetrics

# Importing Data

Following steps are performed
- Import failures (analysis.failure_info)
- Import features/data (xspoc.xdiag)
- Merge the info
- Clean and modify the data depending on how we want the input features

**Note: Use `Feature Analysis.ipynb` to analyse the data and failures**

## Importing Failures

In [3]:
# If using manually dont need to run
%%time
# Querying the entire failure info
query_failures = """
SELECT *
FROM
    analysis.failure_info
ORDER BY "NodeID";
"""

with lib_aws.PostgresRDS(db='oasis-prod', verbose=1) as engine:
    failures = pd.read_sql(query_failures, engine, parse_dates=['Last Oil', 'Finish Date'])
    
failures.head()

Connected to oasis-prod DataBase
Connection Closed
Wall time: 16.3 s


Unnamed: 0,NodeID,Formation,Last Oil,Start Date,Finish Date,Run Time,Job Type,Job Bucket,Components,Primary Symptom,Secondary Symptom
0,Aagvik 1-35H,MIDDLE BAKKEN,2019-11-27,2019-12-02,2019-12-06,203.0,TUBING LEAK,TUBING,Tubing - Body,Mechanically Induced Damage,Solids in Pump
1,Aagvik 5298 41-35 2TX,THREE FORKS,2019-05-29,2019-06-04,2019-06-25,80.0,GAS LIFT,PUMP,Gas Lift - Valve - Bellows,Low Production,Blank
2,Acadia 31-25H,THREE FORKS,2018-04-11,2018-05-05,2018-05-11,266.0,TUBING LEAK,TUBING,Tubing - Collar,Corrosion,Sand
3,Acadia 31-25H,THREE FORKS,2019-03-30,2019-04-10,2019-04-16,323.0,"1-1/4"" PUMP",PUMP,Pump - Plunger,Corrosion,Mechanically Induced Damage
4,Acklins 6092 12-18H,MIDDLE BAKKEN,2019-12-24,2020-01-02,2020-01-03,853.0,POLISH ROD BREAK,ROD,Polish Rod,Mechanically Induced Damage,


In [5]:
# %%time
# failures.to_csv('failure_info.csv', index=False)

Wall time: 66 ms


In [44]:
# Manually importing data from s3
file_path = "s3://et-oasis/failure-excel/Oasis Complete Failure List 2018-2020_ver6.xlsx"

failures = pd.read_excel(file_path)
# Cleaning WELL NAMES
failures['Well'] = (failures['Well'].str.replace("#", "")  # remove #
                                 .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
                                 .str.strip()  # Remove trailing whitespaces
                                 .str.lower()  # lower all character
                                 .str.title()  # Uppercase first letter of each word
                                 .map(lambda x: x[0:-2] + x[-2:].upper()))

failures.rename(columns={'Well': 'NodeID',
                        'LAST OIL': 'Failure Start Date',
                        'FAILURE END DATE': 'Failure End Date'}, inplace=True)
well_list = failures.NodeID.unique().tolist()  # List of wells to use for querying the features and model training
failures.head()

Unnamed: 0,NodeID,Failure Start Date,LOE START DATE,LOE FINISH DATE,Failure End Date,Components,Job Type,Job Bucket,Primary Symptom,Secondary Symptom,Root Cause
0,Ava 5693 43-35T,2019-08-14 12:02:16,2019-08-20,2019-08-26,2019-08-28 11:19:14,BHA - Seat Nipple,TUBING LEAK,TUBING,Erosion - Fluid,Unknown,"Erosion Abrasion - Fluids, Solids"
1,Bouvardia Federal 2658 12-12H,2019-08-02 09:57:11,2019-08-13,2019-08-19,2019-08-24 08:31:02,BHA - Seat Nipple,"1-1/2"" PUMP",PUMP,Pump Unseating,,"Erosion Abrasion - Fluids, Solids"
2,Johnson 29-30H,2019-07-08 12:36:03,2019-07-12,2019-07-16,2019-07-16 21:27:22,BHA - Seat Nipple,TUBING LEAK,BHA,Erosion - Fluid,,Undetermined
3,Didrick 4X-27H,2020-01-30 11:53:33,2020-02-10,2020-02-14,2020-02-15 10:16:06,BHA - TAC,TUBING LEAK,BHA,Corrosion,,Improper Chemical Usage
4,Langved 5393 43-10 9T2,2019-10-10 08:44:27,2019-10-15,2019-10-17,2019-11-11 11:58:17,BHA - TAC,TUBING LEAK,TUBING,Compression,Solids in Pump,BHA Needs Improvement


## Importing Data

Import the data depending on how the model is to be trained and the use case:

- For Training: Only wells which have been labelled
- For Hisorical Predictions: The entire dataset
- For Real Time Predictions: Will depend on how the data pipeline has been set up


In [45]:
%%time

# for well_list use only those wells which have been labeled
data_query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM xspoc.xdiag
WHERE "NodeID" in {}
ORDER BY "NodeID","Date"
""".format(tuple(well_list))

# full_query = """
# SELECT
#     "NodeID",
#     "Date",
#     "PPRL",
#     "MPRL",
#     "FluidLoadonPump",
#     "PumpIntakePressure"
# FROM xspoc.xdiag
# ORDER BY "NodeID","Date"
# """

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data = pd.read_sql(data_query, engine, parse_dates=['Date'])

Wall time: 1min 52s


In [46]:
well_list_features = data.NodeID.unique()
well_list_features.shape
np.shape(well_list)

(269,)

In [47]:
# Failur Info only from wells present in the data
failure_info = failures[failures.NodeID.isin(well_list_features)]
failure_info.reset_index(inplace=True, drop=True)

# info
display(data.head())
print("Failure info in these in these wells")
display(failure_info)
print("Failure Label Distribution")
display(failure_info.Components.value_counts())

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


Failure info in these in these wells


Unnamed: 0,NodeID,Failure Start Date,LOE START DATE,LOE FINISH DATE,Failure End Date,Components,Job Type,Job Bucket,Primary Symptom,Secondary Symptom,Root Cause
0,Bouvardia Federal 2658 12-12H,2019-08-02 09:57:11,2019-08-13,2019-08-19,2019-08-24 08:31:02,BHA - Seat Nipple,"1-1/2"" PUMP",PUMP,Pump Unseating,,"Erosion Abrasion - Fluids, Solids"
1,Johnson 29-30H,2019-07-08 12:36:03,2019-07-12,2019-07-16,2019-07-16 21:27:22,BHA - Seat Nipple,TUBING LEAK,BHA,Erosion - Fluid,,Undetermined
2,Didrick 4X-27H,2020-01-30 11:53:33,2020-02-10,2020-02-14,2020-02-15 10:16:06,BHA - TAC,TUBING LEAK,BHA,Corrosion,,Improper Chemical Usage
3,Johnsrud 5198 14-18 15TX,2019-08-29 04:00:28,2019-09-05,2019-09-11,2019-09-14 07:01:19,BHA - TAC,BHA - TAC,TUBING,Mechanically Induced Damage,Compression,System Design Needs Improvement
4,Lundeen 4-26H,2019-07-29 12:32:23,2019-08-04,2019-08-12,2019-08-13 09:11:02,BHA - TAC,TUBING LEAK,BHA,Mechanically Induced Damage,Corrosion,Rod Design Needs Improvement
...,...,...,...,...,...,...,...,...,...,...,...
237,Rolfson N 5198 12-17 5T,2019-10-24 13:11:57,2019-10-28,2019-11-04,2019-11-07 07:45:00,Tubing Leak,TUBING LEAK,TUBING,,Mechanically Induced Damage,Tubing Leak - No Hole Found
238,Rj Titus 6093 42-20H,2019-09-13 13:33:45,2019-09-25,2019-09-27,2019-09-29 08:38:56,Tubing Leak,TUBING LEAK,TUBING,Compression,Corrosion,Tubing Leak - No Hole Found
239,Hendricks 5602 43-36 2T,2019-08-11 13:49:43,2019-08-19,2019-08-22,2019-08-26 08:59:47,Tubing Leak,TUBING LEAK,TUBING,Corrosion,,Tubing Leak - No Hole Found
240,Crawford 5493 44-7T,2019-07-28 12:24:42,2019-07-30,2019-08-01,2019-08-01 15:14:46,Tubing Leak,TUBING LEAK,TUBING,Corrosion,Flumping,Tubing Leak - No Hole Found


Failure Label Distribution


Tubing - Body                48
Pump - Plunger               43
Rod - Main Body              25
Pump - Stuck Pump            22
Polish Rod                   19
Tubing Leak                  17
Rod - Pin                    12
Pump - Junked                 9
Pump - Barrel                 8
Rod - 6" Critical Section     8
Rod - Coupling                7
Pump - No-Tap                 6
Tubing - Collar               5
Pump - On - Off Tool          4
BHA - TAC                     3
BHA - Seat Nipple             2
Pump - Standing Valve         2
Pump - Traveling Valve        1
Pump - Valve Rod              1
Name: Components, dtype: int64

## Merging Info

In [7]:
"""
Before analysing the data we need to merge the information
Transfering info from failures to data (copy of features)
Using a for loop -- may not be very efficient
"""

def fill_null(df, chk_col='PPRL', well_col='NodeID', time_col='Date'):
    """
    This function will fill in Null Values on those dates where no datapoints are present
    Helps Show failures where no data was present
    Will have to take this into account when running analysis 
    """
    data_temp = df.copy()
    # Set time col as index if it is not
    if time_col in data_temp.columns:
        data_temp.set_index(time_col, inplace=True)
    
    data_gp = data_temp.groupby(well_col).resample('1D').max()  # Groupby wellname and resample to Day freq
    data_gp.drop(columns=[well_col], inplace=True)  # Drop these columns as they are present in the index
    data_gp.reset_index(inplace=True)  # Get Back WellCol from
    data_null = data_gp[data_gp.loc[:, chk_col].isnull()]  # Get all null values, which need to be added to the main data file
    data_null.reset_index(inplace=True, drop=True)
    data_temp.reset_index(inplace=True)  # get timestamp back in the column for concating
    data_full = pd.concat([data_temp, data_null], axis=0, ignore_index=True)  # concat null and og files
    data_full.sort_values(by=[well_col, time_col], inplace=True)
    data_full.drop_duplicates(subset=[well_col, time_col], inplace=True)
    data_full.reset_index(drop=True, inplace=True)
    
    return data_full

# TODO: transfer_cols only works for multiple columns, get it to work with 1 column
def failure_merge(df, failure_df, transfer_cols):
    """
    Merges the failures info
    :param df: dataframe to which info is being transferred to. (Should have columns "NodeID" and "Date")
    :param failure_df: Failure info data (Should have columns "NodeID", "Start Date" and "End Data")
    :param cols: Columns which need to be transferred
    """
    merged = df.copy()  
    for col in transfer_cols:
        merged[col] = 'Normal'  # for now putting everything as normal (even NAN's)
        
    for i in failure_df.index:
        well = failure_df.loc[i, 'NodeID']
        t_start = failure_df.loc[i, 'Failure Start Date']
        t_end = failure_df.loc[i, 'Failure End Date'] + pd.Timedelta('1 day')  # As we have day based frequency (the times in a day are considered as 00:00:00)
        bool_ = (merged.NodeID == well) & (merged.Date >= t_start) & (merged.Date <= t_end)  # Boolean mask for main data
        merged.loc[bool_, transfer_cols] = failure_df.loc[i, transfer_cols].values
        
    return merged

In [48]:
%%time
data = fill_null(data)  # FIlling in Nan's where data was missing

# Transfer 'Job Bucket' from failure_info to fill_data
transfer_col = ['Components', 'Failure Start Date']
data = failure_merge(data, failure_info, transfer_col)
data.drop(columns='Failure Start Date', inplace=True)

data.head()

Wall time: 23.3 s


Unnamed: 0,Date,NodeID,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,Components
0,2019-06-21 15:58:34,Aagvik 1-35H,27639.0,16811.0,3280.0,,Normal
1,2019-06-21 16:25:36,Aagvik 1-35H,27457.0,16752.0,3241.0,,Normal
2,2019-06-21 18:25:16,Aagvik 1-35H,27448.0,16594.0,3330.0,,Normal
3,2019-06-21 18:28:10,Aagvik 1-35H,27424.0,16595.0,3327.0,,Normal
4,2019-06-21 20:25:01,Aagvik 1-35H,27662.0,16711.0,3341.0,,Normal


## Feature Engg

Depending on What features we and labels we want to use for our model, we can use the functions

`get_agg()`:

    - For now only gives us moving averages
    - Can modify it to give other aggregate functions
    
`create_prediction_zones()`:
    
    - Will create new classes depending on what windows we choose for failures
  
**Note: Both these fucntions will give out separate dataframes/series and will have to be merged accordingly**


In [10]:
"""
Helper Functions
"""

def get_agg(df, freq, time_col='Date', well_col = 'NodeID'):
    
    frames = []
    
    for well in df[well_col].unique():
        temp_df = df[df[well_col] == well].copy()
        temp_df.set_index(time_col, inplace=True)
        temp_df = temp_df.rolling(freq).mean()
        temp_df = temp_df.add_prefix(freq+'_')
        temp_df[well_col] = well
        temp_df.reset_index(inplace=True)
        frames.append(temp_df)
        
    rolled_df = pd.concat(frames)
    rolled_df.reset_index(inplace=True, drop=True)
    
    return rolled_df


def create_prediction_zones(df, fail_col, prediction_zone_dict):
    """
    Depending on the prediction_zone_dict will create predictions zones for failures 
    in the Failure column.
    :param df: The dataframe to extract it from
    :param fail_col: Failure column to use from the dataframe
    :param prediction_zone_dict: A dict with timedeltas for each type of Failure in fail_col
    :return Will return a Series or an Array of these Prediction Zones
    """
    
    test_data = df[['NodeID', 'Date', fail_col]].copy()
    fail_zones = test_data[fail_col]  # fail_zones will be initialized as a copy of the fail col
    
    # Getting start of predictions from fail col
    fail_dates = test_data[test_data[fail_col] != 'Normal']  # everthing other than normal is considered as a prediction
    fail_start = fail_dates[fail_dates.Date.diff().abs().fillna(pd.Timedelta('10D')) > pd.Timedelta('1d 12H')]
    fail_start.reset_index(inplace=True, drop=True)
    
    # Adding zones by iterating over each prediction start date
    for i in fail_start.index:
        temp_well = fail_start.loc[i, 'NodeID']  # well name
        zone_end_date = fail_start.loc[i, 'Date']  # prediction start date
        fail = fail_start.loc[i, fail_col]  # actual prediction class
        zone_delta = pd.Timedelta(prediction_zone_dict[fail])  # delta to subtract from the dictionary
        zone_start_date = zone_end_date - zone_delta

        bool_ = (test_data.NodeID == temp_well) & (test_data.Date < zone_end_date) & (test_data.Date >= zone_start_date)
        fail_zones[bool_] = 'fz_' + fail
        
    return fail_zones

Say we want to use rolling averages with a frequency of 7 days for our features and a constant 10 day window for our failures. Follow the next few sections to see how it will be done

In [49]:
%%time
# 7 day rolling averages
avg_data = get_agg(df=data, freq='7D')  

# Merge it with the original data 
# and use only those columns which will be of use
# While working with large datasets try optmizing the copies of dataframes you create
# May not even have to merge it

full_data = data.set_index(['NodeID', 'Date']).merge(avg_data.set_index(['NodeID', 'Date']), 
                                                     left_index=True,
                                                     right_index=True).reset_index()

# Drop Columns that we dont need
cols_drop = [
    'PPRL',
    'MPRL',
    'PumpIntakePressure',
    'FluidLoadonPump',
]

full_data.drop(columns=cols_drop, inplace=True)

full_data.head()

Wall time: 9.71 s


Unnamed: 0,NodeID,Date,Components,7D_PPRL,7D_MPRL,7D_FluidLoadonPump,7D_PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,Normal,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,Normal,27548.0,16781.5,3260.5,
2,Aagvik 1-35H,2019-06-21 18:25:16,Normal,27514.666667,16719.0,3283.666667,
3,Aagvik 1-35H,2019-06-21 18:28:10,Normal,27492.0,16688.0,3294.5,
4,Aagvik 1-35H,2019-06-21 20:25:01,Normal,27526.0,16692.6,3303.8,


In [50]:
full_data.isnull().sum(axis=0)/len(full_data) * 100

NodeID                   0.000000
Date                     0.000000
Components               0.000000
7D_PPRL                  3.440361
7D_MPRL                  3.440361
7D_FluidLoadonPump       3.440361
7D_PumpIntakePressure    4.220519
dtype: float64

In [51]:
# pred zone dict
# manual
# pred_zone_dict = {
#     'PUMP': '15 days',
#     'ROD': '15 days',
#     'TUBING': '15 days',
#     'BHA': '15 days'
# }

# automated all failures will have the same failure window
fail_window = '15 days'
fail_labels = full_data.Components.unique().tolist()
fail_labels.remove('Normal')
pred_zone_dict = {x: fail_window for x in fail_labels}

In [53]:
# Create pred windows
# Note:  The output of the fucntion will be a pandas Series
full_data['Label'] = create_prediction_zones(df=full_data, 
                                             fail_col='Components', 
                                             prediction_zone_dict=pred_zone_dict)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Some Assumtions we are going to make:
- Drop Nan values
- Remove the actual failures as classes and only use windows

In [54]:
full_data.dropna(inplace=True)

class_drop = fail_labels  # just for now dont need to use it
full_data = full_data[~full_data.Label.isin(class_drop)]
full_data.Label = full_data.Label.str.replace('fz_', '').str.strip()
full_data.reset_index(inplace=True, drop=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [55]:
full_data.head()

Unnamed: 0,NodeID,Date,Components,7D_PPRL,7D_MPRL,7D_FluidLoadonPump,7D_PumpIntakePressure,Label
0,Aagvik 1-35H,2019-08-13 11:19:56,Normal,31488.0,17075.0,9968.0,608.0,Normal
1,Aagvik 1-35H,2019-08-14 00:00:00,Normal,31488.0,17075.0,9968.0,608.0,Normal
2,Aagvik 1-35H,2019-08-15 06:04:48,Normal,31613.0,17062.5,9915.0,611.0,Normal
3,Aagvik 1-35H,2019-08-15 07:53:36,Normal,31615.333333,17012.666667,10011.0,564.666667,Normal
4,Aagvik 1-35H,2019-08-15 10:02:31,Normal,31594.5,17021.25,9975.25,576.5,Normal


In [56]:
X = full_data.drop(columns=['NodeID', 'Date', 'Components', 'Label'])
Y = full_data.Label

print("Features")
display(X.head())

print("Classes Being Predicted")
display(Y.value_counts())

Features


Unnamed: 0,7D_PPRL,7D_MPRL,7D_FluidLoadonPump,7D_PumpIntakePressure
0,31488.0,17075.0,9968.0,608.0
1,31488.0,17075.0,9968.0,608.0
2,31613.0,17062.5,9915.0,611.0
3,31615.333333,17012.666667,10011.0,564.666667
4,31594.5,17021.25,9975.25,576.5


Classes Being Predicted


Normal                       669426
Tubing - Body                  6469
Pump - Plunger                 5145
Rod - Main Body                3293
Polish Rod                     2551
Pump - Stuck Pump              2394
Tubing Leak                    1930
Rod - Pin                      1605
Rod - 6" Critical Section      1146
Pump - Barrel                  1049
Rod - Coupling                 1016
Pump - Junked                   942
Pump - No-Tap                   776
Pump - On - Off Tool            534
Tubing - Collar                 423
BHA - TAC                       294
BHA - Seat Nipple               286
Pump - Standing Valve           246
Pump - Traveling Valve          171
Pump - Valve Rod                 79
Name: Label, dtype: int64

# Testing Algos

In [21]:
# Imports
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

In [22]:
"""
Model 1
Random Forest classifier
"""

def build_rfc_model ():
    """
    Define A Random Forrest Classifier Model
    :return: RFC Model
    """
    scaler = StandardScaler() # Define Scaler
 
    # RFC Params
    rfc_params = {
        'n_estimators': 100,
        'min_samples_split': 2,
        'min_samples_leaf': 1,
        'class_weight': 'balanced',
        'verbose': 0,
        'max_features': 'auto',
        'max_depth': None,
    }
    
    # RFC Classifier
    rfc = RandomForestClassifier(**rfc_params)
    
    #
    model = Pipeline([
        ('scaler', scaler),
        ('rfc', rfc)
    ])

    return model

In [23]:
rfc_model = build_rfc_model()
rfc_model

Pipeline(steps=[('scaler', StandardScaler()),
                ('rfc', RandomForestClassifier(class_weight='balanced'))])

In [78]:
MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

# cv_rfc = MultiClassMetrics.cv_validation(X, Y, rfc_model)
# print("CV Metrics")
# display(cv_rfc)

Weighted Metrics
Precision : 99.35
Recall: 99.36
F-score: 99.34

Macro Metrics
Precision : 97.01
Recall: 87.55
F-score: 91.89

Classification Report
                           precision    recall  f1-score   support

                   Normal       0.99      1.00      1.00    123118
               Polish Rod       0.95      0.85      0.90       251
         Polish Rod - Box       1.00      0.96      0.98        47
            Pump - Barrel       0.93      0.83      0.88       138
  Pump - Barrel Extension       1.00      0.73      0.84        33
            Pump - Junked       0.97      0.86      0.91       147
            Pump - No-Tap       0.92      0.75      0.83       129
     Pump - On - Off Tool       1.00      0.78      0.87        40
           Pump - Plunger       0.97      0.90      0.93       753
      Pump - Plunger Cage       1.00      0.97      0.99        72
    Pump - Standing Valve       0.97      0.93      0.95        74
        Pump - Stuck Pump       0.95      0.88

In [79]:
kf_rfc = MultiClassMetrics.kfold_validation(X, Y, rfc_model)
print("Kfold Metrics")
display(kf_rfc)

Kfold Metrics


Unnamed: 0,Precision_wt,Recall_wt,F-score_wt,Precision_macro,Recall_macro,F-score_macro
0,99.409996,99.419998,99.400002,96.419998,89.440002,92.699997
1,99.409996,99.409996,99.400002,97.25,88.630005,92.629997
2,99.459999,99.459999,99.449997,96.75,89.900002,93.07
3,99.440002,99.449997,99.440002,96.860001,88.410004,92.32
4,99.43,99.440002,99.419998,97.080002,87.040001,91.689995
Mean,99.429999,99.435999,99.422,96.872,88.684003,92.481998
STD,0.018975,0.018548,0.020395,0.284774,0.982947,0.462317


# Training and Saving the model

Using the `lib_aws.S3` class. We can serialize and save a trained model.  

In [57]:
model_train = build_rfc_model() # Define a model

In [58]:
%%time
model_train.fit(X, Y)  # fit the model

Wall time: 3min 31s


Pipeline(steps=[('scaler', StandardScaler()),
                ('rfc', RandomForestClassifier(class_weight='balanced'))])

In [59]:
%%time
# # saving the trained model in s3
# s3 = lib_aws.S3(bucket='et-oasis')
# model_name = 'algo/failure-forecasting/rfc_model_v2.pkl'
# s3.save_model(obj=model_train, name=model_name)

# Saving the model locally
# Save the model as a pickle in a file 
# joblib.dump(model_train, 'model_rfc.pkl') 

Wall time: 221 ms


['model_rfc.pkl']

# Making Predictions

While making predictions modify the features accordingly.

In [62]:
%%time
# To mimic actual working in production query the entire data again
# Find moving averages
# drop nan values 
# make predictions
full_query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM xspoc.xdiag
ORDER BY "NodeID","Date"
"""

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data = pd.read_sql(full_query, engine, parse_dates=['Date'])
    
data.head()

Wall time: 7min 19s


Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


In [63]:
%%time
# Feature Engg
# Will find moving averages
# Drop NAN's
# 7 day rolling averages
avg_data = get_agg(df=data, freq='7D')  

# Merge it with the original data 
# and use only those columns which will be of use
# While working with large datasets try optmizing the copies of dataframes you create
# May not even have to merge it

full_data = data.set_index(['NodeID', 'Date']).merge(avg_data.set_index(['NodeID', 'Date']), 
                                                     left_index=True,
                                                     right_index=True).reset_index()

# Drop Columns that we dont need
cols_drop = [
    'PPRL',
    'MPRL',
    'PumpIntakePressure',
    'FluidLoadonPump',
]

full_data.drop(columns=cols_drop, inplace=True)
full_data.dropna(inplace=True)
full_data.reset_index(inplace=True, drop=True)
full_data.head()

Wall time: 2min 49s


Unnamed: 0,NodeID,Date,7D_PPRL,7D_MPRL,7D_FluidLoadonPump,7D_PumpIntakePressure
0,Aagvik 1-35H,2019-08-13 11:19:56,31488.0,17075.0,9968.0,608.0
1,Aagvik 1-35H,2019-08-15 06:04:48,31613.0,17062.5,9915.0,611.0
2,Aagvik 1-35H,2019-08-15 07:53:36,31615.333333,17012.666667,10011.0,564.666667
3,Aagvik 1-35H,2019-08-15 10:02:31,31594.5,17021.25,9975.25,576.5
4,Aagvik 1-35H,2019-08-15 12:14:05,31621.4,16997.8,9907.0,603.0


In [60]:
%%time
# #load the model
# s3 = lib_aws.S3(bucket='et-oasis')
# model_name = '/algo/failure-forecasting/rfc_model_v2.pkl'
# imported_model = s3.import_model(model_name)

# Load model Locally
# imported_model = joblib.load('model_rfc.pkl') 

# If training done in the same notebook instance use that itself
imported_model = model_train

Wall time: 0 ns


In [78]:
%%time
x_pred = full_data.drop(columns=['NodeID', 'Date'])
predictions = imported_model.predict(x_pred)  # Get Label Predictions
probabilities = imported_model.predict_proba(x_pred)  # Get Probabilities

Wall time: 1min 36s


In [79]:
%%time
# Creating final pred_df
classes_predicted = imported_model[1].classes_  # Trained rfc part for making predictions
pred_df = pd.concat([
    full_data[["NodeID", "Date"]], # NodeID and Date for indexing
    pd.DataFrame(np.round(probabilities * 100, 2), columns=classes_predicted),  # probabilities for each class
    pd.DataFrame(predictions, columns=['Prediction'])  # Actual Predictions
], axis=1)

Wall time: 1.22 s


In [87]:
rem_arr = np.array(["NodeID", "Date", "Normal", "Prediction"])
prob_cols = np.setdiff1d(pred_df.columns, rem_arr)
pred_df['FailureProb'] = pred_df[prob_cols].max(axis=1)

# Dropping class based probabilities
pred_df.drop(columns = prob_cols, inplace=True)

In [88]:
pred_df.head()

Unnamed: 0,NodeID,Date,Normal,Prediction,FailureProb
0,Aagvik 1-35H,2019-08-13 11:19:56,100.0,Normal,0.0
1,Aagvik 1-35H,2019-08-15 06:04:48,100.0,Normal,0.0
2,Aagvik 1-35H,2019-08-15 07:53:36,100.0,Normal,0.0
3,Aagvik 1-35H,2019-08-15 10:02:31,100.0,Normal,0.0
4,Aagvik 1-35H,2019-08-15 12:14:05,100.0,Normal,0.0


In [75]:
# # Custom mapping only for now, modify the classes while training
# # This has been taken care of, dont have to run this
# class_mapping = {
#     'fz_BHA': 'BHA',
#     'fz_PUMP': 'PUMP',
#     'fz_ROD': 'ROD',
#     'fz_TUBING': 'TUBING'
# }
# pred_df.Prediction = pred_df.Prediction.map(class_mapping).fillna(pred_df.Prediction)  # mapping Predictions to actual class values
# pred_df.rename(columns=class_mapping, inplace=True)  # mapping column names to actual class values


In [89]:
%%time
# Filling nulls with str columns as well
pred_df.set_index(['NodeID', 'Date'], inplace=True)  # set index

#seperate string columns and num columns
pred_str = pred_df.select_dtypes(include='object')
pred_num = pred_df.select_dtypes(exclude='object')

# fill nulls in num columns
pred_num.reset_index(inplace=True)
pred_num = fill_null(pred_num, chk_col='Normal', well_col='NodeID', time_col='Date')
pred_num.set_index(['NodeID', 'Date'], inplace=True)

# merge for final df
pred_df = pd.concat([pred_num, pred_str], axis=1)
pred_df.reset_index(inplace=True)

Wall time: 1min 23s


In [67]:
# pred_df.columns = pred_df.columns.str.replace('"', 'in')

In [90]:
pred_df.head()

Unnamed: 0,NodeID,Date,Normal,FailureProb,Prediction
0,Aagvik 1-35H,2019-08-13 11:19:56,100.0,0.0,Normal
1,Aagvik 1-35H,2019-08-14 00:00:00,,,
2,Aagvik 1-35H,2019-08-15 06:04:48,100.0,0.0,Normal
3,Aagvik 1-35H,2019-08-15 07:53:36,100.0,0.0,Normal
4,Aagvik 1-35H,2019-08-15 10:02:31,100.0,0.0,Normal


In [91]:
# Adding data to DB
# Replace the full bounds df
# Will only work if INSERT/Update ACCESS provided to databse
lib_aws.AddData.add_data(df=pred_df, 
                         db='oasis-prod', 
                         schema='xspoc',
                         table='sample_predictions', 
                         merge_type='replace', 
                         card_col=None, 
                         index_col='Date')

Data replaceed on Table sample_predictions in time 92.55s


In [92]:
# Setting up indexes
# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX sample_predictions_idx ON xspoc.sample_predictions ("NodeID", "Date");""")


## Some Additional Functions

Help with adding some specific tables to the database:

**Add a failure db with failures as 0,1 (Only used for visualization)**

**Fill in NAN values in the data**
- After querying the whole dataset
- Find those datapoints (Date, NodeID as index) where we dont have Data present


In [72]:
# Check Failure Info
data_query = """
SELECT
    "NodeID",
    "Date",
    "PPRL"
FROM xspoc.xdiag
WHERE "NodeID" in {}
ORDER BY "NodeID","Date"
""".format(tuple(well_list))

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    only_fails = pd.read_sql(data_query, engine, parse_dates=['Date'])

In [73]:
%%time
only_fails = fill_null(only_fails)  # FIlling in Nan's where data was missing

# Transfer 'Job Bucket' from failure_info to fill_data
transfer_col = ['Components', 'Failure Start Date']
only_fails = failure_merge(only_fails, failure_info, transfer_col)
only_fails.drop(columns='Failure Start Date', inplace=True)
only_fails.drop(columns = 'PPRL', inplace=True)
only_fails.head()

Wall time: 24 s


Unnamed: 0,Date,NodeID,Components
0,2019-06-21 15:58:34,Aagvik 1-35H,Normal
1,2019-06-21 16:25:36,Aagvik 1-35H,Normal
2,2019-06-21 18:25:16,Aagvik 1-35H,Normal
3,2019-06-21 18:28:10,Aagvik 1-35H,Normal
4,2019-06-21 20:25:01,Aagvik 1-35H,Normal


In [74]:
bool_ = only_fails.Components != 'Normal'
only_fails.loc[bool_, 'BinaryFails'] = 1

In [75]:
only_fails

Unnamed: 0,Date,NodeID,Components,BinaryFails
0,2019-06-21 15:58:34,Aagvik 1-35H,Normal,
1,2019-06-21 16:25:36,Aagvik 1-35H,Normal,
2,2019-06-21 18:25:16,Aagvik 1-35H,Normal,
3,2019-06-21 18:28:10,Aagvik 1-35H,Normal,
4,2019-06-21 20:25:01,Aagvik 1-35H,Normal,
...,...,...,...,...
738435,2020-08-15 10:37:07,Zdenek 6093 42-24H,Normal,
738436,2020-08-15 11:43:01,Zdenek 6093 42-24H,Normal,
738437,2020-08-15 13:26:35,Zdenek 6093 42-24H,Normal,
738438,2020-08-15 14:32:27,Zdenek 6093 42-24H,Normal,


In [76]:
# Adding data to DB
# Replace the full bounds df
lib_aws.AddData.add_data(df=only_fails, 
                         db='oasis-prod', 
                         schema='xspoc',
                         table='only_fails', 
                         merge_type='replace', 
                         card_col=None, 
                         index_col='Date')

Data replaceed on Table only_fails in time 62.18s


In [77]:
# Setting up indexes
# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX idx_only_fails_node_date ON xspoc.only_fails ("NodeID", "Date");""")


# Older Build

Older Codes. Can be used for reference.

## Importing Labeled Data

Labeled data is stored in the database `oasis-dev` in the table `clean.xspoc`

In [None]:
# Setuo the query
failure_wells = ['Cade 12-19HA', 'Cook 12-13 6B', 'Helling Trust 43-22 16T3',
                'Helling Trust 44-22 5B', 'Johnsrud 5198 14-18 13T',
                'Johnsrud 5198 14-18 15TX', 'Rolfson N 5198 12-17 5T',
                'Rolfson N 5198 12-17 7T', 'Rolfson S 5198 11-29 2TX',
                'Rolfson S 5198 11-29 4T', 'Rolfson S 5198 12-29 8T',
                'Rolfson S 5198 14-29 11T', 'Stenehjem 14X-9HA']

query = """
SELECT 
    "NodeID",
    "Date",
    "cardPPRL",
    "cardMPRL",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure",
    "FailureBin",
    "FailureLabel"
FROM
    clean.xspoc
WHERE
    "NodeID" in {}
ORDER BY
    "NodeID", "Date";
""".format(tuple(failure_wells))

In [None]:
%%time
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    data = pd.read_sql(query, engine, parse_dates=['Date'])
    
data.head()

In [None]:
data = data[data.Date < t]
data.reset_index(inplace=True, drop=True)

In [None]:
data.groupby(['NodeID']).agg({"Date": [min, max, "count"]})

In [None]:
# Modifying data for intern project

data.rename(columns={"NodeID": "WellName", "FailureBin":"BinaryLabel", "FailureLabel":"MultiLabel"}, inplace=True)
wells = data.WellName.unique()

In [None]:
new_wells = ["Well " + i for i in list('ABCDEFGHIJKLM')]
well_map = dict(zip(wells, new_wells))

data.WellName = data.WellName.map(well_map)

In [None]:
data.head()

In [None]:
data.MultiLabel.value_counts()

In [None]:
data.set_index("Date").to_csv("sample_data.csv")

In [None]:
"""
Generate Windows
"""

def window_func(df, window):
    """
    Generate MultiLabel windows
    0 = Does not fail
    'Label' = Actual Failure or Fails in the next n window
    :param df: DataFrame with a single well, the Timestamp col should be the index
    :param window: Window Value
    """
    
    df['WinLabel'] = 'Normal'  # Initialize it with 0
    
    mask_ = df.index >= (df.index.max() - pd.Timedelta(window))  
    df.loc[mask_, 'WinLabel'] = -1  # Will eliminate the final window fn
    
    # Iterate over all the labels
    for code in df.loc[df.FailureBin == 1, 'FailureLabel'].unique():
        
         # dates where that code occurs
        code_dates = df[df.FailureLabel == code].index
        # print(code)

        # counter
        c = 0

        # iterate over these dates
        for t in code_dates:
            if c == 0:
                bool_ = (df.index < code_dates[c]) & (df.index >= (code_dates[c] - pd.Timedelta(window)))
                df.loc[bool_, 'WinLabel'] = code
            else:
                bool_ = (df.index < code_dates[c]) & (df.index >= (code_dates[c] - pd.Timedelta(window))) & (
                        df.index > code_dates[c - 1])
                df.loc[bool_, 'WinLabel'] = code
            c = c + 1

        df.loc[df.FailureLabel == code, 'WinLabel'] = code
    
    return df


"""
Function for Moving AVGs
"""

def get_ma(df, cols, freq):
    """
    Rolling Values
    :param df: DataFrame
    :param cols: Columns which are being Rolled
    :param freq: Rolling Window( example: 7D)
    :return: DataFrame with Rolled Values
    """
    for i in cols:
        col_name_1 = i + '_MA'
        df[col_name_1] = df[i].rolling(freq).mean()
    return df


In [None]:
rol_cols = [
    "cardPPRL",
    "cardMPRL",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure"
]
frames = []

for well in data.NodeID.unique():
    print("Well: {}".format(well))
    
    tempdf = data[data.NodeID == well]
    tempdf.set_index("Date", inplace=True)
    
    tempdf = window_func(tempdf, '3 days')
    tempdf = get_ma(tempdf, rol_cols, '7D')
    tempdf.reset_index(inplace=True)
    frames.append(tempdf)

In [None]:
train_data = pd.concat(frames)  # creeating a train df
train_data = train_data[train_data.WinLabel != -1]
train_data.sort_values(by=['NodeID', 'Date'], inplace=True)

print("Null Value Distribution")
display(train_data.isnull().sum(axis=0))

print("Wells")
display(train_data.NodeID.value_counts())

print("Labels")
display(train_data.WinLabel.value_counts())

In [None]:
"""
Plotting
"""
col = 'NetProd_MA'
well = 'Helling Trust 43-22 16T3'

well_df = train_data[train_data.NodeID == well]
# well_df.loc[well_df.FailureBin == 1, [col, 'WinLabel']] = np.nan  # Nan where Failures are present
fig, ax = plt.subplots(figsize=(25,8))

ax.plot(well_df.Date, well_df[col], label=col)
bool_ = (well_df.WinLabel != 'Normal')

ax.scatter(well_df.loc[bool_, "Date"], well_df.loc[bool_, col], c='r', label='Failure')

ax.set_xlabel("Date")
ax.set_ylabel("KPI")
ax.legend(loc='best')
plt.show()

In [None]:
# # Droping Failure Data Point
# train_data[train_data.FailureBin == 1]

feature_cols = ['PPRL_MA', 'MPRL_MA', 'NetProd','FluidLoadonPump_MA', 'PumpIntakePressure_MA']
add_cols=feature_cols + ['NodeID', 'Date', 'WinLabel']
final_train = train_data[add_cols].dropna()
final_train.reset_index(drop=True, inplace=True)

# Features
X = final_train[feature_cols]
Y = final_train.WinLabel

print("Feature df")
display(X.head())

print("Labels Being Predicted")
display(Y.value_counts())

## Algo Test

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
"""
Model 1
Random Forest classifier
"""

def build_rfc_model():
    """
    Define A Random Forrest Classifier Model
    :return: RFC Model
    """
    scaler = StandardScaler()

    rfc_params = {
        'n_estimators': 100,
        'min_samples_split': 2,
        'min_samples_leaf': 1,
        'class_weight': 'balanced',
        'verbose': 0,
        'max_features': 'auto',
        'max_depth': None,
    }

    rfc = RandomForestClassifier(**rfc_params)

    model = Pipeline([
        ('scaler', scaler),
        ('rfc', rfc)
    ])

    return model

In [None]:
rfc_model = build_rfc_model()
rfc_model

In [None]:
MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

cv_rfc = MultiClassMetrics.cv_validation(X, Y, rfc_model)
print("CV Metrics")
display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, rfc_model)
print("Kfold Metrics")
display(kf_rfc)

In [40]:
"""
Model 2
Gradient Boosted Classifier with Oversampling
"""

def build_gbc_model(y):
    # Building Smote dict
    max_count = int(y.value_counts()[0] / 3)
    class_list = list(y.value_counts().index)
    class_list.remove('Normal')
    smote_dict = {key: max_count for key in class_list}
    print(smote_dict)

    # Define the model pipeline
    scaler = StandardScaler()
    smote = SMOTE(sampling_strategy=smote_dict, random_state=42)
    baseline_param = {
        'n_estimators': 4,
        'max_depth': 8,
        'learning_rate': 0.1,
        'loss': 'deviance',
        'min_samples_split': 2,
        'verbose': 0
    }

    gbc = GradientBoostingClassifier(**baseline_param)

    model = Pipeline([
        ('scaler', scaler),
        ('smote', smote),
        ('gbc', gbc)
    ])

    return model


In [41]:
gbc_model = build_gbc_model(Y)
gbc_model

{'fz_PUMP': 1026909, 'fz_ROD': 1026909, 'fz_TUBING': 1026909, 'fz_BHA': 1026909}


Pipeline(steps=[('scaler', StandardScaler()),
                ('smote',
                 SMOTE(random_state=42,
                       sampling_strategy={'fz_BHA': 1026909, 'fz_PUMP': 1026909,
                                          'fz_ROD': 1026909,
                                          'fz_TUBING': 1026909})),
                ('gbc',
                 GradientBoostingClassifier(max_depth=8, n_estimators=4))])

In [None]:
MultiClassMetrics.baseline_metrics(X, Y, gbc_model)

cv_rfc = MultiClassMetrics.cv_validation(X, Y, gbc_model)
print("CV Metrics")
display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, gbc_model)
print("Kfold Metrics")
display(kf_rfc)

## Making Predictions on the entire dataset

- Task Done for showing quick results in the dashboard
- All wells used in the training set.
- These same values are used in predictions as well
- Visually when ploted the results will be, how we expect our results to look
- Take the predictions with a big grain of salt


### Prediciton Table

The Results are added to a prediciton table in the 'oasis-dev' database.

Following Columns will be present in the `clean.win_predictons` table:
- NodeID
- Date
- FailureProb
- Prob1 
- Prob2
- Prob3

WIll use a basic rfc model and for features we use the following 7-day Moving Averages
- PPRL_MA
- MPRL_MA
- FluidloadonPump_MA
- PumpIntakePressure_MA

This combination gave the best results in the tests.

In [None]:
# import the entire dataset with the columns we need for making predictions and the failrue info

query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure",
    "FailureBin",
    "FailureLabel"
FROM
    clean.xspoc
ORDER BY
    "NodeID", "Date"
"""

In [None]:
%%time
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    data = pd.read_sql(query, engine, parse_dates=['Date'])
    
data.head()

In [None]:
# Generating Features
rol_cols = [
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure"
]
frames = []

for well in data.NodeID.unique():
    print("Well: {}".format(well))
    
    tempdf = data[data.NodeID == well]
    tempdf.set_index("Date", inplace=True)
    
    tempdf = window_func(tempdf, '3 days')
    tempdf = get_ma(tempdf, rol_cols, '7D')
    tempdf.reset_index(inplace=True)
    frames.append(tempdf)

In [None]:
train_data = pd.concat(frames)  # creeating a train df
train_data = train_data[train_data.WinLabel != -1]
train_data.sort_values(by=['NodeID', 'Date'], inplace=True)

print("Null Value Distribution")
display(train_data.isnull().sum(axis=0))

print("Wells")
display(train_data.NodeID.value_counts())

print("Labels")
display(train_data.WinLabel.value_counts())

In [None]:
# # Droping Failure Data Point
# train_data[train_data.FailureBin == 1]

feature_cols = ['PPRL_MA', 'MPRL_MA', 'NetProd_MA','FluidLoadonPump_MA', 'PumpIntakePressure_MA']
add_cols=feature_cols + ['NodeID', 'Date', 'WinLabel']
final_train = train_data[add_cols].dropna()
final_train.reset_index(drop=True, inplace=True)

# Features
X = final_train[feature_cols]
Y = final_train.WinLabel

print("Feature df")
display(X.head())

print("Labels Being Predicted")
display(Y.value_counts())

In [None]:
# quick test
rfc_model = build_rfc_model()
display(rfc_model)

MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

In [None]:
# Fit the whole df 
rfc_model = build_rfc_model()
rfc_model.fit(X, Y)

In [None]:
"""
Predictions
"""
print("Classes Predicted {}".format(rfc_model.classes_))
y_hat = rfc_model.predict(X.to_numpy())                                          # Get predictions
y_prob = rfc_model.predict_proba(X.to_numpy()) 

In [None]:
ind = final_train.index
data_pred = final_train[["NodeID", "Date"]]
data_pred.loc[ind, 'PredClass'] = y_hat 

pred_classes = rfc_model.classes_

for i in range(np.shape(pred_classes)[0]):
    print(i)
    col = 'Prob ' + str(pred_classes[i])
    data_pred.loc[ind, col] = y_prob[:, i] * 100
data_pred = data_pred.round(3)
data_pred['FailureProb'] = 100 - data_pred['Prob Normal']
data_pred.drop(columns='Prob Normal', inplace=True)

In [None]:
data_pred.head()

In [None]:
"""
Adding Prob data to DF
"""

# Replace the full bounds df
lib_aws.AddData.add_data(df=data_pred, db='oasis-dev', table='xpred', schema='clean',
                         merge_type='replace', card_col=None, index_col='Date')

# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX xpred_idx ON clean.xpred ("NodeID", "Date");""")
