# Notebook Info

This Notebook, is for the development of a window forecasting model. The following tables/schemas are considered

```
Main Database: oasis-prod

For Failures:
schema: analysis
table: failure_info

For Features:
schema: xspoc
table: xdiag

Columns used as Features:
- PPRL
- MPRL
- FluidLoadonPump
- PumpIntakePressure

```

**Notes**

Following Are some Conditions and Assumption made:

- **NAN Values are not handled**. They are dropped for traininng and predictions
- A failure specific variable window is used for predictions? 
- **Failure Data Points are not used** for training and Predictions
    - One of the reasons being, we want to see the trends pointing towards the failures and not the actual failures
    - Many a time, when failures occur a well is shutdown and no values are present. Dropping them will help us avoid worrying about imputing the data
    - At first glance this may not impact the algo
    - Further Discussion is needed
- Multi-Class Classification is performed (Not MultiLabel)
- Training Data
    - Do we train on the entire dataset or only where failures have occured


In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from library import lib_aws, lib_cleaning
from library.lib_metrics import MultiClassMetrics

# Importing Data

Following steps are performed
- Import failures (analysis.failure_info)
- Import features/data (xspoc.xdiag)
- Merge the info
- Clean and modify the data depending on how we want the input features

**Note: Use `Feature Analysis.ipynb` to analyse the data and failures**

## Importing Failures

In [3]:
%%time
# Querying the entire failure info
query_failures = """
SELECT 
    "NodeID",
    "Last Oil",
    "Finish Date",
    "Job Bucket",
    "Job Type"
FROM
    analysis.failure_info
ORDER BY "NodeID";
"""

with lib_aws.PostgresRDS(db='oasis-prod', verbose=1) as engine:
    failures = pd.read_sql(query_failures, engine, parse_dates=['Last Oil', 'Finish Date'])
    
failures.head()

Connected to oasis-prod DataBase
Connection Closed
Wall time: 14.6 s


Unnamed: 0,NodeID,Last Oil,Finish Date,Job Bucket,Job Type
0,Aagvik 1-35H,2019-11-27,2019-12-06,TUBING,TUBING LEAK
1,Aagvik 5298 41-35 2TX,2019-05-29,2019-06-25,PUMP,GAS LIFT
2,Acadia 31-25H,2018-04-11,2018-05-11,TUBING,TUBING LEAK
3,Acadia 31-25H,2019-03-30,2019-04-16,PUMP,"1-1/4"" PUMP"
4,Acklins 6092 12-18H,2019-12-24,2020-01-03,ROD,POLISH ROD BREAK


## Importing Data

- We need to Import the entire dataset.
- While testing choose a subset of the wells 

In [5]:
%%time
# List of wells for testing
well_list = [
    'Anderson 7-18H',
    'Andre 5501 14-5 3B',
    'Autumn Wind State 5601 14-16B',
    'Berwick 4-2HE',
    'Carl Federal 2658 43-23H',
    'Carson Federal 2658 13-17H',
    'Cook 5300 12-13 6B',
    'Dixon 5602 44-34H',
    'Emma 13-7H',
    'Forland 28-33H',
    'Hanson 33-28H'
    'Inez 6093 43-19H',
    'Johnsrud 5198 12-18 10T',
    'Mae 5603 43-19H',
    'Susie 15-22H'
]


data_query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM xspoc.xdiag
-- WHERE "NodeID" in {}
ORDER BY "NodeID","Date"
""".format(tuple(well_list))

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data = pd.read_sql(data_query, engine, parse_dates=['Date'])

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


Failure info in these in these wells


Unnamed: 0,NodeID,Last Oil,Finish Date,Job Bucket,Job Type
0,Anderson 7-18H,2019-10-08,2019-10-16,ROD,POLISH ROD BREAK
1,Andre 5501 14-5 3B,2018-04-30,2018-05-12,ROD,POLISH ROD BREAK
2,Andre 5501 14-5 3B,2018-05-22,2018-05-26,ROD,"1"" ROD SECTION"
3,Andre 5501 14-5 3B,2020-03-06,2020-03-13,PUMP,"1-3/4"" PUMP"
4,Autumn Wind State 5601 14-16B,2020-02-03,2020-02-10,TUBING,TUBING LEAK
5,Berwick 4-2HE,2019-10-31,2019-11-11,PUMP,"2"" PUMP"
6,Carl Federal 2658 43-23H,2019-01-27,2019-02-14,ROD,POLISH ROD BREAK
7,Carl Federal 2658 43-23H,2019-06-04,2019-07-02,ROD,POLISH ROD BREAK
8,Carl Federal 2658 43-23H,2020-02-03,2020-02-07,ROD,POLISH ROD BREAK
9,Carl Federal 2658 43-23H,2019-07-26,2019-08-13,ROD,"1"" ROD SECTION"


Wall time: 2min 50s


In [11]:
well_list = data.NodeID.unique()

869

In [12]:
# Failur Info only from wells present in the data
failure_info = failures[failures.NodeID.isin(well_list)]
failure_info.reset_index(inplace=True, drop=True)

# info
display(data.head())
print("Failure info in these in these wells")
display(failure_info)

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


Failure info in these in these wells


Unnamed: 0,NodeID,Last Oil,Finish Date,Job Bucket,Job Type
0,Aagvik 1-35H,2019-11-27,2019-12-06,TUBING,TUBING LEAK
1,Acadia 31-25H,2018-04-11,2018-05-11,TUBING,TUBING LEAK
2,Acadia 31-25H,2019-03-30,2019-04-16,PUMP,"1-1/4"" PUMP"
3,Aerabelle 5502 43-7T,2018-10-10,2018-10-24,PUMP,"1-1/2"" PUMP"
4,Aerabelle 5502 43-7T,2018-10-10,2019-08-15,ROD,"3/4"" ROD SECTION"
...,...,...,...,...,...
782,Yeiser 5603 42-33H,2020-06-13,2020-06-25,TUBING,TUBING LEAK
783,Yeiser 5603 42-33H,2019-06-12,2019-06-29,PUMP,"1-1/2"" PUMP"
784,Zdenek 6093 42-24H,2018-09-18,2018-09-29,ROD,POLISH ROD BREAK
785,Zdenek 6093 42-24H,2020-04-09,2020-05-08,PUMP,"1-3/4"" PUMP"


## Merging Info

In [13]:
"""
Before analysing the data we need to merge the information
Transfering info from failures to data (copy of features)
Using a for loop -- may not be very efficient
"""

def fill_null(df, chk_col='PPRL', well_col='NodeID', time_col='Date'):
    """
    This function will fill in Null Values on those dates where no datapoints are present
    Helps Show failures where no data was present
    Will have to take this into account when running analysis 
    """
    data_temp = df.copy()
    # Set time col as index if it is not
    if time_col in data_temp.columns:
        data_temp.set_index(time_col, inplace=True)
    
    data_gp = data_temp.groupby(well_col).resample('1D').max()  # Groupby wellname and resample to Day freq
    data_gp.drop(columns=[well_col], inplace=True)  # Drop these columns as they are present in the index
    data_gp.reset_index(inplace=True)  # Get Back WellCol from
    data_null = data_gp[data_gp.loc[:, chk_col].isnull()]  # Get all null values, which need to be added to the main data file
    data_null.reset_index(inplace=True, drop=True)
    data_temp.reset_index(inplace=True)  # get timestamp back in the column for concating
    data_full = pd.concat([data_temp, data_null], axis=0, ignore_index=True)  # concat null and og files
    data_full.sort_values(by=[well_col, time_col], inplace=True)
    data_full.drop_duplicates(subset=[well_col, time_col], inplace=True)
    data_full.reset_index(drop=True, inplace=True)
    
    return data_full

def failure_merge(df, failure_df, transfer_cols):
    """
    Merges the failures info
    :param df: dataframe to which info is being transferred to. (Should have columns "NodeID" and "Date")
    :param failure_df: Failure info data (Should have columns "NodeID", "Start Date" and "End Data")
    :param cols: Columns which need to be transferred
    """
    merged = df.copy()  
    for col in transfer_cols:
        merged[col] = 'Normal'  # for now putting everything as normal (even NAN's)
        
    for i in failure_df.index:
        well = failure_df.loc[i, 'NodeID']
        t_start = failure_df.loc[i, 'Last Oil']
        t_end = failure_df.loc[i, 'Finish Date'] + pd.Timedelta('1 day')  # As we have day based frequency (the times in a day are considered as 00:00:00)
        bool_ = (merged.NodeID == well) & (merged.Date >= t_start) & (merged.Date <= t_end)  # Boolean mask for main data
        merged.loc[bool_, transfer_cols] = failure_df.loc[i, transfer_cols].values
        
    return merged

In [14]:
%%time
data = fill_null(data)  # FIlling in Nan's where data was missing

# Transfer 'Job Bucket' from failure_info to fill_data
transfer_col = ['Job Bucket', 'Job Type']
data = failure_merge(data, failure_info, transfer_col)

data.head()

Wall time: 3min 54s


Unnamed: 0,Date,NodeID,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,Job Bucket,Job Type
0,2019-06-21 15:58:34,Aagvik 1-35H,27639.0,16811.0,3280.0,,Normal,Normal
1,2019-06-21 16:25:36,Aagvik 1-35H,27457.0,16752.0,3241.0,,Normal,Normal
2,2019-06-21 18:25:16,Aagvik 1-35H,27448.0,16594.0,3330.0,,Normal,Normal
3,2019-06-21 18:28:10,Aagvik 1-35H,27424.0,16595.0,3327.0,,Normal,Normal
4,2019-06-21 20:25:01,Aagvik 1-35H,27662.0,16711.0,3341.0,,Normal,Normal


## Feature Engg

Depending on What features we and labels we want to use for our model, we can use the functions

`get_agg()`:

    - For now only gives us moving averages
    - Can modify it to give other aggregate functions
    
`create_prediction_zones()`:
    
    - Will create new classes depending on what windows we choose for failures
  
**Note: Both these fucntions will give out separate dataframes/series and will have to be merged accordingly**


In [15]:
"""
Helper Functions
"""

def get_agg(df, freq, time_col='Date', well_col = 'NodeID'):
    
    frames = []
    
    for well in df[well_col].unique():
        temp_df = df[df[well_col] == well].copy()
        temp_df.set_index(time_col, inplace=True)
        temp_df = temp_df.rolling(freq).mean()
        temp_df = temp_df.add_prefix(freq+'_')
        temp_df[well_col] = well
        temp_df.reset_index(inplace=True)
        frames.append(temp_df)
        
    rolled_df = pd.concat(frames)
    rolled_df.reset_index(inplace=True, drop=True)
    
    return rolled_df


def create_prediction_zones(df, fail_col, prediction_zone_dict):
    """
    Depending on the prediction_zone_dict will create predictions zones for failures 
    in the Failure column.
    :param df: The dataframe to extract it from
    :param fail_col: Failure column to use from the dataframe
    :param prediction_zone_dict: A dict with timedeltas for each type of Failure in fail_col
    :return Will return a Series or an Array of these Prediction Zones
    """
    
    test_data = df[['NodeID', 'Date', fail_col]].copy()
    fail_zones = test_data[fail_col]  # fail_zones will be initialized as a copy of the fail col
    
    # Getting start of predictions from fail col
    fail_dates = test_data[test_data[fail_col] != 'Normal']  # everthing other than normal is considered as a prediction
    fail_start = fail_dates[fail_dates.Date.diff().abs().fillna(pd.Timedelta('10D')) > pd.Timedelta('1d 12H')]
    fail_start.reset_index(inplace=True, drop=True)
    
    # Adding zones by iterating over each prediction start date
    for i in fail_start.index:
        temp_well = fail_start.loc[i, 'NodeID']  # well name
        zone_end_date = fail_start.loc[i, 'Date']  # prediction start date
        fail = fail_start.loc[i, fail_col]  # actual prediction class
        zone_delta = pd.Timedelta(prediction_zone_dict[fail])  # delta to subtract from the dictionary
        zone_start_date = zone_end_date - zone_delta

        bool_ = (test_data.NodeID == temp_well) & (test_data.Date < zone_end_date) & (test_data.Date >= zone_start_date)
        fail_zones[bool_] = 'fz_' + fail
        
    return fail_zones

Say we want to use rolling averages with a frequency of 7 days for our features and a constant 10 day window for our failures. Follow the next few sections to see how it will be done

In [21]:
#  # 7 day rolling averages
# avg_data = get_agg(df=data, freq='7D')  

# # Merge it with the original data 
# # and use only those columns which will be of use
# # While working with large datasets try optmizing the copies of dataframes you create
# # May not even have to merge it

# full_data = data.set_index(['NodeID', 'Date']).merge(avg_data.set_index(['NodeID', 'Date']), 
#                                                      left_index=True,
#                                                      right_index=True).reset_index()

# # Drop Columns that we dont need
# cols_drop = [
#     'PPRL',
#     'MPRL',
#     'PumpIntakePressure',
#     'FluidLoadonPump',
#     'Job Type'
# ]

# full_data.drop(columns=cols_drop, inplace=True)

# Create pred windows
# Note:  The output of the fucntion will be a pandas Series
pred_zone_dict = {
    'PUMP': '15 days',
    'ROD': '15 days',
    'TUBING': '15 days',
    'BHA': '10 days'
}

full_data['Label'] = create_prediction_zones(df=full_data, 
                                             fail_col='Job Bucket', 
                                             prediction_zone_dict=pred_zone_dict)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Some Assumtions we are going to make:
- Drop Nan values
- Remove the actual failures as classes and only use windows

In [22]:
full_data.dropna(inplace=True)

class_drop = ['PUMP', 'ROD', 'TUBING', 'BHA']
full_data = full_data[~full_data.Label.isin(class_drop)]

In [30]:
X = full_data.drop(columns=['NodeID', 'Date', 'Job Bucket', 'Label'])
Y = full_data.Label

print("Features")
display(X.head())

print("Classes Being Predicted")
display(Y.value_counts())

Features


Unnamed: 0,7D_PPRL,7D_MPRL,7D_FluidLoadonPump,7D_PumpIntakePressure
128,31488.0,17075.0,9968.0,608.0
129,31488.0,17075.0,9968.0,608.0
130,31613.0,17062.5,9915.0,611.0
131,31615.333333,17012.666667,10011.0,564.666667
132,31594.5,17021.25,9975.25,576.5


Classes Being Predicted


Normal       3080729
fz_PUMP        12220
fz_ROD         11512
fz_TUBING       9485
fz_BHA           340
Name: Label, dtype: int64

# Testing Algos

In [31]:
# Imports
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

In [32]:
"""
Model 1
Random Forest classifier
"""

def build_rfc_model():
    """
    Define A Random Forrest Classifier Model
    :return: RFC Model
    """
    scaler = StandardScaler()

    rfc_params = {
        'n_estimators': 100,
        'min_samples_split': 2,
        'min_samples_leaf': 1,
        'class_weight': 'balanced',
        'verbose': 0,
        'max_features': 'auto',
        'max_depth': None,
    }

    rfc = RandomForestClassifier(**rfc_params)

    model = Pipeline([
        ('scaler', scaler),
        ('rfc', rfc)
    ])

    return model

In [33]:
rfc_model = build_rfc_model()
rfc_model

Pipeline(steps=[('scaler', StandardScaler()),
                ('rfc', RandomForestClassifier(class_weight='balanced'))])

In [34]:
MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

cv_rfc = MultiClassMetrics.cv_validation(X, Y, rfc_model)
print("CV Metrics")
display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, rfc_model)
print("Kfold Metrics")
display(kf_rfc)

Weighted Metrics
Precision : 99.78
Recall: 99.78
F-score: 99.77

Macro Metrics
Precision : 98.33
Recall: 82.53
F-score: 89.40

Classification Report
              precision    recall  f1-score   support

      Normal       1.00      1.00      1.00    924219
      fz_BHA       0.99      0.68      0.80       102
     fz_PUMP       0.98      0.79      0.88      3666
      fz_ROD       0.98      0.86      0.92      3454
   fz_TUBING       0.98      0.79      0.88      2845

    accuracy                           1.00    934286
   macro avg       0.98      0.83      0.89    934286
weighted avg       1.00      1.00      1.00    934286



  _warn_prf(average, modifier, msg_start, len(result))


CV Metrics


Unnamed: 0,F-Score_wt,Precision_wt,Recall_wt,F-Score_macro,Precision_macro,Recall_macro
0,98.27,97.86,98.68,19.89,20.05,19.96
1,98.24,98.01,98.59,22.3,32.75,21.33
2,98.29,97.89,98.7,20.45,21.48,20.31
Mean,98.266667,97.92,98.656667,20.88,24.76,20.533333
STD,0.020548,0.064807,0.047842,1.02979,5.679865,0.581167


Kfold Metrics


Unnamed: 0,Precision_wt,Recall_wt,F-score_wt,Precision_macro,Recall_macro,F-score_macro
0,99.809998,99.809998,99.809998,98.479996,85.099998,91.07
1,99.809998,99.809998,99.800003,98.699997,84.720001,90.879997
2,99.800003,99.809998,99.800003,98.220001,84.949997,90.879997
3,99.809998,99.809998,99.809998,98.68,86.529999,92.059998
4,99.800003,99.809998,99.800003,98.489998,85.909996,91.589996
Mean,99.806,99.809998,99.804001,98.513998,85.441998,91.295998
STD,0.004896,0.0,0.004896,0.173389,0.675733,0.462021


# Older Build

Use for reference

## Importing Labeled Data

Labeled data is stored in the database `oasis-dev` in the table `clean.xspoc`

In [None]:
# Setuo the query
failure_wells = ['Cade 12-19HA', 'Cook 12-13 6B', 'Helling Trust 43-22 16T3',
                'Helling Trust 44-22 5B', 'Johnsrud 5198 14-18 13T',
                'Johnsrud 5198 14-18 15TX', 'Rolfson N 5198 12-17 5T',
                'Rolfson N 5198 12-17 7T', 'Rolfson S 5198 11-29 2TX',
                'Rolfson S 5198 11-29 4T', 'Rolfson S 5198 12-29 8T',
                'Rolfson S 5198 14-29 11T', 'Stenehjem 14X-9HA']

query = """
SELECT 
    "NodeID",
    "Date",
    "cardPPRL",
    "cardMPRL",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure",
    "FailureBin",
    "FailureLabel"
FROM
    clean.xspoc
WHERE
    "NodeID" in {}
ORDER BY
    "NodeID", "Date";
""".format(tuple(failure_wells))

In [None]:
%%time
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    data = pd.read_sql(query, engine, parse_dates=['Date'])
    
data.head()

In [None]:
data = data[data.Date < t]
data.reset_index(inplace=True, drop=True)

In [None]:
data.groupby(['NodeID']).agg({"Date": [min, max, "count"]})

In [None]:
# Modifying data for intern project

data.rename(columns={"NodeID": "WellName", "FailureBin":"BinaryLabel", "FailureLabel":"MultiLabel"}, inplace=True)
wells = data.WellName.unique()

In [None]:
new_wells = ["Well " + i for i in list('ABCDEFGHIJKLM')]
well_map = dict(zip(wells, new_wells))

data.WellName = data.WellName.map(well_map)

In [None]:
data.head()

In [None]:
data.MultiLabel.value_counts()

In [None]:
data.set_index("Date").to_csv("sample_data.csv")

In [None]:
"""
Generate Windows
"""

def window_func(df, window):
    """
    Generate MultiLabel windows
    0 = Does not fail
    'Label' = Actual Failure or Fails in the next n window
    :param df: DataFrame with a single well, the Timestamp col should be the index
    :param window: Window Value
    """
    
    df['WinLabel'] = 'Normal'  # Initialize it with 0
    
    mask_ = df.index >= (df.index.max() - pd.Timedelta(window))  
    df.loc[mask_, 'WinLabel'] = -1  # Will eliminate the final window fn
    
    # Iterate over all the labels
    for code in df.loc[df.FailureBin == 1, 'FailureLabel'].unique():
        
         # dates where that code occurs
        code_dates = df[df.FailureLabel == code].index
        # print(code)

        # counter
        c = 0

        # iterate over these dates
        for t in code_dates:
            if c == 0:
                bool_ = (df.index < code_dates[c]) & (df.index >= (code_dates[c] - pd.Timedelta(window)))
                df.loc[bool_, 'WinLabel'] = code
            else:
                bool_ = (df.index < code_dates[c]) & (df.index >= (code_dates[c] - pd.Timedelta(window))) & (
                        df.index > code_dates[c - 1])
                df.loc[bool_, 'WinLabel'] = code
            c = c + 1

        df.loc[df.FailureLabel == code, 'WinLabel'] = code
    
    return df


"""
Function for Moving AVGs
"""

def get_ma(df, cols, freq):
    """
    Rolling Values
    :param df: DataFrame
    :param cols: Columns which are being Rolled
    :param freq: Rolling Window( example: 7D)
    :return: DataFrame with Rolled Values
    """
    for i in cols:
        col_name_1 = i + '_MA'
        df[col_name_1] = df[i].rolling(freq).mean()
    return df


In [None]:
rol_cols = [
    "cardPPRL",
    "cardMPRL",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure"
]
frames = []

for well in data.NodeID.unique():
    print("Well: {}".format(well))
    
    tempdf = data[data.NodeID == well]
    tempdf.set_index("Date", inplace=True)
    
    tempdf = window_func(tempdf, '3 days')
    tempdf = get_ma(tempdf, rol_cols, '7D')
    tempdf.reset_index(inplace=True)
    frames.append(tempdf)

In [None]:
train_data = pd.concat(frames)  # creeating a train df
train_data = train_data[train_data.WinLabel != -1]
train_data.sort_values(by=['NodeID', 'Date'], inplace=True)

print("Null Value Distribution")
display(train_data.isnull().sum(axis=0))

print("Wells")
display(train_data.NodeID.value_counts())

print("Labels")
display(train_data.WinLabel.value_counts())

In [None]:
"""
Plotting
"""
col = 'NetProd_MA'
well = 'Helling Trust 43-22 16T3'

well_df = train_data[train_data.NodeID == well]
# well_df.loc[well_df.FailureBin == 1, [col, 'WinLabel']] = np.nan  # Nan where Failures are present
fig, ax = plt.subplots(figsize=(25,8))

ax.plot(well_df.Date, well_df[col], label=col)
bool_ = (well_df.WinLabel != 'Normal')

ax.scatter(well_df.loc[bool_, "Date"], well_df.loc[bool_, col], c='r', label='Failure')

ax.set_xlabel("Date")
ax.set_ylabel("KPI")
ax.legend(loc='best')
plt.show()

In [None]:
# # Droping Failure Data Point
# train_data[train_data.FailureBin == 1]

feature_cols = ['PPRL_MA', 'MPRL_MA', 'NetProd','FluidLoadonPump_MA', 'PumpIntakePressure_MA']
add_cols=feature_cols + ['NodeID', 'Date', 'WinLabel']
final_train = train_data[add_cols].dropna()
final_train.reset_index(drop=True, inplace=True)

# Features
X = final_train[feature_cols]
Y = final_train.WinLabel

print("Feature df")
display(X.head())

print("Labels Being Predicted")
display(Y.value_counts())

## Algo Test

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
"""
Model 1
Random Forest classifier
"""

def build_rfc_model():
    """
    Define A Random Forrest Classifier Model
    :return: RFC Model
    """
    scaler = StandardScaler()

    rfc_params = {
        'n_estimators': 100,
        'min_samples_split': 2,
        'min_samples_leaf': 1,
        'class_weight': 'balanced',
        'verbose': 0,
        'max_features': 'auto',
        'max_depth': None,
    }

    rfc = RandomForestClassifier(**rfc_params)

    model = Pipeline([
        ('scaler', scaler),
        ('rfc', rfc)
    ])

    return model

In [None]:
rfc_model = build_rfc_model()
rfc_model

In [None]:
MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

cv_rfc = MultiClassMetrics.cv_validation(X, Y, rfc_model)
print("CV Metrics")
display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, rfc_model)
print("Kfold Metrics")
display(kf_rfc)

In [None]:
"""
Model 2
Gradient Boosted Classifier with Oversampling
"""

def build_gbc_model(y):
    # Building Smote dict
    max_count = int(y.value_counts()[0] / 3)
    class_list = list(y.value_counts().index)
    class_list.remove('Normal')
    smote_dict = {key: max_count for key in class_list}
    print(smote_dict)

    # Define the model pipeline
    scaler = StandardScaler()
    smote = SMOTE(sampling_strategy=smote_dict, random_state=42)
    baseline_param = {
        'n_estimators': 4,
        'max_depth': 8,
        'learning_rate': 0.1,
        'loss': 'deviance',
        'min_samples_split': 2,
        'verbose': 0
    }

    gbc = GradientBoostingClassifier(**baseline_param)

    model = Pipeline([
        ('scaler', scaler),
        ('smote', smote),
        ('gbc', gbc)
    ])

    return model


In [None]:
gbc_model = build_gbc_model(Y)
gbc_model

In [None]:
MultiClassMetrics.baseline_metrics(X, Y, gbc_model)

cv_rfc = MultiClassMetrics.cv_validation(X, Y, gbc_model)
print("CV Metrics")
display(cv_rfc)

kf_rfc = MultiClassMetrics.kfold_validation(X, Y, gbc_model)
print("Kfold Metrics")
display(kf_rfc)

## Making Predictions on the entire dataset

- Task Done for showing quick results in the dashboard
- All wells used in the training set.
- These same values are used in predictions as well
- Visually when ploted the results will be, how we expect our results to look
- Take the predictions with a big grain of salt


### Prediciton Table

The Results are added to a prediciton table in the 'oasis-dev' database.

Following Columns will be present in the `clean.win_predictons` table:
- NodeID
- Date
- FailureProb
- Prob1 
- Prob2
- Prob3

WIll use a basic rfc model and for features we use the following 7-day Moving Averages
- PPRL_MA
- MPRL_MA
- FluidloadonPump_MA
- PumpIntakePressure_MA

This combination gave the best results in the tests.

In [None]:
# import the entire dataset with the columns we need for making predictions and the failrue info

query = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure",
    "FailureBin",
    "FailureLabel"
FROM
    clean.xspoc
ORDER BY
    "NodeID", "Date"
"""

In [None]:
%%time
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    data = pd.read_sql(query, engine, parse_dates=['Date'])
    
data.head()

In [None]:
# Generating Features
rol_cols = [
    "PPRL",
    "MPRL",
    "NetProd",
    "FluidLoadonPump",
    "PumpIntakePressure"
]
frames = []

for well in data.NodeID.unique():
    print("Well: {}".format(well))
    
    tempdf = data[data.NodeID == well]
    tempdf.set_index("Date", inplace=True)
    
    tempdf = window_func(tempdf, '3 days')
    tempdf = get_ma(tempdf, rol_cols, '7D')
    tempdf.reset_index(inplace=True)
    frames.append(tempdf)

In [None]:
train_data = pd.concat(frames)  # creeating a train df
train_data = train_data[train_data.WinLabel != -1]
train_data.sort_values(by=['NodeID', 'Date'], inplace=True)

print("Null Value Distribution")
display(train_data.isnull().sum(axis=0))

print("Wells")
display(train_data.NodeID.value_counts())

print("Labels")
display(train_data.WinLabel.value_counts())

In [None]:
# # Droping Failure Data Point
# train_data[train_data.FailureBin == 1]

feature_cols = ['PPRL_MA', 'MPRL_MA', 'NetProd_MA','FluidLoadonPump_MA', 'PumpIntakePressure_MA']
add_cols=feature_cols + ['NodeID', 'Date', 'WinLabel']
final_train = train_data[add_cols].dropna()
final_train.reset_index(drop=True, inplace=True)

# Features
X = final_train[feature_cols]
Y = final_train.WinLabel

print("Feature df")
display(X.head())

print("Labels Being Predicted")
display(Y.value_counts())

In [None]:
# quick test
rfc_model = build_rfc_model()
display(rfc_model)

MultiClassMetrics.baseline_metrics(X, Y, rfc_model)

In [None]:
# Fit the whole df 
rfc_model = build_rfc_model()
rfc_model.fit(X, Y)

In [None]:
"""
Predictions
"""
print("Classes Predicted {}".format(rfc_model.classes_))
y_hat = rfc_model.predict(X.to_numpy())                                          # Get predictions
y_prob = rfc_model.predict_proba(X.to_numpy()) 

In [None]:
ind = final_train.index
data_pred = final_train[["NodeID", "Date"]]
data_pred.loc[ind, 'PredClass'] = y_hat 

pred_classes = rfc_model.classes_

for i in range(np.shape(pred_classes)[0]):
    print(i)
    col = 'Prob ' + str(pred_classes[i])
    data_pred.loc[ind, col] = y_prob[:, i] * 100
data_pred = data_pred.round(3)
data_pred['FailureProb'] = 100 - data_pred['Prob Normal']
data_pred.drop(columns='Prob Normal', inplace=True)

In [None]:
data_pred.head()

In [None]:
"""
Adding Prob data to DF
"""

# Replace the full bounds df
lib_aws.AddData.add_data(df=data_pred, db='oasis-dev', table='xpred', schema='clean',
                         merge_type='replace', card_col=None, index_col='Date')

# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX xpred_idx ON clean.xpred ("NodeID", "Date");""")
