# Notebook Info

From the data tables that we have, we try identifying the features that matter the most for forecasting
failures.

For now the data is pulled from the `xdiagresults` table and failure is imported from the `failure_info` table.

Database Details:
```
database = 'oasis-dev'
schema = 'clean'
table1 = 'xdiagresults'  # For features
table2 = 'failure_info'  # For Failures
```

# Imports

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
from library import lib_aws

pd.set_option('display.max_rows', 500)
import warnings
warnings.filterwarnings('ignore')

# Data Import

- Features imported from `xspoc.xdiagresults`
- Failures imported from `clean.failure_info`


In [3]:
%%time
# Querying the features

query_features = """
SELECT 
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM
    xspoc.xdiagresults
ORDER BY "NodeID", "Date";
"""

query_failures = """
SELECT 
    "NodeID",
    "Last Oil",
    "Start Date",
    "Finish Date",
    "Job Type",
    "Job Bucket",
    "Primary Symptom",
    "Secondary Symptom"
FROM
    clean.failure_info
ORDER BY "NodeID";
"""

with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    features = pd.read_sql(query_features, engine, parse_dates=['Date'])
    failures = pd.read_sql(query_failures, engine, parse_dates=['Last Oil', 'Start Date', 'Finish Date'])

Connected to oasis-dev DataBase
Connection Closed
Wall time: 21.4 s


In [4]:
# Cleanin NodeID in features
features['NodeID'] = (features['NodeID'].str.replace("#", "")  # remove #
                                     .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
                                     .str.strip()  # Remove trailing whitespaces
                                     .str.lower()  # lower all character
                                     .str.title()  # Uppercase first letter of each word
                                     .map(lambda x: x[0:-2] + x[-2:].upper()))  # last 2 characters should alwsy be upper case

In [6]:
features.head()

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Bonner 9-12H,2019-07-12 09:43:39,,,,
1,Bonner 9-12H,2019-07-12 12:20:58,33965.0,17623.0,8822.0,880.0
2,Bonner 9-12H,2019-07-12 12:25:28,,,,
3,Bonner 9-12H,2019-07-14 09:06:04,31905.0,18588.0,9895.0,438.0
4,Bonner 9-12H,2019-07-14 10:59:27,31680.0,19194.0,7735.0,1333.0


In [30]:
failures.head()

Unnamed: 0,NodeID,Last Oil,Start Date,Finish Date,Job Type,Job Bucket,Primary Symptom,Secondary Symptom
0,Aagvik 1-35H,2019-11-27,2019-12-02,2019-12-06,TUBING LEAK,TUBING,Mechanically Induced Damage,Solids in Pump
1,Aagvik 5298 41-35 2TX,2019-05-29,2019-06-04,2019-06-25,GAS LIFT,PUMP,Low Production,Blank
2,Acadia 31-25H,2019-03-30,2019-04-10,2019-04-16,"1-1/4"" PUMP",PUMP,Corrosion,Mechanically Induced Damage
3,Acadia 31-25H,2018-04-11,2018-05-05,2018-05-11,TUBING LEAK,TUBING,Corrosion,Sand
4,Acklins 6092 12-18H,2019-12-24,2020-01-02,2020-01-03,POLISH ROD BREAK,ROD,Mechanically Induced Damage,


In [32]:
data.head()

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,Job Type,Job Bucket,Primary Symptom,Secondary Symptom
0,Bonner 9-12H,2019-07-12 09:43:39,,,,,Normal,Normal,Normal,Normal
1,Bonner 9-12H,2019-07-12 12:20:58,33965.0,17623.0,8822.0,880.0,Normal,Normal,Normal,Normal
2,Bonner 9-12H,2019-07-12 12:25:28,,,,,Normal,Normal,Normal,Normal
3,Bonner 9-12H,2019-07-14 09:06:04,31905.0,18588.0,9895.0,438.0,Normal,Normal,Normal,Normal
4,Bonner 9-12H,2019-07-14 10:59:27,31680.0,19194.0,7735.0,1333.0,Normal,Normal,Normal,Normal


In [22]:
"""
Merging the Data
Transfering info from failures to data (copy of features)
Using a for loop -- may not be very efficient
"""
data = features.copy()

# columns we need to transfer
transfer_cols = ['Job Type', 'Job Bucket', 'Primary Symptom', 'Secondary Symptom']

for col in transfer_cols:
    data[col] = 'Normal'

for i in failures.index:
    well = failures.loc[i, 'NodeID']
    t_start = failures.loc[i, 'Start Date']
    t_end = failures.loc[i, 'Finish Date']
    
    bool_ = (data.NodeID == well) & (data.Date >= t_start) & (data.Date <= t_end)  # Boolean mask for main data
    if data[bool_].shape[0] > 0:
        print(i)
        print(data[bool_])

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,Job Type,Job Bucket,Primary Symptom,Secondary Symptom
