# Notebook Info

From the data tables that we have, we try identifying the features that matter the most for forecasting
failures.

For now the data is pulled from the `xdiag` table and failure is imported from the `failure_info` table.

Database Details:
```
# Data
database = 'oasis-prod'
schema = 'xspoc'
table = 'xdiag'

# Failure
database = 'oasis-prod'
schema = 'analysis'
table = 'failure_info'  
```

Note: The tables especially `xdiag` has data from around 900 wells. Querying the entire table may take time. Can try working on a group of wells or single wells for the analysis.

# Imports

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
from library import lib_aws

pd.set_option('display.max_rows', 500)
import warnings
warnings.filterwarnings('ignore')

# Initial Analysis

Just to check the timestamps and how the data is spread out in both the tables

In [4]:
%%time
query_initial = """
SELECT
    distinct("NodeID"),
    min("Date") as min_date,
    max("Date") as max_date
FROM xspoc.xdiag
GROUP BY "NodeID"
ORDER BY "NodeID"
"""

# queryinh the entire failure info
query_failures = """
SELECT 
    "NodeID",
    "Last Oil",
    "Start Date",
    "Finish Date",
    "Job Type",
    "Job Bucket",
    "Primary Symptom",
    "Secondary Symptom"
FROM
    analysis.failure_info
ORDER BY "NodeID";
"""

with lib_aws.PostgresRDS(db='oasis-prod', verbose=1) as engine:
    data_info = pd.read_sql(query_initial, engine, parse_dates=['Date'])
    failures = pd.read_sql(query_failures, engine, parse_dates=['Last Oil', 'Start Date', 'Finish Date'])

Connected to oasis-prod DataBase
Connection Closed
Wall time: 59.4 s


In [25]:
data_wells = set(data_info.NodeID)
fail_wells = set(failures.NodeID.unique())

print("Wells with Failure:")
display(data_wells & fail_wells) # wells with failure

# print("Wells without Failure (Atleast in the failure info being used):")
# display(data_wells - fail_wells)

Wells with Failure:


{'Aagvik 1-35H',
 'Acadia 31-25H',
 'Aerabelle 5502 43-7T',
 'Amazing Grace Federal 11-2H',
 'Anderson 7-18H',
 'Andre 5501 13-4H',
 'Andre 5501 14-5 3B',
 'Andre 5601 42-33 2B',
 'Andre Shepherd 5501 14-7 1T',
 'Andre Shepherd 5501 14-7 2T',
 'Andre Shepherd 5501 21-5 3T',
 'Andre Shepherd 5501 21-5 4T',
 'Andre Shepherd 5501 21-5 5T',
 'Andre Shepherd 5501 31-8 7T',
 'Andrea 5502 44-7T',
 'Annie 12X-18HA',
 'Annie 12X-18HB',
 'Annie 5502 43-7B',
 'Anvers Federal 5602 13-18H',
 'Arlyss 5601 14-26T',
 'Arnold 16X-12H',
 'Arnstad 3-10H',
 'Ashley 13X-9H',
 'Aubrey 5304 41-22H',
 'Autumn Wind State 5601 14-16B',
 'B & Rt 2958 13-25H',
 'Baffin 5601 12-18H',
 'Barenthsen 11-20H',
 'Barnes 5892 20-30B',
 'Behan 2-29H',
 'Berkner Federal 5602 43-11H',
 'Berquist 33-28H',
 'Berquist 34-27H',
 'Berwick 4-2HE',
 'Berwick 4-2HN',
 'Betsy Federal 2758 24-29H',
 'Beulah Irene Federal 19-18H',
 'Bobby 5602 43-35H',
 'Bonita 5992 42-22 2B',
 'Bonita 5992 42-22H',
 'Bouvardia Federal 2658 12-12H',
 

# Data Import

- Features imported from `xspoc.xdiag`

Following are the Features (Columns) to use for the initial analysis:
```
"Date",
"PPRL",
"MPRL",
"FluidLoadonPump",
"PumpIntakePressure"
```


## Well Specific

In [39]:
well_name = 'Aagvik 1-35H'  # choose from wells which have failure

query_well = """
SELECT 
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM
    xspoc.xdiag
WHERE "NodeID" = '{}'
ORDER BY "NodeID", "Date";
""".format(well_name)

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data_well = pd.read_sql(query_well, engine, parse_dates=['Date'])
 
# Just failures for that well
failure_well = failures[failures.NodeID == well_name]
failure_well.reset_index(inplace=True, drop=True)

display(data_well.head())
print("Failure Info")
display(failure_well)

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


Failure Info


Unnamed: 0,NodeID,Last Oil,Start Date,Finish Date,Job Type,Job Bucket,Primary Symptom,Secondary Symptom
0,Aagvik 1-35H,2019-11-27,2019-12-02,2019-12-06,TUBING LEAK,TUBING,Mechanically Induced Damage,Solids in Pump


## Group of Wells

In [74]:
%%time
well_list = [
    'Andre 5501 14-5 3B',
    'Dixon 5602 44-34H',
    'Forland 28-33H'
]

query_list = """
SELECT
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM xspoc.xdiag
WHERE "NodeID" in {}
ORDER BY "NodeID","Date"
""".format(tuple(well_list))

with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data_list = pd.read_sql(query_list, engine, parse_dates=['Date'])

failure_list = failures[failures.NodeID.isin(well_list)]
failure_list.reset_index(inplace=True, drop=True)

display(data_list.head())
print("Failure info in these in these wells")
display(failure_list)

Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Andre 5501 14-5 3B,2019-05-28 00:32:02,33010.0,15386.0,8997.0,549.0
1,Andre 5501 14-5 3B,2019-05-28 02:11:23,30272.0,17090.0,8643.0,978.0
2,Andre 5501 14-5 3B,2019-05-28 04:20:24,33434.0,15386.0,9606.0,578.0
3,Andre 5501 14-5 3B,2019-05-28 07:07:08,33168.0,16017.0,9745.0,521.0
4,Andre 5501 14-5 3B,2019-05-28 08:49:42,33046.0,15429.0,9069.0,802.0


Failure info in these in these wells


Unnamed: 0,NodeID,Last Oil,Start Date,Finish Date,Job Type,Job Bucket,Primary Symptom,Secondary Symptom
0,Andre 5501 14-5 3B,2020-03-06,2020-03-10,2020-03-13,"1-3/4"" PUMP",PUMP,Corrosion,Abrasion - Foreign Debris
1,Andre 5501 14-5 3B,2018-05-22,2018-05-24,2018-05-26,"1"" ROD SECTION",ROD,Dropped (X) Amount of Times,Tagging
2,Andre 5501 14-5 3B,2018-04-30,2018-05-09,2018-05-12,POLISH ROD BREAK,ROD,Spray Metal Overload,Blank
3,Dixon 5602 44-34H,2019-09-06,2019-09-19,2019-09-19,POLISH ROD BREAK,ROD,,Unknown
4,Dixon 5602 44-34H,2017-12-27,2018-01-18,2018-01-20,"1-1/2"" PUMP",PUMP,Corrosion,Blank
5,Dixon 5602 44-34H,2019-01-21,2019-02-25,2019-03-01,"1-3/4"" PUMP",PUMP,Sand,Blank
6,Forland 28-33H,2019-07-15,2019-07-19,2019-07-24,"1-3/4"" PUMP",PUMP,Mechanically Induced Damage,Compression
7,Forland 28-33H,2018-01-04,2018-01-06,2018-01-06,2-3/4 PUMP,PUMP,Solids in Pump,Blank
8,Forland 28-33H,2018-06-12,2018-07-01,2018-07-14,2-3/4 PUMP,PUMP,Compression,Solids in Pump
9,Forland 28-33H,2018-12-17,2018-12-21,2018-12-24,TUBING LEAK,TUBING,Unknown,Blank


Wall time: 8.64 s


## Entire Feature Data

Running the next query will import the entire dataset from `xspoc.xdiag`. It has around 3,228,303 rows and took around 14min to run the query

In [32]:
%%time
# Querying the features
query_full = """
SELECT 
    "NodeID",
    "Date",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure"
FROM
    xspoc.xdiag
ORDER BY "NodeID", "Date";
"""



with lib_aws.PostgresRDS(db='oasis-prod') as engine:
    data_full = pd.read_sql(query_full, engine, parse_dates=['Date'])
    
data_full.head()

Wall time: 14min 9s


Unnamed: 0,NodeID,Date,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure
0,Aagvik 1-35H,2019-06-21 15:58:34,27639.0,16811.0,3280.0,
1,Aagvik 1-35H,2019-06-21 16:25:36,27457.0,16752.0,3241.0,
2,Aagvik 1-35H,2019-06-21 18:25:16,27448.0,16594.0,3330.0,
3,Aagvik 1-35H,2019-06-21 18:28:10,27424.0,16595.0,3327.0,
4,Aagvik 1-35H,2019-06-21 20:25:01,27662.0,16711.0,3341.0,


## Combining

Note: the original failure info can be used. However to make it efficient we only use those wells which are present in the feature dataframe (data_well, data_list, data_full)

In [73]:
## TODO: Test out this function
"""
Before analysing the data we need to merge the information
Transfering info from failures to data (copy of features)
Using a for loop -- may not be very efficient
"""

def failure_merge(df, failure_df, transfer_cols):
    """
    Merges the failures info
    :param df: dataframe to which info is being transferred to. (Should have columns "NodeID" and "Date")
    :param failure_df: Failure info data (Should have columns "NodeID", "Start Date" and "End Data")
    :param cols: Columns which need to be transferred
    """
    
    merged = df.copy()
    
    for col in transfer_cols:
        merged[col] = 'Normal'  # for now putting everything as normal (even NAN's)
        
    for i in failure_df.index:
        well = failure_df.loc[i, 'NodeID']
        t_start = failure_df.loc[i, 'Start Date']
        t_end = failure_df.loc[i, 'Finish Date']

        bool_ = (merged.NodeID == well) & (merged.Date >= t_start) & (merged.Date <= t_end)  # Boolean mask for main data
        merged.loc[bool_, transfer_cols] = failure_df.loc[i, transfer_cols].values
        
    return merged