# LEGACY CODE

For our Current Analysis, we are using the following tables:
- `xspoc.xdiagresults` --> From oasis-dev db
- `xspoc.card` --> From oasis-dev db
- `Enfinite Pilot Well Failure Summary.xlsx`  --> From s3

## Dataset Used

`xspoc.card` and `xspoc.xdiagresults` have been joined using a 'FULL OUTER JOIN' and stored as a view called `xspoc.merged`.
This is the table being used. To add more columns talk to the db admin.

This view has the following columns.

- NodeID
- Date
- XNodeID
- XDate
- SPM
- StrokeLength
- Runtime
- Fillage
- FillBasePct
- cardPPRL
- cardMPRL
- TubingPressure
- CasingPressure
- GrossProd
- PPRL
- MPRL
- FluidLoadonPump
- PumpIntakePressure

Note: The XNodeID and XDate column is from the xdiagresutls. This will help us find which data wasnt present in card but in xdiagresults.


### Handling values present in xdiag but not in card. Merging Missing values
**LOGIC**
- Break into 2 df's,  df1 with NodeID and df2 with NAN NodeID
- Drop XNodeID and XDate from df1 
- In df2, drop NodeID and Date, convert XNodeID and XDate to NodeID and Date  
- Concat them to get the final clean values
- Can fill up missing timestamps with nan values using the `lib_cleaning.fill_null()` method

*NOTE: Can do with boolean Masks as well, hovever tests were giving some errors*

## Failure Data

This Notebook has 2 ways of pulling the failure info.
- Directly from a s3 bucket
- From a local excel file

Both have been shown, cleaning should lead to the `failure_data` df having the following columns
- NodeID
- StartDate
- EndDate
- FailureInfo

## Combining

- If resampling needs to be done, do it before combining data with failure_data
- Combined Data is strored in the schema `clean` and table `xspoc`

**Note: This notebook has the lib_aws.AddData class being used to save the data. Only admin has privileges**



In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
from library import lib_aws, lib_cleaning
import s3fs  # To handle s3 urls

pd.set_option('display.max_rows', 500)
import warnings
warnings.filterwarnings('ignore')

### Merged Data

In [5]:
%%time
"""
Query the merged data from xspoc.merged
"""
query = """SELECT * FROM xspoc.merged ORDER BY "NodeID", "Date"; """

with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    merged_data = pd.read_sql(query, engine, parse_dates=['Date'])
    
merged_data.head()

Connected to oasis-dev DataBase
Connection Closed
Wall time: 33.3 s


Unnamed: 0,NodeID,Date,XNodeID,XDate,SPM,StrokeLength,Runtime,Fillage,FillBasePct,cardPPRL,...,GrossProd,NetProd,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,MinEnergyKWH,MinTorqueKWH,MinEnergyGBLoadPct,MinTorqueGBLoadPct
0,Bonner 9-12H,2019-01-22 13:30:13,,NaT,2.0,306.0,0.0,98.1,45.0,33660.0,...,,,,,,,,,,
1,Bonner 9-12H,2019-03-19 00:51:53,,NaT,2.0,306.0,24.0,99.0,45.0,34112.0,...,,,,,,,,,,
2,Bonner 9-12H,2019-03-19 02:41:29,,NaT,2.0,306.0,24.0,91.0,45.0,34276.0,...,,,,,,,,,,
3,Bonner 9-12H,2019-03-19 04:57:34,,NaT,2.0,306.0,24.0,99.8,45.0,34016.0,...,,,,,,,,,,
4,Bonner 9-12H,2019-03-19 06:41:44,,NaT,1.2,306.0,24.0,100.0,45.0,33299.0,...,,,,,,,,,,


In [6]:
print("Before Handling Nulls and non Matching data: Size is {}".format(merged_data.shape[0]))
display(merged_data.isnull().sum(axis=0))

print("\nAround 1217 data points are missing in card data, but present in xdiag")
print("Happens in the following wells:")
print(merged_data[merged_data.NodeID.isnull()]['XNodeID'].unique())

Before Handling Nulls and non Matching data: Size is 128186


NodeID                 1217
Date                   1217
XNodeID               22933
XDate                 22933
SPM                    1217
StrokeLength           1217
Runtime                1217
Fillage                1217
FillBasePct            1217
cardPPRL               1217
cardMPRL               1217
FillagePct            22933
TubingPressure        22933
CasingPressure        22933
GrossProd             22933
NetProd               27879
PPRL                  36730
MPRL                  36730
FluidLoadonPump       36730
PumpIntakePressure    41676
MinEnergyKWH          41950
MinTorqueKWH          36730
MinEnergyGBLoadPct    28153
MinTorqueGBLoadPct    22933
dtype: int64


Around 1217 data points are missing in card data, but present in xdiag
Happens in the following wells:
['Spratley 5494 14-13 15T' 'Stenehjem 14X-9HA']


In [7]:
# Merging Missing values
df1 = merged_data[~merged_data.NodeID.isnull()]
df1.drop(columns=['XNodeID', 'XDate'], inplace=True)
display(df1.head())

df2 = merged_data[merged_data.NodeID.isnull()]
df2.drop(columns=['NodeID', 'Date'], inplace=True)
df2.rename(columns = {'XNodeID': 'NodeID',
                      'XDate': 'Date'}, inplace=True)
display(df2.head())

data = pd.concat([df1, df2], axis=0)  # d
data.sort_values(by=['NodeID', 'Date'], inplace=True)
display(data.head())

del df1
del df2


Unnamed: 0,NodeID,Date,SPM,StrokeLength,Runtime,Fillage,FillBasePct,cardPPRL,cardMPRL,FillagePct,...,GrossProd,NetProd,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,MinEnergyKWH,MinTorqueKWH,MinEnergyGBLoadPct,MinTorqueGBLoadPct
0,Bonner 9-12H,2019-01-22 13:30:13,2.0,306.0,0.0,98.1,45.0,33660.0,18996.0,,...,,,,,,,,,,
1,Bonner 9-12H,2019-03-19 00:51:53,2.0,306.0,24.0,99.0,45.0,34112.0,18774.0,,...,,,,,,,,,,
2,Bonner 9-12H,2019-03-19 02:41:29,2.0,306.0,24.0,91.0,45.0,34276.0,19042.0,,...,,,,,,,,,,
3,Bonner 9-12H,2019-03-19 04:57:34,2.0,306.0,24.0,99.8,45.0,34016.0,18788.0,,...,,,,,,,,,,
4,Bonner 9-12H,2019-03-19 06:41:44,1.2,306.0,24.0,100.0,45.0,33299.0,19589.0,,...,,,,,,,,,,


Unnamed: 0,NodeID,Date,SPM,StrokeLength,Runtime,Fillage,FillBasePct,cardPPRL,cardMPRL,FillagePct,...,GrossProd,NetProd,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,MinEnergyKWH,MinTorqueKWH,MinEnergyGBLoadPct,MinTorqueGBLoadPct
126969,Spratley 5494 14-13 15T,2019-11-28 13:29:55,,,,,,,,0.0,...,0.0,0.0,,,,,,,0.0,0.0
126970,Spratley 5494 14-13 15T,2019-12-21 08:00:04,,,,,,,,100.0,...,88.0,431.0,34008.0,20157.0,6747.0,1739.0,54.0,1254.0,73.0,36.0
126971,Spratley 5494 14-13 15T,2020-02-06 03:07:57,,,,,,,,100.0,...,38.0,446.0,34147.0,19524.0,6587.0,1789.0,75.0,1287.0,77.0,38.0
126972,Spratley 5494 14-13 15T,2019-12-20 19:58:36,,,,,,,,77.0,...,88.0,356.0,32986.0,20152.0,7384.0,1536.0,33.0,1107.0,68.0,34.0
126973,Spratley 5494 14-13 15T,2019-09-23 18:00:26,,,,,,,,0.0,...,0.0,0.0,,,,,,,0.0,0.0


Unnamed: 0,NodeID,Date,SPM,StrokeLength,Runtime,Fillage,FillBasePct,cardPPRL,cardMPRL,FillagePct,...,GrossProd,NetProd,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,MinEnergyKWH,MinTorqueKWH,MinEnergyGBLoadPct,MinTorqueGBLoadPct
0,Bonner 9-12H,2019-01-22 13:30:13,2.0,306.0,0.0,98.1,45.0,33660.0,18996.0,,...,,,,,,,,,,
1,Bonner 9-12H,2019-03-19 00:51:53,2.0,306.0,24.0,99.0,45.0,34112.0,18774.0,,...,,,,,,,,,,
2,Bonner 9-12H,2019-03-19 02:41:29,2.0,306.0,24.0,91.0,45.0,34276.0,19042.0,,...,,,,,,,,,,
3,Bonner 9-12H,2019-03-19 04:57:34,2.0,306.0,24.0,99.8,45.0,34016.0,18788.0,,...,,,,,,,,,,
4,Bonner 9-12H,2019-03-19 06:41:44,1.2,306.0,24.0,100.0,45.0,33299.0,19589.0,,...,,,,,,,,,,


In [8]:
"""
Filling up Null Values where data is missing for a specific freq
Use the lib_cleaning.fill_null() function with a 1D freq
"""

print("Before Filling with nulls: Size is {}".format(data.shape[0]))
display(data.isnull().sum(axis=0))

data = lib_cleaning.fill_null(data, freq='1D', test_col='SPM')

print("Before Filling with nulls: Size is {}".format(data.shape[0]))
display(data.isnull().sum(axis=0))

Before Filling with nulls: Size is 128186


NodeID                    0
Date                      0
SPM                    1217
StrokeLength           1217
Runtime                1217
Fillage                1217
FillBasePct            1217
cardPPRL               1217
cardMPRL               1217
FillagePct            22933
TubingPressure        22933
CasingPressure        22933
GrossProd             22933
NetProd               27879
PPRL                  36730
MPRL                  36730
FluidLoadonPump       36730
PumpIntakePressure    41676
MinEnergyKWH          41950
MinTorqueKWH          36730
MinEnergyGBLoadPct    28153
MinTorqueGBLoadPct    22933
dtype: int64

Before Filling with nulls: Size is 177746


Date                      0
NodeID                    0
SPM                   50777
StrokeLength          50777
Runtime               50777
Fillage               50777
FillBasePct           50777
cardPPRL              50777
cardMPRL              50777
FillagePct            72305
TubingPressure        72305
CasingPressure        72305
GrossProd             72305
NetProd               77254
PPRL                  86191
MPRL                  86191
FluidLoadonPump       86191
PumpIntakePressure    91140
MinEnergyKWH          91418
MinTorqueKWH          86191
MinEnergyGBLoadPct    77531
MinTorqueGBLoadPct    72305
dtype: int64

## Failure Data

### Failure Info Locally

In [9]:
file_loc = r'Enfinite Pilot Wells Failure Summary V3.xlsx'

# Basic Cleaning
failure_data = pd.read_excel(file_loc)
cols_map = {                                     # Just to match the other files, not needed
    'WELL NAME': 'NodeID',
    'ACTUAL FAILURE START': 'StartDate',
    'ACTUAL FAILURE STOP': 'EndDate',
    'FAILURE TYPE': 'FailureInfo'
}
failure_data.rename(columns=cols_map, inplace=True)

print("Without any Cleaning")
display(failure_data)

Without any Cleaning


Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,Cade 12-19HA,2019-07-17 16:32:23,2019-07-28 08:06:59,POLISH ROD BREAK
1,Cook 12-13 6B,2019-12-11 07:52:24,2019-12-25 08:17:09,TUBING LEAK
2,Helling Trust 43-22 10T,2019-07-13 14:05:57,2018-07-25 08:51:20,PUMP FAILURE
3,Helling Trust 43-22 16T3,2019-07-19 15:23:52,2019-07-29 10:01:03,TUBING LEAK
4,Helling Trust 44-22 5B,2020-03-19 01:43:54,2020-03-26 23:20:11,POLISH ROD BREAK
5,Johnsrud 5198 14-18 13T,2020-02-13 02:09:07,2020-03-04 09:12:36,TUBING LEAK
6,Johnsrud 5198 14-18 13T,2019-09-17 09:35:17,2019-10-03 09:41:31,TUBING LEAK
7,Johnsrud 5198 14-18 15TX,2020-02-12 07:10:06,2020-02-27 10:16:50,TUBING LEAK
8,Johnsrud 5198 14-18 15TX,2019-07-09 11:03:40,2019-08-19 13:00:04,TUBING LEAK
9,Johnsrud 5198 14-18 15TX,2019-06-10 14:50:03,2019-07-01 08:59:47,TUBING LEAK


In [10]:
# Cleaning the NodeID columns
failure_data['NodeID'] = failure_data.NodeID.str.lower()  # Convert all to lower
well_dict = dict(zip(data.NodeID.str.lower().unique(), data.NodeID.unique()))  # use the original data to create a dict, converts from lower case to the corerct one
failure_data["NodeID"] = failure_data.NodeID.map(well_dict)  # map it, and drop the nan as those columns dont match

failure_data = failure_data.dropna(subset=['NodeID'])  # Drop all wells which didnt match
failure_data.reset_index(inplace=True, drop=True) 
failure_data.dropna(inplace=True) # drop all nan values

In [11]:
display(failure_data.sort_values(by='NodeID'))
print("Failures Being Considered:")
display(failure_data.FailureInfo.value_counts())


Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,Cade 12-19HA,2019-07-17 16:32:23,2019-07-28 08:06:59,POLISH ROD BREAK
1,Cook 12-13 6B,2019-12-11 07:52:24,2019-12-25 08:17:09,TUBING LEAK
2,Helling Trust 43-22 10T,2019-07-13 14:05:57,2018-07-25 08:51:20,PUMP FAILURE
3,Helling Trust 43-22 16T3,2019-07-19 15:23:52,2019-07-29 10:01:03,TUBING LEAK
4,Helling Trust 44-22 5B,2020-03-19 01:43:54,2020-03-26 23:20:11,POLISH ROD BREAK
5,Johnsrud 5198 14-18 13T,2020-02-13 02:09:07,2020-03-04 09:12:36,TUBING LEAK
6,Johnsrud 5198 14-18 13T,2019-09-17 09:35:17,2019-10-03 09:41:31,TUBING LEAK
8,Johnsrud 5198 14-18 15TX,2019-07-09 11:03:40,2019-08-19 13:00:04,TUBING LEAK
9,Johnsrud 5198 14-18 15TX,2019-06-10 14:50:03,2019-07-01 08:59:47,TUBING LEAK
7,Johnsrud 5198 14-18 15TX,2020-02-12 07:10:06,2020-02-27 10:16:50,TUBING LEAK


Failures Being Considered:


TUBING LEAK         14
POLISH ROD BREAK     3
PUMP FAILURE         2
Name: FailureInfo, dtype: int64

### Failure Info from s3 Bucket

Use this if locally failures havent been provided.
Need to have access to the s3 bucket.

In [81]:
# failure_data = pd.read_excel("s3://et-oasis/failure-excel/Enfinite Pilot Wells Failure Summary.xlsx")  # Query it locally

# # Use only these columns
# columns_use = [
#     'WELL NAME',
#     'FAILURE START (Rig LOE Start)',
#     'FAILURE STOP (Rig LOE Finish)',
#     'EVENT OPERATIONS DESCRIPTION'
# ]
# failure_data = failure_data[columns_use]  # use only coluemns wew need

# # Rename columns
# cols_rename = {
#     'WELL NAME': 'NodeID',
#     'FAILURE START (Rig LOE Start)': 'StartDate',
#     'FAILURE STOP (Rig LOE Finish)': 'EndDate',
#     'EVENT OPERATIONS DESCRIPTION': 'FailureInfo'
# }
# failure_data.rename(columns=cols_rename, inplace=True)  # Rename the columns for ease of use


# display(failure_data.head())


In [82]:
# # Cleaning Failure Columns
# failure_data.loc[:, 'FailureInfo'] = failure_data.FailureInfo.str.upper()  # convert all failure to Upper case

# # Mapping specific failures
# failure_map = {
#     'ROD PART - DEEP': 'DEEP ROD PART',
#     'ROD PART DEEP': 'DEEP ROD PART',
#     'SHALLOW ROD PART': 'ROD PART SHALLOW',
#     'HOLE IN TUBING': 'TUBING LEAK'
# }
# failure_data['FailureInfo'] = failure_data.FailureInfo.map(failure_map).fillna(failure_data['FailureInfo'])  # map the valus in the dict

# failure_data.head()

In [84]:
# # Cleaning the NodeID columns
# failure_data['NodeID'] = failure_data.NodeID.str.lower()  # Convert all to lower
# well_dict = dict(zip(data.NodeID.str.lower().unique(), data.NodeID.unique()))  # use the original data to create a dict, converts from lower case to the corerct one
# failure_data["NodeID"] = failure_data.NodeID.map(well_dict)  # map it, and drop the nan as those columns dont match

# failure_data = failure_data.dropna(subset=['NodeID'])  # Drop all wells which didnt match
# failure_data.reset_index(inplace=True, drop=True) 
# failure_data.dropna(inplace=True) # drop all nan values

In [85]:
# display(failure_data.head())
# display(failure_data.FailureInfo.value_counts())

## Combining

- Failure Info from failure_data is combined with data.
- 2 Columns will be added to `data`
    - FailureBin: Binary Info of Failure (0 - Normal, 1 - Failure)
    - FailureLabel: Failure Labels

In [None]:
"""
Resampling Codes
"""
freq = '1D' # Select the frequecy we need to resample the data by

In [12]:
print("Failure Data")
display(failure_data)

print("Main Data Start and End Dates")
display(data.groupby('NodeID').agg({'Date': ['min', 'max']}))

Failure Data


Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,Cade 12-19HA,2019-07-17 16:32:23,2019-07-28 08:06:59,POLISH ROD BREAK
1,Cook 12-13 6B,2019-12-11 07:52:24,2019-12-25 08:17:09,TUBING LEAK
2,Helling Trust 43-22 10T,2019-07-13 14:05:57,2018-07-25 08:51:20,PUMP FAILURE
3,Helling Trust 43-22 16T3,2019-07-19 15:23:52,2019-07-29 10:01:03,TUBING LEAK
4,Helling Trust 44-22 5B,2020-03-19 01:43:54,2020-03-26 23:20:11,POLISH ROD BREAK
5,Johnsrud 5198 14-18 13T,2020-02-13 02:09:07,2020-03-04 09:12:36,TUBING LEAK
6,Johnsrud 5198 14-18 13T,2019-09-17 09:35:17,2019-10-03 09:41:31,TUBING LEAK
7,Johnsrud 5198 14-18 15TX,2020-02-12 07:10:06,2020-02-27 10:16:50,TUBING LEAK
8,Johnsrud 5198 14-18 15TX,2019-07-09 11:03:40,2019-08-19 13:00:04,TUBING LEAK
9,Johnsrud 5198 14-18 15TX,2019-06-10 14:50:03,2019-07-01 08:59:47,TUBING LEAK


Main Data Start and End Dates


Unnamed: 0_level_0,Date,Date
Unnamed: 0_level_1,min,max
NodeID,Unnamed: 1_level_2,Unnamed: 2_level_2
Bonner 9-12H,2019-01-22 13:30:13,2020-05-25 19:49:35
Bonner 9X-12HA,2019-03-19 01:30:10,2020-05-26 06:53:43
Bonner 9X-12HB,2019-03-19 00:33:33,2020-04-10 04:31:11
Cade 12-19HA,2019-03-18 23:30:17,2020-04-10 05:25:25
Cade 12-19HB,2019-03-18 23:18:25,2020-04-10 07:39:23
Cade 12X-19H,2019-03-19 00:22:13,2020-04-10 08:39:04
Cook 12-13 6B,2019-03-19 00:12:48,2020-05-26 07:43:30
Cook 12-13 7T,2019-04-22 07:39:11,2020-05-26 07:44:25
Cook 12-13 9T,2019-04-04 15:24:42,2020-05-26 07:48:50
Cook 41-12 11T,2019-10-15 16:18:57,2020-05-26 07:26:41


In [13]:
%%time
# Using for loop, for transfering --  not very efficient

data.loc[:, 'FailureBin'] = 0
data.loc[:, 'FailureLabel'] = 'Normal'

for i in failure_data.index:
    well = failure_data.loc[i, 'NodeID']  # get well
    t_start = failure_data.loc[i, 'StartDate']  # strt date
    t_end = failure_data.loc[i, 'EndDate']  # end data
    failure = failure_data.loc[i, 'FailureInfo']  # failure
    
    bool_ = (data.NodeID == well) & (data.Date >= t_start) & (data.Date <= t_end)  # Boolean mask for main data
    data.loc[bool_, 'FailureLabel'] = failure  # attach failure for that specific boolean mask
    data.loc[bool_, 'FailureBin'] = 1  # add 1 in binary failure columns


Wall time: 361 ms


In [14]:
print("Total Data Set Binary Failure Distribution")
display(data.FailureBin.value_counts())

print("Total Data Set Failure Label Distribution")
display(data.FailureLabel.value_counts())

print("Wells which have failure")
display(data[data.FailureBin == 1].NodeID.unique())

Total Data Set Binary Failure Distribution


0    177214
1       532
Name: FailureBin, dtype: int64

Total Data Set Failure Label Distribution


Normal              177214
TUBING LEAK            461
POLISH ROD BREAK        38
PUMP FAILURE            33
Name: FailureLabel, dtype: int64

Wells which have failure


array(['Cade 12-19HA', 'Cook 12-13 6B', 'Helling Trust 43-22 16T3',
       'Helling Trust 44-22 5B', 'Johnsrud 5198 14-18 13T',
       'Johnsrud 5198 14-18 15TX', 'Rolfson N 5198 12-17 5T',
       'Rolfson N 5198 12-17 7T', 'Rolfson S 5198 11-29 2TX',
       'Rolfson S 5198 11-29 4T', 'Rolfson S 5198 12-29 8T',
       'Rolfson S 5198 14-29 11T', 'Stenehjem 14X-9HA'], dtype=object)

In [15]:
data.head()

Unnamed: 0,Date,NodeID,SPM,StrokeLength,Runtime,Fillage,FillBasePct,cardPPRL,cardMPRL,FillagePct,...,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,MinEnergyKWH,MinTorqueKWH,MinEnergyGBLoadPct,MinTorqueGBLoadPct,FailureBin,FailureLabel
0,2019-01-22 13:30:13,Bonner 9-12H,2.0,306.0,0.0,98.1,45.0,33660.0,18996.0,,...,,,,,,,,,0,Normal
1,2019-01-23 00:00:00,Bonner 9-12H,,,,,,,,,...,,,,,,,,,0,Normal
2,2019-01-24 00:00:00,Bonner 9-12H,,,,,,,,,...,,,,,,,,,0,Normal
3,2019-01-25 00:00:00,Bonner 9-12H,,,,,,,,,...,,,,,,,,,0,Normal
4,2019-01-26 00:00:00,Bonner 9-12H,,,,,,,,,...,,,,,,,,,0,Normal


### Saving Data

Data saved in the database

**Location**

Database = 'oasis-dev'

Schema = 'clean'

Table = 'xspoc'

In [17]:
# Adding the data. Need to have write permissions
lib_aws.AddData.add_data(df=data, db='oasis-dev', table='xspoc', schema='clean',
                         merge_type='replace', card_col=None, index_col='Date')

# Update index on pred table in database
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    with engine.begin() as connection:
        connection.execute("""CREATE UNIQUE INDEX xspoc_idx ON clean.xspoc ("NodeID", "Date");""")
        print("Index Updated")

Connected to oasis-dev DataBase
Connection Closed
Data replaceed on Table xspoc in time 31.76s
Connected to oasis-dev DataBase
Index Updated
Connection Closed
