# Combining Data

For our Current Analysis, we are using the following tables:
- `xspoc.xdiagresults` --> From oasis-dev db
- `xspoc.card` --> From oasis-dev db
- `Enfinite Pilot Well Failure Summary.xlsx`  --> From s3


### Step 1
Combining `xspoc.xdiagresults` and `xspoc.card` using specific columns from each table.
This is queried direclty using the Left Join Function on xspoc.xdiagresults, as we want all rows from xspoc.xdiagresults to be prioritized.

Columns used from xspoc.xdiagresults
- FillagePct
- TubingPressure
- CasingPressure
- GrossProd
- PPRL
- MPRL
- FluidLoadonPump
- PumpintakePressure


Columns used from xspoc.card
- SPM
- StrokeLength
- Runtime
- Fillage
- FillBasePct


### Step 2
Combine the failure info from `Enfinite Pilot Well Failure Summary.xlsx`. This file can be imorted locally but in this notebook it is imported from an s3 bucket. 



In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
from library import lib_aws, lib_cleaning
import s3fs  # To handle s3 urls

pd.set_option('display.max_rows', 500)

### STEP 1

In [4]:
"""
Query for merging xdiagresults and card on specific column
"""

query = """
SELECT
    xdiagresults."NodeID",
    xdiagresults."Date",
    "FillagePct",
    "TubingPressure",
    "CasingPressure",
    "GrossProd",
    "PPRL",
    "MPRL",
    "FluidLoadonPump",
    "PumpIntakePressure",
    "SPM",  
    "StrokeLength",
    "Runtime",
    "Fillage",
    "FillBasePct"
FROM 
    xspoc.xdiagresults
LEFT JOIN xspoc.card
    ON xdiagresults."NodeID" = card."NodeID"
    AND xdiagresults."Date" = card."Date" 
ORDER BY 
	xdiagresults."NodeID", xdiagresults."Date"
"""

In [5]:
%%time
with lib_aws.PostgresRDS(db='oasis-dev') as engine:
    data = pd.read_sql(query, engine)
    
display(data.head())

Connected to oasis-dev DataBase
Connection Closed


Unnamed: 0,NodeID,Date,FillagePct,TubingPressure,CasingPressure,GrossProd,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,SPM,StrokeLength,Runtime,Fillage,FillBasePct
0,Bonner 9-12H,2019-07-12 09:43:39,0,0,0,0,,,,,1.5,306.0,0.0,0.0,45.0
1,Bonner 9-12H,2019-07-12 12:20:58,100,25,85,0,33965.0,17623.0,8822.0,880.0,2.0,306.0,19.2,99.5,45.0
2,Bonner 9-12H,2019-07-12 12:25:28,0,0,0,0,,,,,2.0,306.0,19.2,99.4,45.0
3,Bonner 9-12H,2019-07-14 09:06:04,100,29,90,48,31905.0,18588.0,9895.0,438.0,2.0,306.0,24.0,95.7,45.0
4,Bonner 9-12H,2019-07-14 10:59:27,86,26,84,48,31680.0,19194.0,7735.0,1333.0,2.0,306.0,24.0,87.8,45.0


Wall time: 33.1 s


In [7]:
"""
Filling up Null Values where data is missing for a specific freq
Use the lib_cleaning.fill_null() function with a 1D freq
"""

print("Before Filling with nulls: Size is {}".format(data.shape[0]))
display(data.isnull().sum(axis=0))

data = lib_cleaning.fill_null(data, freq='1D', test_col='GrossProd')

print("Before Filling with nulls: Size is {}".format(data.shape[0]))
display(data.isnull().sum(axis=0))

Before Filling with nulls: Size is 113139


Date                      0
NodeID                    0
FillagePct             7886
TubingPressure         7886
CasingPressure         7886
GrossProd              7886
PPRL                  21683
MPRL                  21683
FluidLoadonPump       21683
PumpIntakePressure    26629
SPM                    9103
StrokeLength           9103
Runtime                9103
Fillage                9103
FillBasePct            9103
dtype: int64

Before Filling with nulls: Size is 121025


Date                      0
NodeID                    0
FillagePct            15772
TubingPressure        15772
CasingPressure        15772
GrossProd             15772
PPRL                  29569
MPRL                  29569
FluidLoadonPump       29569
PumpIntakePressure    34515
SPM                   16989
StrokeLength          16989
Runtime               16989
Fillage               16989
FillBasePct           16989
dtype: int64

In [8]:
data.head(10)

Unnamed: 0,Date,NodeID,FillagePct,TubingPressure,CasingPressure,GrossProd,PPRL,MPRL,FluidLoadonPump,PumpIntakePressure,SPM,StrokeLength,Runtime,Fillage,FillBasePct
0,2019-07-12 09:43:39,Bonner 9-12H,0.0,0.0,0.0,0.0,,,,,1.5,306.0,0.0,0.0,45.0
1,2019-07-12 12:20:58,Bonner 9-12H,100.0,25.0,85.0,0.0,33965.0,17623.0,8822.0,880.0,2.0,306.0,19.2,99.5,45.0
2,2019-07-12 12:25:28,Bonner 9-12H,0.0,0.0,0.0,0.0,,,,,2.0,306.0,19.2,99.4,45.0
3,2019-07-13 00:00:00,Bonner 9-12H,,,,,,,,,,,,,
4,2019-07-13 00:00:00,Bonner 9-12H,,,,,,,,,,,,,
5,2019-07-14 09:06:04,Bonner 9-12H,100.0,29.0,90.0,48.0,31905.0,18588.0,9895.0,438.0,2.0,306.0,24.0,95.7,45.0
6,2019-07-14 10:59:27,Bonner 9-12H,86.0,26.0,84.0,48.0,31680.0,19194.0,7735.0,1333.0,2.0,306.0,24.0,87.8,45.0
7,2019-07-14 13:10:03,Bonner 9-12H,81.0,19.0,82.0,48.0,31808.0,18385.0,8476.0,1018.0,2.0,306.0,24.0,81.7,45.0
8,2019-07-14 15:02:59,Bonner 9-12H,81.0,34.0,92.0,48.0,31522.0,18499.0,8047.0,1212.0,2.0,306.0,24.0,81.8,45.0
9,2019-07-14 17:14:02,Bonner 9-12H,95.0,35.0,93.0,48.0,31996.0,18441.0,9922.0,433.0,2.0,306.0,24.0,97.0,45.0


# Step 2

- Importingn failure data .xlsx from the s3 bucket. (Can import it locally)
- Cleaning the data
- Combining it with the data queried in the previous section

*Note: Need to have atleast read access to the s3 bucket where the file os being queried from*

### Failure Info Locally

In [27]:
file_loc = r'Enfinite Pilot Wells Failure Summary V2.xlsx'

# Basic Cleaning
failure_data = pd.read_excel(file_loc)
cols_map = {                                     # Just to match the other files, not needed
    'WELL NAME': 'NodeID',
    'ACTUAL FAILURE START': 'StartDate',
    'ACTUAL FAILURE STOP': 'EndDate',
    'FAILURE TYPE': 'FailureInfo'
}
failure_data.rename(columns=cols_map, inplace=True)

print("Without any Cleaning")
display(failure_data)

Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,HELLING TRUST 5494 44-22 5B,2020-03-19 01:43:54,2020-03-26 23:20:11,POLISH ROD BREAK
1,Johnsrud 5198 14-18 13T,2020-02-13 02:09:07,2020-03-04 09:12:36,TUBING LEAK
2,JOHNSRUD 5198 14-18 15TX,2020-02-12 07:10:06,2020-02-27 10:16:50,TUBING LEAK
3,ROLFSON S 5198 12-29 8T,2020-01-20 19:21:09,2020-02-06 08:46:59,TUBING LEAK
4,ROLFSON S 5198 14-29 11T,2020-01-07 12:50:43,2020-02-01 10:00:36,PUMP FAILURE
5,COOK 5300 12-13 6B,2019-12-11 07:52:24,2019-12-25 08:17:09,TUBING LEAK
6,ROLFSON S 5198 11-29 4T,2019-11-24 19:45:19,2019-12-09 11:28:55,TUBING LEAK
7,ROLFSON S 5198 11-29 2TX,2019-10-31 15:49:45,2019-11-14 14:46:34,TUBING LEAK
8,ROLFSON N 5198 12-17 7T,2019-10-20 03:34:30,2019-11-18 16:26:33,TUBING LEAK
9,Johnsrud 5198 14-18 13T,2019-09-17 09:35:17,2019-10-03 09:41:31,TUBING LEAK


In [28]:
# Cleaning the NodeID columns
failure_data['NodeID'] = failure_data.NodeID.str.lower()  # Convert all to lower
well_dict = dict(zip(data.NodeID.str.lower().unique(), data.NodeID.unique()))  # use the original data to create a dict, converts from lower case to the corerct one
failure_data["NodeID"] = failure_data.NodeID.map(well_dict)  # map it, and drop the nan as those columns dont match

failure_data = failure_data.dropna(subset=['NodeID'])  # Drop all wells which didnt match
failure_data.reset_index(inplace=True, drop=True) 
failure_data.dropna(inplace=True) # drop all nan values

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [31]:
failure_data.sort_values(by='NodeID')

Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,Johnsrud 5198 14-18 13T,2020-02-13 02:09:07,2020-03-04 09:12:36,TUBING LEAK
7,Johnsrud 5198 14-18 13T,2019-09-17 09:35:17,2019-10-03 09:41:31,TUBING LEAK
1,Johnsrud 5198 14-18 15TX,2020-02-12 07:10:06,2020-02-27 10:16:50,TUBING LEAK
6,Rolfson N 5198 12-17 7T,2019-10-20 03:34:30,2019-11-18 16:26:33,TUBING LEAK
5,Rolfson S 5198 11-29 2TX,2019-10-31 15:49:45,2019-11-14 14:46:34,TUBING LEAK
4,Rolfson S 5198 11-29 4T,2019-11-24 19:45:19,2019-12-09 11:28:55,TUBING LEAK
2,Rolfson S 5198 12-29 8T,2020-01-20 19:21:09,2020-02-06 08:46:59,TUBING LEAK
8,Rolfson S 5198 12-29 8T,2019-09-04 08:44:49,2019-09-22 08:48:08,TUBING LEAK
3,Rolfson S 5198 14-29 11T,2020-01-07 12:50:43,2020-02-01 10:00:36,PUMP FAILURE


### Failure Info from s3 Bucket

In [124]:
failure_data = pd.read_excel("s3://et-oasis/failure-excel/Enfinite Pilot Wells Failure Summary.xlsx")  # Query it locally

# Use only these columns
columns_use = [
    'WELL NAME',
    'FAILURE START (Rig LOE Start)',
    'FAILURE STOP (Rig LOE Finish)',
    'EVENT OPERATIONS DESCRIPTION'
]
failure_data = failure_data[columns_use]  # use only coluemns wew need

# Rename columns
cols_rename = {
    'WELL NAME': 'NodeID',
    'FAILURE START (Rig LOE Start)': 'StartDate',
    'FAILURE STOP (Rig LOE Finish)': 'EndDate',
    'EVENT OPERATIONS DESCRIPTION': 'FailureInfo'
}
failure_data.rename(columns=cols_rename, inplace=True)  # Rename the columns for ease of use


display(failure_data.head())


Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,BERRY 5493 44-7 14BX,2019-12-02,2019-12-06,ESP BROKEN SHAFT
1,BERRY 5493 44-7 15TX,2019-08-05,2019-08-16,ESP BROKEN SHAFT
2,BERRY 5493 44-7 15TX,NaT,NaT,ESP - BROKEN SHAFT
3,BONNER 9-12H,2015-12-30,2015-12-30,rod part
4,BONNER 9X-12HA,2016-10-13,2016-10-13,tubing failure


In [125]:
# Cleaning Failure Columns
failure_data.loc[:, 'FailureInfo'] = failure_data.FailureInfo.str.upper()  # convert all failure to Upper case

# Mapping specific failures
failure_map = {
    'ROD PART - DEEP': 'DEEP ROD PART',
    'ROD PART DEEP': 'DEEP ROD PART',
    'SHALLOW ROD PART': 'ROD PART SHALLOW',
    'HOLE IN TUBING': 'TUBING LEAK'
}
failure_data['FailureInfo'] = failure_data.FailureInfo.map(failure_map).fillna(failure_data['FailureInfo'])  # map the valus in the dict

failure_data.head()

Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,BERRY 5493 44-7 14BX,2019-12-02,2019-12-06,ESP BROKEN SHAFT
1,BERRY 5493 44-7 15TX,2019-08-05,2019-08-16,ESP BROKEN SHAFT
2,BERRY 5493 44-7 15TX,NaT,NaT,ESP - BROKEN SHAFT
3,BONNER 9-12H,2015-12-30,2015-12-30,ROD PART
4,BONNER 9X-12HA,2016-10-13,2016-10-13,TUBING FAILURE


In [126]:
# Cleaning the NodeID columns
failure_data['NodeID'] = failure_data.NodeID.str.lower()  # Convert all to lower
well_dict = dict(zip(data.NodeID.str.lower().unique(), data.NodeID.unique()))  # use the original data to create a dict, converts from lower case to the corerct one
failure_data["NodeID"] = failure_data.NodeID.map(well_dict)  # map it, and drop the nan as those columns dont match

failure_data = failure_data.dropna(subset=['NodeID'])  # Drop all wells which didnt match
failure_data.reset_index(inplace=True, drop=True) 
failure_data.dropna(inplace=True) # drop all nan values

In [127]:
display(failure_data.head())
display(failure_data.FailureInfo.value_counts())

Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,Bonner 9-12H,2015-12-30,2015-12-30,ROD PART
1,Bonner 9X-12HA,2016-10-13,2016-10-13,TUBING FAILURE
2,Bonner 9X-12HA,2017-02-07,2017-02-08,ROD PART SHALLOW
3,Bonner 9X-12HA,2017-05-18,2017-05-21,STUCK PUMP
4,Bonner 9X-12HA,2019-08-26,2019-09-04,FRAC UNPROTECT


TUBING LEAK                   25
PUMP FAILURE                  13
ESP-GROUND                    11
GAS LIFT TO RODS               9
TUBING FAILURE                 7
ESP GROUND                     5
PUMP CHANGE                    5
ROD PART SHALLOW               5
ESP-BROKEN SHAFT               4
UPLIFT                         4
DEEP ROD PART                  2
POLISH ROD BREAK               2
ESP-TUBING LEAK                2
STUCK PUMP                     1
ESP- PUMP FAILURE              1
ESP -GROUND                    1
ESP MOTOR                      1
FRAC UNPROTECT                 1
ESP-OVERHEATING                1
ROD PART                       1
ESP-NO PRODUCTION              1
GAS LIFT TO ESP CONVERSION     1
ESP-GROUNDS                    1
Name: FailureInfo, dtype: int64

# Step 3
Transfering the failures to the main dataset

Use failure_data either from the s3 bucket or given locally.

In [35]:
print("Failure Data")
display(failure_data)

print("Main Data Start and End Dates")
display(data.groupby('NodeID').agg({'Date': ['min', 'max']}))

Failure Data


Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,Johnsrud 5198 14-18 13T,2020-02-13 02:09:07,2020-03-04 09:12:36,TUBING LEAK
1,Johnsrud 5198 14-18 15TX,2020-02-12 07:10:06,2020-02-27 10:16:50,TUBING LEAK
2,Rolfson S 5198 12-29 8T,2020-01-20 19:21:09,2020-02-06 08:46:59,TUBING LEAK
3,Rolfson S 5198 14-29 11T,2020-01-07 12:50:43,2020-02-01 10:00:36,PUMP FAILURE
4,Rolfson S 5198 11-29 4T,2019-11-24 19:45:19,2019-12-09 11:28:55,TUBING LEAK
5,Rolfson S 5198 11-29 2TX,2019-10-31 15:49:45,2019-11-14 14:46:34,TUBING LEAK
6,Rolfson N 5198 12-17 7T,2019-10-20 03:34:30,2019-11-18 16:26:33,TUBING LEAK
7,Johnsrud 5198 14-18 13T,2019-09-17 09:35:17,2019-10-03 09:41:31,TUBING LEAK
8,Rolfson S 5198 12-29 8T,2019-09-04 08:44:49,2019-09-22 08:48:08,TUBING LEAK


Main Data Start and End Dates


Unnamed: 0_level_0,Date,Date
Unnamed: 0_level_1,min,max
NodeID,Unnamed: 1_level_2,Unnamed: 2_level_2
Bonner 9-12H,2019-07-12 09:43:39,2020-05-25 19:49:35
Bonner 9X-12HA,2019-09-09 09:06:55,2020-05-26 06:53:43
Bonner 9X-12HB,2019-07-09 12:55:53,2020-04-10 04:31:11
Cade 12-19HA,2019-05-30 10:49:02,2020-04-10 05:25:25
Cade 12-19HB,2019-05-30 12:38:50,2020-04-10 07:39:23
Cade 12X-19H,2019-05-27 23:25:32,2020-04-10 08:39:04
Cook 12-13 6B,2019-05-27 23:58:47,2020-05-26 07:43:30
Cook 12-13 7T,2019-05-28 02:07:32,2020-05-26 07:44:25
Cook 12-13 9T,2019-05-28 00:23:59,2020-05-26 07:48:50
Cook 41-12 11T,2019-10-15 16:18:57,2020-05-26 07:26:41


In [36]:
%%time
# Using for loop, for transfering --  not very efficient

data.loc[:, 'FailureInfo'] = 'Normal'

for i in failure_data.index:
    well = failure_data.loc[i, 'NodeID']  # get well
    t_start = failure_data.loc[i, 'StartDate']  # strt date
    t_end = failure_data.loc[i, 'EndDate']  # end data
    failure = failure_data.loc[i, 'FailureInfo']  # failure
    
    bool_ = (data.NodeID == well) & (data.Date >= t_start) & (data.Date <= t_end)  # Boolean mask for main data
    data.loc[bool_, 'FailureInfo'] = failure  # attach failure for that specific boolean mask

Wall time: 88.8 ms


In [40]:
print("Total Data Set failure Distribution")
display(data.FailureInfo.value_counts())

print("Wells which have failure")
display(data[data.FailureInfo != 'Normal'].NodeID.value_counts())

Total Data Set failure Distribution


Normal          120594
TUBING LEAK        377
PUMP FAILURE        54
Name: FailureInfo, dtype: int64

Wells which have failure


Johnsrud 5198 14-18 13T     117
Rolfson N 5198 12-17 7T      87
Rolfson S 5198 12-29 8T      72
Rolfson S 5198 14-29 11T     54
Rolfson S 5198 11-29 4T      39
Johnsrud 5198 14-18 15TX     35
Rolfson S 5198 11-29 2TX     27
Name: NodeID, dtype: int64