# Notebook Info

The failure info sent by the clients. This notebook is used for understanding these file and extracting relevant information.

The extracted info should have the following information.
- NodeID (Well Name)
- StartDate (Failure Start Timestamp)
- EndDate (Failure End Timestamp)
- Failure (The Type of Failure)

This is stored as a table in our Postgres DB Server.

Database Details are as follows (May Change in Production)
```
database = 'oasis-prod'
schema = 'analysis'
table = 'failure_info'
```

# Imports

In [1]:
# To extract local libraries root path should be added
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
import s3fs  # To handle s3 urls

from library import lib_aws, lib_cleaning

# options
pd.set_option('display.max_rows', 10000)

# Failure Files

The failure files are located in an s3 bucket (`s3://et-oasis/failure-excel/*`).

The following are the Files we currently have:
```
1. Enfinite Pilot Wells Failure Summary.xlsx   -- Already Clean  
2. Downtime (2015 - Feb 2020) (ID 24960).xlsx
3. Downtime (Mar-Apr 2020) (ID 46953).xlsx
4. Oasis Complete Failure List 2018-2020.xlsx   -- Latest (2 Sheets)
```

For failure we are currently only using `Oasis Complete Failure List 2018-2020.xlsx Sheet 2`
Which is Section Failure File 4 / Sheet 2

## Failure File 4

### Sheet 1

In [15]:
failure4 = pd.read_excel("s3://et-oasis/failure-excel/Oasis Complete Failure List 2018-2020.xlsx")

# FOllowing are the columns which do not seem useful
# Add and/or Subtract additional columns 
cols_drop = [
    'ADJUSTED WELL COUNT',  
    'ACTUAL ON PUMP DATE',
    'ENERTIA WELL ID',
    'ARTIFICIAL LIFT TECH',
    'LAST OIL MONTH', 
    'LAST OIL YEAR',
    'FAILURE STOP MONTH', 
    'FAILURE STOP YEAR'
]

failure4.drop(columns=cols_drop, inplace=True)
failure4.sort_values(by=['WELL NAME'], inplace=True)
failure4.reset_index(inplace=True, drop=True)
failure4.head()

Unnamed: 0,TYPE,WELL NAME,PROJECT AREA,RESERVOIR GROUP,LAST OIL,LAST OIL REASON,FAILURE START (Rig LOE Start),FAILURE STOP (Rig LOE Finish),WOE TO LAST OIL RUN TIME (DAYS),WOE TO WOE RUN TIME (DAYS),RIG NAME,EVENT OPERATIONS DESCRIPTION,JOB TYPE (EVENT TYPE IN OW),Lift Type During Failure,Chem,Premature,Under 180,Company,Route
0,preventative,A K Stangeland 5300 43-12T,INDIAN HILLS,THREE FORKS,2018-09-12,,2018-09-19,2018-09-22,546.0,553.0,Blackhawk 307,PUMP CHANGE,PUMP CHANGE,,SLB,N,N,,Oasis05
1,FAILURE,A. Johnson 12-1H,WILD BASIN,MIDDLE BAKKEN,2018-09-16,,2018-10-09,2018-10-16,432.0,455.0,MBI 13,TUBING LEAK,TUBING LEAK,,SLB,N,N,,Oasis24
2,FAILURE,AAGVIK 5298 41-35 2TX,WILD BASIN,THREE FORKS,2019-05-29,,2019-06-04,2019-06-25,80.0,86.0,ND ENERGY RIG 15,GAS LIFT,GAS LIFT,GAS LIFT,SLB,Y,Y,,Oasis25
3,FAILURE,ACADIA 31-25H,SOUTH NESSON,THREE FORKS,2020-06-01,Lease Holder,2020-06-29,2020-07-01,412.0,440.0,BLACKHAWK 310,TUBING LEAK,TUBING LEAK,,SLB,N,N,SM,Oasis24
4,FAILURE,ACADIA 31-25H,SOUTH NESSON,THREE FORKS,2018-04-11,,2018-05-05,2018-05-11,266.0,290.0,MBI 12,TUBING LEAK,TUBING LEAK,,AstroChem,N,N,SM,Oasis24


#### Columns with Strings Cleaning

Columns with string values can have duplicates because of: 
* Random case (Upper and Lower)
* Additional Spaces and chartacters
 
Following columns are cleaned
```
WELL NAME -- Modifying this will have to be reflected in all other tables (TODO)
TYPE
EVENT OPERATION DESCRIPTION
JOB TYPE (EVENT TYPE IN OW)
```

In [20]:
# Cleaning WELL NAMES
failure4['WELL NAME'] = (failure4['WELL NAME'].str.replace("#", "")  # remove #
#                                              .str.replace(".", "")  # remove .
                                             .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
                                             .str.strip()  # Remove trailing whitespaces
                                             .str.lower()  # lower all character
                                             .str.title()  # Uppercase first letter of each word
                                             .map(lambda x: x[0:-2] + x[-2:].upper()))

# TYPE Columns
failure4['TYPE'] = (failure4['TYPE'].map(lambda x: str(x).replace("-", "").lower().title())
                                 .str.replace('\s+', ' ', regex=True)
                                 .str.strip())

In [22]:
# Event operations column
failure4['EVENT OPERATIONS DESCRIPTION']= (failure4['EVENT OPERATIONS DESCRIPTION'].str.replace('-', '')
                                                                    .str.upper()
                                                                    .str.replace("ESP", "ESP ")  # Add Space After ESP
                                                                    .str.replace("UPLIFT", "UPLIFT ")
                                                                    .str.replace('\s+', ' ', regex=True)
                                                                    .str.strip())

# Values changed manually
manual_change = {
    'BROKEN POLISH ROD': 'POLISH ROD BREAK',
    'DEEP ROD PART': 'ROD PART DEEP',
    'ESP GROUD': 'ESP GROUND',
    'ESP GROUNDED': 'ESP GROUND',
    'PARTED POLISH ROD': 'POLISH ROD PART',
    'POLISH ROD BROKE': 'POLISH ROD BREAK',
    'POLISHED ROD BREAK': 'POLISH ROD BREAK',
    'PUIMP CHANGE': 'PUMP CHANGE',
    'PUJMP FAILURE': 'PUMP FAILURE',
    'PUMP FALIURE': 'PUMP FAILURE',
    'PUMP FILURE': 'PUMP FAILURE',
    'ROD PART SHALLLOW': 'ROD PART SHALLOW',
    'ROD PARTDEEP': 'ROD PART DEEP',
    'ROD PARTSHALLOW': 'ROD PART SHALLOW',
    'SHALLOW ROD PART': 'ROD PART SHALLOW',
}

failure4['EVENT OPERATIONS DESCRIPTION'].replace(manual_change, inplace=True)

In [11]:
# Job Type event
failure4['JOB TYPE (EVENT TYPE IN OW)'] = (failure4['JOB TYPE (EVENT TYPE IN OW)'].str.replace('-', '')
                                                                                   .str.upper()
                                                                                   .str.replace('\s+', ' ', regex=True)
                                                                                   .str.strip())

manual_change = {
    'POLISHED ROD BREAK': 'POLISH ROD BREAK',
    'RESPACE PUMP': 'PUMP RESPACE'
}

failure4['JOB TYPE (EVENT TYPE IN OW)'].replace(manual_change, inplace=True)

### Sheet 2

In [3]:
df_fail = pd.read_excel("s3://et-oasis/failure-excel/Oasis Complete Failure List 2018-2020.xlsx", sheet_name=1)

cols_to_keep = [
    "Well",
    "Formation",
    "LAST OIL",
    "LOE START DATE",
    "LOE FINISH DATE",
    "Run time (days)",
    "Job Type",
    "Job Bucket",
    "Components",
    "Primary Symptom",
    "Secondary Symptom",
    "Root Cause",
    "Polish Rod Run Time",
    "Pony Sub Run Time",
    "Pump Run Time",
    "Tubing Run Time (Days)"
]

df_fail = df_fail[cols_to_keep]

In [4]:
# Cleaning WELL NAMES
df_fail['Well'] = (df_fail['Well'].str.replace("#", "")  # remove #
                                 .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
                                 .str.strip()  # Remove trailing whitespaces
                                 .str.lower()  # lower all character
                                 .str.title()  # Uppercase first letter of each word
                                 .map(lambda x: x[0:-2] + x[-2:].upper()))

In [5]:
# Cleaning 'Root Cause'
manual_change = {"Fatigue/Acceptable Run Time": "Fatigue, Acceptable Run Time"}

df_fail['Root Cause'] = (df_fail['Root Cause'].replace(manual_change)
                                             .str.replace('\s+', ' ', regex=True)
                                             .str.strip()).value_counts().sort_index()

In [6]:
# Cols rename
cols_rename = {
    "Well": "NodeID",
    "LAST OIL": "Last Oil",
    "LOE START DATE": "Start Date",
    "LOE FINISH DATE": "Finish Date",
    "Run time (days)": 'Run Time',
    "Tubing Run Time (Days)": 'Tubing Run Time'
}

df_fail.rename(columns = cols_rename, inplace=True)

In [7]:
df_fail.head()

Unnamed: 0,NodeID,Formation,Last Oil,Start Date,Finish Date,Run Time,Job Type,Job Bucket,Components,Primary Symptom,Secondary Symptom,Root Cause,Polish Rod Run Time,Pony Sub Run Time,Pump Run Time,Tubing Run Time
0,Ceynar 4X-18H,THREE FORKS,2020-06-08,2020-06-29,2020-06-30,300.0,"1-1/2"" PUMP",PUMP,Pump - Plunger,Abrasion - Foreign Debris,,,300.0,300.0,300.0,300
1,Casey 5200 13-30B,MIDDLE BAKKEN,2020-06-06,2020-06-26,2020-06-29,145.0,"1"" ROD SECTION",ROD,"Rod - 6"" Critical Section",Corrosion,,,144.0,144.0,144.0,144
2,Warren 5892 42-23H,MIDDLE BAKKEN,2020-06-03,2020-06-24,2020-06-29,1032.0,TUBING LEAK,TUBING,Tubing - Body,Corrosion,Abrasion - Foreign Debris,,1031.0,1031.0,1031.0,1031
3,Yeiser 5603 42-33H,MIDDLE BAKKEN,2020-06-13,2020-06-22,2020-06-25,350.0,TUBING LEAK,TUBING,Tubing - Body,Mechanically Induced Damage,Compression,,350.0,350.0,350.0,350
4,Otis 2658 43-23H,MIDDLE BAKKEN,2020-06-13,2020-06-19,2020-06-22,240.0,"1-1/2"" PUMP",PUMP,Pump - Plunger,Solids in Pump,Corrosion,,239.0,239.0,239.0,239


## Failure File 1

This file is already clean with only slight processing needed, 

Failure File 4 is the updated version of this, can avoid this for now.
(Check this statement Programatically)

In [14]:
failure_file1 = pd.read_excel("s3://et-oasis/failure-excel/Enfinite Pilot Wells Failure Summary.xlsx")  # Query it locally

# Rename columns
cols_rename = {
    'WELL NAME': 'NodeID',
    'ACTUAL FAILURE START': 'StartDate',
    'ACTUAL FAILURE STOP': 'EndDate',
    'FAILURE TYPE': 'FailureInfo'
}
failure_file1.rename(columns=cols_rename, inplace=True)  # Rename the columns for ease of use
failure_file1.sort_values(by=['NodeID', 'StartDate'], inplace=True)
failure_file1.reset_index(inplace=True, drop=True)

display(failure_file1.head())

Unnamed: 0,NodeID,StartDate,EndDate,FailureInfo
0,Cade 12-19HA,2019-07-17 16:32:23,2019-07-28 08:06:59,POLISH ROD BREAK
1,Cook 12-13 6B,2019-12-11 07:52:24,2019-12-25 08:17:09,TUBING LEAK
2,Helling Trust 43-22 10T,2019-07-13 14:05:57,2019-07-25 08:51:20,PUMP FAILURE
3,Helling Trust 43-22 16T3,2019-07-19 15:23:52,2019-07-29 10:01:03,TUBING LEAK
4,Helling Trust 44-22 5B,2020-03-19 01:43:54,2020-03-26 23:20:11,POLISH ROD BREAK


## Failure Files 2 & 3

Info from these files have to extracted and understood before we can save it to the Database

In [5]:
file1 = 's3://et-oasis/failure-excel/Downtime (2015 - Feb 2020) (ID 24960).xlsx'
file2 = 's3://et-oasis/failure-excel/Downtime (Mar-Apr 2020) (ID 46953).xlsx'

failure1 = pd.read_excel(file1)
failure2 = pd.read_excel(file2)

failure_df = pd.concat([failure1, failure2])
failure_df.sort_values(by=['PropertyName', 'effectivedate'], inplace=True)
failure_df.reset_index(inplace=True, drop=True)
failure_df.head()

In [120]:
# Split by start date
start_dt = pd.Timestamp('2019-01-01')
failure_latest = failure_df[failure_df.effectivedate >= start_dt].copy()
failure_latest.reset_index(inplace=True, drop=True)
failure_latest.sort_values(by=['PropertyName', 'effectivedate'], inplace=True)
failure_latest.head()

Unnamed: 0,PropertyBK,PropertyCode,PropertyName,effectivedate,DtProdDayStart,DtReason,DtBeginDateTime,DtEndDateTime,DtTotalHrs,DtRemarks
0,1890389,18066.1,9217 JV-P CAPRITO 2D SWD,2019-04-08 15:06:19,12:00a,SCHED,2019-04-08,2019-04-09,24,
1,1890389,18066.1,9217 JV-P CAPRITO 2D SWD,2019-04-09 12:55:24,12:00a,SCHED,2019-04-09,2019-04-10,24,
2,1890389,18066.1,9217 JV-P CAPRITO 2D SWD,2019-04-10 13:18:34,12:00a,SCHED,2019-04-10,2019-04-11,24,
3,1890389,18066.1,9217 JV-P CAPRITO 2D SWD,2019-04-11 17:08:54,12:00a,SCHED,2019-04-11,2019-04-12,24,
4,1890389,18066.1,9217 JV-P CAPRITO 2D SWD,2019-04-12 08:38:19,12:00a,SCHED,2019-04-12,2019-04-13,24,


In [121]:
failure_latest.DtReason.value_counts()

SINOE    14700
TEMPS    12114
SCHED    11183
RODPT     9312
SIFRA     7729
POCAL     3211
PUMPI     2547
PPSD      2057
HIGHP     1900
HISEP     1785
ELEC      1584
ESP D     1531
WELLD     1454
HIGHT     1441
WEATH     1157
HITEM      940
ELECR      743
MECHR      719
GENER      691
LEAKD      672
TBGF       636
DRIVE      548
PLC        457
RODHI      429
GLDOW      318
UNK        289
HPBPV      223
GLSUR      189
EMISR      185
SI FO      166
DOWNF      141
GLSUP      129
GASIN      107
ESPHI       82
JPSUR       78
COMP        71
GLHP        61
JPHP        47
JPDOW       39
AJAX-       23
Name: DtReason, dtype: int64

In [122]:
failure_latest.DtRemarks.value_counts()

T/a B.B.                                                                                                                                                                                                                             542
T/A SAS                                                                                                                                                                                                                              427
Off cycle                                                                                                                                                                                                                            305
Parted rods                                                                                                                                                                                                                          279
Frack protection                                                    

In [65]:
len(failure_df.PropertyBK.unique())

1508

In [56]:
# Reason Specific Value counts
failure_df[failure_df.DtReason == 'DOWNH'].DtRemarks.value_counts()

Patch                                                                                                                                                                                                                                                              2334
Parted rods                                                                                                                                                                                                                                                        2130
Hole in tubing                                                                                                                                                                                                                                                     1342
Parted                                                                                                                                                                                                          

In [55]:
# Remarks Sorted by number of counts
remarks_cts = failure_new.DtRemarks.value_counts()
remarks_cts[remarks_cts >= 100]

Parted rods                                                                           2703
Patch                                                                                 2353
Hole in tubing                                                                        1676
T/A SAS                                                                               1234
Hole in tbg                                                                            961
Parted                                                                                 865
Off cycle                                                                              786
Rig on well                                                                            766
Frac protect                                                                           669
Hard set                                                                               567
T/a B.B.                                                                               564

In [68]:
well='ZUTZ 5693 44-12T'

well_df = failure_new[failure_new.PropertyName == well]
well_df.head()

Unnamed: 0,PropertyBK,PropertyCode,PropertyName,effectivedate,DtProdDayStart,DtReason,DtBeginDateTime,DtEndDateTime,DtTotalHrs,DtRemarks
317246,1068645,15621.1,ZUTZ 5693 44-12T,2015-04-01 07:42:17,12:00a,HISEP,2015-04-01,2015-04-01 14:00:00,14,
317247,1068645,15621.1,ZUTZ 5693 44-12T,2015-04-13 07:44:07,12:00a,HISEP,2015-04-13,2015-04-13 01:00:00,1,
317248,1068645,15621.1,ZUTZ 5693 44-12T,2015-04-14 06:01:44,12:00a,FHIS,2015-04-14,2015-04-14 20:00:00,20,
317249,1068645,15621.1,ZUTZ 5693 44-12T,2015-04-16 06:38:59,12:00a,FHIS,2015-04-16,2015-04-16 20:00:00,20,
317250,1068645,15621.1,ZUTZ 5693 44-12T,2015-12-30 06:13:40,12:00a,DOWNH,2015-12-30,2015-12-31 00:00:00,24,


In [69]:
well_df.DtReason.value_counts()
well_df.DtRemarks.value_counts()

Bad bearing                                                                                                26
Waiting on rig                                                                                              5
Pump failure                                                                                                5
Bearing out                                                                                                 4
Off cycle                                                                                                   2
Belts broke                                                                                                 2
No rpm signal   Can't get to startpollution pot fine vibe switch fine BPV fine.                             1
Bad bearing on unit                                                                                         1
Faulted                                                                                                     1
Down hole 

###  Compare with a string

In [111]:
str_test = 'hole'
str_test2 = 'Down'
bool_ = failure_new.DtRemarks.fillna("None").str.contains(str_test and str_test2, na=False)
masked_df = failure_new[bool_]


In [113]:
masked_df.tail()

Unnamed: 0,PropertyBK,PropertyCode,PropertyName,effectivedate,DtProdDayStart,DtReason,DtBeginDateTime,DtEndDateTime,DtTotalHrs,DtRemarks
317297,1068645,15621.1,ZUTZ 5693 44-12T,2017-07-06 05:31:18,12:00a,WO,2017-07-06,2017-07-07,24,Down hole. Should be pumping tomorrow afternoon
317298,1068645,15621.1,ZUTZ 5693 44-12T,2017-07-07 05:14:57,12:00a,WO,2017-07-07,2017-07-08,24,Down hole
317315,1068645,15621.1,ZUTZ 5693 44-12T,2018-10-09 11:11:44,12:00a,RODPT,2018-10-09,2018-10-10,24,Down hole issues
317317,1068645,15621.1,ZUTZ 5693 44-12T,2018-10-11 12:02:51,12:00a,RODPT,2018-10-11,2018-10-12,24,Down hole problem. Bad pump ?
317318,1068645,15621.1,ZUTZ 5693 44-12T,2018-10-12 12:23:11,12:00a,RODPT,2018-10-12,2018-10-13,24,Down hole problem.


# Merging

Merge all the nececcary failures in this section.

*For Now we are only using Failure File 1*

In [11]:
full_failures = df_fail.copy()  # using only sheet 2 from file 4

# Saving 

The clean failure data is saved in our Database. 
```
database = 'oasis-prod'
schema = 'analysis'
table = 'failure_info'
```

We will be using the class `lib_aws.AddData` from our local library. Check the dpcstring to know how it works.

In [12]:
# Adding the data. Need to have write permissions
lib_aws.AddData.add_data(df=full_failures, 
                         db='oasis-prod',
                         schema='analysis',
                         table='failure_info',
                         merge_type='replace', 
                         index_col='NodeID')

Data replaceed on Table failure_info in time 10.68s
