# Notebook Info

The failure info sent by the clients. This notebook is used for understanding these file and extracting relevant information.

The extracted info should have the following information.
- NodeID (Well Name)
- StartDate (Failure Start Timestamp)
- EndDate (Failure End Timestamp)
- Failure (The Type of Failure)

This is stored as a table in our Postgres DB Server.

Database Details are as follows (May Change in Production)
```
database = 'oasis-prod'
schema = 'analysis'
table = 'failure_info'
```

# Imports

In [1]:
# To extract local libraries root path should be added
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
import s3fs  # To handle s3 urls

from library import lib_aws, lib_cleaning

# options
pd.set_option('display.max_rows', 10000)

# Failure Files

The failure files are located in an s3 bucket (`s3://et-oasis/failure-excel/*`).

The following are the Files we currently have:
```
1. Enfinite Pilot Wells Failure Summary.xlsx   -- Using Sheet 4  
2. Downtime (2015 - Feb 2020) (ID 24960).xlsx
3. Downtime (Mar-Apr 2020) (ID 46953).xlsx
4. Oasis Complete Failure List 2018-2020.xlsx   -- Using Sheet 2
5. Manually Labelled data
```


For failure we are currently only using `Oasis Complete Failure List 2018-2020.xlsx Sheet 2`
Which is Section Failure File 4 / Sheet 2

## Failure File 4

### Sheet 1

**Note: Not Using Sheet 1 for now**

In [26]:
# failure4 = pd.read_excel("s3://et-oasis/failure-excel/Oasis Complete Failure List 2018-2020.xlsx")

# # FOllowing are the columns which do not seem useful
# # Add and/or Subtract additional columns 
# cols_drop = [
#     'ADJUSTED WELL COUNT',  
#     'ACTUAL ON PUMP DATE',
#     'ENERTIA WELL ID',
#     'ARTIFICIAL LIFT TECH',
#     'LAST OIL MONTH', 
#     'LAST OIL YEAR',
#     'FAILURE STOP MONTH', 
#     'FAILURE STOP YEAR'
# ]

# failure4.drop(columns=cols_drop, inplace=True)
# failure4.sort_values(by=['WELL NAME'], inplace=True)
# failure4.reset_index(inplace=True, drop=True)
# failure4.head()

#### Columns with Strings Cleaning

Columns with string values can have duplicates because of: 
* Random case (Upper and Lower)
* Additional Spaces and chartacters
 
Following columns are cleaned
```
WELL NAME -- Modifying this will have to be reflected in all other tables (TODO)
TYPE
EVENT OPERATION DESCRIPTION
JOB TYPE (EVENT TYPE IN OW)
```

In [27]:
# # Cleaning WELL NAMES
# failure4['WELL NAME'] = (failure4['WELL NAME'].str.replace("#", "")  # remove #
# #                                              .str.replace(".", "")  # remove .
#                                              .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
#                                              .str.strip()  # Remove trailing whitespaces
#                                              .str.lower()  # lower all character
#                                              .str.title()  # Uppercase first letter of each word
#                                              .map(lambda x: x[0:-2] + x[-2:].upper()))

# # TYPE Columns
# failure4['TYPE'] = (failure4['TYPE'].map(lambda x: str(x).replace("-", "").lower().title())
#                                  .str.replace('\s+', ' ', regex=True)
#                                  .str.strip())

In [28]:
# # Event operations column
# failure4['EVENT OPERATIONS DESCRIPTION']= (failure4['EVENT OPERATIONS DESCRIPTION'].str.replace('-', '')
#                                                                     .str.upper()
#                                                                     .str.replace("ESP", "ESP ")  # Add Space After ESP
#                                                                     .str.replace("UPLIFT", "UPLIFT ")
#                                                                     .str.replace('\s+', ' ', regex=True)
#                                                                     .str.strip())

# # Values changed manually
# manual_change = {
#     'BROKEN POLISH ROD': 'POLISH ROD BREAK',
#     'DEEP ROD PART': 'ROD PART DEEP',
#     'ESP GROUD': 'ESP GROUND',
#     'ESP GROUNDED': 'ESP GROUND',
#     'PARTED POLISH ROD': 'POLISH ROD PART',
#     'POLISH ROD BROKE': 'POLISH ROD BREAK',
#     'POLISHED ROD BREAK': 'POLISH ROD BREAK',
#     'PUIMP CHANGE': 'PUMP CHANGE',
#     'PUJMP FAILURE': 'PUMP FAILURE',
#     'PUMP FALIURE': 'PUMP FAILURE',
#     'PUMP FILURE': 'PUMP FAILURE',
#     'ROD PART SHALLLOW': 'ROD PART SHALLOW',
#     'ROD PARTDEEP': 'ROD PART DEEP',
#     'ROD PARTSHALLOW': 'ROD PART SHALLOW',
#     'SHALLOW ROD PART': 'ROD PART SHALLOW',
# }

# failure4['EVENT OPERATIONS DESCRIPTION'].replace(manual_change, inplace=True)

In [11]:
# # Job Type event
# failure4['JOB TYPE (EVENT TYPE IN OW)'] = (failure4['JOB TYPE (EVENT TYPE IN OW)'].str.replace('-', '')
#                                                                                    .str.upper()
#                                                                                    .str.replace('\s+', ' ', regex=True)
#                                                                                    .str.strip())

# manual_change = {
#     'POLISHED ROD BREAK': 'POLISH ROD BREAK',
#     'RESPACE PUMP': 'PUMP RESPACE'
# }

# failure4['JOB TYPE (EVENT TYPE IN OW)'].replace(manual_change, inplace=True)

### Sheet 2

In [29]:
failure_file4 = pd.read_excel("s3://et-oasis/failure-excel/Oasis Complete Failure List 2018-2020.xlsx", sheet_name=1)

# Keep Specific columns only
cols_to_keep = [
    "Well",
    "Formation",
    "LAST OIL",
    "LOE START DATE",
    "LOE FINISH DATE",
    "Run time (days)",
    "Job Type",
    "Job Bucket",
    "Components",
    "Primary Symptom",
    "Secondary Symptom",
    "Root Cause",
    "Polish Rod Run Time",
    "Pony Sub Run Time",
    "Pump Run Time",
    "Tubing Run Time (Days)"
]
failure_file4 = failure_file4[cols_to_keep]

# Cleaning WELL NAMES
failure_file4['Well'] = (failure_file4['Well'].str.replace("#", "")  # remove #
                                 .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
                                 .str.strip()  # Remove trailing whitespaces
                                 .str.lower()  # lower all character
                                 .str.title()  # Uppercase first letter of each word
                                 .map(lambda x: x[0:-2] + x[-2:].upper()))

# Cleaning 'Root Cause'
manual_change = {"Fatigue/Acceptable Run Time": "Fatigue, Acceptable Run Time"}

failure_file4['Root Cause'] = (failure_file4['Root Cause'].replace(manual_change)
                                             .str.replace('\s+', ' ', regex=True)
                                             .str.strip()).value_counts().sort_index()

# Renaming Specific Columns
cols_rename = {
    "Well": "NodeID",
    "LAST OIL": "Last Oil",
    "LOE START DATE": "Start Date",
    "LOE FINISH DATE": "Finish Date",
    "Run time (days)": 'Run Time',
    "Tubing Run Time (Days)": 'Tubing Run Time'
}

failure_file4.rename(columns = cols_rename, inplace=True)

# Finale Sorting
failure_file4.sort_values(by='NodeID', inplace=True)
failure_file4.reset_index(inplace=True, drop=True)
failure_file4.head()

Unnamed: 0,NodeID,Formation,Last Oil,Start Date,Finish Date,Run Time,Job Type,Job Bucket,Components,Primary Symptom,Secondary Symptom,Root Cause,Polish Rod Run Time,Pony Sub Run Time,Pump Run Time,Tubing Run Time
0,A. Johnson 12-1H,MIDDLE BAKKEN,2018-09-16,2018-10-09,2018-10-16,432.0,TUBING LEAK,TUBING,Tubing - Unknown,Scale,Corrosion,,,,,
1,Aagvik 1-35H,MIDDLE BAKKEN,2019-11-27,2019-12-02,2019-12-06,203.0,TUBING LEAK,TUBING,Tubing - Body,Mechanically Induced Damage,Solids in Pump,,,,,202.0
2,Aagvik 5298 41-35 2TX,THREE FORKS,2019-05-29,2019-06-04,2019-06-25,80.0,GAS LIFT,PUMP,Gas Lift - Valve - Bellows,Low Production,Blank,,,,,79.0
3,Acadia 31-25H,THREE FORKS,2019-03-30,2019-04-10,2019-04-16,323.0,"1-1/4"" PUMP",PUMP,Pump - Plunger,Corrosion,Mechanically Induced Damage,,,,,266.0
4,Acadia 31-25H,THREE FORKS,2018-04-11,2018-05-05,2018-05-11,266.0,TUBING LEAK,TUBING,Tubing - Collar,Corrosion,Sand,,,,,


## Failure File 1

This file includes failures for the first group of wells. Cleaning Structure is similar to `Failure File 4`

Using Sheet 4 (Sheet Name: Analysis)

In [32]:
failure_file1 = pd.read_excel("s3://et-oasis/failure-excel/Enfinite Pilot Wells Failure Summary.xlsx", sheet_name=3)  # Query it locally

# Keep Specific columns only
cols_to_keep = [
    "Well",
    "Formation",
    "LAST OIL",
    "LOE START DATE",
    "LOE FINISH DATE",
    "Run time (days)",
    "Job Type",
    "Job Bucket",
    "Components",
    "Primary Symptom",
    "Secondary Symptom",
    "Root Cause",
    "Polish Rod Run Time",
    "Pony Sub Run Time",
    "Pump Run Time",
    "Tubing Run Time (Days)"
]
failure_file1 = failure_file1[cols_to_keep]

# Cleaning WELL NAMES
failure_file1['Well'] = (failure_file1['Well'].str.replace("#", "")  # remove #
                                 .str.replace('\s+', ' ', regex=True)  # remove multiple spaces if present
                                 .str.strip()  # Remove trailing whitespaces
                                 .str.lower()  # lower all character
                                 .str.title()  # Uppercase first letter of each word
                                 .map(lambda x: x[0:-2] + x[-2:].upper()))

# Cleaning 'Root Cause'
manual_change = {"Fatigue/Acceptable Run Time": "Fatigue, Acceptable Run Time"}

failure_file1['Root Cause'] = (failure_file1['Root Cause'].replace(manual_change)
                                             .str.replace('\s+', ' ', regex=True)
                                             .str.strip()).value_counts().sort_index()

# Renaming Specific Columns
cols_rename = {
    "Well": "NodeID",
    "LAST OIL": "Last Oil",
    "LOE START DATE": "Start Date",
    "LOE FINISH DATE": "Finish Date",
    "Run time (days)": 'Run Time',
    "Tubing Run Time (Days)": 'Tubing Run Time'
}

failure_file1.rename(columns = cols_rename, inplace=True)

# Finale Sorting
failure_file1.sort_values(by='NodeID', inplace=True)
failure_file1.reset_index(inplace=True, drop=True)
failure_file1.head()

Unnamed: 0,NodeID,Formation,Last Oil,Start Date,Finish Date,Run Time,Job Type,Job Bucket,Components,Primary Symptom,Secondary Symptom,Root Cause,Polish Rod Run Time,Pony Sub Run Time,Pump Run Time,Tubing Run Time
0,Bonner 9X-12HA,THREE FORKS,2016-12-29,2017-02-07,2017-02-08,77,"1"" ROD SECTION",ROD,Rod - Main Body,Compression,Solids in Pump,,,,,
1,Bonner 9X-12HA,THREE FORKS,2017-05-10,2017-05-18,2017-05-21,91,"2"" PUMP",PUMP,Pump - Stuck Pump,Solids in Pump,Corrosion,,,,,201.0
2,Bonner 9X-12HB,THREE FORKS,2017-06-30,2017-07-09,2017-07-12,324,TUBING LEAK,TUBING,Tubing - Unknown,Corrosion,Rod Wear,,,,,
3,Cade 12-19HA,MIDDLE BAKKEN,2017-05-09,2017-05-21,2017-05-26,371,TUBING LEAK,TUBING,Tubing - Unknown,Unknown,Loose Anchor,,,,,
4,Cade 12-19HA,MIDDLE BAKKEN,2019-07-19,2019-07-24,2019-07-26,784,POLISH ROD BREAK,ROD,Polish Rod,Mechanically Induced Damage,,,,,,784.0


## Failure Files 2 & 3

Info from these files have to extracted and understood before we can save it to the Database. Maybe using some NLP Techniques.

**Note: Not Using these files currently**

In [5]:
# file1 = 's3://et-oasis/failure-excel/Downtime (2015 - Feb 2020) (ID 24960).xlsx'
# file2 = 's3://et-oasis/failure-excel/Downtime (Mar-Apr 2020) (ID 46953).xlsx'

# failure1 = pd.read_excel(file1)
# failure2 = pd.read_excel(file2)

# failure_df = pd.concat([failure1, failure2])
# failure_df.sort_values(by=['PropertyName', 'effectivedate'], inplace=True)
# failure_df.reset_index(inplace=True, drop=True)
# failure_df.head()

In [25]:
# # Split by start date
# start_dt = pd.Timestamp('2019-01-01')
# failure_latest = failure_df[failure_df.effectivedate >= start_dt].copy()
# failure_latest.reset_index(inplace=True, drop=True)
# failure_latest.sort_values(by=['PropertyName', 'effectivedate'], inplace=True)
# failure_latest.head()

In [15]:
# failure_latest.DtReason.value_counts()

In [16]:
# failure_latest.DtRemarks.value_counts()

In [17]:
# len(failure_df.PropertyBK.unique())

In [18]:
# # Reason Specific Value counts
# failure_df[failure_df.DtReason == 'DOWNH'].DtRemarks.value_counts()

In [19]:
# # Remarks Sorted by number of counts
# remarks_cts = failure_new.DtRemarks.value_counts()
# remarks_cts[remarks_cts >= 100]

In [21]:
# well='ZUTZ 5693 44-12T'

# well_df = failure_new[failure_new.PropertyName == well]
# well_df.head()

In [20]:
# well_df.DtReason.value_counts()
# well_df.DtRemarks.value_counts()

###  Compare with a string

In [23]:
# str_test = 'hole'
# str_test2 = 'Down'
# bool_ = failure_new.DtRemarks.fillna("None").str.contains(str_test and str_test2, na=False)
# masked_df = failure_new[bool_]


In [24]:
# masked_df.tail()

# Merging

Merge all the nececcary failures in this section.

Using the following files:
- `Failure File 4 (failure_file4)`
- `Failure File 1 (failure_file1)`

In [45]:
full_failures = pd.concat([failure_file4, failure_file1])

# Dropping some columns
cols_drop = [
    'Root Cause',
    'Polish Rod Run Time',
    'Pony Sub Run Time',
    'Pump Run Time',
    'Tubing Run Time'
]
full_failures.drop(columns=cols_drop, inplace=True)

# Dropping Duplicates
full_failures.drop_duplicates(subset=['NodeID', 'Last Oil', 'Finish Date','Job Type', 'Job Bucket'], inplace=True)
full_failures.reset_index(inplace=True, drop=True)

full_failures.head()

Unnamed: 0,NodeID,Formation,Last Oil,Start Date,Finish Date,Run Time,Job Type,Job Bucket,Components,Primary Symptom,Secondary Symptom
0,A. Johnson 12-1H,MIDDLE BAKKEN,2018-09-16,2018-10-09,2018-10-16,432.0,TUBING LEAK,TUBING,Tubing - Unknown,Scale,Corrosion
1,Aagvik 1-35H,MIDDLE BAKKEN,2019-11-27,2019-12-02,2019-12-06,203.0,TUBING LEAK,TUBING,Tubing - Body,Mechanically Induced Damage,Solids in Pump
2,Aagvik 5298 41-35 2TX,THREE FORKS,2019-05-29,2019-06-04,2019-06-25,80.0,GAS LIFT,PUMP,Gas Lift - Valve - Bellows,Low Production,Blank
3,Acadia 31-25H,THREE FORKS,2019-03-30,2019-04-10,2019-04-16,323.0,"1-1/4"" PUMP",PUMP,Pump - Plunger,Corrosion,Mechanically Induced Damage
4,Acadia 31-25H,THREE FORKS,2018-04-11,2018-05-05,2018-05-11,266.0,TUBING LEAK,TUBING,Tubing - Collar,Corrosion,Sand


# Saving 

The clean failure data is saved in our Database. 
```
database = 'oasis-prod'
schema = 'analysis'
table = 'failure_info'
```

We will be using the class `lib_aws.AddData` from our local library. Check the docstring to know how it works.

In [47]:
# Adding the data. Need to have write permissions
lib_aws.AddData.add_data(df=full_failures, 
                         db='oasis-prod',
                         schema='analysis',
                         table='failure_info',
                         merge_type='replace', 
                         index_col='NodeID')

Data replaceed on Table failure_info in time 21.37s
