___

# Energy A.I. Hackathon 2023 Workflow - Hackalopes 

#### Authors:   
<h4>  

**Richard Larson** - Energy and Earth Resources - *Jackson*  
**Karthik Menon** - Energy and Earth Resources - *Jackson*   
**Daniel Pang** - Petroleum and Geosystems Engineering - *Cockrell*    
**Benjamin Stormer** - Walker Department of Mechanical Engineering - *Cockrell* 

</h4> 

#### The University of Texas at Austin, Austin, Texas USA 
___

### Executive Summary 

We needed to predict whether 40 Electronic Submersible Pumps (ESP) will fail or not within 30 days. To create our predictions, we developed a data analytics and machine learning workflow in Python. We found that the GOR, ESP vibration data, and other factors related to fluid flow through the pump are most critical in assessing the status of an ESP. We recommend that feature engineering of these factors be used as a measure to evaluate the lifespan of ESPs in the future.

___

### Workflow Goal

Our workflow should clearly provide the steps to select the optimal features for evaluating ESP lifespan, impute or remove any appropriate data, and develop a predictive learning model to categorize ESPs as expected to fail or not to fail in a period of 30 days.
___

### Workflow Steps 

Enumerated steps, very short and concise overview of your methods and workflow

1. **Data Analysis** - Separate data in failing pumps from operational pumps and understand feature completeness and correlation
2. **Feature Selection** - Leverage physical knowledge and domain expertise to determine most influential features, while imputing any data when possible, and removing any redundancies when necessary.
3. **Machine Learning Model \#1** - Create a minimum viable prototype using simple machine learning methods
4. **Machine Learning Model \#2** - Improve the minimum viable prototype by increasing complexity and changing feature selections


### Import Packages

In [15]:
import numpy as np                    # model arrays
import pandas as pd                   # DataFrames
import matplotlib.pyplot as plt       # building plots
import seaborn as sns                 # Plotting help
import os                             # accessing the operating system
from pathlib import Path              # Write files out
from sklearn.impute import KNNImputer # k-nearest imputing
from sklearn.experimental import enable_iterative_imputer # required for MICE imputation
from sklearn.impute import IterativeImputer # MICE imputation
# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from tpot import TPOTClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler


ModuleNotFoundError: No module named 'tpot'

### Load Data

The following workflow applies the .csv files 'wellData.csv' and 'dailyData.csv'. 'wellData.csv' is a collection of static features related to a specific well. 'dailyData.csv' is a collection of time-series data particular to the operation of each pump. The datasets were made available to competitors as part of the UT Austin 2023 PGE Hackathon.

We will work with the following features:
* **DLS_Critical** - ...
* **GOR** - ...
* **ESP Data - Vibration X** - ...
* **Pump_Power** - ...

In [2]:
# Relative paths used since assumption is that user is in root directory of forked Git repo.
well_data = pd.read_csv("../Hackalopes/wellData.csv")      # load the well data in
daily_data = pd.read_csv("../Hackalopes/dailyData.csv")    # load the daily data in
solution_data = pd.read_csv("../Hackalopes/solution.csv")  # load the solution data in

### Functions

The following functions will be used in the workflow.

In [3]:
# Plots a correlation matrix as a heat map
def plot_corr(dataframe,size=10):                        
    corr = dataframe.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    im = ax.matshow(corr,vmin = -1.0, vmax = 1.0)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);
    plt.colorbar(im, orientation = 'vertical')
    plt.title('Correlation Matrix')


# Replaces the NaN values in a dataset
def count_and_replace_nan(dataset, feature, replacement_value):
    num_nan = sum(dataset[feature].isna())
    print(f"Number of NaN values in {feature} is {num_nan} out of {len(dataset)} values")
    if replacement_value == "bf":
        new = dataset[feature].fillna(method="bfill", inplace=True)
        print(f"NaN values have been replaced by back-filling")
    elif replacement_value == "ff":
        dataset[feature].fillna(method="ffill", inplace=True)
        print(f"NaN values have been replaced by forward-filling")
    else:
        dataset[feature].fillna(replacement_value, inplace=True)
        print(f"NaN values have been replaced with {replacement_value}")
    return dataset


# Compute the completeness metrics for a given dataset
def completeness(df, cutoff=0.75):
    df_temp = df.copy(deep=True)
    df_bool = df_temp.isnull()
    features = list(df_bool.columns)
    percent_missing = []
    past_cutoff = []
    for feature in features:
        num_missing = df_bool[feature].sum()
        proportion_missing = df_bool[feature].sum() / len(df_bool)
        percent_missing.append(proportion_missing)
        if proportion_missing >= cutoff:
            past_cutoff.append(feature)
    print(f'{past_cutoff} were missing at least {cutoff*100}% of their data')


# Calculate the (spearman) correlation between all features in a dataframe.
def correlation(df, cutoff=0.75):
    corr = df.corr(method='spearman')
    correlated = []
    inversely_correlated = []
    for feature in corr:
        for feature2 in corr:
            correlation = corr[feature][feature2]
            if str(feature) == str(feature2):
                pass
            elif correlation >= cutoff:
                correlated.append((feature, feature2))
            elif correlation <= -cutoff:
                inversely_correlated.append((feature, feature2))
    # print(f'Items listed here: {correlated} are correlated by a coefficient of at least {cutoff}')
    # print('-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------')
    # print(f'Items listed here: {inversely_correlated} are inversely correlated by a coefficient of at greatest -{cutoff}')

### Aggregate and Separate Data Appropriately

The following steps combine the input csv files into DataFrames that are more useful to manage

In [4]:
# Mark data we will be predicting with a 0 and failed pumps with a 1
solution_data = solution_data.rename(columns={"Fail in 30 days": "Failed"}) 
solution_data["Failed"] = 0 # Initialize solutions with 0

In [5]:
# Combine all data into one dataframe
combined_data = pd.merge(pd.merge(well_data, daily_data, on=["Well_ID", "AL_Key"], how="left"), 
                         solution_data, on=["Well_ID", "AL_Key"], how="left")
combined_data["Failed"] = combined_data["Failed"].replace(np.nan, 1)
combined_data = combined_data.drop(columns=["Unnamed: 0"])

In [6]:
# Backfill NaN values in OIL, GAS, and Water columns
combined_data["OIL"].fillna(method="bfill", inplace=True)
combined_data["GAS"].fillna(method="bfill", inplace=True)
combined_data["WATER"].fillna(method="bfill", inplace=True)

In [7]:
# Insert new ratios and features in dataframe
combined_data.insert(15, "GOR", combined_data["GAS"] / combined_data["OIL"])
combined_data.insert(16, "GOF", combined_data["GAS"] / (combined_data["OIL"] + combined_data["WATER"]))
combined_data.insert(17, "GOR_Slope", combined_data["GOR"].diff())

In [8]:
# Create a dataframe for both the pumps still operational and the pumps that have failed
solution_data = combined_data[combined_data["Failed"] == 0] # Still operational
failed_data = combined_data[combined_data["Failed"] == 1] # Failed

### Basic Data Checking and Visualization

In [9]:
correlation(combined_data) # Compute the correlation statistics for the combined dataframe

In [10]:
completeness(solution_data) # Compute the data completeness statistics for the operational pumps

['ESP_Motor_Frequency_Rating'] were missing at least 75.0% of their data


### Minimum Viable Prototype

In [11]:
# Remove inessential data for a preliminary prototype. Essential data 
# determined from physical knowldege from domain experts and 
# correlation/completeness studies.
min_viable = combined_data.drop(columns=['Artificial_Lift_Type',
       'AL_Bottom_Depth', 'ESP_Pump_Stages',
       'ESP_Motor_Frequency_Rating', 'ESP_Motor_Current_Rating',
       'ESP_Motor_Voltage_Rating', 'ESP_Motor_Power_Rating',
       'DLS_at_Set_Depth', 'OIL', 'GAS', 'WATER', 'ARTIFICIAL_LIFT',
       'DOWN_TIME_HOURS', 'ESP Data - Drive Current',
       'ESP Data - Drive Voltage', 'ESP Data - Intake Pressure',
       'ESP Data - Motor Temperature Shutdown Setpoint',
       'ESP Data - Motor Winding Temperature', 'ESP Data - Output Frequency',
       'Startup_Count', 'Oil_Intake', 'Water_Intake',
       'Gas_Intake', 'Liquid_Intake', 'Gas_Saturation_at_Intake',
       'Gas_Separator_Efficiency', 'Gas_through_Annulus_Intake',
       'Gas_through_ESP_Intake', 'Gas_through_Annulus', 'Gas_through_ESP',
       'Pb_ESP', 'Discharge_Pressure', 'ESP_Fluid',
       'Gas_Saturation_at_Discharge', 'Pump_Delta_Pressure',
       'Pump_Average_Pressure', 'Gas_Saturation_in_Pump',
       'Drive_Power', 'Power_Ratio', 'Power_Difference', 'ESP_Temperature',
       'Lower_Limit', 'Failed'])

### Improved Model
Two methods of imputation are used and tested separately

In [12]:
# Remove inessential data for an improved prototype. This is slightly less
# strict in removing features then the minimum viable prototype
improved = combined_data.drop(columns=['Artificial_Lift_Type',
       'AL_Bottom_Depth', 'ESP_Pump_Stages',
       'ESP_Motor_Frequency_Rating', 'ESP_Motor_Current_Rating',
       'ESP_Motor_Voltage_Rating', 'ESP_Motor_Power_Rating',
       'DLS_at_Set_Depth', 'GAS', 'WATER', 'ARTIFICIAL_LIFT',
       'DOWN_TIME_HOURS', 'ESP Data - Drive Current',
       'ESP Data - Drive Voltage', 'ESP Data - Intake Pressure',
       'ESP Data - Motor Temperature Shutdown Setpoint',
       'ESP Data - Motor Winding Temperature', 'ESP Data - Output Frequency',
       'Startup_Count', 'Oil_Intake', 'Water_Intake',
       'Gas_Intake', 'Liquid_Intake', 'Gas_Saturation_at_Intake',
       'Gas_Separator_Efficiency', 'Gas_through_Annulus_Intake',
       'Gas_through_ESP_Intake', 'Gas_through_Annulus', 'Gas_through_ESP',
       'Discharge_Pressure', 'ESP_Fluid',
       'Gas_Saturation_at_Discharge', 'Pump_Delta_Pressure',
       'Pump_Average_Pressure', 'Gas_Saturation_in_Pump',
       'Drive_Power', 'Power_Difference', 'ESP_Temperature',
       'Lower_Limit', 'Failed'])

In [13]:
# Code modified from https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/SubsurfaceDataAnalytics_Feature_Imputation.ipynb
# Authored by Dr. Michael Pyrcz

# Impute the data using k-nearest neighbors
df_knn = improved.copy(deep=True) # make a deep copy of the DataFrame
df_knn.drop(columns=["Well_ID", "AL_Key"], inplace=True)
df_knn.replace([np.inf, -np.inf], np.nan, inplace=True) # Replace infinite values
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
df_knn.iloc[:,:] = knn_imputer.fit_transform(df_knn)

In [14]:
# Code modified from https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/SubsurfaceDataAnalytics_Feature_Imputation.ipynb
# Authored by Dr. Michael Pyrcz

# Impute the data using chained equations
df_mice = improved.copy(deep=True)  # make a deep copy of the DataFrame
df_mice.drop(columns=["Well_ID", "AL_Key"], inplace=True)
df_mice.replace([np.inf, -np.inf], np.nan, inplace=True) # Replace infinite values
mice_imputer = IterativeImputer()
df_mice.iloc[:,:] = mice_imputer.fit_transform(df_mice)

### Machine Learning Workflow
In practice, this method was applied first to the minimum viable prototype, and then the improved data set. Additionally, with more time we would have liked to put some of the functionality in functions to reduce lots of redundant code, but failed to do that given the time constraints.

In [None]:
combined_data.head()
combined_data["ESP Data - Vibration X"] = df_mice["ESP Data - Vibration X"]

In [None]:
grouped_data=combined_data.groupby(["Well_ID","AL_Key"])
array_life=np.array(grouped_data.size())
solution_data = pd.read_csv("solution.csv") # Changed here
mega_matrix=np.array(grouped_data)
test_array=np.array(solution_data)
list_sort=[]
list_fail=[]
for i in range (0,len(mega_matrix)):
    flag=0
    for k in range(0,len(test_array)):        
        if mega_matrix[i][0][0] == test_array[k][0] and mega_matrix[i][0][1] == test_array[k][1]:
            list_sort.append(i)
            flag=1            
    if flag==0:
        list_fail.append(i)
mega_s=mega_matrix[list_sort]
mega_s_life=array_life[list_sort]

    
mega_f=mega_matrix[list_fail]
mega_f_life=array_life[list_fail]


old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["OIL"].diff().mean()
    old_folks.append(alpha)

army=[]
for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["OIL"].diff().mean()
    army.append(soldier)

In [None]:
# Oil production Standard deviation.
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["OIL"].std()
    old_folks.append(alpha)

army=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["OIL"].std()
    army.append(soldier)


# Adding values to a dataframe. First (0-army) Second(1-old).

army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df = pd.DataFrame(np.array(main_arr),columns = ["Oil Production"])    
data_df["Target"] = y 
    
data_df

In [None]:
# Gas/oil ratio     
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["GAS/OIL"].diff().mean()
    old_folks.append(alpha)

army=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["GAS/OIL"].diff().mean()
    army.append(soldier)
        
army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["GAS/OIL"] = np.array(main_arr)

    
data_df

In [None]:
# Standard Deviation of Gas/Oil    
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["GAS/OIL"].std()
    old_folks.append(alpha)

army=[]
for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["GAS/OIL"].std()
    army.append(soldier)
        
army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["GAS/OIL(std)"] = np.array(main_arr)

    
data_df

In [None]:
# Gas/fluid slope.    
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["GAS/FLUID"].diff().mean()
    old_folks.append(alpha)

army=[]
for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["GAS/FLUID"].diff().mean()
    army.append(soldier)
    
army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["GAS/FLUID"] = np.array(main_arr)

    
data_df

In [None]:
# Gas/Fluid standard deviation.    
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["GAS/FLUID"].std()
    old_folks.append(alpha)

army=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["GAS/FLUID"].std()
    army.append(soldier)


army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["GAS/FLUID(std)"] = np.array(main_arr)

    
data_df

In [None]:
# Static features.    
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["AL_Bottom_Depth"].mean()
    old_folks.append(alpha)

army=[]
for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["AL_Bottom_Depth"].mean()
    army.append(soldier)
        
army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["AL_Bottom_Depth"] = np.array(main_arr)
data_df['AL_Bottom_Depth'] = (data_df['AL_Bottom_Depth']) / (data_df['AL_Bottom_Depth'].abs().max())
    
data_df

In [None]:
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["DLS_Critical"].mean()
    old_folks.append(alpha)

army=[]
for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["DLS_Critical"].mean()
    army.append(soldier)
        

army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["DLS_Critical"] = np.array(main_arr)
data_df['DLS_Critical'] = (data_df['DLS_Critical']) / (data_df['DLS_Critical'].abs().max())
    
data_df

In [None]:
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["OIL"].diff().mean()
    old_folks.append(alpha)

army=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["OIL"].diff().mean()
    army.append(soldier)

army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["OIL"] = np.array(main_arr)

    
data_df

In [None]:
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["ESP Data - Vibration X"].diff().mean()
    old_folks.append(alpha)

army=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["ESP Data - Vibration X"].diff().mean()
    army.append(soldier)

army_arr = (np.array(army))
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["ESP Data - Vibration X"] = np.array(main_arr)

    
data_df

In [None]:
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha1=mega_f[i][1][senior:overall_time]["OIL"].sum()
    alpha2=mega_f[i][1][0:overall_time]["OIL"].sum()
    alpha3=alpha1/alpha2
    alpha3 = alpha3/ (overall_time*0.1)
    old_folks.append(alpha3)

army=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier1=mega_f[i][1][start:end]["OIL"].sum()
    soldier2=mega_f[i][1][0:end]["OIL"].sum()
    soldier3=soldier1/soldier2
    soldier3 = soldier3/(overall_time*0.2)
    army.append(soldier3)
    
army_arr2 = np.array(army)
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr2,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["Percentage Cumulative"] = np.array(main_arr)

In [None]:
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["OIL"].diff().diff().mean()
    old_folks.append(alpha)

army=[]
for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["OIL"].diff().diff().mean()
    army.append(soldier)
    
army_arr2 = np.array(army)
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr2,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["B"] = np.array(main_arr)

In [None]:
# Age.
old_folks=[]

for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_f[i][1][senior:overall_time]["OIL"].diff().diff().mean()
    old_folks.append(overall_time)

army=[]
for i in range (len(mega_f)):
    overall_time=mega_f_life[i]
    start=overall_time-int((overall_time*0.3).round(0))
    end=overall_time-int((overall_time*0.2).round(0))
    soldier=mega_f[i][1][start:end]["OIL"].diff().diff().mean()
    army.append(end)
    
army_arr2 = np.array(army)
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((army_arr2,old_folks_arr))

zero = np.zeros(shape = (len(mega_f),1))
ones = np.ones(shape = (len(mega_f),1))

y = np.vstack((zero,ones))

data_df["Age"] = np.array(main_arr)

In [None]:
data_df = data_df.fillna(method = 'backfill')

In [None]:
data_df = data_df[data_df["GAS/OIL"] <= 10]
data_df = data_df[data_df["GAS/FLUID"] <= 10]

In [None]:
data_df = data_df[data_df["GAS/OIL(std)"] <= 100]
data_df = data_df[data_df["OIL"] >= -50]
data_df = data_df[data_df["OIL"] <= 50]

In [None]:
top = np.percentile(data_df["Age"].values,90)
bottom = np.percentile(data_df["Age"].values,10)
data_df = data_df[data_df["Age"] <= top]
data_df = data_df[data_df["Age"] >= bottom]

In [None]:
sns.boxplot(y = data_df['GAS/OIL'])

In [None]:
X_data = data_df[['OIL','GAS/OIL','GAS/OIL(std)','GAS/FLUID', 'GAS/FLUID(std)','Percentage Cumulative','B', 'Age']]

In [None]:
scaler = StandardScaler()


In [None]:
y = data_df['Target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data,y)
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

In [None]:
clf2 = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf2.fit(X_train, y_train)
y_pred = clf2.predict(X_test)

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
# Random Forest
clf = RandomForestClassifier(max_depth=6, random_state=0, min_samples_split=2,n_estimators = 1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [None]:
# For the Solution Data
old_folks=[]
wellid = []
pumpn = []

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["OIL"].diff().diff().mean()
    old_folks.append(alpha)
    wellid.append(mega_s[i][0][0])
    pumpn.append(mega_s[i][0][1])


    
old_folks_arr = (np.array(old_folks))
wellid_arr = np.array(wellid)
pump_arr = np.array(pumpn)

sol_df = pd.DataFrame(np.array(wellid),columns = ["Well_ID"])
sol_df["Pump_ID"] = np.array(pumpn)

main_arr = np.hstack((old_folks_arr))

sol_df["Oil"] = np.array(main_arr)

In [None]:
# Gas/oil ratio     
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["GAS/OIL"].diff().mean()
    old_folks.append(alpha)

        
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["GAS/OIL"] = np.array(main_arr)

    
sol_df

In [None]:
# Gas/oil standard deviation.    
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["GAS/OIL"].std()
    old_folks.append(alpha)

        
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["GAS/OIL(std)"] = np.array(main_arr)

    
sol_df

In [None]:
# Gas/fluid ratio slope     
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["GAS/FLUID"].diff().mean()
    old_folks.append(alpha)

        
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["GAS/FLUID"] = np.array(main_arr)


sol_df

In [None]:
# Gas/Fluid standard deviation
old_folks = []

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["GAS/FLUID"].std()
    old_folks.append(alpha)
    
        
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["GAS/FLUID(std)"] = np.array(main_arr)

    
sol_df

In [None]:
# Static features.    
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["AL_Bottom_Depth"].mean()
    old_folks.append(alpha)
    
        
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["AL_Bottom_Depth"] = np.array(main_arr)

sol_df['AL_Bottom_Depth'] = (sol_df['AL_Bottom_Depth']) / (sol_df['AL_Bottom_Depth'].abs().max())

    
sol_df

In [None]:
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["DLS_Critical"].mean()
    old_folks.append(alpha)
    
        
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["DLS_Critical"] = np.array(main_arr)

sol_df['DLS_Critical'] = (sol_df['DLS_Critical']) / (sol_df['DLS_Critical'].abs().max())

    
sol_df

In [None]:
# Oil production data.
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["OIL"].diff().mean()
    old_folks.append(alpha)
    
        
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["OIL"] = np.array(main_arr)

    
sol_df

In [None]:
# Vibration Data.
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["ESP Data - Vibration X"].diff().mean()
    old_folks.append(alpha)
    
    
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["ESP Data - Vibration X"] = np.array(main_arr)

    
sol_df

In [None]:
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha1=mega_s[i][1][senior:overall_time]["OIL"].sum()
    alpha2=mega_s[i][1][0:overall_time]["OIL"].sum()
    alpha3=alpha1/alpha2
    alpha3 = alpha3/ (overall_time*0.1)
    old_folks.append(alpha3)

sol_df["Cumulative Percentage"] = np.array(main_arr)

    
sol_df

In [None]:
# 'B' Feature
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["OIL"].diff().diff().mean()
    old_folks.append(alpha)
    
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["B"] = np.array(main_arr)

    
sol_df

In [None]:
old_folks=[]

for i in range (len(mega_s)):
    overall_time=mega_s_life[i]
    senior=overall_time-int((overall_time*0.1).round(0))
    alpha=mega_s[i][1][senior:overall_time]["OIL"].diff().diff().mean()
    old_folks.append(overall_time)
    
old_folks_arr = (np.array(old_folks))
main_arr = np.hstack((old_folks_arr))


sol_df["Age"] = np.array(main_arr)

    
sol_df

In [None]:
sol_df = sol_df.fillna(method = 'backfill')

In [None]:
sol_df = sol_df[sol_df["GAS/OIL"] <= 10]
sol_df = sol_df[sol_df["GAS/FLUID"] <= 10]

sol_df = sol_df[sol_df["GAS/OIL(std)"] <= 100]
sol_df = sol_df[sol_df["OIL"] >= -50]
sol_df = sol_df[sol_df["OIL"] <= 50]

In [None]:
top = np.percentile(sol_df["Age"].values,90)
bottom = np.percentile(sol_df["Age"].values,10)
sol_df = sol_df[sol_df["Age"] <= top]
sol_df = sol_df[sol_df["Age"] >= bottom]

In [None]:
X_data = sol_df[['OIL','GAS/OIL','GAS/OIL(std)','GAS/FLUID', 'GAS/FLUID(std)','Cumulative Percentage','B','Age']]

In [None]:
y_pred = clf.predict(X_data)

In [None]:
y_pred

In [None]:
mega_s_life

In [None]:
# SVM
y_pred_svm = clf2.predict(X_data)

In [None]:
y_pred_svm

In [None]:
sol_df.head()

In [None]:
plt.hist(sol_df["GAS/FLUID"])

In [None]:
sol_df["Predicted Labels"] = y_pred

In [None]:
sol_df