<a href="https://colab.research.google.com/github/SPE-PFAC01/ALCE/blob/main/MPFM_vfm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 4.0. Multiphase Flow Meter Virtualization
* Objective is to determine if a virtual multiphase flow meter can be constructed using machine learning rather than physical modeling.
### Regression problem

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Load Python Libraries

In [2]:
# Data storage, exploration
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Data imputing
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

# ML Model libraries

# The following three lines allow multiple and non-truncated outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
pd.set_option('display.max_columns', None)

# Import Data & Preliminary Data Exploration
1. How many data records?
2. How many variables / features or columns in each data record?
3. Peek at the first five records and the last five records

In [3]:
# Import Data
mpfm_file = '/content/drive/MyDrive/ALCE/MPFM.XLSX'
mpfm = pd.read_excel(mpfm_file)

In [None]:
mpfm.info()
mpfm.head()
mpfm.tail()

## Examine Data / think / discuss
1. Do we need time column?
2. Order of columns is not intuitive
   1. Outputs or targets __ref_oil_rate, ref_water_rate, ref_gas_rate__ are in the beginning
       1. multioutput regression problem
       2. Do we need all three? Won't just one do? Which one????
   2. Chokes and gas-lift rate impact production.
   3. Pressures can be organized together in the order from bottomhole to the surface.
   4. Temperature columns may be at the end just before targets.
3. **Note**: Well and Reservoir are provided as numeric columns.Generally, they are categorical variables (having specific and distinct values rather than continuous real-values that a numeric variable would get). The numbers are ordinal values probably.

In [None]:
mpfm.drop(columns=['time'], inplace=True)
reorder = ['well', 'reservoir','chokeprod', 'chokegaslift', 'dhp', 'whp', 'chokepressdownstream', 'flowlinepressure', 'gasliftpressure', 'dht', 'wht', 'gasliftrate', 'ref_oil_rate', 'ref_water_rate', 'ref_gas_rate']
mpfm = mpfm[reorder]

mpfm['ref_liq_rate'] = mpfm.ref_oil_rate + mpfm.ref_water_rate

# # Convert Well and Reservoir to categorical variables
categoricals = ['well', 'reservoir']
mpfm[categoricals] = mpfm[categoricals].astype('category')
mpfm_num = mpfm.drop(columns=categoricals);

mpfm_num.head()
mpfm_num.describe()

## Statistics on each column / feature / variable
1. What are NaNs in the first five records for <b>chokegaslift</b>? There could be more in other records for this variable... and for other(s).
2. Review how many measurements for each variable in the <b>count</b> row. Why chokegaslift has less measurements? NaNs? How many?
3. Review Mean, min, std dev and percentile values for each variable.
    1. Could min-values be negative for <b>chokeprod, dhp</b>?
    2. How about min-values being zero for certain variables?
    3. What are the median values?
4. What does it mean when 25% percentile-value is 0.0 for <b>dhp, gasliftpressure, dht</b> variables?

In [None]:
# plot statistics
plt.figure(figsize=(15,6));
plt.yscale("log");
plt.grid('y');
plt.xticks(rotation='vertical')
sns.boxplot(data=mpfm_num);

#### Boxplot Visualization
In the visualization above, why some boxes are very tall (long color bars).
1. Which variable has the smallest distribution?
2. Which variable is widely distributed?
3. What does it mean when one whisker is longer than the other?

In [None]:
plt.figure(figsize=(15,6));
plt.grid('y');
sns.boxplot(data=mpfm, x="well", y="ref_liq_rate");
plt.figure(figsize=(15,6));
plt.grid('y');
sns.boxplot(data=mpfm, x="reservoir", y="ref_liq_rate");

### Histograms
Help visualize how measurements are distributed.
Wouldn't we like them to be normally distributed?!?

In [None]:
mpfm.hist(figsize=(20, 20));

In [None]:
# May skip running in class as it takes about 90s.
plot_vars=['chokeprod', 'dhp', 'whp', 'chokepressdownstream',
           #'flowlinepressure',
           #'ref_oil_rate', 'ref_water_rate','ref_gas_rate',
           'ref_liq_rate'
           ]

# Define a function to plot histogram and scatterplot for the specified variables/columns of provided dataframe
def plotPairgrid(df, plot_vars=['chokeprod', 'dhp', 'whp', 'chokepressdownstream', 'flowlinepressure',
                                'ref_oil_rate', 'ref_water_rate','ref_gas_rate']):
    g = sns.PairGrid(data=df, vars=plot_vars, hue='reservoir', diag_sharey=False);
#    g.map_upper(sns.scatterplot, s=15);
#    g.map_lower(sns.kdeplot);
    #g.map_diag(sns.kdeplot, lw=2);
    g.map_diag(sns.histplot);
    g.map_offdiag(sns.scatterplot);


plotPairgrid(mpfm, plot_vars);

### Correlations between Variables
Let's find out if two variables are correlated by calculating correlation coefficients between two variables.
1. Positive value (positive correlation) means one increases with another in the dataset; and
2. Negative value (negative correlation) means one decreases while another increases and vice versa.
3. Magnitude of the correlation coefficient indicates strength of the correlation.

#### Why do we want to perform this exercise?

There are multiple ways to perform this task. We will calculate Pearson and Spearman coefficeints.
#### Pearson Correlation Coefficient
Pearson correlation assumes that the data we are comparing is normally distributed. When that assumption is not true, the correlation value is reflecting the true association.

#### Spearman Rank Correlation
Spearman correlation does not assume that data is from a specific distribution, so it is a non-parametric correlation measure. Spearman correlation is also known as Spearman’s rank correlation as it computes correlation coefficient on rank values of the data.

In [None]:
def plot_corrcoeff(method):
    fig, ax = plt.subplots(figsize=(7, 7))

    dfCorr = mpfm[plot_vars].corr(method=method)
    g1 = sns.heatmap(dfCorr, center=0.0, linewidths=0.3, square=True, annot=True, vmin=-1, vmax=1., fmt='1.2f')
    g1.set_xticklabels(g1.get_xticklabels(), rotation=90);
    g1.set_title(method.capitalize() + ' Correlation Coefficients - Heatmap')
    plt.show()

# Pearson Correlation Coefficient
plot_corrcoeff(method='pearson')

# Spearman Rank Correlation
plot_corrcoeff(method='spearman')

# Data Exploration
### Missing Data at Macro Level
1. In the box plots above, <b>chokeprod, chokegaslift, dhp, gasliftpressure, dht, gasliftrate</b> have almost zero or negative values
2. There may be some null (NaN) measurements for some of these data columns.

In [None]:
# Q1. How many nulls are there?
mpfm.isnull().sum()

# Q2. How many values are zero or -ve
(mpfm[plot_vars] <= 0.0).sum()

## Data Exploration - Slightly Deeper Dive
1. <b>Missing Data</b>
    1. Which wells have null values for the chokegaslift variable?
    2. No gaslift pressure but +ve gaslift rate
2. <b>Illogical data</b>
    1. whp <= flp and Qliquid > 0.0
    2. dhp > 0.0 but <= whp
3. <b>Overall number of Impaired records</b>: Having one or more issues with data

<b>Any others, you'd like to check?</b>

In [None]:
# Null records for the chokegaslift variable
mpfm[mpfm.chokegaslift.isnull()].well.unique()

# Qgl > 0.0 but Pgaslift is 0.0 or -ve?
((mpfm.gasliftpressure <= 0.0) & (mpfm.gasliftrate >= 0.0)).sum()

# Records with whp <= flowlinepressure and liquid flowrate > 0.0  <-- 904
((mpfm.whp <= mpfm.flowlinepressure) &  ((mpfm.ref_oil_rate + mpfm.ref_water_rate) > 0.0)).sum()

# 904 records with WHP <= FLP / dhp <= 0 <- 17589 / ProdChoke -ve 181
# 17781 records
((mpfm.dhp <= 0.0) | (mpfm.chokeprod <= 0.0) |
 ((mpfm.whp <= mpfm.flowlinepressure) & ((mpfm.ref_oil_rate + mpfm.ref_water_rate) > 0.0)) |
 (mpfm.dht <= 0.0) | (mpfm.wht <= 0.0) | (mpfm.gasliftrate <= 0.0) | (mpfm.gasliftpressure <= 0.0)
 ).sum()

## Data Cleaning
1. Convert negatives or zeros to Nan
2. Production choke has some weird values around 0 (< 1.0). Set them to NaN
3. Recheck histograms

In [None]:
mpfm_positives = mpfm_num.where(mpfm_num > 0.0, other=np.nan)

mpfm_positives['chokeprod'] = np.where(mpfm_positives['chokeprod'] < 1.0, np.nan, mpfm_positives['chokeprod'])

plt.figure(figsize=(12, 4));
plt.subplot(121, title='Positive ChokeProd');
sns.histplot(x=mpfm.chokeprod, log_scale=(True, True));

plt.subplot(122, title='ChokeProd > 1.0');
aa = sns.histplot(x=mpfm_positives.chokeprod, log_scale=(True, True));
aa.set_xlim(0.6, 150);

In [15]:
# Separate targets from inputs
targets = ['ref_oil_rate', 'ref_water_rate', 'ref_gas_rate']#, 'ref_liq_rate']

y = mpfm_positives[targets].to_numpy()
X = mpfm_positives.drop(columns=targets).to_numpy()

In [None]:
# Do we have nulls in targets and inputs?
np.isnan(y).any(), np.isnan(X).any()

### Impute missing values

In [18]:
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

# Replace missing values by nearest neighbor
imputer = KNNImputer(n_neighbors = 5)
X = imputer.fit_transform(X)

In [None]:
# Normalize inputs
from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(X)

### Sub-divide datatest into training and testing: 70 - 30% split

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3, random_state=1002)

### Method Evaluation using Multiple Metrics

In [21]:
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error, r2_score

def calc_predMetrics(y_true, y_pred, method):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mape = mean_absolute_percentage_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return {'method':method, 'rmse':rmse, 'MAPE':mape, 'R2':r2}

pred_perf_metric = []

### Decision Tree Regressor

In [22]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(max_depth=100, splitter='best',
                           criterion='squared_error', random_state=1002)
y_pred_dt = dt.fit(X_train, y_train).predict(X_test)

pred_perf_metric.append(calc_predMetrics(y_test, y_pred_dt, method="Decision Tree"))

### Support Vector machine Regressor

In [None]:
# Takes very long so commented out for class exercise
'''
from sklearn.multioutput import MultiOutputRegressor

from sklearn.svm import SVR
svr = SVR(kernel="rbf")
y_pred_svr = MultiOutputRegressor(svr).fit(X_train, y_train).predict(X_test)
pred_perf_metric.append(
    calc_predMetrics(y_test, y_pred_svr, method="Support Vector Regression"))
'''

### Random Forest Regressor

In [24]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1000, max_depth=1000,
                           criterion='squared_error', random_state=0)
y_pred_rf = rf.fit(X_train, y_train).predict(X_test)
pred_perf_metric.append(calc_predMetrics(y_test, y_pred_rf, method="Random Forest" ))

### Extra Trees Regressor

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
et = ExtraTreesRegressor(n_estimators=1000, max_depth=1000,
                           criterion='squared_error', random_state=0)
y_pred_et = et.fit(X_train, y_train).predict(X_test)
pred_perf_metric.append(calc_predMetrics(y_test, y_pred_et, method="Extra Trees"))

### XGBoost

In [None]:
import xgboost as xgb
xgbm = xgb.XGBRegressor()
y_pred_xgb = MultiOutputRegressor(xgbm).fit(X_train, y_train).predict(X_test)
pred_perf_metric.append(calc_predMetrics(y_test, y_pred_xgb, method="XGBoost"))

### ANN

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.optimizers import Adam

nn = Sequential()
nn.add(Dense(200, input_dim=X.shape[1], activation='selu', kernel_initializer='he_uniform'))
nn.add(Dense(50, activation='selu', kernel_initializer='he_uniform'))
nn.add(Dense(4, activation='selu'))

# compile model
nn.compile(loss='mean_squared_error', optimizer=RMSprop(learning_rate=0.01))
# fit model
nn.fit(X_train, y_train, epochs=100, verbose=0)
# evaluate the model
y_pred_nn = nn.predict(X_test)
pred_perf_metric.append(calc_predMetrics(y_test, y_pred_nn, method="ANN"))

In [None]:
X.shape

In [None]:
# Plot Relative Performances
plotDf = pd.DataFrame(pred_perf_metric)
fig, axs = plt.subplots(1, 3, figsize=(12, 6), sharey=True)
rect1 = axs[0].barh(plotDf.method, plotDf.rmse)
axs[0].set_title('RMSE, bbls/day')
axs[0].bar_label(rect1, padding=1, fmt='%.1f')
axs[0].set_xlim(0., 4000.)

rect2 = axs[1].barh(plotDf.method, plotDf.MAPE)
axs[1].set_title('MAPE, %')
axs[1].bar_label(rect2, padding=1, fmt='%.3f')
axs[1].set_xlim(0., 0.6)

rect3 = axs[2].barh(plotDf.method, plotDf.R2)
axs[2].set_title('R2 Coefficient')
axs[2].bar_label(rect3, padding=0, fmt='%.3f')
axs[2].set_xlim(0., 1.2)

fig.suptitle('ML Methods Comparison')