# Random Forest for Predicting Continuous Well Measurements

## Importing Libraries and Loading Data

In [1]:
import joblib


In [2]:
import pandas as pd


In [3]:
df = pd.read_csv('volve_wells.csv', usecols=['WELL', 'DEPTH', 'RHOB', 'GR', 'NPHI', 'PEF', 'DT'])

## Create Training, Testing and Validation Datasets

Our dataset should have 4 wells within it. We can confirm this by calling upon the `unique()` function

In [4]:
df['WELL'].unique()

array(['15/9-F-11 B', '15/9-F-11 A', '15/9-F-1 B', '15/9-F-1 A'],
      dtype=object)

As we are using measurements taken from multiple wells, one way to split our data into training and testing is to set aside a single well (blind test well) which will be used to see how our model performs on unseen data.

In [5]:
# Training Wells
training_wells = ['15/9-F-11 B', '15/9-F-11 A', '15/9-F-1 A']

# Test Well
test_well = ['15/9-F-1 B']

"Extract" the data from the main dataframe using the well lists above

In [6]:
train_val_df = df[df['WELL'].isin(training_wells)].copy()
test_df = df[df['WELL'].isin(test_well)].copy()

In [7]:
train_val_df.describe()

Unnamed: 0,DEPTH,DT,GR,NPHI,PEF,RHOB
count,116914.0,21699.0,115933.0,37587.0,37668.0,37668.0
mean,2154.233438,77.252247,51.823119,0.174302,6.450603,2.443072
std,1180.976133,14.350893,37.606884,0.08566,1.478121,0.166466
min,145.9,53.165,0.1491,0.01,3.647,1.627
25%,1148.525,66.85445,22.1261,0.115,5.07885,2.276
50%,2122.8,72.724,52.217,0.163,6.5487,2.501
75%,3097.1,86.1323,74.201,0.2121,7.728625,2.577
max,4770.2,126.827,1124.403,0.5932,13.841,3.09


In [8]:
test_df

Unnamed: 0,DEPTH,DT,GR,NPHI,PEF,RHOB,WELL
81553,145.9,,,,,,15/9-F-1 B
81554,146.0,,,,,,15/9-F-1 B
81555,146.1,,,,,,15/9-F-1 B
81556,146.2,,,,,,15/9-F-1 B
81557,146.3,,,,,,15/9-F-1 B
...,...,...,...,...,...,...,...
114739,3464.5,,,,,,15/9-F-1 B
114740,3464.6,,,,,,15/9-F-1 B
114741,3464.7,,,,,,15/9-F-1 B
114742,3464.8,,,,,,15/9-F-1 B


### Remove NaN Values From Dataframe
Removing missing values from the dataframe is one way to deal with them, however, doing so reduces the amount of training data you have available. Other methods can be used to infill the NaNs with sensible values.

In [9]:
train_val_df.isna().sum()

DEPTH        0
DT       95215
GR         981
NPHI     79327
PEF      79246
RHOB     79246
WELL         0
dtype: int64

In [10]:
train_val_df=train_val_df.dropna()
test_df=test_df.dropna()

In [11]:
train_val_df.dropna(inplace=True)
test_df.dropna(inplace=True)
train_val_df.describe()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df.dropna(inplace=True)


Unnamed: 0,DEPTH,DT,GR,NPHI,PEF,RHOB
count,21688.0,21688.0,21688.0,21688.0,21688.0,21688.0
mean,3141.098875,77.235857,39.803246,0.166648,7.093603,2.475232
std,314.723749,14.336048,57.907158,0.0992,1.188313,0.147635
min,2577.0,53.165,0.852,0.01,4.2978,1.9806
25%,2869.475,66.8493,9.41635,0.096,6.218475,2.379
50%,3140.55,72.72075,27.552,0.136,7.4877,2.533
75%,3411.625,86.0938,44.877425,0.2172,8.001,2.5814
max,3723.3,126.827,1124.403,0.5932,13.841,3.025


In [12]:
train_val_df.isna().sum()

DEPTH    0
DT       0
GR       0
NPHI     0
PEF      0
RHOB     0
WELL     0
dtype: int64

In [13]:
train_val_df.to_csv("notnull.csv")

## Implementing the Random Forest Model

In [14]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor

### Selecting Training and Target Features

In [15]:
X = train_val_df[['RHOB', 'GR', 'NPHI', 'PEF']]
y = train_val_df['DT']

Note that the name test used here is commonly used within machine learning. In this case the variables X_test and y_test are our validation data. In other words it is used to help tune our model. 

In [16]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

In [17]:
y_val.shape

(4338,)

Checking the shapes of X_train and X_test to make sure they have been split correctly.

### Building the Model

In [18]:
regr = RandomForestRegressor()

In [19]:
regr.fit(X_train, y_train)

In [20]:
train_pred = regr.predict(X_train)

In [21]:
y_pred = regr.predict(X_val)

In [22]:
y_pred

array([89.836215, 71.814403, 96.054845, ..., 67.451919, 74.026599,
       89.676292], shape=(4338,))

### Check the Prediction Results

In [23]:
metrics.mean_absolute_error(y_val, y_pred)

1.6117188916551404

In [24]:
mse = metrics.mean_squared_error(y_val, y_pred)
rmse = mse**0.5 

In [25]:
rmse

2.853462935520159

In [26]:
val_r2 =metrics.r2_score(y_val, y_pred)
val_r2

0.9603986729568894

In [27]:
train_r2 =metrics.r2_score(y_train, train_pred)
train_r2

0.9938961484758699

Simple metrics like above are a nice way to see how a model has performed, but you should always check the actual data. 

In the plot below, we are comparing the real data against the predicted data.

In [29]:

# save
joblib.dump(regr, "model.pkl") 

['cmodel.pkl']