<span style='color:gray'> <span style="font-size:25px;"> **Development of "Machine Learning Models"  (Workflow)**
    
In this Notebook, the machine learning model will be created and then the data from well-logs DLIS file [after preprocessing, sorting and finalizing the data] is loaded as input for Machine Learning model (ML); 
* Random Forest Regressor
* Gradient Boosting Regressor
    
    
For the prediction of petrophysical properties, such as porosity, permeability and water saturation, these two Regressor models **Random Forest Regressor** and **Gradient Boosting Regressor** are suitable.

They are Ensemble Based Tree Methods; they are based on the generation of Decision Trees.

We use Regression Models since we want to predict a continuous variable.

**Advantages** of the 2 regression models, since they are based on Decision Trees:

* They do not need the normalization or scaling of the original dataset;
* They are not sensitive to outliers, thus, outliers detection and removal are not required.

**==================================================================================================================**
    
In well-log machine learning models, the choice between regression and classification (Supervised ML) depends on the nature of the problem you are trying to solve and the type of data you have. Let's break down the reasons why regression is often preferred over classification in this context:

**Continuous Output**: Well-log data often involves continuous measurements such as porosity, permeability, resistivity, and other geological properties. Regression is well-suited for predicting and modeling continuous numerical values. Classification, on the other hand, is typically used when the output is categorical or discrete, like classifying lithology or rock types.

**Data Distribution**: Well-log data tends to have a wide range of continuous values. Using classification would require discretizing this data into bins or classes, which can lead to loss of information and potentially introduce biases. Regression models can capture the nuances and variations present in the continuous data more effectively.

**Evaluation Metrics**: Regression models are evaluated using metrics such as mean squared error (MSE), root mean squared error (RMSE), or mean absolute error (MAE). These metrics are well-suited for measuring the accuracy of predictions involving continuous values. Classification models, on the other hand, use metrics like accuracy, precision, recall, and F1-score, which are designed for categorical predictions.

**Feature Importance**: Well-log data analysis often involves understanding the relationships between different geological features and the target property. Regression models can provide insights into the quantitative impact of each feature on the predicted values, aiding in geological interpretation.



<span style='color:gray'> <span style="font-size:20px;"> 
**Importing Libraries, Regressors, and Required Dependencies**

In [30]:
%pip install --quiet --upgrade scikit-learn==1.2.2
%pip install --quiet qbstyles


# Importing the dependencies
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

from qbstyles import mpl_style
mpl_style(dark=False)  # Set light matplotlib style

import matplotlib.patches as mpatches  # To create a legend with a color box
import pickle

# Importing the models 
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neural_network import MLPRegressor
                                         
from sklearn.model_selection import RandomizedSearchCV

# train_test_split is a function 
# cross_val_score and KFold are functions

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold 

# Regression metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_squared_error, mean_absolute_error

# The package "Matplotlib Inline Back-end" provides support for Matplotlib to display figures directly inline
# "svg" stands for "scalable vector graphic". The plot can be scaled without compromising its quality
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('svg') 

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


<span style='color:gray'> <span style="font-size:20px;"> 
**Loading datafile that has been extracted from DLIS or LAS file after (Sorting, Cleaning, preprocessing, choosing the logs based on need logically ...)**

Load the **csv** well log data to Pandas DataFrame 

In [31]:
file_path = '/Users/amirhosseinakhondzadeh/CODE_WELLLOGS/Petrobras Well-log Analysis/Processed Data (out put of preprocessing == Input of ML)/df0_ML.csv'
df = pd.read_csv(file_path)
df

Unnamed: 0,DEPTH,GR,RHGX_HILT,NPHI,AT10,AT20,AT30,AT60,AT90,PEFZ
0,3241.2432,43.603180,2.831039,0.011302,51.794643,61.238102,52.368523,47.517567,38.26941,3.912346
1,3241.3955,43.603180,2.831039,0.011302,51.794643,61.238102,52.368523,47.517567,38.26941,3.912346
2,3241.5480,31.196218,2.831039,0.011302,51.794643,61.238102,52.368523,47.517567,38.26941,3.912346
3,3241.7004,22.927324,2.831039,0.011413,51.794643,61.238102,52.368523,47.517567,38.26941,3.912346
4,3241.8528,25.734980,2.832985,0.011976,51.794643,61.238102,52.368523,47.517567,38.26941,3.912346
...,...,...,...,...,...,...,...,...,...,...
1141,3415.1316,219.444870,2.834162,0.091268,6.251309,1950.000000,101.902054,419.289830,207.16121,3.492837
1142,3415.2840,219.444870,2.833239,0.091268,6.248991,1950.000000,101.493515,412.861100,209.38712,3.481267
1143,3415.4365,219.444870,2.832363,0.091268,6.247348,1950.000000,101.377750,416.689060,214.43398,3.470307
1144,3415.5889,219.444870,2.831719,0.091268,6.247392,1950.000000,101.574510,427.813420,221.12366,3.467056


<span style='color:gray'> <span style="font-size:20px;"> 
**Defining Predictor (X) and what will be predicted**

In [32]:
predictors = ["GR","NPHI"] 
X = df[predictors]
y = df["PEFZ"]

<span style='color:gray'> <span style="font-size:20px;"> 
    
**Training & Test Well Log Datasets**

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

* **Data Splitting Function**: We utilize the "train_test_split" function to perform the splitting of our data.


* **Variables to Split**: The data we're working with consists of two main variables, denoted as X and y. These are the entities that we want to partition.


* **Training Set Size**: Instead of specifying an exact training set size, we have the option to leave it as "None." In this case, the function will automatically determine the training size based on the complement of the test size. The test size is set at 20%, meaning 80% will be allocated to the training set.


* **Test Set Size**: We assign a test size of 20%, indicating that one-fifth of the entire dataset will be allocated for testing the model's performance.


* **Shuffling Data**: The default behavior is to shuffle the data prior to splitting. This randomization helps in creating a balanced distribution between the training and test sets.


* **Reproducibility with Random Seed**: For the sake of reproducibility across multiple runs of the function, we introduce an integer value known as the "random state." Here, we've chosen the value 42. It's essential to set this only when shuffling is enabled.


*In essence, we're utilizing the "train_test_split" function to divide our data into training and test portions. We provide our data variables X and y, and the function handles the allocation. The training size is determined as the complementary value to the test size, which is set at 20%. Shuffling the data ensures randomness, and to achieve consistent outcomes in different runs, we use a random state value of 42, applying it only when shuffling is activated.*

In [34]:
print(X.shape, y.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1146, 2) (1146,) (916, 2) (230, 2) (916,) (230,)


<span style='color:gray'> <span style="font-size:20px;"> 
**Cross-Validation: Enhancing Model Evaluation**

<span style='color:black'> <span style='font-size:14px;'> **Cross-validation is a technique used to evaluate the performance of a machine learning model. It works by dividing the training dataset into k subsets, called folds. The model is then trained on k-1 folds of the training dataset and evaluated on the remaining fold. This procedure is repeated k times, with each fold being used as the validation set once. The average accuracy of the model across all k folds is then used as an estimate of the model's overall performance.**</span></span>

* It is proposed to use k-fold cross-validation to evaluate the performance of two models on the training dataset. This will help them to choose the model that is most likely to generalize well to unseen data.

* The k-fold cross-validation step can be skipped, since it will be carried out again during the optimization process. However, it is still a good idea to perform k-fold cross-validation on the training dataset before starting the optimization process, as this will help to ensure that the optimization process is not converging to a local optimum.

Here are some additional details about k-fold cross-validation:

- The value of k is typically chosen to be between 5 and 10.
- The folds should be created randomly, to avoid any bias in the results.
- The model should be trained and evaluated on the same set of features for each fold.
- The accuracy of the model is typically measured using a metric such as the coefficient of determination.


K-fold cross-validation is a powerful technique for evaluating the performance of machine learning models. It is more reliable than simply training and evaluating the model on a single holdout dataset, as it helps to mitigate the effects of overfitting.

<span style='color:gray'> <span style="font-size:20px;"> 
**Creating Models + Cross-Validation [evaluation the performance of a machine learning model]**

Creating Models such as **Random Forest** & **Gradient Boosting** Models

using **for loop** to iterate over different models and compare them together

In [36]:
rf_model = RandomForestRegressor(random_state=42)  # Random Forest Model 
gb_model = GradientBoostingRegressor(random_state=42)  # Gradient Boosting Model 

models = [rf_model, gb_model]

kf = KFold(n_splits=10, shuffle=True, random_state=42)  # Number of folds 

def compare_models_cv():  
    for model in models:
        r2_score = cross_val_score(model, X_train, y_train, cv=kf, scoring='r2')
        r2_score = np.round(r2_score,4)
        mean_r2 = sum(r2_score)/len(r2_score)
        mean_r2 = mean_r2*100
        mean_r2 = round(mean_r2,2)

        print('Coefficient of Determination for', model, '=', r2_score)
        print('Average % Coefficient of Determination for', model, '=', mean_r2)
        print('============================================')
        
compare_models_cv() 

Coefficient of Determination for RandomForestRegressor(random_state=42) = [0.6216 0.709  0.7017 0.4652 0.6325 0.5615 0.5508 0.5754 0.698  0.5291]
Average % Coefficient of Determination for RandomForestRegressor(random_state=42) = 60.45
Coefficient of Determination for GradientBoostingRegressor(random_state=42) = [0.6433 0.7415 0.6819 0.4158 0.5621 0.5719 0.5457 0.5043 0.72   0.5193]
Average % Coefficient of Determination for GradientBoostingRegressor(random_state=42) = 59.06


<span style='color:gray'> <span style="font-size:20px;"> 
**Hyperparameter Tuning (Randomized Search CV) - Optimization Problem**

We re-consider the training dataset and use the Randomized Search Cross Validation technique to determine **the optimal hyperparameter values** for <span style='color:blue'> <span style="font-size:15px;"> the Random Forest</span> </span>
and
<span style='color:blue'> <span style="font-size:15px;"> Gradient Boosting models </span> </span>.

*To start, we define a grid of hyperparameters that will be randomly sampled when calling the RandomizedSearchCV() function. The models are then cross-validated on these random combinations of hyperparameters.*

**The parameters of the RandomizedSearchCV() function are:**

* The model without any hyperparameters
* The grid of hyperparameters
* The number of combinations to be randomly sampled (n_iter=20)
* The number of k-folds into which the training dataset is split (cv=10)
* The technique returns the optimal combination of hyperparameters for the two models.

**Here is a more detailed explanation of each parameter:**

* **Model**: The model without any hyperparameters is the base model that we will use to start the search. In this case, we are using the Random Forest and Gradient Boosting models.
* **Grid of hyperparameters**: The grid of hyperparameters defines the range of values that we will randomly sample from. This allows us to explore a wider range of hyperparameter values than if we were to simply grid search over a fixed set of values.
* **Number of combinations to be randomly sampled (n_iter=20)**: The n_iter parameter specifies the number of random combinations of hyperparameters to be sampled. In this case, we are sampling 20 combinations.
* **Number of k-folds into which the training dataset is split (cv=10)**: The cv parameter specifies the number of k-folds to use for cross-validation. In this case, we are using 10 folds.
* The RandomizedSearchCV() function will randomly sample 20 combinations of hyperparameters from the grid and cross-validate each combination on 10 folds of the training dataset. The function will then return the combination of hyperparameters that resulted in the best cross-validation score.

This technique allows us to quickly and efficiently explore a wide range of hyperparameter values to find the optimal combination for our models.

<span style='color:gray'> <span style="font-size:15px;"> 
**Random Forest Model**

We consider the following hyperparameters:

* n_estimators = number of trees in the forest;
* max_depth = the maximum depth of the tree;
* criterion = the function that measures the quality of the split.

In [39]:
# RANDOM FOREST Hyperparameters

# Number of trees to be used
rf_n_estimators = [100, 150, 200, 250, 300, 350, 400]

# Maximum number of levels in tree
rf_max_depth = [5, 10, 15, 20, 25]

# Criterion to split on
rf_criterion = ['squared_error']                         # "squared_error" is by default. It is optional

# Create the grid 
rf_grid = {'n_estimators': rf_n_estimators,
           'max_depth': rf_max_depth,
           'criterion': rf_criterion}

In [41]:
# Model to be tuned 
rf_model = RandomForestRegressor(random_state=42)        # Shuffle=True by default

# Create the random search Random Forest 
rf_random = RandomizedSearchCV(rf_model, rf_grid, n_iter=20, cv=10, random_state=42)

# Fit the random search model 
rf_random.fit(X_train, y_train)


In [42]:
# Print the results 
rf_random.cv_results_

{'mean_fit_time': array([0.58885643, 0.56909473, 0.41963794, 0.16617167, 0.2424422 ,
        0.25683694, 0.56711476, 0.4986397 , 0.21407225, 0.32304287,
        0.28638771, 0.58381152, 0.11730025, 0.34924774, 0.40415077,
        0.66506228, 0.40444431, 0.43023741, 0.17594767, 0.23366919]),
 'std_fit_time': array([0.01190323, 0.00262206, 0.01023218, 0.00106005, 0.00097819,
        0.01992609, 0.00791665, 0.00243257, 0.00079913, 0.00160324,
        0.00173297, 0.006448  , 0.00039869, 0.00138113, 0.00114193,
        0.0082836 , 0.00160679, 0.00531822, 0.00504721, 0.00605522]),
 'mean_score_time': array([0.02482176, 0.02482221, 0.01803899, 0.00742693, 0.01051645,
        0.01114318, 0.02328713, 0.02190804, 0.00986583, 0.01395547,
        0.01296782, 0.02429924, 0.00654097, 0.01754129, 0.0170161 ,
        0.02692366, 0.02000186, 0.01898487, 0.00935669, 0.01199079]),
 'std_score_time': array([1.35817616e-03, 3.21569315e-04, 1.40079644e-03, 7.97334468e-05,
        1.60144033e-04, 4.26167184e-

<span style='color:blue'> <span style="font-size:15px;"> the Random Forest</span> </span>
and
<span style='color:blue'> <span style="font-size:15px;"> Gradient Boosting models </span> </span>