

<p align="center">
    <img src="https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true" width="220" height="240" />

</p>

## PGE 383 Final Project 

____________________



## Algorithm Chains and Pipeline Tutorial

#### Nataly Chacon Buitrago
####  Hildebrand Department of Petroleum and Geosystems Engineering, Cockrell School of Engineering

### Subsurface Machine Learning Course, The University of Texas at Austin
_____________________

Workflow supervision and review by:

#### Instructor: Prof. Michael Pyrcz, Ph.D., P.Eng., Associate Professor, The Univeristy of Texas at Austin
##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Course TA: Misael Morales, Graduate Student, The University of Texas at Austin


### Executive Summary


A machine learning pipeline is a tool that helps to automate machine learning workflows. It consists of multiple steps that can do everything in a ML workflow, from data extraction and preprocessing to model training and deployment. 
In data science teams, the production of pipelines many times are the central product. As, they encapsulate the best practices of producing a ML model. In sciences and engineering a pipeline can be very useful for a single model that needs to be updated frequently. 
In this workflow, I will explain how to build a basic pipeline with a simple data set. Then, I will explain how to combine pipelines and GridSearchCV to search over parameters for all processing steps at once. I will also explain about how to create a pipeline using make_pipeline and how to build a pipeline when there are different types of data columns.

### Import Packages


In [1]:
import numpy as np                                        # for working with data and model arrays
import warnings
from sklearn.neighbors import KNeighborsRegressor         # for nearest k neighbours
from sklearn.decomposition import PCA                     # perform dimensionality reduction using PCA
from sklearn.preprocessing import OneHotEncoder           # Encode categorical features as a one-hot numeric array
import pandas as pd                                       # data manipulation and analysis
from sklearn import set_config                            # visualization of diagrams
from sklearn.impute import SimpleImputer                  # Complete missing values with simple strategies
from sklearn.model_selection import train_test_split      # split the model in train and test sets
from sklearn.pipeline import Pipeline                     # to build pipelines
from sklearn.pipeline import make_pipeline                # Construct a Pipeline from the given estimators
from sklearn.preprocessing import StandardScaler          # feature scaling
from sklearn.preprocessing import MinMaxScaler            # feature scaling
from sklearn.linear_model import LogisticRegression       # logistic regression
from sklearn.compose import ColumnTransformer             # Applies transformers to columns of an array or pandas df
from sklearn.model_selection import GridSearchCV          # Exhaustive search over specified parameter values for an estimator
import mglearn                                            # helper functions for the book "Introduction to 
                                                          # Machine Learning with Python"

### Load Data

 We will work with the following features:

* **Well Index (WellIndex)** 
* **porosity (Por)** - fraction of rock void in units of percentage
* **permeability (LogPerm)** - ability of a fluid to flow through the rock in mil;iDarcy
* **acoustic impedence (AI)** - product of sonic velocity and rock density in unitsof $kg/m^2s*10^3$
* **Brittleness (Brittle)** - ability of a rock to break
* **Total Organic Content (TOC)** - concentration of organic material in source rocks as represented by the weight percent of organic carbon
* **Vitrinite Reflectance (VR)** - measurement of the maturity of organic matter with respect to whether it has generated hydrocarbons or could be an effective source rock.

 The target variable is:
* **Production** 

In [2]:
my_data = pd.read_csv(r"https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/unconv_MV.csv") # load the
          # comma delimited data file from Dr. Pyrcz's GeoDataSets GitHub repository
my_data.head()

Unnamed: 0,WellIndex,Por,LogPerm,AI,Brittle,TOC,VR,Production
0,1,15.91,1.67,3.06,14.05,1.36,1.85,177.381958
1,2,15.34,1.65,2.6,31.88,1.37,1.79,1479.767778
2,3,20.45,2.02,3.13,63.67,1.79,2.53,4421.221583
3,4,11.95,1.14,3.9,58.81,0.4,2.03,1488.317629
4,5,19.53,1.83,2.57,43.75,1.4,2.11,5261.094919


###  Part 1. Step by step low level code

Using the K nearest neighbors approach, let's build a basic workflow to predict production from por, LogPerm, AI, brittle, TOC and VR. 

In [3]:
#Load and split the data

X = my_data.loc[:,['Por','LogPerm','AI', 'Brittle', 'TOC', 'VR']]
y = my_data.loc[:,'Production'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=0)

#Compute minimum and maximum on the training data
scaler = MinMaxScaler().fit(X_train)

#Rescale the training data
X_train_scaled = scaler.transform(X_train)

knn = KNeighborsRegressor()
# learn an SVM on the scaled training data
knn.fit(X_train_scaled, y_train)
# scale the test data and score the scaled data
X_test_scaled = scaler.transform(X_test)

print("Test score: {:.2f}".format(knn.score(X_test_scaled, y_test)))

Test score: 0.77


Now, let's select the best parameters for KNN using a naive approach:

In [4]:
# for illustration purposes only, don't use this code! 
warnings.filterwarnings("ignore")

param_grid = {'n_neighbors': [*range(0, 200, 20)],
              'weights': ['uniform','distance'],
              'p':[1,2]}

grid = GridSearchCV(KNeighborsRegressor(), param_grid=param_grid, cv=5)
grid.fit(X_train_scaled, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_)) 
print("Best set score: {:.2f}".format(grid.score(X_test_scaled, y_test))) 
print("Best parameters: ", grid.best_params_)


Best cross-validation accuracy: 0.76
Best set score: 0.72
Best parameters:  {'n_neighbors': 20, 'p': 2, 'weights': 'distance'}


We just ran a grid search using GridSearchCV using the scaled data. However, in the code above, we are leaking data, as the data was scaled using all the train data. Thus, when performing the cross-validation and splitting the training data set into training folds and validation folds, the validation folds already have some information about the testing folds due to how the data was scaled. For this reason, the splits in the cross-validation no longer correctly predict how the model will do on new data.
To solve the issue, the scaling or any other preprocessing step should be done after splitting the dataset for cross-validation. To achieve this scikit-learn with the cross_val_score function and the GridSearchCV function, we can use the Pipeline class!!

###  Part 2. Building Pipelines

The Pipeline class from scikit-learn is a class that allows joining multiple processing steps. This class behaves like other models in the scikit-learn library, as the Pipeline class has fit, predict, and score methods. We use pipelines when we want to "glue" multiple preprocessing steps with a supervised model. In the example below we use a pipeline to glue the steps of the workflow in part 1. 

### I. Building a basic Pipeline

Let's look at how we can built a basic Pipeline class for training a KNN in our data set  after scaling the data with MinMaxScaler. 
For building the pipeline object we have to provide a list of steps. Each step is a tuple with the following syntax: [("any sting of your choosing"), instance of an estimator)]

In [5]:
###create steps as a list of tuples
steps =[(" standard_scaler", MinMaxScaler()), ('knear', KNeighborsRegressor())]

Then we will give the steps to the pipeline class:

In [6]:
pipe = Pipeline(steps)

Let's visualize our first pipeline!

In [7]:
set_config(display='diagram')

In [8]:
pipe

Now we can fit our pipeline like any other scikit-learn estimator:

In [9]:
pipe.fit(X_train,y_train)

Here, pipe.fit first calls the first step (MinMaxScaler), then transforms the taining data using the scaler, and finally fits the KNN with the scaled data. 

In [10]:
#evaluate on test data
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Test score: 0.77


Our results are the same as the ones achieved using workflow 1!

### II. Displaying a pipeline with standard scaler, dimensionality reduction and then estimator

Let's include a dimensionality reduction step using PCA in our pipeline:

In [11]:
steps=[("scaling",StandardScaler()), ("PCA",PCA(n_components=3)), ('knear', KNeighborsRegressor())]
pipe2 =Pipeline(steps)
pipe2

The class pipeline, let you apply only one of the steps in the data set. For instance, let's apply only the standard scaling:

In [12]:
pipe2["scaling"].fit_transform(X_train)

array([[ 1.14126637,  1.96045125,  0.27801418, -0.4657294 , -0.2645799 ,
        -0.57674243],
       [-0.39512967, -1.82501191,  0.34738147,  0.75907862, -0.78224858,
         0.40435041],
       [ 0.23324644,  0.76375644, -0.51970963, -0.61466606,  0.33273012,
        -0.51133624],
       ...,
       [-0.28327214,  0.76375644, -0.79717878,  1.40904341,  0.01416478,
        -0.05349292],
       [ 0.3845831 ,  0.64164473, -0.74515332, -0.2926232 , -0.10529722,
        -1.26350742],
       [-1.51041502, -1.3609874 ,  1.43991625,  2.10081498, -0.48359357,
         2.03950514]])

Here, pipe.fit first calls fit on the first step (the scaler), then transforms the training data using the scaler, finds the PCA of the training data and finally fits the KNN with the scaled data. 

In [13]:
pipe2.fit(X_train,y_train)

To evaluate on the test data, we simply call pipe.score:

In [14]:
#evaluate on test data
print("Test score: {:.2f}".format(pipe2.score(X_test, y_test)))

Test score: 0.80


### Part 3:  Using Pipelines in Grid Search

For using GridSearchCV in a pipeline, we also need to build a parameter grid (see part 1). However, when specifying the parameter grid, we need to set for each parameter which step of the pipeline it belongs to. In our example, parameters n_neigbors, weights, and p are parameters of the KNN step. The step's name is "knear." To correctly define a parameter grid for a pipeline, we need the following syntax: "step name_parameter name." Let's check the parameter grid dictionary for our example :

In [15]:
param_grid = {"knear__n_neighbors":[*range(0, 200, 20)],
              "knear__weights": ['uniform','distance'],
              "knear__p":[1,2]}

Use GridSearchCV as usual:

In [16]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5) 
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_)) 
print("Test set score: {:.2f}".format(grid.score(X_test, y_test))) 
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.77
Test set score: 0.72
Best parameters: {'knear__n_neighbors': 20, 'knear__p': 2, 'knear__weights': 'distance'}


In contrast to what we did in part 1, now for each split in the cross-validation the MinMaxScaler is used with only the training splits and no information is leaked!

### Part 4: Pipeline Creation using make_pipeline

The function, make_pipeline makes easier the process of building a pipeline. This function, will create a pipeline for us and automatically name each step based on its class.

### I. Building a basic Pipeline using make_pipeline

In [17]:
# standard syntax
pipe_long = Pipeline([(" standard_scaler", MinMaxScaler()), ('knear', KNeighborsRegressor())]) 
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), KNeighborsRegressor())

Both pipe_long and pipe_short do the same, the only difference is that in pipe_short the steps were automatically named. Let's look at the names of the steps:

In [18]:
print("Pipeline steps:\n{}".format(pipe_short.steps))

Pipeline steps:
[('minmaxscaler', MinMaxScaler()), ('kneighborsregressor', KNeighborsRegressor())]


The names of the steps are just lowercase versions of the class names. 

### II. Displaying a pipeline with standard scaler, dimensionality reduction and then standard scaler Pipeline using make_pipeline

In this example we are going to see how the steps get named if multiple steps have the same class:

In [19]:
pipe = make_pipeline(StandardScaler(), PCA(n_components=3),StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))

Pipeline steps:
[('standardscaler-1', StandardScaler()), ('pca', PCA(n_components=3)), ('standardscaler-2', StandardScaler())]


As you can see, if multiple steps have the same class a number is appended. StandardScaler step was named standardscaler-1 and the second standardscaler-2. 

### Part 5: Using Pipelines with different types of data columns

For this example I am going to add a categorical column to our data set. Instead of using the porosity values I will assign high to porosities greater than 16%, medium to porosities greater than 13% and lower than 16%, and low to the other values.

In [20]:
# create a list of our conditions
conditions = [
    (my_data['Por'] <= 13),
    (my_data['Por'] > 13) & (my_data['Por'] <= 16),
    (my_data['Por'] > 16)
    
]

# create a list of the values we want to assign for each condition
values = ['low', 'medium', 'high']

# create a new column and use np.select to assign values to it using our lists as arguments
my_data['porosity'] = np.select(conditions, values) 

# display updated DataFrame
my_data.head()

Unnamed: 0,WellIndex,Por,LogPerm,AI,Brittle,TOC,VR,Production,porosity
0,1,15.91,1.67,3.06,14.05,1.36,1.85,177.381958,medium
1,2,15.34,1.65,2.6,31.88,1.37,1.79,1479.767778,medium
2,3,20.45,2.02,3.13,63.67,1.79,2.53,4421.221583,high
3,4,11.95,1.14,3.9,58.81,0.4,2.03,1488.317629,low
4,5,19.53,1.83,2.57,43.75,1.4,2.11,5261.094919,high


I will add a target column, where high production (> 2500) equals to 0, medium production (2500< production < 2000) equals to 1 and low production equals to 2. 

In [21]:
# create a list of our conditions
conditions = [
    (my_data['Production'] <= 2000),
    (my_data['Production'] > 2000) & (my_data['Production'] <= 2500),
    (my_data['Production'] > 2500)
    
]

# create a list of the values we want to assign for each condition
values = [2, 1, 0]

# create a new column and use np.select to assign values to it using our lists as arguments
my_data['prod_target'] = np.select(conditions, values) 

# display updated DataFrame
my_data.head()

Unnamed: 0,WellIndex,Por,LogPerm,AI,Brittle,TOC,VR,Production,porosity,prod_target
0,1,15.91,1.67,3.06,14.05,1.36,1.85,177.381958,medium,2
1,2,15.34,1.65,2.6,31.88,1.37,1.79,1479.767778,medium,2
2,3,20.45,2.02,3.13,63.67,1.79,2.53,4421.221583,high,0
3,4,11.95,1.14,3.9,58.81,0.4,2.03,1488.317629,low,2
4,5,19.53,1.83,2.57,43.75,1.4,2.11,5261.094919,high,0


Let's split our data set in training and testing.

In [22]:
X2 = my_data.copy()
X2 = X2.loc[:,['porosity','LogPerm','AI', 'Brittle', 'TOC', 'VR']]

y2 = my_data.copy()
y2 = y2.loc[:,'prod_target']

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.30,random_state=0)

To build a pipeline for numerical and categorical features we have to first create a numerical and categorical pipeline.

In [23]:
#pipeline for numerical features
numeric_processor=Pipeline(
    steps=[("imputation_mean",SimpleImputer(missing_values=np.nan,strategy="mean")),
          ("scaler",StandardScaler())]
)
numeric_processor

In [24]:
#pipeline for categorical features
categorical_processor=Pipeline(
    steps=[("imputation_constant",SimpleImputer(fill_value ="missing",strategy = "constant")),
          ("onehot",OneHotEncoder(handle_unknown = "ignore"))]
)
categorical_processor

Then we combine both numerical and categorical processors using the estimator ColumnTransformer. The difference between a Pipeline and a ColumnTransformer is that Pipeline steps are executed serially (the output from the first step is passed to the second step, and so on). Whereas the ColumnTransformers are executed separately, and the transformed features are concatenated at the end.
ColumnTransformers include in the tuple for each step the names of the columns to be transformed in that step. By default, any columns you pass into the ColumnTransformer that aren't named in any of the steps will be dropped (remainder = "drop"). If you have columns that you want to include but do not need to be transformed, use: remainder='passthrough'.

In [25]:
preprocessor = ColumnTransformer(
    [("categorical",categorical_processor,['porosity']),
    ("numerical",numeric_processor,['LogPerm','AI', 'Brittle', 'TOC', 'VR'])
    ])
preprocessor

Next, we’ll create a Pipeline using make_pipeline where preprocessor is the first step, and a Logistic Regression classifier is the second step.

In [26]:
pipe3 = make_pipeline(preprocessor,LogisticRegression())
pipe3.fit(X2_train, y2_train) #fit the model

In [27]:
#evaluate on test data
print("Test score: {:.2f}".format(pipe3.score(X2_test, y2_test)))

Test score: 0.71


### Final Remarks

Machine Learning workflows include different processing steps. In this workflow, we learned how to use the Pipeline class, a useful tool to "glue" together multiple processing steps in a machine learning workflow. The Pipeline class allows us to chain multiple steps into a Python object that works with the scikit-learn interface.
In ML  it is an art to find the right combination of feature extraction, preprocessing, and models, and it often requires trial and error. Using pipelines make this trial and error of many different steps easier!

I hope this was helpful,

*Nataly Chacon Buitrago*

### References

* Miles, J. (2021, June 29). Getting the most out of scikit-learn pipelines | by Jessica Miles ... Towards Data Science. Retrieved November 20, 2022, from https://towardsdatascience.com/getting-the-most-out-of-scikit-learn-pipelines-c2afc4410f1a 

* Müller, A. C., &amp; Guido, S. (2018). Introduction to machine learning with python a guide for Data scientists. O'Reilly. 

* Naik, K. Creating Pipelines Using SKlearn- Machine Learning Tutorial. (2020). YouTube. Retrieved November 10, 2022, from https://youtu.be/w9IGkBfOoic. 
* Pyrcz, M. (n.d.). Machine Learning Modeling in Python with scikit-learn Pipelines. GitHub. Retrieved November 13, 2022, from https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/PythonDataBasics_Pipelines.ipynb 

___________________

#### Work Supervised by:

### Michael Pyrcz, Associate Professor, University of Texas at Austin 
*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. 

For more about Michael check out these links:

#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Want to Work Together?

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! 

* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

* I can be reached at mpyrcz@austin.utexas.edu.

I'm always happy to discuss,

*Michael*

Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin
