# Decision Trees and Well Logs

In this notebook we read in petrophysical data from a `.csv` file and load it into a `pandas` `DataFrame`. Next we use a decision tree to classify the samples into one of two different formations. Then we write the decision tree to a graph to visualize how the tree works. 

In [None]:
! pip install scooby

Start with some importing `pandas`, `sklearn` and `scooby` so we can see which versions of the packages I am using. If you don't have `scooby` yet, you can install it in your environment by running the cell above

In [1]:
import pandas as pd
import scooby
import sklearn #for label encoding later, now just check the version
%matplotlib inline

Let's check and see which version of the core ML packages I am using first. This way we can troubleshoot any bugs along the way

In [2]:
scooby.Report(additional=[sklearn], sort=True)

0,1,2,3,4,5
Thu Nov 14 08:37:59 2019 Central Standard Time,Thu Nov 14 08:37:59 2019 Central Standard Time,Thu Nov 14 08:37:59 2019 Central Standard Time,Thu Nov 14 08:37:59 2019 Central Standard Time,Thu Nov 14 08:37:59 2019 Central Standard Time,Thu Nov 14 08:37:59 2019 Central Standard Time
Windows,OS,8,CPU(s),AMD64,Machine
64bit,Architecture,15.9 GB,RAM,Jupyter,Environment
"Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]","Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]","Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]","Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]","Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]","Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]"
7.6.1,IPython,3.1.0,matplotlib,1.16.4,numpy
1.2.1,scipy,0.4.3,scooby,0.21.2,sklearn
Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications,Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications


Next we want to read in the data from the `.csv` file and create a `DataFrame`

In [3]:
data = pd.read_csv(r'well_data.csv') #read it in
data.tail()

Unnamed: 0,DEPT,AHT10,AHT20,AHT30,AHT60,AHT90,AHTCO60,AHTCO90,DPHZ,DSOZ,...,ITT,NPOR,PEFZ,RSOZ,RXOZ,SDEV,SP,SPHI,RHOZ,TOP
5466,1914.0,1.6167,3.0335,7.5475,8.5244,9.1691,117.3103,109.0619,-0.4129,0.5048,...,0.2649,0.4847,10.0,0.0348,2.9599,1.0538,1.625,0.7009,3.3313,MATANUSKA
5467,1913.5,1.6164,3.0324,7.5492,8.5195,9.183,117.3782,108.8963,-0.6763,0.3208,...,0.265,0.476,10.0,0.0,1.7452,1.077,10.9375,0.6161,3.7659,MATANUSKA
5468,1913.0,1.6163,3.0317,7.5488,8.5243,9.1852,117.3116,108.8711,-0.9772,0.2371,...,0.2651,0.4754,10.0,0.0,0.3407,1.0509,43.8125,0.5991,4.2624,MATANUSKA
5469,1912.5,1.6162,3.0311,7.5493,8.5248,9.1936,117.3051,108.7711,-1.1748,0.212,...,0.2652,0.4853,10.0,0.0,0.2168,0.8236,79.5,0.6521,4.5884,MATANUSKA
5470,1912.0,1.6161,3.0305,7.5496,8.5289,9.1974,117.2483,108.7263,-1.1654,0.208,...,0.2652,0.4471,9.9845,0.0,0.1797,0.7958,108.5,0.6699,4.5729,MATANUSKA


Let's investigate the `TOP` column and see what the formations are that we want to predict

In [4]:
data.TOP.unique()

array(['NELCHINA', 'MATANUSKA'], dtype=object)

Next, we want to use a `LabelEncoder` to encode our formations to integers and assign that to a `tops` variable

In [5]:
from sklearn import preprocessing #for label encoding
#label encode our formation data
le = preprocessing.LabelEncoder()
top_names = data.TOP
le.fit(data.TOP)
tops = le.transform(data.TOP)
tops

array([1, 1, 1, ..., 0, 0, 0])

Before we get to training, we want to assign the `tops` feature as our target feature, and the rest of the `DataFrame` as our predictor features. We need to drop the `TOP` column as well at this point

In [6]:
#define our target variable and predictor features
y = tops
X = data.drop(['TOP'], axis=1)

Then we split our data into train and test subsets

In [7]:
#let's split into train and test samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)

Import our `DecisionTreeClassifier`

In [8]:
#import our classifier
from sklearn.tree import DecisionTreeClassifier

Fit our classifier on the training data with default parameters just to see how it does

In [9]:
#and train our classifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

And score it on the test subset

In [10]:
#score our classifier
clf.score(X_test, y_test)

1.0

Hot dog! 100% accuracy right out of the box! That's really good! Now let's interpret our model by saving it as a `graphviz` file that we can visualize online.

In [11]:
#now we want to visualize our tree so we export it out. We can visualize at the url below by opening the output file in a text editor and pasting it into the webapp
from sklearn import tree
tree.export_graphviz(clf, out_file='graph', feature_names=X.columns, class_names=data.TOP.unique(), filled=True)
#http://webgraphviz.com/ from file in text editor

![Decision Tree One](https://github.com/jessepisel/5minutesofpython/blob/master/Machine%20Learning/decision_tree_one.JPG?raw=true "Decision Tree One")

Our decision tree is simply using depth to predict the formation! This works nicely for this single well, but how do we think this would work for a different well? Probably not so nicely, it's definitely overtraining on this well. Let's get rid of the wellbore information like depth, transit time, azimuth, deviation, caliper, and temperature to see if we can predict formation by petrophysics alone

In [12]:
X_clean = data.drop(['TOP','DEPT', 'ITT', 'HAZI', 'SDEV', 'HCAL', 'HTEM'], axis=1) #drop wellbore information
X_train, X_test, y_train, y_test = train_test_split(X_clean,y, test_size=0.3, random_state=0) #split the data
clf_clean = DecisionTreeClassifier(max_depth=2) #new classifier with a max depth of the tree to keep it interpretable
clf_clean.fit(X_train,y_train) #fit on the data
clf_clean.score(X_test, y_test) #score the accuracy of the classifier

0.8915956151035322

Let's also export this new tree to see if we can interpret which features it is using to predict the two different formations.

In [13]:
tree.export_graphviz(clf_clean, out_file='clean_graph', feature_names=X_clean.columns, class_names=data.TOP.unique(),filled=True)


![Decision Tree Two](https://github.com/jessepisel/5minutesofpython/blob/master/Machine%20Learning/decision_tree_two.JPG?raw=true "Decision Tree Two")

With the wellbore information removed, the decision tree is predicting the different formations based on gamma ray `GR`, standard resolution density standoff `DSOZ`, and spontaneous potential `SP`. Not too bad for 89% accuracy based on three features and such a short tree.

This notebook is licensed as CC-BY, use and share to your hearts content.