# Intro to machine learning

In [1]:
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline

import sklearn as sk
sk.__version__

'0.18'

## Read the data

`numpy` has a convenient function, `loadtxt` that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a `StringIO` object, which acts exactly like a file handle.

In [2]:
import requests
import io

r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')
f = io.StringIO(r.text)

We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we cna't tell that, I just happen to know it... and it's normal for CSV files). 

Let's look at the first few rows:

In [3]:
r.text.split('\n')[:5]

['Facies,Formation,Well Name,Depth,GR,ILD_log10,DeltaPHI,PHIND,PE,NM_M,RELPOS',
 '3,A1 SH,SHRIMPLIN,2793.0,77.45,0.664,9.9,11.915,4.6,1,1.0',
 '3,A1 SH,SHRIMPLIN,2793.5,78.26,0.661,14.2,12.565,4.1,1,0.979',
 '3,A1 SH,SHRIMPLIN,2794.0,79.05,0.658,14.8,13.05,3.6,1,0.957',
 '3,A1 SH,SHRIMPLIN,2794.5,86.1,0.655,13.9,13.115,3.5,1,0.936']

For convenience later, we'll make a list of the features we're going to use.

In [4]:
features = r.text.split('\n')[0].split(',')
_ = [features.pop(i) for i in reversed([0,1,2])]
features

['Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']

Now we'll load the data we want. First the feature vectors, `X`...

In [5]:
X = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[3,4,5,6,7,8,9,10])

And the label vector, `y`:

In [6]:
_ = f.seek(0)  # Reset the file reader.
y = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[0])

In [7]:
X.shape, y.shape

((3232, 8), (3232,))

We have data! Almost ready to train, we just have to get our test / train subsets sorted.

## Getting ready to train

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [10]:
X_train.shape, y_train.shape

((2585, 8), (2585,))

Now the fun can really begin. 

## Training and evaluating a model

In [11]:
from sklearn.ensemble import ExtraTreesClassifier 

In [12]:
clf = ExtraTreesClassifier()

In [13]:
clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [14]:
clf.score(X_test, y_test)

0.78516228748068007

Maybe we can do better by twiddling some of those parameters:

In [15]:
clf = ExtraTreesClassifier(n_estimators=2000, n_jobs=4, verbose=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.7s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:    1.1s
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:    1.6s
[Parallel(n_jobs=4)]: Done 2000 out of 2000 | elapsed:    1.9s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:    0.5s
[Parallel(n_jobs=4)]: Done 2000 out of 2000 | elapsed:    0.6s finished


0.83153013910355489

All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models:

In [16]:
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier().fit(X_train, y_train).score(X_test, y_test)

0.7032457496136012

In [17]:
from sklearn.svm import SVC
SVC().fit(X_train, y_train).score(X_test, y_test)

0.61514683153013905

In [18]:
from sklearn.naive_bayes import GaussianNB
GaussianNB().fit(X_train, y_train).score(X_test, y_test)

0.43585780525502316

In [19]:
from sklearn.ensemble import GradientBoostingClassifier
GradientBoostingClassifier().fit(X_train, y_train).score(X_test, y_test)

0.73724884080370945

## More in-depth evaluation: k-fold cross-validation

We need a vector that contains an integer (or something) representing each unique well.

In [20]:
wells = [row.split(',')[2] for row in r.text.split('\n')[1:] if row]

In [21]:
from sklearn.model_selection import LeaveOneGroupOut

logo = LeaveOneGroupOut()
clf = ExtraTreesClassifier(random_state=0)

for train, test in logo.split(X, y, groups=wells):
    # train and test are the indices of the data to use.
    well_name = wells[test[0]]
    clf.fit(X[train], y[train])
    score = clf.score(X[test], y[test])
    print("{:>20s}  {:.3f}".format(well_name, score))

     CHURCHMAN BIBLE  0.473
      CROSS H CATTLE  0.297
            LUKE G U  0.408
               NEWBY  0.423
               NOLAN  0.467
          Recruit F9  0.912
             SHANKLE  0.410
           SHRIMPLIN  0.554


<hr />

<div>
<img src="https://avatars1.githubusercontent.com/u/1692321?s=50"><p style="text-align:center">© Agile Geoscience 2016</p>
</div>