# EXERCISE: Intro to machine learning

In [1]:
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline

import sklearn
sklearn.__version__

# Should be 0.18

'0.18'

## Read the data

`numpy` has a convenient function, `loadtxt` that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a `StringIO` object, which acts exactly like a file handle.

In [2]:
import requests
import io

r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')
f = io.StringIO(r.text)

We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we cna't tell that, I just happen to know it... and it's normal for CSV files). 

Let's look at the first few rows:

In [3]:
r.text.split('\n')[:5]

['Facies,Formation,Well Name,Depth,GR,ILD_log10,DeltaPHI,PHIND,PE,NM_M,RELPOS',
 '3,A1 SH,SHRIMPLIN,2793.0,77.45,0.664,9.9,11.915,4.6,1,1.0',
 '3,A1 SH,SHRIMPLIN,2793.5,78.26,0.661,14.2,12.565,4.1,1,0.979',
 '3,A1 SH,SHRIMPLIN,2794.0,79.05,0.658,14.8,13.05,3.6,1,0.957',
 '3,A1 SH,SHRIMPLIN,2794.5,86.1,0.655,13.9,13.115,3.5,1,0.936']

For convenience later, we'll make a list of the features we're going to use.

Now we'll load the data we want. First the feature vectors, `X`. We'll just get the logs, which are in columns 4 to 8:

In [4]:
features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']

In [5]:
cols = [4,5,6,7,8]

In [6]:
X = np.loadtxt(f, skiprows=1, delimiter=',', usecols=cols)

In [7]:
_ = f.seek(0)  # Reset the file reader.

### Ex: Can you write the code to get the label vector, `y`?

In [None]:
y = 

Check that `X` is a 2D matrix with 3232 rows and 5 columns, and `y` is a 1D vector with 3232 elements.`

In [10]:
X.shape, y.shape

((3232, 5), (3232,))

We have data! Almost ready to train, we just have to get our test / train subsets sorted.

## Getting ready to train

In [11]:
from sklearn.model_selection import train_test_split

### Ex: Use the docs for `train_test_split` to set the size of the test set, and also to set a random seed for the splitting.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [13]:
X_train.shape, y_train.shape

((2424, 5), (2424,))

Now the fun can really begin. 

## Training and evaluating a model

In [14]:
from sklearn.ensemble import ExtraTreesClassifier 

In [15]:
clf = ExtraTreesClassifier()

In [16]:
clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [17]:
clf.score(X_test, y_test)

0.62623762376237624

Maybe we can do better by twiddling some of those parameters:

In [44]:
clf = ExtraTreesClassifier(n_estimators=2000, n_jobs=4, verbose=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.7s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    1.4s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:    2.6s
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:    3.5s
[Parallel(n_jobs=4)]: Done 2000 out of 2000 | elapsed:    3.9s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:    0.6s
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:    0.9s
[Parallel(n_jobs=4)]: Done 2000 out of 2000 | elapsed:    1.0s finished


0.67079207920792083

All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models.

In [18]:
from sklearn.naive_bayes import GaussianNB
GaussianNB().fit(X_train, y_train).score(X_test, y_test)

0.43585780525502316

### Ex: Try lots of models!

## Predict!

In [47]:
y_pred = clf.predict(X_test)

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 1242 tasks      | elapsed:    0.6s
[Parallel(n_jobs=4)]: Done 1792 tasks      | elapsed:    0.8s
[Parallel(n_jobs=4)]: Done 2000 out of 2000 | elapsed:    0.9s finished


In [48]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.67079207920792083

In [49]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

array([[ 48,  15,   2,   0,   0,   0,   0,   0,   0],
       [  7, 132,  42,   2,   1,   1,   0,   4,   0],
       [  2,  37, 116,   3,   0,   6,   0,   3,   0],
       [  3,  11,   3,  25,   3,   1,   0,   5,   0],
       [  0,   4,   2,   0,  21,  14,   2,   8,   0],
       [  0,   5,   1,   0,   9,  76,   0,  14,   0],
       [  1,   1,   1,   0,   0,   0,  12,   3,   0],
       [  1,   7,   3,   1,   3,  29,   0,  78,   3],
       [  0,   0,   0,   0,   0,   1,   2,   0,  34]])

In [50]:
target_names = [str(n) for n in set(y)]

In [51]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=target_names))

             precision    recall  f1-score   support

        1.0       0.77      0.74      0.76        65
        2.0       0.62      0.70      0.66       189
        3.0       0.68      0.69      0.69       167
        4.0       0.81      0.49      0.61        51
        5.0       0.57      0.41      0.48        51
        6.0       0.59      0.72      0.65       105
        7.0       0.75      0.67      0.71        18
        8.0       0.68      0.62      0.65       125
        9.0       0.92      0.92      0.92        37

avg / total       0.68      0.67      0.67       808



## More in-depth evaluation: k-fold cross-validation

We need a vector that contains an integer (or something) representing each unique well.

In [19]:
wells = [row.split(',')[2] for row in r.text.split('\n')[1:] if row]

In [20]:
from sklearn.model_selection import LeaveOneGroupOut

logo = LeaveOneGroupOut()
clf = ExtraTreesClassifier(random_state=0)

for train, test in logo.split(X, y, groups=wells):
    # train and test are the indices of the data to use.
    well_name = wells[test[0]]
    clf.fit(X[train], y[train])
    score = clf.score(X[test], y[test])
    print("{:>20s}  {:.3f}".format(well_name, score))

     CHURCHMAN BIBLE  0.460
      CROSS H CATTLE  0.315
            LUKE G U  0.388
               NEWBY  0.376
               NOLAN  0.352
          Recruit F9  0.868
             SHANKLE  0.419
           SHRIMPLIN  0.524


## Ex: Can you improve the model by adding more features? We didn't include some of them in this run (back in block 5).

<hr />

<div>
<img src="https://avatars1.githubusercontent.com/u/1692321?s=50"><p style="text-align:center">© Agile Geoscience 2016</p>
</div>