# EXERCISE: Intro to machine learning

In [None]:
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline

import sklearn
sklearn.__version__

# Should be 0.18

## Read the data

`numpy` has a convenient function, `loadtxt` that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a `StringIO` object, which acts exactly like a file handle.

In [None]:
import requests
import io

r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')
f = io.StringIO(r.text)

We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we cna't tell that, I just happen to know it... and it's normal for CSV files). 

Let's look at the first few rows:

In [None]:
r.text.split('\n')[:5]

For convenience later, we'll make a list of the features we're going to use.

In [None]:
features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']

In [None]:
cols = [4,5,6,7,8]
X = np.loadtxt(f, skiprows=1, delimiter=',', usecols=cols)

We could do that... but Pandas is really convenient for this sort of data.

In [None]:
import pandas as pd

f.seek(0)

df = pd.read_csv(f)

df.head()

Now we'll load the data we want. First the feature vectors, `X`. We'll just get the logs, which are in columns 4 to 8:

In [None]:
X = np.array(df[features])

### Ex: Can you write the code to get the label vector, `y`?

In [None]:
y = 

Check that `X` is a 2D matrix with 3232 rows and 5 columns, and `y` is a 1D vector with 3232 elements.`

In [None]:
X.shape, y.shape

We have data! Almost ready to train, we just have to get our test / train subsets sorted.

## Getting ready to train

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

### Ex: Use the docs for `train_test_split` to set the size of the test set, and also to set a random seed for the splitting.

Now the fun can really begin. 

## Training and evaluating a model

In [None]:
from sklearn.ensemble import ExtraTreesClassifier 

In [None]:
clf = ExtraTreesClassifier()

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

### Ex: Try changing some hyperparameters, eg `verbose`, `n_estimators`, `n_jobs`, and `random_state`

In [None]:
clf = ExtraTreesClassifier( # HYPERPARAMETERS GO HERE # )
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models.

In [None]:
from sklearn.naive_bayes import GaussianNB
GaussianNB().fit(X_train, y_train).score(X_test, y_test)

### Ex: Try lots of models!

## Predict!

In [None]:
y_pred = clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

In [None]:
target_names = [str(n) for n in set(y)]

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=target_names))

### What are precision, recall, and F1?

## More in-depth evaluation: k-fold cross-validation

We need a vector that contains an integer (or something) representing each unique well.

In [None]:
wells = df['Well Name']

In [None]:
from sklearn.model_selection import LeaveOneGroupOut

logo = LeaveOneGroupOut()
clf = ExtraTreesClassifier(random_state=0)

for train, test in logo.split(X, y, groups=wells):
    # train and test are the indices of the data to use.
    well_name = wells[test[0]]
    clf.fit(X[train], y[train])
    score = clf.score(X[test], y[test])
    print("{:>20s}  {:.3f}".format(well_name, score))

## Ex: Can you improve the model by adding more features? We didn't include some of them in this run (back in block 5).

<hr />

<div>
<img src="https://avatars1.githubusercontent.com/u/1692321?s=50"><p style="text-align:center">© Agile Geoscience 2016</p>
</div>