# Very quick machine learning

This notebook goes with 
We're going to go over a very simple machine learning exercise. We're using the data from the [2016 SEG machine learning contest](https://github.com/seg/2016-ml-contest). This exercise previously appeared as [an Agile blog post](http://ageo.co/xlines04).

In [None]:
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn

## Read the data

[Pandas](http://pandas.pydata.org/) is really convenient for this sort of data.

In [None]:
import pandas as pd

uid = "1WZsd3AqH9dEOOabZNjlu1M-a8RlzTyu9BYc2cw8g6J8"
uri = f"https://docs.google.com/spreadsheets/d/{uid}/export?format=csv"
df = pd.read_csv(uri)

df.head()

<div style="margin-top:12px; padding: 12px; border:2px solid gray; border-radius:5px; background: #eeeeee;"><p><b>A word about the data.</b> This dataset is not, strictly speaking, open data. It has been shared by the Kansas Geological Survey for the purposes of the contest. That's why I'm not copying the data into this repository, but instead reading it from the web. We are working on making an open access version of this dataset. In the meantime, I'd appreciarte it if you didn't replicate the data anywhere. Thanks!</p></div>

## Inspect the data

First, we need to see what we have.

In [None]:
df.describe()

In [None]:
facies_dict = {1:'sandstone', 2:'c_siltstone', 3:'f_siltstone', 4:'marine_silt_shale',
               5:'mudstone', 6:'wackestone', 7:'dolomite', 8:'packstone', 9:'bafflestone'}

df["Facies"] = df["Facies Code"].replace(facies_dict)

In [None]:
df.groupby('Facies').count()

In [None]:
features = ['GR', 'ILD', 'DeltaPHI', 'PHIND', 'PE']

sns.pairplot(df, vars=features, hue='Facies')

In [None]:
fig, axs = plt.subplots(ncols=5, figsize=(15, 3))

for ax, feature in zip(axs, features):
    sns.distplot(df[feature], ax=ax)

## Label and feature engineering

In [None]:
fig, axs = plt.subplots(nrows=5, figsize=(15, 10))

for ax, feature in zip(axs, features):
    for facies in df.Facies.unique():
        sns.kdeplot(df.loc[df.Facies==facies][feature], ax=ax, label=facies)
        ax.legend('')

In [None]:
sns.distplot(df.ILD)

In [None]:
sns.distplot(np.log10(df.ILD))

In [None]:
df['log_ILD'] = np.log10(df.ILD)

## Get the feature vectors, `X`

In [None]:
features = ['GR', 'log_ILD', 'DeltaPHI', 'PHIND', 'PE']

Now we'll load the data we want. First the feature vectors, `X`. We'll just get the logs, which are in columns 4 to 8:

In [None]:
X = df[features].values

In [None]:
X.shape

## Get the label vector, `y`

In [None]:
y = df.Facies.values

In [None]:
y

In [None]:
y.shape

In [None]:
plt.figure(figsize=(15,2))
plt.fill_between(np.arange(y.size), y, -1)

We have data! Almost ready to train, we just have to get our test / train subsets sorted.

## Extracting some training data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y)

In [None]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

**Optional exercise:** Use [the docs for `train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to set the size of the test set, and also to set a random seed for the splitting.

Now the fun can really begin. 

## Training

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()

clf.fit(X_train, y_train)

## Predict and evaulate

In [None]:
y_pred = clf.predict(X_val)

How did we do? A quick score:

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_val, y_pred)

A better score:

In [None]:
from sklearn.metrics import f1_score

f1_score(y_val, y_pred, average='weighted')

We can also get a quick score, without using `predict`, but it's not always clear what this score represents. 

In [None]:
clf.score(X_val, y_val)

## Model tuning and model selection

**Optional exercise:** Let's change the hyperparameters of the model. E.g. try changing the `n_neighbours` argument.

In [None]:
clf = KNeighborsClassifier(... HYPERPARAMETERS GO HERE ...)
clf.fit(X_train, y_train)
clf.score(X_val, y_val)

**Optional exercise:** Try another classifier.

In [None]:
from sklearn.svm import SVC

clf = SVC(gamma='auto')
clf.fit(X_train, y_train)
clf.score(X_val, y_val)

**Optional exercise:** Let's look at another classifier. Try changing some hyperparameters, eg `verbose`, `n_estimators`, `n_jobs`, and `random_state`.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier 

clf = ExtraTreesClassifier(n_estimators=100)
clf.fit(X_train, y_train)
clf.score(X_val, y_val)

All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models.

## More in-depth evaluation

The confusion matrix, showing exactly what kinds of mistakes (type 1 and type 2 errors) we're making:

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_val, y_pred)

Finally, the classification report shows the type 1 and type 2 error rates (well, 1 - the error) for each facies, along with the combined, F1, score:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_val, y_pred))

<html><hr />

<div>
<img src="https://avatars1.githubusercontent.com/u/1692321?s=50"><p style="text-align:center">© Agile Geoscience 2019</p>
</div></html>