In [None]:
from src.data import Dataset
from src import workflow

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

# We’re gonna (data) science the *@#! out of this

Now that we're getting good at automating the `Dataset` generation process, let's acutally **use** our data!

## Bjørn's Problem: Supervised Learning

Bjørn employs a large number of Finnish line cooks. He can’t understand a word they say.

Bjørn needs a trained model to do real-time translation from Finnish to Swedish.

Bjørn has decided to start with the Finnish phoneme dataset shipped with a project called lvq-pak. His objective is to train three different models, and choose the one with the best overall accuracy score.


## Load the Dataset
In a previous notebook, we created training and test versions of the lvq-pak `Dataset` object. Let's reload these and have a look.

In [None]:
workflow.available_datasets()

**Recall**: the data consists of 20-dimensional MFCC data.

In [None]:
ds_train = Dataset.load('lvq-pak_train')

In [None]:
ds_train.data.shape

In [None]:
ds_train.target

The target labels are numerical. If for some reason you were interested in phoneme labels themselves, this map is stored in the Dataset metadata:

In [None]:
ds_train.LABEL_MAP

Let's grab the test set as well.

In [None]:
ds_test = Dataset.load('lvq-pak_test')

In [None]:
ds_test.data.shape

A quick look at the license verifies that, while we are free to use this data for experimentation, we can't turn around and ship a commercial Finnish to Swedish translator. That's okay. This is for Bjørn's kitchen only:

In [None]:
print(ds_train.LICENSE)

## Let's train a model (the old-fashioned way)
We will walk through one example of building a model by hand. Later, we will convert this process to a reproducible data science workflow. 

Let's add the **Linear Support Vector Classifier** from scikit-learn.

In [None]:
from sklearn.svm import LinearSVC

In [None]:
model = LinearSVC(random_state=42)

In [None]:
model.fit(ds_train.data, ds_train.target)

Whoops. We had better increase the number of iterations until the model actually converges.

In [None]:
%%time
model = LinearSVC(random_state=42, max_iter=200000)
model.fit(ds_train.data, ds_train.target)

## Use the model to predict phoneme classes


In [None]:
lsvc_prediction = model.predict(ds_test.data);
lsvc_prediction[:20]

## Assess the quality of the prediction


In [None]:
model.score(ds_test.data, ds_test.target)

"Score" seems a little opaque. What kind of score is being used here? Turns out it's an **accuracy score**. Here it is a little more explicitly:

In [None]:
from sklearn.metrics import accuracy_score
help(accuracy_score)

In [None]:
accuracy_score(ds_test.target, lsvc_prediction)

Now, let's automate this process, and make it reproducible.

# Step 3: Train Models (`make train`)
![The `make train` process](../references/workflow/make-train.png)
In this step, we use the processed datasets we created in *Step 2* (`make data`) to train and save models. For this workflow, a **Model** is an object that conforms to the scikit-learn `BaseEstimator` API.

## Add our algorithm to `available_algorithms()`

How do we make an algorithm available for use with our reprodible data science workflow? We give it a name (a text string), and map this string to the function we wish to call. We will use this general technique throughout this flow to make various algorithms, datasets, models, and analyses usable by our workflow process


In [None]:
workflow.available_algorithms()

To add an algorithm to this list, we add a key:value pair to the dict `_ALGORITHMS` in `src/models/algorithms.py`.

For example, add
```
'linearSVC': LinearSVC()
```
to the `_ALGORITHMS` dict, and add
```
from sklearn.svm import LinearSVC
```
to the top of the file.

Also, add `linearSVC` to the docstring of `available_algorithms`.

In [None]:
help(workflow.available_algorithms)

In [None]:
workflow.available_algorithms()

Now we can add **model generation instructions** to our reproducible data science workflow. In this case, apply the `linearSVC` model to the `lvq-pak_train` dataset:

In [None]:
workflow.add_model(dataset_name='lvq-pak_train',
                   algorithm_name="linearSVC",
                   algorithm_params={'random_state': 42, 'max_iter': 200000})

We can see the complete list of model/dataset combinations using `get_model_list()`

In [None]:
workflow.get_model_list()

To actually train this model:

In [None]:
workflow.build_models()

Or alternately, from the Makefile:

In [None]:
!cd .. && make train

The output of this process is a **trained model**. We currently record this in two places:
* A trained model in `models/trained_models`
* A json file on disk (`models/trained_models.json`).  

Of course, we also make this information available via a workflow command: `available_models()`. Notice the clever naming scheme for the model produced by applying `linearSVC` to `lvq-pak_train`:

In [None]:
workflow.available_models()

### ASIDE: Under the Hood

If you take a peek into the `Makefile`, you'll notice that `make train` takes a `models/model_list.json` as input.
```
## train / fit / build models
train: models/model_list.json
	$(PYTHON_INTERPRETER) -m src.models.train_model model_list.json
```

Under the hood, a `model_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset_name`: A valid dataset name from `available_datasets()`
* `algorithm_name`: A valid dataset name from `available_algorithms()`
* `algorithm_params`: A dictionary of parameters to use when running the specified algorithm
* `run_number`: (optional, default 1) A unique integer used to distinguish between different builds with otherwise identical parameters

Throughout this reproducible data science workflow, we are constantly creating and storing information in json files on disk.  

In [None]:
!cat ../models/model_list.json

You don't necessarily need to know any of this, but sometimes it's nice to know what's going on under the hood.

### What exactly is a trained model in our reproducible workflow?
Let's take a look at the output from `make train`

In [None]:
from src.paths import trained_model_path

In [None]:
workflow.available_models()

In [None]:
# load up the trained model
from src.models.train import load_model

tm, tm_metadata = load_model(model_name='linearSVC_lvq-pak_train_1')

In [None]:
tm

Just as before, this is function that conforms to the sklearn `BaseEstimator` API. In addition to the trained model, we also returned some useful metadata, which includes the hashes of the input data, the hash of the generated model, and everything we need to know to train the model from scratch

In [None]:
tm_metadata

Just to check, we can verify that the stored dataset called `lvq-pak_train` was the same one used to train this model: (**data provenance** in action!)

In [None]:
ds = Dataset.load('lvq-pak_train')
ds.DATA_HASH

## An Aside: sklearn Estimator API
To implement the notion of a model, we borrow a basic data type from scikit-learn: the **BaseEstimator**. To use an algorithm as a model, we must build it into a class which:
* is a subclass of the sklearn `BaseEstimator` class (or implements `get_params`, `set_params`)
* has a `fit` method (needed for `make train`)
* has either a `predict` method (if it's a **supervised learning** problem) or a `transform` method (**unsupervised learning** problem) (needed for `make predict`)

We will see how things work in the unsupervised case in the next workbook. 

One of the advantages of using the sklearn **Estimator** API is that a model can consist of any combination of "algorithms" as long as that combination is a `BaseEstimator` implementing above methods. For example, you can use an sklearn `Pipeline`, or an sklearn meta-estimator like `GridSearchCV` to implement a model. 

If your algorithm of choice is **not yet** a `BaseEstimator` with the appropriate API, it is fairly easy to wrap it to be used in this way. While we won't have time to cover an example of doing this during the in-person part of this tutorial, we'll give an example of a custom estimator later in this notebook using `GridSearchCV`. Furthermore, the Text Embedding (advanced usage tutorial notebook) has an example of implementing gensim's FastText algorithm as an Estimator.



# Step 4: `make predict`

![The `make predict` process](../references/workflow/make-predict.png)

In the **Predict/Transform** step, we flow data through our trained models to obtain new datasets - either predictions, or transformations, depending whether we are using supervised or unsupervised-style algorithms. 



### Predicting Phonemes
Bjørn is doing supervised learning, (and he did a train/test split on the data before we started), so let's use the test set here to do the prediction.

In [None]:
workflow.add_prediction(dataset_name='lvq-pak_test',
                        model_name='linearSVC_lvq-pak_train_1',
                        is_supervised=True)

In [None]:
workflow.get_prediction_list()

In [None]:
workflow.run_predictions()

In [None]:
# This is the same as
!cd .. && make predict

In [None]:
workflow.available_predictions()

Yuck. We didn't specify an output dataset name, so our workflow just inferred one that makes sense (though it is a bit of a mouthful). Let's fix that.

In [None]:
workflow.get_prediction_list()

In [None]:
prediction = workflow.pop_prediction()
prediction['output_dataset'] = 'lvq-test-svc'
workflow.add_prediction(**prediction)
workflow.get_prediction_list()

In [None]:
workflow.run_predictions()

In [None]:
workflow.available_predictions()

Now we have two predictions. We'll see here that they are the same.

### What is a Prediction?

Under the hood, a Prediction is just a `Dataset` with an added `experiment` metadata header.

In [None]:
from src.paths import model_output_path
from src.utils import list_dir

In [None]:
list_dir(model_output_path)

In [None]:
predict_ds = Dataset.load('lvq-test-svc', data_path=model_output_path)

In [None]:
predict_ds.data.shape

In [None]:
predict_ds.metadata['experiment']

Here we have saved all sorts of useful information, such as the hashes of the data that went in, and the start time/diration of the prediction itself. Most importantly, the prediction we got via this process was exactly the same as the one we did manually, before converting our process to a reproducible workflow.

In [None]:
ds = Dataset.load('lvq-pak_test')
ds.DATA_HASH

Finally, check that our prediction matches what we got **before** we turned this into an automated reproducible workflow:


In [None]:
all(predict_ds.data == lsvc_prediction)

### An Aside: "Randomness" and `random_state`
Randomness is often a key feature of machine learning algorithms, but for reproducible data science, it is death. It's essential, when building reproducible data science flows, that our randomness is controlled by a deterministic `random_state` (or random_seed). 

In [None]:
model, model_meta = load_model('linearSVC_lvq-pak_train_1')
model_meta['algorithm_params']

**Always** pass in a `random_state`. If we want to run our algorithm multiple times with different random states, we can even use `GridSearchCV` where the only parameter that we're varying over is the `random_state`. 

## Summary: `make train`, `make predict`
That felt like a lot of exposition. In fact, here's what we ended up doing:

In [None]:
# Add `linearSCV` to the algorithm list in `src/models/algorithms.py`

# train a model called "linearSVC_lvq-pak_train_1"
workflow.add_model(dataset_name='lvq-pak_train',
                   algorithm_name="linearSVC",
                   algorithm_params={'random_state': 42, 'max_iter': 200000})
# 
workflow.add_prediction(dataset_name='lvq-pak_test',
                        model_name='linearSVC_lvq-pak_train_1', 
                        is_supervised=True, output_dataset='lvq-tets-svc')

workflow.build_models()  # or `make train`
workflow.run_predictions() # or `make predict`