# 1. hybrid-vocal-classifier autolabel workflow

Here's the steps in the workflow for autolabeling vocalizations.

First we import the library, since in Python you need to `import` a library before you can work with it.

In [None]:
import hvc  # in Python we have to import a library before we can use it

### 0. Label a small set of songs to provide **training data** for the models, typically ~20 songs.
Here we download the data from a repository.  
** You don't need to run this if you've already downloaded the data.**

In [None]:
hvc.utils.fetch('gy6or6.032212')
hvc.utils.fetch('gy6or6.032612')

### 1. Pick a machine learning algorithm/**model** and the **features** used to train the model. 

In this case we'll use the k-Nearest Neighbors (k-NN) algorithm because it's fast to apply to our data. We'll use the features built into the library that have been tested with k-NN.

Picking a model and the features that go with it is simple:  
1. In a text editor, open `gy6or6_autolabel.example.knn.extract.config.yml`
2. Below the line that says `feature group:` add `knn` after the dash.
3. Below the line that says `data_dirs:` add the path to the data you downloaded after the dash.

### 2. Extract features for that model from song files that will be used to train the model.  

We call the `extract` function and we pass it the name of the `yaml` config file as an argument.

```Python
# 1. pick a model and 2. extract features for that model
# Model and features are defined in extract.config.yml file.
hvc.extract('gy6or6_autolabel_example.extract.knn.config.yml')
```

### 3. Pick the **hyperparameters** used by the algorithm as it trains the model on the data.
Now in Python we use some convenience functions to figure out which "hyperparameters" will give us the best accuracy when we train our machine learning models.
```Python
# 3. pick hyperparameters for model
# Load summary feature file to use with helper functions for
# finding best hyperparameters.
from glob import glob
summary_file = glob('./extract_output*/summary*')
summary_data = hvc.load_feature_file(summary_file)
# In this case, we picked a k-nearest neighbors model
# and we want to find what value of k will give us the highest accuracy
X = summary_data['features']
y = summary_data['labels']
cv_scores, best_k = hvc.utils.find_best_k(X,y,k_range=range(1, 11))
```

### 4. Train, i.e., fit the **model** to the data  
### 5. Select the **best** model based on some measure of accuracy. 

1. In a text editor, open `gy6or6_autolabel.example.knn.select.config.yml`
2. On the line that says `feature_file:` paste the name of the feature file after the colon. The name will have a format like `summary_file_bird_ID_date`.

Then run the following code in the cell below:
```Python
# 4. Fit the **model** to the data and 5. Select the **best** model
hvc.select('gy6or6_autolabel.example.select.knn.config.yml')
```

### 6. Using the fit model, **Predict** labels for unlabeled data.

1. In a text editor, open `gy6or6_autolabel.example.knn.predict.config.yml`
2. On the line that says `model_meta_file:`, after the colon, paste the name of a meta file from the `select` output. The name will have a format like `summary_file_bird_ID_date`.
3. Below the line that says `data_dirs:`, after the dash, add the path to the other folder of data that you downloaded.

Then run the following code in the cell below.
```Python
# 6. **Predict** labels for unlabeled data using the fit model.
hvc.predict('gy6or6_autolabel.example.predict.knn.config.yml')
```

Congratulations! You have auto-labeled an entire day's worth of data.