# feature extraction

So we already found out that supervised learning need **features** to learn to classify. What are we talking about when we say *features*?
An example of some of the features used with the **k-Nearest Neighbor** algorithm are shown below: the durations of syllables and the brief, silent pauses between syllables. Notice they are very *engineered*: I have to define the current syllable and then specify the duration of the syllable that precides it, and write code that knows how to deal with edge cases (e.g., the first syllable doesn't have any syllable preceding it, so what should the value of the feature be in that case?)

![example knn features](../static/knn_duration_features.png)

How do we do feature extraction with `hvc`?
First we need some song to work with. We download some of it from a public repository.

In [None]:
import hvc
# use the `fetch` function to download data from a public repository
hvc.utils.fetch('gy6or6.032212', destination='../output/')
hvc.utils.fetch('gy6or6.032612', destination='../output/')

We will call the `hvc.extract` function on some of this data. But first we need to define some arguments we'll use when call the funciton. First we need to tell `hvc` where the data lives.

In [None]:
data_dirs = ['../output/032212/']  # a list, in this case with only one element in it

We also tell `hvc` the audio file format. 

In [None]:
file_format = 'cbin'

The repository we downloaded has `.cbin` files and the annotations are in `.mat` files which `hvc` knows how to parse.

You can use the more common `.wav` files and annotate your song however you prefer as long as you can get it into the very simple text file format that `hvc` uses; see the docs for a simple Python script that does this which you can adapt as you need.

Lastly we tell `hvc` which syllables it should learn how to classify, i.e. from the labeled segments, which one does it use? We also tell `hcv` the name of a **feature group**, i.e. a set of features, to extract. These feature groups are built in to the library for use with one of the machine learning algorithms. We use a set of features that has been shown to be effective with the k-Nearest Neighbors algorithm (some of which we saw above). 

In [None]:
labels_to_use = 'iabcdefghjk'
feature_group = 'knn'

We also tell `hvc` where to save the features it extracts. And we want `hvc` to give us back the features in a variable so we can work with them in our Jupyter notebook.

In [None]:
output_dir = '../output/'
return_features = True

Now we call the `hvc.extract` function with these arguments.

In [None]:
ftrs = hvc.extract(data_dirs=data_dirs,
                   file_format=file_format,
                   labels_to_use=labels_to_use,
                   feature_group=feature_group,
                   output_dir=output_dir,
                   return_features=return_features)

`hvc` tells us it's extracting features from files and then returns a Python `dict` (dictionary) which is a mapping of "keys" to "values". For example the 'features' key points to a matrix where each row is an individual syllable, and each column is a feature extracted from that syllable.

In [None]:
print(ftrs['features'])

This output was also saved in the `output_dir`.

In [None]:
ls ../output/extract*

Now we can load those features and use them to train models!