# Tutorial for `musiF`

`musiF` is a python library to analyse music scores. It is a tool to massively extract features from MusicXML and Musescore files. 

`musiF` was born in the context of the [Didone Project](https://didone.eu/) and, consequently,
it is specialized in a 18th Century Italian Operas arias. However, it is likely helpful for the analysis intended to work in other repertoires as well.

## Installation

First, you should install Python > 3.10. We reccommend to use virtualenvs while you develop, so that you do not interfer with the other Python applications installed in your system.

An easy way to do this is by using [`anaconda`](https://www.anaconda.com/products/distribution), especially if you are not used to commandline interface. Alternatively, you can use more standard methods such Python's `virtualenv` module, `pyenv`, `poetry`, `pdm`, and similar.

In any case, to install, just use `pip install musif` from a command line inside your virtualenv.
You can also use a jupyer notebook or console and the `!` special command:

In [1]:
!pip install -e /home/federico/musiF # TODO: change to pip install musif

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: /home/federico/musiF is not a valid editable requirement. It should either be a path to a local project or a VCS URL (beginning with bzr+http, bzr+https, bzr+ssh, bzr+sftp, bzr+ftp, bzr+lp, bzr+file, git+http, git+https, git+ssh, git+git, git+file, hg+file, hg+http, hg+https, hg+ssh, hg+static-http, svn+ssh, svn+http, svn+https, svn+svn, svn+file).[0m[31m
[0m

Now, let's import it:

In [2]:
import musif
print(musif.__version__)

0.1.0


## Introduction

### If you are new to programming etc. read this

If you are new to Python, we suggest you to read a tutorial for it.
In the following, we will use some technical terminology. In general keep in mind the following:

* a _function_ is a way to represent code that is convenient for humans. You can think about functions as the mathematical functions, with some input and some output. However, some programming languages call them _procedures_; this is not the case of Python, but this name allow grasp what functions are, after all: they are successions of commands that the computer has to execute.

* an _object_ is a computational way to represent information in the memory of computers; you can think about objects as real concepts of the real world: object have properties (in Python named _fields_) and functionalities (named _methods_). For instance, an object could be a vehicle, which has some properties (length, maximum speed, number of wheels) and some functionalities (accelerate, decelerate, stop). Objects can also have specializations (named _children_): in our example, a child of vehicle could be the car and another child could be the bike: they have different properties and apply the functionalities in a different way. Both the vehicle, the car, and the bike may have instances: the car that you use everyady to go at work is different from the one of your friend even if they have the same exact properties, because they are two different concrete objects. Technically, those two cars are two _instances_ of the same _class_. To create an instance you have to call a function, which takes some argument such as the class, and other properties, and that will return the instance. To use `musiF`, you don't need to know a lot about objects, but while you search the web it is good to have a little of knowledge.

* a _dataframe_ is another way to represent information for computers. They are designed to be extremely efficient, even if sometimes some aspect of the information can get lost, and for this reason are used for data science problems. You can think about a _dataframe_ as to a table, with rows and columns. Usually, rows are _instances_ while columns are _properties_. In data science, these words often become _samples_ and _features_/_variables_. A typical operation is to select only certain columns (properties) or only certain rows (instances) to select subset of the data or to modify the data itself.
* don't be scared of using web search engines such as Google: searching the web in a proper way is one of the most important skills a programmer has!

### Main objects

When using `musif`, you will usually interface with two objects:
1. [`FeaturesExtractor()`](API/musif.extract.html#musif.extract.extract.FeaturesExtractor), which read scores in MusicXML and MuseScore format and computes a Dataframe containing all the extracted features. In the simplest case, each row represents a music score, while each column represents a feature.
2. [`DataProcessor()`](API/musif.process.html#musif.process.processor.DataProcessor), which takes the dataframe with all the features in it and post-processes it to clean, improve, and possibly modify some of the features.

These two objects take as input two different configurations that modify their behaviour. In other words, the function that create the instances of `FeaturesExtractor` and `DataProcessor` can accept a wide range of arguments.

But let's proceed step-by-step!

## Data

For starting, downaload one or more of the following datasets:

...

You should put the MusicXML data under the `data` directory aside to this notebook!

## Configuration

Let's create a configuration for our experiment. Configuration can be expressed using a yaml file or with key-value arguments. Key-value arguments are something similar to a dictionary: on one side there is a _key_ which must be unique in the dictionary; each _key_ is associated to a _value_, that can also be repeated. Python can retrieve a a value using its key in  avery efficient way!

First, we'll need to import the class that describes how a `Configuration` is:

In [3]:
from musif.config import ExtractConfiguration

Now, we can call its constructor to obtain a configuration object:

In [4]:
import glob

config = ExtractConfiguration(
    None,
    musescore_dir="data",
    limit_files=glob.glob("**/*.mid", recursive=True, root_dir='data'),
    basic_modules=["scoring"],
    features=["core", "ambitus", "interval", "tempo", 
              "density", "texture", "lyrics", "scale", 
              "key", "dynamics", "rhytm"],
)

As you can see, the configuration has 3 values in this case:
1. `None`: this is just a place-holder: it would usually be the `yaml` file containing the configuration; since we do not need it, we use `None`
2. the directory where MuseScore will look for the data; we'll use MuseScore to convert MIDI files to MusicXML, so you should first [download](https://musescore.org/en/download) and install it (Note: if you are running this notebook on a remote server without GUI, make sure to have a virtual display set up in the shell running this notebook, e.g. `Xvfb :99 & export DISPLAY=:99`)
3. `basic_modules` are modules used to compute basic features needed for the successive ones
4. `features` are the features that will be computed

Each feature name is a word that refers to a set of features and to a musiF package.
You can also create your own [custom features](./Custom_features.html), but it's a more advanced topic.
For now, just think about `basic_modules` and `features` as the same thing, but with a precedence order. We will see the true difference later on.


Now that we have our configuration, we pass it to the function that creates `FeaturesExtraction` objects. This function is exactly named `FeaturesExtraction`:

In [5]:
from musif.extract.extract import FeaturesExtractor

extractor = FeaturesExtractor(config)

# Note: we could also pass the configuration values directly to FeaturesExtractor, like this:
# 
# extractor = FeaturesExctractor(
#     None,
#     musescore_dir="/home/federico/arias_example/",
#     basic_modules=["scoring"],
#     features=["core", "ambitus", "interval", "tempo", 
#               "density", "texture", "lyrics", "scale", 
#               "key", "dynamics", "rhytm"]
# )

Before of starting the extraction, we also need to tell MuseScore the type of files it should look for. In this case, we want it looks for files with extension `'.mid'`:

In [6]:
import musif.musescore.constants as musescore_c
musescore_c.MUSESCORE_FILE_EXTENSION = '.mid'

Now, we can start the extraction using the method `extract`. It will return a `dataframe`:

In [None]:
df = extractor.extract()

[1;37m                                               --- Analyzing scores ---
                                                [0m
[1;37m                                               --- Analyzing scores ---
                                                [0m


 34%|███████████████████████████▋                                                      | 59/175 [01:44<03:02,  1.57s/it]

In [None]:
# To show the dataframe in a Jupyter notebook, just use it as last instruction of the cell, like this:
df

## Post-processing

Most of the features we have computed actually need some post-processing, for instance to replace `NaN` with 0, merge columns, remove features created while computing other features.

For this, we need another configuration. However, we'll use the default configuration, so we'll pass `None` in place of the yaml file/configuration object. You can do the same thing with the `FeaturesExtractor` as well.

In [None]:
from musif.process.processor import DataProcessor

processed_df = DataProcessor(df, None).process().data
processed_df

As you see, the columns are now half than before!

Let's try to remove `NaN`...

In [None]:
processed_df.dropna(axis=1, inplace=True)
processed_df

The dataset was not very well formatted and this is the reason why only a few features remain.

But let's try to classify them. We will setup a feature-learning approach where the model learns to classify each sample in the dataset. The usual way to do this would be using an autoencoder architecture, but since `sklearn` 

For this, we will use `sklearn` and its Multilayer Perceptron.

In [None]:
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score

# removing FileName and Id
if 'FileName' in processed_df:
    del processed_df['FileName']
if 'Id' in processed_df:
    del processed_df['Id']

processed_df.select_dtypes([float, int])

model = make_pipeline(
    OrdinalEncoder(), # give a cardinal number to features that are categories
    StandardScaler(), # subtract the mean and scale between -1 and +1
    MLPRegressor(
        hidden_layer_sizes=(128, 32, 8, 2, 8, 32, 128, ), # the output size is the same as the number of labels
        activation="relu",
        solver="adam",
        alpha=0,                 # regularizer L2 weight in ADAM
        batch_size=5,
        learning_rate_init=5e-5, 
        max_iter=10**3,
        tol=1e-32,
        early_stopping=False,
        beta_1=1e-9,             # Adam decay rate for momentum 1
        beta_2=0.999,            # Adam decay rate for momentum 2
        epsilon=1e-8,            # Adam numerical stability
        random_state=934,
        # shuffle=True  
    )
)

# the next call will take some time...
model.fit(processed_df, processed_df.index)
y_hat = model.predict(processed_df)
print(f"Macro-averaged F1 score: {f1_score(processed_df.index, y_hat, average='macro')}")

In [None]:
y_hat

Now, we will attach a method `transform` to the MLPClassifier which returns the activations at the inner layer with 2 outputs, that we interpret as latent features.

Then, we plot the music scores according to the learned feature space.

In [None]:
mlpclassifier = model['mlpclassifier']

def mytransform_method(X):
    activations = [None for _ in range(mlpclassifier.n_layers_)]
    activations[0] = X
    X = mlpclassifier._forward_pass(activations)[-6]
    return X
    # return PCA(2).fit_transform(X)

mlpclassifier.transform = mytransform_method

learned_features = model.transform(processed_df)

learned_features.shape

In [None]:
import seaborn
seaborn.scatterplot(x=learned_features[:, 0], y=learned_features[:, 1])

In [None]:
learned_features

For comparison, let's plot the feature space learned by PCA

In [None]:
from sklearn.decomposition import PCA

pca_pipeline = make_pipeline(
    OrdinalEncoder(), StandardScaler(), PCA(2)
)
data_pca = pca_pipeline.fit_transform(processed_df)
seaborn.scatterplot(x=data_pca[:, 0], y=data_pca[:, 1])