# Tutorial

In order to run this notebook:
* create new environment, make it visible to your Jupyter
  * for conda do `conda create --name {name} python=3.10`
  * activate it and install `pip install ipykernel`
  * `ipython kernel install --user --name={name}`
* within the new environment, install requirements, e.g. `pip install -r requirements.txt`
  * this currently involves installing the current development versions of ms3 and dimcat
* clone the corpus: `git clone --recurse-submodules -j8 git@github.com:DCMLab/unittest_metacorpus.git`
* Set the `meta_repo` in the second cell to your local clone.

If the plots are not displayed and you are in JupyterLab, use [this guide](https://plotly.com/python/getting-started/#jupyterlab-support).

In [None]:
import os
from git import Repo
import dimcat as dc

In [None]:
meta_repo = "~/unittest_metacorpus"
repo = Repo(meta_repo)
print(f"{os.path.basename(meta_repo)} @ {repo.commit().hexsha[:7]}")
print(f"dimcat version {dc.__version__}")

## The Dataset object
### Initializing a Dataset

Pass a directory to `dimcat.Dataset.load()` to discover and parse all TSV files. The property `data` simply returns an `ms3.Parse` object.

In [None]:
dataset = dc.Dataset()
dataset.load(directory=meta_repo)
dataset.data

### Accessing pieces using IDs

* The field `Dataset.pieces` holds references to `ms3.Piece` objects which reunite data facets such as note tables, (harmony) annotation tables for all the loaded pieces.
* Pieces are addressed by means of an index/ID of the form `('corpus_name', 'fname')`.

In [None]:
list(dataset.pieces.keys())

In [None]:
dataset.pieces[('ravel_piano', 'Ravel_-_Jeux_dEau')]

### Groups of IDs

* Accessing any kind of information from the Dataset relies on the current grouping of IDs.
* Although the `ms3.Parse` object groups data into various corpora, DiMCAT assumes no grouping before any Grouper has been applied (see below).
* Instead, after initialization, all indices are grouped into one list, accessible through the key `()` (empty tuple).

In [None]:
dataset.indices

### Accessing data facets

Currently, the following facets may be available, depending on the state of annotations:

* `'measures'`
* `'notes'`
* `'rests'`
* `'notes_and_rests'`
* `'labels'`
* `'expanded'`
* `'form_labels'`
* `'cadences'`
* `'events'`
* `'chords'`

There are two ways of accessing facets of a Dataset:

#### Iterating

For example, to iterate through note lists:

In [None]:
for group_id, id2dataframe in dataset.iter_facet('notes'):
    for ID, df in id2dataframe.items():
        print(f"First note of {ID}:")
        display(df.head(1))

or, since no Groupers have been applied, we can also skip the first loop:

In [None]:
for ID, df in dataset.iter_facet('notes', ignore_groups=True):
    print(f"Time signatures in {ID}: {list(df.timesig.unique())}")

#### Getting

Or we simply retrieve a concatenated DataFrame with a MultiIndex (i.e. an index with several hierarchical levels):

In [None]:
dataset.get_facet('notes')

## Applying PipelineSteps to a Dataset

Everything else in DiMCAT is a PipelineStep which are distributed over several modules:

* `dimcat.filter`: Filters return a new Dataset where certain IDs have been removed.
* `dimcat.grouper`: Groupers subdivide each of the current ID groups based on a given criterion and return a new Dataset with an altered `.indices` field.
* `dimcat.slicer`: Slicers create for each ID (read: piece) a set of chunks identified by non-overlapping intervals. Any facet retrieved from such a sliced Dataset will be sliced, cutting and duplicating any event that overlaps the interval boundaries.
* `dimcat.analyzer`: Analyzers perform an analysis on a given Dataset and return a new Dataset with the results stored in the `.processed` field. 
* `dimcat.plotter`: Plotters plot analysis ('processed') data and potentially output plots as files.
* `dimcat.writer`: Writers output analyzed data to disk.

All these PipelineSteps come with the method `process_data()` and return a copy of the given Dataset.

### Applying a filter

Let's see this principle at work by applying the `IsAnnotatedFilter` which returns a new Dataset where all pieces contain harmony annotations:

In [None]:
annotated = dc.IsAnnotatedFilter().process_data(dataset)
print(f"Before: {dataset.n_indices} IDs, after filtering: {annotated.n_indices} IDs")

### Applying a slicer

Now we apply the `LocalKeySlicer`, slicing the annotation tables into segments that remain in one local key:

In [None]:
localkey_slices = dc.LocalKeySlicer().process_data(annotated)
print(f"Before: {annotated.n_indices} IDs, after slicing: {localkey_slices.n_indices} IDs")
print(f"Facets that have been sliced so far: {list(localkey_slices.sliced.keys())}.")

The IDs of the sliced Dataset have multiplied and received a third element, which is the interval specifying the extent of one slice. Let's have a look at the first 10 IDs:

In [None]:
localkey_slices.indices[()][:10]

The IDs make sure that all facets retrieved from this Dataset will be sliced.

This is True not only for the facet that has been used for slicing (annotation tables in the present case):

In [None]:
localkey_slices.get_facet('expanded').head(30)

But also for any other facet requested:

In [None]:
localkey_slices.get_facet('notes')

In both cases we see an additional index level `localkey_slice` containing the intervals of the localkey segments. Notes that originally overlapped a localkey boundary are now split in two with `duration_qb` values adapted (but not `duration` which keeps the original value). 

However, we might be interested only in the slices themselves, so we can get the information stored in the field `slice_info` by calling:

In [None]:
localkey_slices.get_slice_info()[['duration_qb', 'globalkey', 'localkey']]

### Applying a Grouper

If, for example, we want to analyse localkey segments separately depending on whether they are in major or minor, we could apply a `ModeGrouper`, which can only applied to a Dataset that has already been sliced:

In [None]:
grouped_localkey_slices = dc.ModeGrouper().process_data(localkey_slices)
grouped_localkey_slices.get_slice_info()[['duration_qb', 'globalkey', 'localkey']]

The grouping is displayed as the prepended index level `localkey_is_minor`. In this case the groups are simply called `True` or `False`, as can be seen by inspecting the `.indices` dictionary. The keys are tuples whose lengths match the number of applied Groupers so far.

In [None]:
list(grouped_localkey_slices.indices.keys())

## Applying an Analyzer

After having seen the various ways how a Dataset can be reshaped, let us have a look how the various transformations change the result of an analyzer.
To that aim, let's first initialize the `PitchClassVectors` analyzer with the desired configuration:

In [None]:
pcv_analyzer = dc.PitchClassVectors(pitch_class_format='pc', 
                                    weight_grace_durations=0.5, 
                                    normalize=True, 
                                    include_empty=True)

We want to 

* see pitch classes 0-12 (as opposed to the defautl `tpc`, i.e. tonal pitch classes on the line of fifth),
* include grace notes, which usually have duration 0, by halving their note values,
* normalize the resulting vectors, and
* include zero vectors where no notes occur (i.e. for completely silent segments).

We start by applying this analyzer to the filtered dataset, in which all pieces are excluded that do not contain annotations:

In [None]:
annotated_pieces_pcvs = pcv_analyzer.process_data(annotated)
annotated_pieces_pcvs.get()

Applying the same analyzer to the Dataset sliced by localkey segments yields one vector per segment:

In [None]:
localkey_segment_pcvs = pcv_analyzer.process_data(localkey_slices)
localkey_segment_pcvs.get()

Applying a `PitchClassVectors` Analyzer to the localkey segments that have been grouped by keys, seems to not make much of a difference
(except that this one here does not normalize):

In [None]:
grouped_localkey_pcvs = dc.PitchClassVectors(pitch_class_format='pc').process_data(grouped_localkey_slices)
grouped_localkey_pcvs.get()

However, the previous grouping allows us to iterate through the grouped pitch class vectors, e.g. for summing them up for all segments in major and minor respectively:

In [None]:
for (mode,), pcvs in grouped_localkey_pcvs.iter(as_pandas=True):
    print(f"PITCH CLASS PROFILE FOR ALL {'MINOR' if mode else 'MAJOR'} SEGMENTS:")
    summed = pcvs.sum()
    display(summed / summed.sum())

### Analyzing slice infos

In [None]:
lk_per_piece = dc.PieceGrouper().process_data(localkey_slices)
lokalkeys_per_piece = dc.LocalKeySequence().process_data(lk_per_piece)
lokalkeys_per_piece.get()

In [None]:
unique_localkeys = dc.LocalKeyUnique().process_data(lk_per_piece)
unique_localkeys.get()