# Tutorial

In order to run this notebook:
* create new environment, make it visible to your Jupyter
  * for conda do `conda create --name {name} python=3.10`
  * activate it and install `pip install ipykernel`
  * `ipython kernel install --user --name={name}`
* within the new environment, install requirements, e.g. `pip install -r requirements.txt`
  * this currently involves installing the current development versions of ms3 and dimcat
* clone the corpus: `git clone --recurse-submodules -j8 https://github.com/DCMLab/dcml_corpora`
* Set the `meta_repo` in the second cell to your local clone.

If the plots are not displayed and you are in JupyterLab, use [this guide](https://plotly.com/python/getting-started/#jupyterlab-support).

In [3]:
import os, random
from git import Repo
import dimcat as dc

In [4]:
meta_repo = "~/dcml_corpora"
repo = Repo(meta_repo)
print(f"{os.path.basename(meta_repo)} @ {repo.commit().hexsha[:7]}")
print(f"dimcat version {dc.__version__}")

dcml_corpora @ e21a5f0
dimcat version 0.2.0.post1.dev64+gda0a036


## The Dataset object

A `Dataset` object represents one or several [DCML corpora](https://github.com/DCMLab/dcml_corpora), depending on which folder(s) you load. By default, DiMCAT will discover all data it can potentially load, but parse only the tabular TSV files, not the scores from which they have been derived.

### Initializing a Dataset

Pass a directory to `dimcat.Dataset.load()` to discover and parse all TSV files. The property `data` simply returns an `ms3.Parse` object.

In [5]:
dataset = dc.Dataset()
dataset.load(directory=meta_repo)
dataset.data

[[1mdefault[0;0m|all]
All corpora
-----------
View: This view is called 'default'. It 
	- excludes fnames that are not contained in the metadata,
	- filters out file extensions requiring conversion (such as .xml), and
	- excludes review files and folders.

                               has   active   scores measures           notes        expanded          events          chords       
                          metadata     view detected detected parsed detected parsed detected parsed detected parsed detected parsed
corpus                                                                                                                              
ABC                            yes  default       70       70     70       70     70       70     70        0      0       70     70
beethoven_piano_sonatas        yes  default       87       87     87       87     87       64     64        0      0       87     87
chopin_mazurkas                yes  default       55       55     55       5

### Data is organized in facets and IDs

* As you can see, DiMCAT recognized the subfolders of the meta-repository `dcml_corpora` as individual corpora (rows) and shows the number of detected and parsed files per facet (columns).
* The facets that a DCML dataset includes by default are
  * `measures`: feature matrices of partial and complete measure units that make up a score
  * `notes`:  feature matrices of all note heads contained in a score
  * `expanded`: feature matrices of all DCML harmony labels contained in a score
  * (`scores`): since the scores are considered to hold the latest version of the mentioned facets, DiMCAT can extract the information freshly, but for saving RAM it is recommended to simply keep your facets up to date using the `ms3 extract` command 
* Internally, DiMCAT addresses pieces using index tuples, also called indices or IDs, that take the form `('corpus_name', 'piece_name')`. The list of pieces currently being analysed is stored under the `.indices` field:

In [6]:
random.sample(dataset.indices, 10)

[('ABC', 'n08op59-2_04'),
 ('medtner_tales', 'op26n03'),
 ('medtner_tales', 'op48n01'),
 ('grieg_lyrical_pieces', 'op68n04'),
 ('grieg_lyrical_pieces', 'op57n06'),
 ('beethoven_piano_sonatas', '14-1'),
 ('beethoven_piano_sonatas', '03-4'),
 ('ABC', 'n01op18-1_04'),
 ('medtner_tales', 'op42n03'),
 ('beethoven_piano_sonatas', '15-2')]

### Accessing a facet
To inspect a facet concatenated over all pieces, simply call

In [7]:
dataset.get_facet('notes')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mc,mn,quarterbeats,duration_qb,mc_onset,mn_onset,timesig,staff,voice,duration,...,nominal_duration,scalar,tied,tpc,midi,volta,chord_id,name,octave,tremolo
corpus,fname,interval,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
ABC,n01op18-1_01,"[0.0, 1.0)",1,1,0,1.0,0,0,3/4,3,1,1/4,...,1/4,1,1,-1,53,,12,,,
ABC,n01op18-1_01,"[0.0, 1.0)",1,1,0,1.0,0,0,3/4,4,1,1/4,...,1/4,1,1,-1,53,,18,,,
ABC,n01op18-1_01,"[0.0, 1.0)",1,1,0,1.0,0,0,3/4,1,1,1/4,...,1/4,1,1,-1,65,,0,,,
ABC,n01op18-1_01,"[0.0, 1.0)",1,1,0,1.0,0,0,3/4,2,1,1/4,...,1/4,1,1,-1,65,,6,,,
ABC,n01op18-1_01,"[1.0, 1.5)",1,1,1,0.5,1/4,1/4,3/4,3,1,1/8,...,1/8,1,-1,-1,53,,13,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tchaikovsky_seasons,op37a12,"[525.0, 526.0)",176,176,525,1.0,0,0,3/4,2,1,1/4,...,1/4,1,,0,72,,1071,C5,5,
tchaikovsky_seasons,op37a12,"[525.0, 526.0)",176,176,525,1.0,0,0,3/4,2,1,1/4,...,1/4,1,,-3,75,,1071,Eb5,5,
tchaikovsky_seasons,op37a12,"[525.0, 526.0)",176,176,525,1.0,0,0,3/4,1,1,1/4,...,1/4,1,,-4,80,,1070,Ab5,5,
tchaikovsky_seasons,op37a12,"[525.0, 526.0)",176,176,525,1.0,0,0,3/4,1,1,1/4,...,1/4,1,,0,84,,1070,C6,6,


As you can see, the IDs are included as index levels and the third level indicates the temporal dimensions, the time interval, of each event (here of every note head) expressed in quarter notes.

### Filtering the data to be analyzed

The data overview above shows that the `expanded` facet is not available for all 87 pieces included in the `beethoven_piano_sonatas` corpus (i.e., they have not been annotated with harmony labels), so in some cases we might want to filter pieces (i.e., IDs) from the dataset. This is what the objects in the `dimcat.filter` module do: They iterate through the IDs and exclude those that do not fulfill the filter's criterion.

For example, to filter out those pieces that have not been annotated we instantiate an `IsAnnotatedFilter`:

In [8]:
filter_object = dc.IsAnnotatedFilter()

All DiMCAT objects working on data come with a `.process_data()` method that takes a dataset and returns a processed copy of the dataset. For example, to apply the filter to our dataset:

In [None]:
annotated = filter_object.process_data(dataset)
print(f"Before filtering: {dataset.n_indices} IDs.\nAfter filtering: {annotated.n_indices} IDs.")

	The incomplete MC 96 (timesig 3/4, act_dur 1/4) is completed by 1 incorrect duration (expected: 1/2):
	{97: Fraction(3, 4)}
	The incomplete MC 112 (timesig 3/4, act_dur 1/2) is completed by 1 incorrect duration (expected: 1/4):
	{25: Fraction(3, 4), 113: Fraction(1, 4)}
	The incomplete MC 39 (timesig 3/4, act_dur 1/8) is completed by 1 incorrect duration (expected: 5/8):
	{40: Fraction(3, 4)}
	The incomplete MC 47 (timesig 3/4, act_dur 5/8) is completed by 1 incorrect duration (expected: 1/8):
	{14: Fraction(1, 1), 48: Fraction(1, 8)}
	The incomplete MC 267 (timesig 3/4, act_dur 1/2) is completed by 1 incorrect duration (expected: 1/4):
	{268: Fraction(1, 2)}
	The incomplete MC 269 (timesig 3/4, act_dur 1/2) is completed by 1 incorrect duration (expected: 1/4):
	{270: Fraction(1, 2)}
	The incomplete MC 271 (timesig 3/4, act_dur 1/2) is completed by 1 incorrect duration (expected: 1/4):
	{272: Fraction(1, 2)}
	The incomplete MC 273 (timesig 3/4, act_dur 1/2) is completed by 1 incorrect

## Applying PipelineSteps to a Dataset

Everything else in DiMCAT is a PipelineStep the various types of which are distributed over several modules:

* `dimcat.filter`: Filters return a new Dataset where certain IDs have been removed.
* `dimcat.plotter`: Plotters plot analysis ('processed') data and potentially output plots as files.
* `dimcat.writer`: Writers output analyzed data to disk.

* `dimcat.grouper`: Groupers subdivide each of the current ID groups based on a given criterion and return a new Dataset with an altered `.indices` field.
* `dimcat.slicer`: Slicers create for each ID (read: piece) a set of chunks identified by non-overlapping intervals. Any facet retrieved from such a sliced Dataset will be sliced, cutting and duplicating any event that overlaps the interval boundaries.
* `dimcat.analyzer`: Analyzers perform an analysis on a given Dataset and return a new Dataset with the results stored in the `.processed` field. 

As we have seen with the `IsAnnotatedFilter` above, these PipelineSteps come with the method `process_data()` and return a copy of the given Dataset. Out of the first three mentioned types, it is actually the only one returning a potentially modified copy, whereas plotters and writers return an exact copy.

This is different for groupers, slicers, and analyzers which all transform datasets to the point that they output new datatypes, which are subtypes of `GroupedData`, `SlicedData`, and `AnalyzedData` respectively.

### Applying a slicer

Having selected only annotated pieces, we can apply the `LocalKeySlicer` that slices the dataset into segments that remaining in one and the same local key:

In [None]:
localkey_slices = dc.LocalKeySlicer().process_data(annotated)
type(localkey_slices)

The type of the new dataset is a hybrid of `SlicedData` (any data processed by a slicer) and `Dataset`, the particular type representing for DCML corpora:

In [None]:
isinstance(localkey_slices, dc.data.SlicedData), isinstance(localkey_slices, dc.Dataset)

This is useful because we can guarantee that sliced data always come with its two defining fields, which are the dictionary `sliced` for storing previously sliced facets, and `.slice_info` for storing information about the slices themselves; regardless of the type of the dataset and its idiosyncratic ways of interacting with this slicing information.

Another thing differentiating `SlicedData` from other data is that it has new IDs which identify pieces rather than slices:

In [None]:
print(f"Before: {annotated.n_indices} IDs, after slicing: {localkey_slices.n_indices} IDs")
print(f"Facets that have been sliced so far: {list(localkey_slices.sliced.keys())}.")

The IDs of the sliced Dataset have multiplied and received a third element, which is the interval specifying the extent of one particular slice in quarter notes. Let's have a look at the first 10 IDs of the `SlicedDataset`:

In [None]:
localkey_slices.indices[:10]

The IDs make sure that all facets retrieved from this Dataset will be sliced.

This is True not only for the facet that has been used for slicing (annotation tables in the present case):

In [None]:
localkey_slices.get_facet('expanded').head(30)

But also for any other facet requested:

In [None]:
localkey_slices.get_facet('notes')

In both cases we see an additional index level `localkey_slice` containing the intervals of the localkey segments. Notes that originally overlapped a localkey boundary are now split in two with `duration_qb` values adapted (but not `duration` which keeps the original value). 

However, we might be interested only in the slices themselves, so we can get the information stored in the field `slice_info` by calling:

In [None]:
localkey_slices.get_slice_info()[['duration_qb', 'globalkey', 'localkey']]

### Applying a Grouper

If, for example, we want to analyse localkey segments separately depending on whether they are in major or minor, we could apply a `ModeGrouper`, which can only applied to a Dataset that has already been sliced:

In [None]:
grouped_localkey_slices = dc.ModeGrouper().process_data(localkey_slices)
grouped_localkey_slices.get_slice_info()[['duration_qb', 'globalkey', 'localkey']]

The grouping is displayed as the prepended index level `localkey_is_minor`. In this case the groups are simply called `True` or `False`, as can be seen by inspecting the `.indices` dictionary. The keys are tuples whose lengths match the number of applied Groupers so far.

In [None]:
list(grouped_localkey_slices.indices)

## Applying an Analyzer

After having seen the various ways how a Dataset can be reshaped, let us have a look how the various transformations change the result of an analyzer.
To that aim, let's first initialize the `PitchClassVectors` analyzer with the desired configuration:

In [None]:
pcv_analyzer = dc.PitchClassVectors(pitch_class_format='pc', 
                                    weight_grace_durations=0.5, 
                                    normalize=True, 
                                    include_empty=True)

We want to 

* see pitch classes 0-12 (as opposed to the defautl `tpc`, i.e. tonal pitch classes on the line of fifth),
* include grace notes, which usually have duration 0, by halving their note values,
* normalize the resulting vectors, and
* include zero vectors where no notes occur (i.e. for completely silent segments).

We start by applying this analyzer to the filtered dataset, in which all pieces are excluded that do not contain annotations:

In [None]:
annotated_pieces_pcvs = pcv_analyzer.process_data(annotated)
annotated_pieces_pcvs.get_results()

Applying the same analyzer to the Dataset sliced by localkey segments yields one vector per segment:

In [None]:
localkey_segment_pcvs = pcv_analyzer.process_data(localkey_slices)
localkey_segment_pcvs.get_results()

Applying a `PitchClassVectors` Analyzer to the localkey segments that have been grouped by keys, seems to not make much of a difference
(except that this one here does not normalize):

In [None]:
grouped_localkey_pcvs = dc.PitchClassVectors(pitch_class_format='pc').process_data(grouped_localkey_slices)
grouped_localkey_pcvs.get_group_results()

In [None]:
grouped_localkey_pcvs.get_results()

However, the previous grouping allows us to iterate through the grouped pitch class vectors, e.g. for summing them up for all segments in major and minor respectively:

In [None]:
for (mode,), pcvs in grouped_localkey_pcvs.iter_group_results():
    print(f"PITCH CLASS PROFILE FOR ALL {'MINOR' if mode else 'MAJOR'} SEGMENTS:")
    display(pcvs / pcvs.sum())

### Analyzing slice infos

In [None]:
lk_per_piece = dc.PieceGrouper().process_data(localkey_slices)
lokalkeys_per_piece = dc.LocalKeySequence().process_data(lk_per_piece)
lokalkeys_per_piece.get_results()

In [None]:
unique_localkeys = dc.LocalKeyUnique().process_data(lk_per_piece)
unique_localkeys.get_results()