vignettes/Context.Rmd

---
title: "Contextualizing humdrum data"
author: "Nathaniel Condit-Schultz"
date:   "July 2022"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Contextualizing humdrum data}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: sentence
---

```{r, include = FALSE, message=FALSE, echo = FALSE }
source('vignette_header.R')
```

Welcome to "Contextualizing humdrum data"!
This article explains how `r hm` can be used contextualize musical data.
When analyzing musical data, we often treat each and every data token as a separate, independent "data point."
However, in many cases, we want to consider data points *in context*---what other data points are nearby in the data?
Since the [humdrum syntax](HumdrumSyntax.html) encodes data in temporal order, the "context" usually means either "what is happening before or after this data point?" or "what else is happening at the same time as this this data point?"
`r Hm` provides a number of ways of analyzing data "in context."

This article, like all of our articles, closely parallels information in `r hm`'s detailed code documentation, which can be found in the "[Reference](https://humdrumr.ccml.gtcmt.gatech.edu/reference/index.html#reading-and-writing#manipulating-humdrum-data "HumdrumR function reference, Manipulating humdrum data")" section of the `r hm` [homepage](humdrumR.ccml.gtcmt.gatech.edu).
You can also find this information within R, once `r hm` is loaded, using `?context`, `group_by`, or `?withinHumdrum`.


# Groupby 

The most conventionally "R-style" way to look at data context is using R/`r hm`'s various "group by" options.
This functionality is described elsewhere, for example in the [Working With Data](WorkingWithData.html) article, and in the `within.humdrumR()` man page.
Group-by functionality isn't necessarily connected to temporal context: you can, for instance, group together all the data from each instrument in an ensemble, across all songs in a dataset---this is useful, but non-temporal-context.
If you want to use `groupby` to get more temporal context, here are a few good options:

### Group by Record

All `r hm` data has a `Record` field, indicating data points that occur at the same time. 
Using `group_by()` we can perform calculations grouped by record.
Let's load our trusty Bach-chorale data:

```{r message=FALSE}
chorales <- readHumdrum(humdrumRroot, 'HumdrumData/BachChorales/.*krn')
```

Let's count how many new note onsets (*not* rests) occur on each record of the data, and tabulate the results.
We'll look for tokens that *don't* contain an `r` (for rest), using a regular-expression match `%~% 'r'` and the negating bang (`!`).

```{r, echo = 6:9}
chorales |>
  group_by(Piece, Record) |>
  with(sum(!Token %~% 'r')) |> 
  count() -> counts

chorales |>
  group_by(Piece, Record) |>
  with(sum(!Token %~% 'r')) |>
  count() 
```

Most records (`r table(counts)['4']` out of `r sum(counts)`) have new notes in all four voices---no surprise for choral music---with *one* onset being the next most common case (`r table(counts)['1']` records).
Note that, when we called `group_by()`, we included `Piece` *and* `Record`.
Why? Because each piece in the data set repeats the same record numbers---you wouldn't want to count the 17th record from the 3rd chorale together with the 17th record from the 7th chorale, for example.
When using `group_by()`, you'll almost always want to have `Piece` as a grouping factor.

### Group by Bar

Most humdrum data includes bar lines, indicated by `=`.
When `r hm` reads data (`?readHumdrumR`) data, it will look at these barlines, count them in each file, and put that count into a field called `Bar` (`?fields`).
You can then use this data to group data by bar.
Remember, that if your data set has no `=` tokens, the `Bar` field will be nothing but useless `0`s.

What if we wanted to know what the lowest note in each bar of the music is?
Let's first extract `semits()` data, so it is easy to get the lowest value:

```{r}
chorales |>
  mutate(Semits = semits(Token),
         Kern = kern(Token)) -> chorales

```

We can now group bars (within pieces, once again!), get the minimum (using `which.min()`), and tabulate them.

```{r, echo = 6:9}
chorales |>
  group_by(Piece, Bar) |>
  with(Kern[which.min(Semits)]) |>
  count() |> table() -> lownotes

chorales |>
  group_by(Piece, Bar) |>
  with(Kern[which.min(Semits)]) |>
  count()


```

The highest lowest-note-in-bar is `r names(lownotes)[max(which(lownotes > 0))]`.
The most common lowest-note-in-bar is `r names(lownotes)[which.max(lownotes)]`.

---

For further analysis, we might want to save the lowest-note into a new field, so we'll use `mutate()`.
By default, the `mutate()` function will [recycle][recycling] any scalar (length one) result throughout its group,
which is usually what we want:


```{r}

chorales |>
  group_by(Piece, Bar) |>
  mutate(LowNote = min(Semits)) |>
  ungroup() -> chorales

chorales

```

(Note that we use `ungroup()` before overwriting `chorales`, so that grouping doesn't affect our future analyses.)
We could then, for example, subtract the bar's lowest note from each note:

```{r}
chorales |>
  mutate(Semits - LowNote) 


```


### Group by Beat

Another useful contextual group is the beat.
There will be no automatic "beat" field in our data, so we'll need to make one---obviously, we need to have data with rhythmic information encoded (like `**kern` or `**recip`) to do this!
We can use the `timecount()` function to count the beats in our data. 
We could count quarter-notes by setting `unit = '4'`; alternatively, if our data includes time-signature interpretations (like `*M3/4`) we could use the `TimeSignature` field to get the tactus of the meter. 
Let's try the later and save the result to a new field, which we'll call `Beat` (or anything else you want).

```{r}
chorales |> select(Token) |>
  mutate(Beat = timecount(Token, unit = tactus(TimeSignature))) -> chorales


```

We can now group by beat (and piece, of course).
Maybe we want to know the range of notes in each beat:

```{r}

chorales |>
  group_by(Piece, Beat) |> 
  with(diff(range(Semits))) |>
  draw(xlab = 'Pitch range within beat')


```


> Note: If your data includes spine paths, you'll want to set `mutate(timecount(Token), expandPaths = TRUE, ...)`.
> Count (and similar functions, like `timeline()`) won't work correctly without paths expanded (`?expandPaths`).

# Contextual Windows

When you use `group_by()`, your data is *exhaustively* partitioned into *non-overlapping* groups; the groups are also not necessarily ordered---we have to explicitly use (temporally) ordered groups like `group_by(Piece, Record)` if we want temporal context.
In contrast, the `context()` function, when used in combination with `with()` or `mutate()`, gives us much more flexible control over the context we want for our data.

`context()` takes an input vector and treats it as an ordered sequence of information.
It then identifies arbitrary contextual windows in the data, based on your criteria.
The windows created by `context()` are always sequentially ordered, aren't necessarily exhaustive (some data might not fall in any window), and can *overlap* (some data falls in multiple windows).

We use `context()` by telling indicating when to "open" (begin) and "close" (end) contextual windows, using the `open` and `close` arguments.
Context can use many criteria to open/close windows.
In the following sections, we layout just a few examples; we'll use our chorale data again, as well as the built-in Beethoven variation data:

```{r message=FALSE}
chorales <- readHumdrum(humdrumRroot, 'HumdrumData/BachChorales/.*krn')
beethoven <- readHumdrum(humdrumRroot, 'HumdrumData/BeethovenVariations/.*krn')
```

For most examples, we'll use `within()` and run the command `paste(Token, collapse = '|')`.
This will cause all the tokens within a contextual group to collapse to single string, separated by `|`.
This is the simplest/fastest way to *see* what `context()` is doing!


## Regular windows

In some cases, we simply want to progress through data and open/close windows at regular locations.
We can do this using the `hop()` function.
(`hop()` is a `r hm` function which is very similar to R's base `seq()` function; however, `hop()` has some special features and, more importantly, gets special treatment from `context()`).
Maybe we want to open a four-note window on every note:

```{r}
chorales |>
  context(hop(), open + 3) |>
  within(paste(Token, collapse = '|'))
  
```

Cool, we've got overlapping four-note windows, running through each spine!
How did we do this?

### Regular Open

The first argument to `context()` is the `open` argument;
we give `open` a set of indices (natural numbers) where to open windows in the data.
`hop()` simply generates a sequence of numbers "along" the input data---by default, the "hop size" is `1`, but we can change that.
For example:

```{r}
hop(letters, by = 1)

hop(letters, by = 2)

```

When we use `hop()` inside `context()`, it automatically knows what the input vector(s) (data fields) to hop along are.
So now check this out:

```{r}

chorales |>
  context(hop(2), open + 3) |>
  within(paste(Token, collapse = '|'))

```

By saying `hop(2)`, our windows only open on every *other* index.
If we don't want our four-note windows to overlap at all, we could set `hop(4)`:

```{r}
chorales |>
  context(hop(4), open + 3) |>
  within(paste(Token, collapse = '|'))

```

Note that the `by` argument can be a vector of hop-sizes, which will cause `hop()` to generate a sequence of irregular hops!
You can also use `hop()`'s `from` and `to` arguments to control when the first/last windows occur, etc.
When using `to`, you can refer to another special variable---`end`---which `context()` will interpret as the last index.
So you could, for example, say `hop(from = 5, to = end - 5)`.

### Fixed Close

So `hop()` is telling `context()` when to open windows; how are we telling it when to close windows?
The second argument to `context()` is the `close` argument.
Like `open`, the `close` argument should be a set of natural-number indices.
However, the cool thing is that the `close` argument can *refer to* the `open` argument.
So rather than manually figuring out what index would go with each `open` (don't try, because the multiple spines etc. make it confusing), we simply say `open + 3`.
If we want five-note windows, we can say `open + 4`.

----

What if we want to the window to simply close before the next opening? 
We can do this by having the `close` argument refer to the `nextopen` variable.
So instead of saying `open + 3` we say `nextopen - 1L`!

```{r}
chorales |>
  context(hop(4), nextopen - 1) |>
  within(paste(Token, collapse = '|'))
```

Now we'll get exhaustive windows no matter where we open the windows.
For example, we can change the hop size and still get exhaustive windows!

```{r}
chorales |>
  context(hop(6), nextopen - 1) |>
  within(paste(Token, collapse = '|'))
```

We could also close windows at `nextopen - 2` or `nextopen + 1`---whatever we want!


### Flip it around

We don't have to define `open` first and have `close` refer to open---we can also do the opposite!

```{r}
chorales |>
  context(close - 3, hop(4, from = 4)) |>
  within(paste(Token, collapse = '|'))

```

We've got the exact same result by telling the window closes to hop along regularly (every four indices) and having the `open` argument refer to the closing indices (`close - 3`).
The `open` argument can also refer to the `prevclose` (previous close).
Notice that our output windows are still "placed" to align with the opening---we can set `alignLeft = FALSE` in our call to `within.humdrumR()`, if we want the output to align with the close:

```{r}
chorales |>
  context(close - 3, hop(4, from = 4)) |>
  within(paste(Token, collapse = '|'), 
         alignLeft = FALSE)
```

----

Note that these regular windows we are creating are examples of N-grams.
`r Hm` also defines another approach to defining N-grams which will generally be faster than using `context()`---this alternative approach is described in the last section of this article.


## Irregular Windows

The regular windows we created in the previous section are useful, but `context()` can do a *lot* more.
You can tell `context()` to open, or close, windows based on arbitrary criteria!
For example, let's say you want to open a window any time the leading tone occurs, and stay open until the next tonic.
To do this, let's get the `solfa()` data into a `Solfa` field:

```{r}
chorales |>
  solfa(Token, simple = TRUE) -> chorales

```

Alright, the easy thing to do here is to give `context()`'s `open`/`close` arguments `character` strings, which are matched as regular expressions against the active field:

```{r}
chorales |>
  select(Solfa) |>
  context('ti', 'do') |>
  with(paste(Solfa, collapse = '|'))
```


Pretty cool! 
But wait, something seems odd; one of the outputs is `"ti-do-fa-re-so-so-fa-so-la-re-so-fa-mi-re-di-re-ti-do"`.
Why doesn't this window close when it hits that first "do"?
This has to do with `context()`s treatment of overlapping windows, which is controlled by the `overlap` argument.
By default, `overlap = 'paired'`, which means `context()` attempts to pair each open with the next *unused* close---the reason we don't close on the first "do" in `"ti-do-fa-re-so-so-fa-so-la-re-so-fa-mi-re-di-re-ti-do"`, is because the "do" was *already* the close of the previous window.
For this analysis, we might want to try `overlap = 'none'`: with this argument, a new window will only open after the current window is closed.

```{r}
chorales |>
  select(Solfa) |>
  context('ti', 'do', overlap = 'none') |>
  with(paste(Solfa, collapse = '|'))
```

Another option would be to allow multiple windows to close at the same place (i.e., on the same "do").
This can be achieved with the setting `overlap = 'edge'`:


```{r}
chorales |>
  select(Solfa) |>
  context('ti', 'do', overlap = 'edge') |>
  with(paste(Solfa, collapse = '|'))
```


----


If this is a lot to wrap your head around, you are not the only one!
There are many ways to define/control how contextual windows are defined, and it's often difficult to decide what we want in a particular analysis.
You can read the `context()` documentation for some more examples, or simply play around!
The following sections layout a few more examples, just to illustrate some possibilities.


### More criteria

What if want to have our windows close on tonic (do), but only on long-durations?
Let's extract duration information into a new field:

```{r}
chorales |>
  select(Token) |>
  duration(Token) -> chorales

```

We could now do something like this:


```{r}
chorales |>
  select(Solfa) |>
  context('ti', 
          Solfa == 'do' & Duration >= .5, 
          overlap = 'edge') |>
  with(paste(Solfa, collapse = '|'))

```

Notice that, because our `close` expression is more complicated now, I had to explicitly say `Solfa == 'do'` instead of using the shortcut of just providing a single string.

### Open until next/previous


We can use what we learned above about the `nextopen` and `nextclose` variables to make windows open/close at matches.
For example, we could have windows close every time there is a fermata (`";"` token in `**kern`) in the data, but *open* again immediately after each fermata:


```{r}
chorales |>
  select(Token) |>
  context(prevclose + 1, ';') |>
  with(paste(Token, collapse = '|'))

```

There is an issue here: the first fermata in the data doesn't get paired with anything, because there is no "previous close" before it.
This will happen whenever you use `nextopen` or `prevclose`!
You can fix this by explicitly adding an opening window at `1`:


```{r}
chorales |>
  select(Token) |>
  context(1 | prevclose + 1, ';') |>
  with(paste(Token, collapse = '|'))

```

By using the `|` (or) command in `context()`, we are saying open a window at `1` *or* at `prevclose + 1`.
When working with `nextopen` you might want to use the special `end` argument (only available inside `context()`), which is the last index in the input vector.


### Semi-fixed windows

In some cases, you might want to have windows open (or close) at a fixed interval, but close based on something irregular.
We can do this easily by combining what we've already learned.
For example, we could open a window on every third index, but close only when we see a fermata.
We'll want to use `overlap = 'edge'` again.

```{r}

chorales |> 
  select(Token) |>
  context(hop(4), ';', overlap = 'edge') |>
  with(paste(Token, collapse = '|'))

```

### Slurs

A common case of contextual information in musical scores are slurs, which are used in to indicate articulation (e.g., bowing) and phrasing information.
In `**kern`, slurs are indicated with parentheses, like `(` or `)`.
To see some examples, let's look at our `beethoven` dataset, which we loaded above. 
We will start by removing multi-stops (which would make this much more complicated) and extracting only the `**kern` data.

```{r}

beethoven |>
  filter(Exclusive == 'kern' & Stop == 1) |> 
  removeEmptySpines() |> 
  removeEmptyStops() -> beethoven
```

We can see parentheses used to indicate slurs in the piano parts.
Let's say we want to get the length of all these slurred groups:

```{r}
beethoven |>
  context('(', ')') |>
  with(length(Token)) |> 
  count()

```


Most of the slurs are only 2, 3, or 4 notes.
But there is one that is 13! I wonder where that is?

```{r}
beethoven |>
  context('(', ')') |> 
  mutate(SlurLength = length(Token)) |>
  uncontext() |>
  group_by(File, Bar, Spine) |>
  select(Token) |> 
  filter(any(SlurLength == 13)) 

```

There it is!


---

Ok, what if we want to collapse our slurred notes together, like we've been doing throughout this article?

```{r}
beethoven |> 
  context('(', ')', overlap = ) |>
  within(paste(Token, collapse = '|'))

```

That worked...but we lost all the *unslurred* notes.
We can recover these tokens when we call `uncontext()`.
Normally, `uncontext()` just removes contextual windows from your data, which doesn't actually change any data fields.
However, if you provide a `complement` argument, which must refer to an existing field, that "complement" field will be filled
into and the currently selected field, wherever no contextual window was defined.
(This behavior is similar to the `complement` argument of `unfilter()`.)


```{r}

beethoven |> 
  context('(', ')', overlap = ) |>
  within(paste(Token, collapse = '|')) |>
  uncontext(complement = 'Token')
```


### Nested windows

In some cases, we might have contextual windows "nested" inside each other.
For example, slurs in sheet music might overlap to represent fine gradations in articulation.
The `context()` funciton can handle nested windows by setting `overlap = 'nested'`.
Here is an example file we can experiment with:

```{r}
nested <- readHumdrum(humdrumRroot, 'examples/Phrases.krn')
nested
```

We've got nested slurs. Let's try `context(..., overlap = 'nested')`:

```{r}
nested |>
  context('(', ')', overlap = 'nested') |>
  with(paste(Token, collapse = '|'))

```

Very good! We get all our windows, including the nested ones.
(Look how the result differs if you set `overlap = 'paired'`.)
But what if we only want the topmost or bottommost slurs?
Use the `depth` argument: `depth` should be one or more non-zero integers, indicating how deeply nested you want your windows to be.
`depth = 1` would be the "top" (unnested) layer, `2` the next-most nested, etc.
You can also use negative numbers to start from the most nested and work backwards: `-1` is the most deeply nested layer, `-2` the second-most deeply nested, etc.
Finally, you can specify more than one depths by making depth vector, like `depth = c(1,2)`.

```{r}

nested |>
  context('(', ')', overlap = 'nested', depth = 1) |>
  with(paste(Token, collapse = '|'))

nested |>
  context('(', ')', overlap = 'nested', depth = 2) |>
  with(paste(Token, collapse = '|'))

nested |>
  context('(', ')', overlap = 'nested', depth = 2:3) |>
  with(paste(Token, collapse = '|'))

nested |>
  context('(', ')', overlap = 'nested', depth = -1) |>
  with(paste(Token, collapse = '|'))


```

# N-grams

In the previous section, we saw that the `context()` function can be used to create n-grams (and so much more).
`r Hm` also offers a different, lag-based, approach to doing n-gram analyses.
The lag-based approach is more fully vectorized than `context()` which makes it extremely fast, but also less general purpose.
Depending on what you are doing with you n-grams, `context()` may be the only way that works---basically, if you want to apply an expression separately to each every n-gram, you need to use `context()`.

The idea of lag-based n-grams can be demonstrated quite simply using the `letters` vector (built in to R) and the `lag(n)` command.
The `lag(n)` command "shifts" a vector over by `n` indices:

```{r}
cbind(letters, lag(letters), lag(letters, n = 2))

```

What happened here? We give the `cbind()` function three separate arguments: 1) the normal `letters` vector; 2) `letters` lagged by 1; 3) `letters` lagged by 3.
These three arguments are bound together into a three-column `matrix`.
We can do the same thing with `paste()`:

```{r}
paste(letters, lag(letters), lag(letters, n = 2))

```

We made three-grams!
This approach, if used with fully-vectorized functions will be extremely fast, even for large datasets.

## Lag within humdrumR

The [with/dplyr][?withHumdrum] functions allow you to create lagged vectors in a special, concise way.
Let's work again with the chorales, using just simple `kern()` data:

```{r message=FALSE}
chorales <- readHumdrum(humdrumRroot, 'HumdrumData/BachChorales/.*krn')

chorales |>
  kern(simple = TRUE) -> chorales

```

When using `with()`/`mutate()`/etc., instead of writing `lag(x, n = 1)`, we can write `x[lag = 1]`.
We can then paste a field (like `Kern`) to itself *lagged*, like this:

```{r}
chorales |>
  within(paste(Kern, Kern[lag = 1], sep = '|'))


```

We can use negative or positive lags, depending on how we want the n-grams to line up:

```{r}
chorales |>
  within(paste(Kern, Kern[lag = -1], sep = '|'))


```

An important point! `with()`/`mutate()`/etc. will automatically group lagged data by `list(File, Spine, Path)`, so the n-grams won't cross
from the end of one file/spine to the beginning of the next, etc.

----

`paste()` isn't the only vectorized function we might want to apply to lagged data.
Another common example would be `count()`:

```{r}
chorales |>
  with(count(Kern, Kern[lag = -1]))

```

We get a transition matrix!

## Larger N

For functions that accept unlimited arguments (`...`), like `paste()`, and `count()`, you can easily extend the principle to create longer n-grams:

```{r}
chorales |>
  within(paste(Kern, Kern[lag = -1], Kern[lag = -2], sep = '|'))

```

But there's an even better way! Simply give that `lag` argument a *vector* of lags!
In fact, `lag = 0` spits out the unlagged vector, so you can do it all in a single index command:

```{r}
chorales |>
  within(paste(Kern[lag = 0:-2], sep = '|'))

```

Let's create 10-grams, and see what the most frequent 10-grams are:

```{r}
chorales |>
  with(paste(Kern[lag = 0:-9], sep = '|')) |> 
  table() |> 
  sort() |> 
  tail(n = 10)

```

That's not what we want!
When you do lagged n-grams, the first and last n-grams get "padded" with `NA` values.
We can use the R function `grep(invert = TRUE, value = TRUE)` to get rid of these:


```{r}
chorales |>
  with(paste(Kern[lag = 0:-9], sep = '|')) |> 
  grep(pattern = 'NA', invert = TRUE, value = TRUE) |> 
  table() |> 
  sort() |> 
  tail(n = 10)


```

That still doesn't seem right, does it?
Actually, it is right: in these 10 chorales, there are no 10-gram pitch patterns that occur more than once!
Let's try a 5-gram instead:

```{r}
chorales |>
  with(paste(Kern[lag = 0:-4], sep = '|')) |> 
  grep(pattern = 'NA', invert = TRUE, value = TRUE) |> 
  table() |> 
  sort() |> 
  tail(n = 10)

```

Now we see a couple of n-grams (like `d e d c b`) that occur more often.