## 0. Import Python Libraries

An essential first step . . .


In [67]:
# crim intervals
import crim_intervals
import crim_intervals.visualizations as viz
import crim_intervals.corpus_tools as corpus_tools
from crim_intervals import CorpusBase, ImportedPiece, importScore
from crim_intervals.corpus_tools import corpus_notes, corpus_mel

# standard library
import glob
import os
import re
import warnings
from copy import deepcopy

# data analysis
import pandas as pd
from itertools import combinations

# visualization
import altair as alt
import plotly.express as px
import plotly.io as pio

# network analysis
import networkx as nx
from community import community_louvain

from pyvis.network import Network

# create directories if needed
for dir_name in ('saved_csv', 'MEI'):
    if not os.path.isdir(dir_name):
        os.makedirs(dir_name)

# configure plotting for Quarto
alt.renderers.enable('default')
pio.renderers.default = "plotly_mimetype+notebook_connected"

# suppress warnings
warnings.filterwarnings('ignore')


### 0.1 Helper Functions and Sorting List

- These will help us sort and regularize note names

In [68]:
# DO NOT EDIT!

# sorting intervals
def extract_descending(interval):
    match = re.search(r'-?\d+', interval)
    return int(match.group())

def extract_and_sort_reverse(items):
    # Function to extract numeric part
    def extract_num(item):
        match = re.search(r'[-+]?\d+', item)
        return int(match.group()) if match else 0
    
    # Extract numbers and sort
    sorted_neg_items = sorted(items, key=extract_num, reverse=True)
    return sorted_neg_items

def extract_and_sort_forward(items):
    # Function to extract numeric part
    def extract_num(item):
        match = re.search(r'[-+]?\d+', item)
        return int(match.group()) if match else 0
    
    # Extract numbers and sort
    sorted_items = sorted(items, key=extract_num, reverse=False)  
    return sorted_items


# set the custom order for pitches
pitch_order = ['Rest',
               'A#1', 'B1', 
               'C2', 'C#2', 'D2', 'D#2','E-2', 'E2', 'E#2', 'F2', 'F#2', 'G-2', 'G2', 'G#2', 'A-2', 'A2', 'A#2','B-2', 'B2', 'B#2',
               'C3', 'C#3', 'D-3','D3', 'D#3', 'E-3','E3', 'E#3', 'F3', 'F#3', 'G-3', 'F##3', 'G3', 'G#3', 'A-3', 'A3', 'A#3', 'B-3','B3', 'B#3',
               'C4', 'C#4', 'D-4','D4', 'D#4','E-4', 'E4', 'F-4', 'E#4', 'F4', 'F#4', 'G-4', 'F##4', 'G4', 'G#4', 'A-4','A4', 'A#4', 'B-4', 'B4', 'B#4',
               'C5', 'C#5','C##5', 'D-5','D5', 'D#5', 'E-5','E5', 'F-5','E#5','F5', 'F#5', 'G-5', 'F##5','G5', 'G#5', 'A-5', 'A5', 'A#5', 'B-5', 'B5',
              'C6']

pitch_class_order = ['C', 'C#', 'Db','D', 'D#', 'Eb', 'E', 'Fb', 'E#', 'F', 'F#', 'Gb', 'F##', 'G', 'G#', 'Ab','A', 'A#', 'Bb', 'B', 'B#']


# 1. Introduction:  All About CRIM Intervals

In this Notebook we will explore our contrapuntal duos with **CRIM Intervals**, a Python library developed especially for the purpose of analyzing 'contrapuntal' music like the duos we have been considering lately. It leverages the power of  Mike Cuthbert's excellent Python library [music21](https://web.mit.edu/music21/) (which takes care of all the headaches of finding notes and intervals), but brings the results into Pandas, and thus allows us to turn scores into tablular data.  Learn more about [CRIM Intervals](https://github.com/HCDigitalScholarship/intervals).

### Notes and Durations
- **piece.notes()**, which finds all the notes and rests in a score, with a tabular score-like representation of the pitches, pitch classes, and durations (expressed in music21 "offsets", in which each quarter note corresponds to the value of 1.0). It can also derminte the location of any note as a measure+beat reference with piece.detailIndex()
- **piece.durations()**, with quarter-note = 1.0, as per music21.  

Learn more about [notes](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/02_Notes_Rests.md) and about [durations](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/03_Durations.md)

### Melodic and Harmonic Intervals
- **piece.melodic()**, which finds melodic intervals in any voice part, with various options for diatonic, chromatic and zero-based distances. Intervals can be compound (distinguishing between tenths and thirds, for instance), or simple, and can include quality (distinguishing major and minor thirds, for instance), or not.
- **piece.harmonic()**, which finds harmonic intervals between every combination of two voices in a piece, with various options for diatonic and chromatic. These intervals can also be directed (as when a tenor voice sounds above the altus), or not.

Learn more [melodic intervals](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/06_Melodic_Intervals.md) and [harmonic intervals](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/07_Harmonic_Intervals.md)


### Ngrams
- **piece.ngrams()**, which finds n-grams of any length in each voice part. n-grams are frequently used in linguistic analysis (), and can help us find repeating patterns within and among works.
- The ngram tool can be used for any of the methods above: **melodic**, **harmonic**, **durations**, **lyrics**.
- By default it finds **contrapuntal modules**, which represent in numerical values a combination of the vertical intervals made between any two voices with the melodic intervals heard in the motion of the lower voice. A module of `7_Held 6_-2, 8` for instance, represents **vertical intervals of `7, 6, 8` between two voices** and in the **lower voice a tied note followed by a descending second**. Together these five events represent a typical cadence formula. Repeating modules are a key part of contrapuntal style.

Learn more about [ngrams](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/08_Contrapuntal_Modules.md)

### Heatmaps, Graphs, and Networks

CRIM Intervals also features tools that help us visualize musical events in different ways, including:

- **bar charts, histograms, and radar (spider) plots** of events (such as notes, or intervals), which are useful for understand ranges, distributions, and other large-scale distribution of categorical data
- **heatmaps**, showing where in a piece patterns (such as repeating n-grams) occur
- **network diagrams**, showing related patterns or pieces

### Import a single piece

Import a single piece:`




`piece = importScore('/MEI/Bach_BWV_0772.mei')`

Then call any of the methods above with that piece:

`notes = piece.notes()`

This will return a Pandas dataframe of the *notes and rests* in the piece.

### Import a "Corpus" of pieces

Or a set of pieces, which we call a 'corpus':

```python
file_list = ('MEI/Bach_BWV_0772_rev.mei',
             'MEI/Bartok_Mikrokosmos_022_rev.mei',
             'MEI/Morley_1595_01_Go_ye_my_canzonettes_rev.mei')
# now import those files as a Corpus
corpus = CorpusBase(file_list)
```

Then we use the `batch` method to create a list of dfs--one for each piece in the `corpus`. In this case we define a function to use (`func`) then pass in a dictionary keyword arguments (`kwargs`).  The `metadata` parameter allows us to include composer and title information each dataframe of results.  And then we can combine the dfs into a single set.

```python
func = ImportedPiece.melodic  # <- NB there are no parentheses here
list_of_dfs = corpus.batch(func = func, 
                            kwargs = {'kind': 'd', 
                                     'compound' : True}, 
                            metadata=True)
mel = pd.concat(list_of_dfs)

```

### Corpus Tools Make Things Simple!

Note that we also have some special tools that make quick work of applying various methods to a corpus of pieces with corpus tools.  

See [CRIM Intervals Tutorials](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/21_Corpus_Tools.md) for more information on these.



# 2.0 Experiment with One File

You can start with just one file so you understand what is happening as CRIM Intervals assembles dataframes.

Load one piece:

```python
# pick a file
file_name = 'MEI/Bach_BWV_0772.mei'

# load a piece object
piece = importScore(file_name)

# run a method and show results

notes = piece.notes()
notes
```

## 2.1 Get Notes

In [69]:
# pick a file
file_name = 'MEI/Bach_BWV_0772.mei'

# load a piece object
piece = importScore(file_name)

# run a method and show results
notes = piece.notes()
notes

Unnamed: 0,Part-1,Part-2
0.00,Rest,Rest
0.25,C4,
0.50,D4,
0.75,E4,
1.00,F4,
...,...,...
83.00,D4,G3
83.25,C5,
83.50,F4,G2
83.75,B4,


In [70]:
durs = piece.durations()

# as strings
durs = durs.map(str)

In [71]:
# as strings
durs = durs.map(str)
note_dur = pd.merge(notes, durs, left_index=True, right_index=True)
# combine matching cols
note_dur['Part-1'] = note_dur['Part-1_x'] + '_' + note_dur['Part-1_y']
note_dur['Part-2'] = note_dur['Part-2_x'] + '_' + note_dur['Part-2_y']
# just the cols we need
note_dur = note_dur[['Part-1', 'Part-2']]
note_dur

Unnamed: 0,Part-1,Part-2
0.00,Rest_0.25,Rest_2.25
0.25,C4_0.25,
0.50,D4_0.25,
0.75,E4_0.25,
1.00,F4_0.25,
...,...,...
83.00,D4_0.25,G3_0.5
83.25,C5_0.25,
83.50,F4_0.25,G2_0.5
83.75,B4_4.25,


In [72]:
# rename the `parts` so they are consistent across all your work:

notes_numbered = piece.numberParts(notes)
notes_numbered

Unnamed: 0,1,2
0.00,Rest,Rest
0.25,C4,
0.50,D4,
0.75,E4,
1.00,F4,
...,...,...
83.00,D4,G3
83.25,C5,
83.50,F4,G2
83.75,B4,


In [73]:
#  add metadata

notes_numbered['Composer'] = piece.metadata['composer']
notes_numbered['Title'] = piece.metadata['title']


# fill na for clarity
notes_numbered.fillna('').head(10)

Unnamed: 0,1,2,Composer,Title
0.0,Rest,Rest,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
0.25,C4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
0.5,D4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
0.75,E4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.0,F4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.25,D4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.5,E4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.75,C4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
2.0,G4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
2.25,,C3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"


In [74]:
# but where are the measure numbers?

# by default Intervals reports the locations by Offset, in which each 1.0 represents 1 quarter note
# these are zero based, so that the first beat is 0.00

# we can pass the results to a function called detailIndex() in order to add measure and beat numbers to the results

# note that the "Progress" column is a very useful 'float' that represents the place of this event as a 'proportion' of the whole.  
# 0 is the start.  1 is the end of the piece

notes_numbered_measures = piece.detailIndex(notes_numbered, offset=True, progress=True)
notes_numbered_measures

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,1,2,Composer,Title
Measure,Beat,Offset,Progress,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1.0,1.00,0.00,0.000000,Rest,Rest,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.0,1.25,0.25,0.002976,C4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.0,1.50,0.50,0.005952,D4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.0,1.75,0.75,0.008929,E4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1.0,2.00,1.00,0.011905,F4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
...,...,...,...,...,...,...,...
21.0,4.00,83.00,0.988095,D4,G3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
21.0,4.25,83.25,0.991071,C5,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
21.0,4.50,83.50,0.994048,F4,G2,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
21.0,4.75,83.75,0.997024,B4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"


In [75]:
# the measures and beats are recorded in a multi index.  
# Reset it to make these part of the df
notes_numbered_measures_reset = notes_numbered_measures.reset_index()
notes_numbered_measures_reset

Unnamed: 0,Measure,Beat,Offset,Progress,1,2,Composer,Title
0,1.0,1.00,0.00,0.000000,Rest,Rest,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
1,1.0,1.25,0.25,0.002976,C4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
2,1.0,1.50,0.50,0.005952,D4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
3,1.0,1.75,0.75,0.008929,E4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
4,1.0,2.00,1.00,0.011905,F4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
...,...,...,...,...,...,...,...,...
331,21.0,4.00,83.00,0.988095,D4,G3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
332,21.0,4.25,83.25,0.991071,C5,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
333,21.0,4.50,83.50,0.994048,F4,G2,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"
334,21.0,4.75,83.75,0.997024,B4,,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772"


## 2.2  Visualize the Results

In [76]:
# let's melt the data around the voice parts to make things easier!
selected_df = notes_numbered_measures_reset
# for melting the data, we will set up value and id variables
value_vars= ['1', '2']
id_vars = [col for col in selected_df.columns if col not in value_vars]

# now melt
selected_df_melted = pd.melt(selected_df, value_vars=value_vars, id_vars=id_vars)
selected_df_melted.rename(columns={'variable': 'Voice', 'value': 'Note'}, inplace=True)
selected_df_melted

# # Option to remove rests
df_clean = selected_df_melted[selected_df_melted['Note'] != 'Rest']


# Count notes by Composer Title Note and Voice
note_counts = selected_df_melted.groupby(['Composer', 'Title', 'Note', 'Voice']).size().reset_index(name='Count')
note_counts.head()

Unnamed: 0,Composer,Title,Note,Voice,Count
0,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A2,2,3
1,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A3,2,21
2,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A4,1,23
3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A4,2,5
4,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A5,1,10


## 2.3 Visualize as Bar Chart



In [77]:
# Create the bar chart
df = note_counts

fig = px.bar(
    df,
    x='Note',
    y='Count',
    color='Voice',
    title=f"Note Distribution Analysis - {piece.metadata['composer']}:  {piece.metadata['title']} ",
    width=800,
    height=600,

    # note that the pitch order is set earlier in our NB!
    category_orders={'Note': pitch_order}
)

fig.show()

## 2.4 Visualize as Radar Plot

In [78]:
# get notes
notes = piece.notes()
notes_numbered = piece.numberParts(notes)
notes_numbered['Composer'] = piece.metadata['composer']
notes_numbered['Title'] = piece.metadata['title']


# melt (we don't need measures and offsets!)
selected_df = notes_numbered_measures_reset
# for melting the data, we will set up value and id variables
value_vars= ['1', '2']
id_vars = [col for col in selected_df.columns if col not in value_vars]

# now melt
selected_df_melted = pd.melt(selected_df, value_vars=value_vars, id_vars=id_vars)
selected_df_melted.rename(columns={'variable': 'Voice', 'value': 'Note'}, inplace=True)
selected_df_melted

# # Option to remove rests
df_clean = selected_df_melted[selected_df_melted['Note'] != 'Rest']


# Count notes by Composer Title Note (but NOT voice)
note_counts = selected_df_melted.groupby(['Composer', 'Title', 'Note', ]).size().reset_index(name='Count')

# remove octave designation
note_counts['Note'] = note_counts['Note'].str.replace(r'\d+', '', regex=True)

# Function to standardize note names
def standardize_note(note):
   
    if '-' in note:
        return note.replace('-', 'b')
    return note

note_counts['Note'] = note_counts['Note'].apply(standardize_note)

# now let's scale the counts relative to the total number of notes in the piece
total_notes = note_counts['Count'].sum()
note_counts['Scaled_Count'] = note_counts['Count']/total_notes
note_counts.head()

Unnamed: 0,Composer,Title,Note,Count,Scaled_Count
0,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A,3,0.006452
1,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A,21,0.045161
2,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A,28,0.060215
3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A,10,0.021505
4,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",Bb,4,0.008602


In [79]:
df = note_counts.copy()
counted_notes_sorted = df.groupby('Note')['Scaled_Count'].sum().reset_index()


# plot with filled area
fig = px.line_polar(
    counted_notes_sorted, 
    r='Scaled_Count',                           
    theta='Note',                              
    line_close=True,
    color='Title' if 'Title' in counted_notes_sorted.columns else None,
    # pitch_class_order defined at outset
    category_orders={'Note': pitch_class_order}
)

# Add fill to all traces
fig.update_traces(fill='toself', fillcolor='rgba(135, 206, 250, 0.5)')  # Light blue with transparency

# title
fig.update_layout(title=f"Note Distribution Analysis - {piece.metadata['composer']}:  {piece.metadata['title']} ")
fig.show()

## 2.4 Try the same with Melodic and Harmonic Intervals


### Melodic

`mel = piece.melodic(kind = 'd', directed=True, compound=True)`

This will return all the melodic 'intervals' in the piece. You can select the type of interval with the `kind` parameter:  

- 'd' = diatonic (in which major and minor thirds will simply be '3') 
- 'c' = chromatic (in which we count every half step between successive notes, in this case a C-G will be '7'
- 'q' = with quality (in which major thirds will be M3 and minor thirds will be m3, for example)

Learn more at via the [tutorial](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/06_Melodic_Intervals.md)

### Harmonic

`har = piece.harmonic(kind = 'd', directed=True, compound=True)`
This will return all the melodic 'intervals' in the piece. You can select the type of interval with the `kind` parameter:  

- 'd' = diatonic (in which major and minor thirds will simply be '3') 
- 'c' = chromatic (in which we count every half step between successive notes, in this case a C-G will be '7'
- 'q' = with quality (in which major thirds will be M3 and minor thirds will be m3, for example)

Learn more at via the [tutorial](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/07_Harmonic_Intervals.md)
  

Then renumber parts and add measure numbers as needed!

```python
mel = piece.melodic(kind = 'd', directed=True, compound=True)
mel_numbered = piece.numberParts(mel)
mel_numbered_measures = piece.detailIndex(mel_numbered, offset=False, progress=True).reset_index()
mel_numbered_measures['Composer'] = piece.metadata['composer']
mel_numbered_measures['Title'] = piece.metadata['title']
```



In [80]:
#  your turn here!
mel = piece.melodic(kind = 'd', directed=True, compound=True)
mel_numbered = piece.numberParts(mel)
mel_numbered_measures = piece.detailIndex(mel_numbered, offset=False, progress=True).reset_index()
mel_numbered_measures['Composer'] = piece.metadata['composer']
mel_numbered_measures['Title'] = piece.metadata['title']

## 2.5 Bar Chart of Melodic Intervals

In [81]:
# for melting the data, we will set up value and id variables
value_vars= ['1', '2']
id_vars = [col for col in mel_numbered_measures.columns if col not in value_vars]

# now melt
mel_numbered_measures_melted = pd.melt(mel_numbered_measures, value_vars=value_vars, id_vars=id_vars)
mel_numbered_measures_melted.rename(columns={'variable': 'Voice', 'value': 'Interval'}, inplace=True)
mel_numbered_measures_melted


# Count notes by Composer Title Note and Voice
mel_counts = mel_numbered_measures_melted.groupby(['Composer', 'Title', 'Interval', 'Voice']).size().reset_index(name='Count')
mel_counts.head()


Unnamed: 0,Composer,Title,Interval,Voice,Count
0,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",-2,1,81
1,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",-2,2,66
2,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",-3,1,28
3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",-3,2,22
4,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",-4,1,2


In [82]:
# Create the bar chart
df = mel_counts.copy()

# remove rests
df_no_rests = df[df['Interval'] !=  'Rest']

unique_intervals = df_no_rests['Interval'].unique()

# unique_intervals_no_rest = [interval for interval in unique_intervals if interval != "Rest"]

# sorting function
def sort_integer_strings_explicit(string_array):
    """
    Explicit method: separate negatives and positives, then sort each group
    """
    negatives = [s for s in string_array if int(s) < 0]
    positives = [s for s in string_array if int(s) >= 0]
    
    # Sort negatives in descending order (highest negative first: -1, -2, -3...)
    negatives.sort(key=lambda x: int(x), reverse=False)
    
    # Sort positives in ascending order (lowest positive first: 0, 1, 2, 3...)
    positives.sort(key=lambda x: int(x))
    
    return negatives + positives

interval_order = sort_integer_strings_explicit(unique_intervals)

fig = px.bar(
    df_no_rests,
    x='Interval',
    y='Count',
    color='Voice',
    title=f"Interval Distribution Analysis - {piece.metadata['composer']}:  {piece.metadata['title']} ",
    width=800,
    height=600,

    # note that the pitch order is set earlier in our NB!
    category_orders={'Interval': interval_order}
)

fig.show()

In [83]:
# you could continue to experiment with piece.harmonic, or piece.ngrams

## 2.6 Heat Map of Ngrams

- Learn about ngrams via the [tutorial](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/09_Ngrams_Heat_Maps.md)!

The function takes in various arguments:
- combine_unisons_choice=False:  whether to combine unisons
- kind_choice='d' :  d for diatonic, etc
- directed=True:  directed intervals or not
- compound=True:  compound intervals or not
- length_choice=4:  length of ngram
- include_count=True:  include chart of count of ngrams
- entries_only=True:  limit the ngrams to 'entries' (events after rests or breaks)

In [84]:
# function for ngram heatmap:  do not edit!

def ngram_heatmap(piece, 
                  combine_unisons_choice=False, 
                  kind_choice='d', 
                  directed=True, 
                  compound=True, 
                  length_choice=6, 
                  include_count=False, 
                  entries_only=True):
    # find ngrams
    nr = piece.notes(combineUnisons = combine_unisons_choice)
    mel = piece.melodic(df = nr, 
                        kind = kind_choice,
                        directed = directed,
                        compound = compound,
                        end = False)
    
    # this is for entries only
    if entries_only == True:    
        # pass the following ngrams to the plot below as first df
        entry_ngrams = piece.entries(df = mel, 
                                    n = length_choice, 
                                    thematic = True, 
                                    anywhere = True,
                                    exclude = ['Rest'])
        # pass the ngram durations below to the plot as second df
        entry_ngrams_duration = piece.durations(df = mel, 
                                            n =length_choice, 
                                            mask_df = entry_ngrams)
        # make the heatmap
        chart = viz.plot_ngrams_heatmap(entry_ngrams, 
                                         entry_ngrams_duration, 
                                         selected_patterns=[], 
                                         voices=[],  
                                         includeCount=include_count)
        
        return chart
    
    # this is for ALL mel ngrams (if entries is False in form)
    else:
        mel_ngrams = piece.ngrams(df = mel, n = length_choice, exclude = ['Rest'])
        mel_ngrams_duration = piece.durations(df = mel, 
                                          n =length_choice)
        
        chart = viz.plot_ngrams_heatmap(mel_ngrams, 
                                         mel_ngrams_duration, 
                                         selected_patterns=[], 
                                         voices=[], 
                                         includeCount=include_count)
        
        # # mel_ngrams_detail = piece.detailIndex(mel_ngrams, offset = False)  
        return chart

In [85]:
ngram_heatmap(piece)

# 3.0 Build a Corpus

Here we obtain the MEI files from our work.

Note that you can use wildcards to limit the corpus to certain files.

```python
all_bach = glob.glob('MEI/Bach*')
all_bach = sorted(all_bach)
```

             
Now Load The List as a CRIM Intervals Corpus           

`corpus = CorpusBase(all_bach)`

In [86]:
# for example
all_bach = glob.glob('MEI/Bach*')
all_bach = sorted(all_bach)
corpus = CorpusBase(all_bach)



## 3.1 Applying CRIM Intervals Functions to Each Piece in a Corpus

Here we apply the _same_ function to each piece in a corpus and then return a combined df that represents our results.  With this we can filter, group, as well as make charts and graphs to show _shared_ patterns across our corpus.

To learn more about how this works, see the [Corpus Tutorial](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/01_Introduction_and_Corpus.md#batch-methods-for-a-corpus-of-pieces) from the CRIM Intervals documentation.


Fortunately for you, we have also created a special set of `corpus tools` that make easy work of all of this.  Learn about them via the [relevant tutorial](https://github.com/HCDigitalScholarship/intervals/blob/main/tutorial/21_Corpus_Tools.md).  

All you need to do:

```python
# define your corpus
all_bach = glob.glob('Updates/Bach*')
all_bach = sorted(all_bach)
corpus = CorpusBase(all_bach)

# pass it to the corpus_tool, note that the arguments determine the settings

corpus_notes(corpus, combine_unisons_choice=True, combine_rests_choice=False)
```


Here are all the corpus tools, in brief:


- corpus_notes
- corpus_note_scaled
- corpus_note_durs
- corpus_mel
- corpus_har
- corpus_contrapuntal_ngrams
- corpus_melodic_ngrams
- corpus_melodic_durational_ratios_ngrams
- corpus_harmonic_ngrams
- corpus_sonority_ngrams
- corpus_cadences
- corpus_presentation_types

## 3.2  Corpus Notes

In [87]:
# define your corpus
all_bach = glob.glob('MEI/Bach*')
all_bach = sorted(all_bach)
corpus = CorpusBase(all_bach)

# pass it to the corpus_tool, note that the arguments determine the settings

corpus_note_df = corpus_notes(corpus, combine_unisons_choice=True, combine_rests_choice=False)
corpus_note_df



Unnamed: 0,Composer,Title,Date,1,2
0.0,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",,Rest,Rest
0.25,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",,C4,
0.5,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",,D4,
0.75,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",,E4,
1.0,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",,F4,
...,...,...,...,...,...
82.75,"Bach, Johann Sebastian","Invention No. 15 in B minor, BWV 786",,D5,G3
83.0,"Bach, Johann Sebastian","Invention No. 15 in B minor, BWV 786",,A#4,F#3
83.5,"Bach, Johann Sebastian","Invention No. 15 in B minor, BWV 786",,,F#2
83.75,"Bach, Johann Sebastian","Invention No. 15 in B minor, BWV 786",,B4,


## 3.3 Visualizing Results for Corpus

### 3.3.1 A Bar Chart of Notes for a Corpus

Here we look at the distribution of notes, breaking things down by voice-part (1 is the top part, 2 is the bottom one)

In [88]:

# run function and create chart
corpus_note_df = corpus_notes(corpus, combine_unisons_choice=True, combine_rests_choice=False)

value_vars= ['1', '2']
id_vars = [col for col in corpus_note_df.columns if col not in value_vars]
corpus_note_df_melted = pd.melt(corpus_note_df, value_vars=value_vars, id_vars=id_vars)
corpus_note_df_melted.rename(columns={'variable': 'Voice', 'value': 'Note'}, inplace=True)
corpus_note_df_melted
# Clean data - remove Rest and NaN values
df_clean = corpus_note_df_melted[
    (corpus_note_df_melted['Note'].notna()) & 
    (corpus_note_df_melted['Note'] != 'Rest')
].copy()

# Count notes by Composer Title Note and Voice
corpus_note_counts = df_clean.groupby(['Composer', 'Title', 'Note', 'Voice']).size().reset_index(name='Count')
corpus_note_counts


Unnamed: 0,Composer,Title,Note,Voice,Count
0,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A2,2,3
1,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A3,2,21
2,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A4,1,22
3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A4,2,5
4,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A5,1,10
...,...,...,...,...,...
704,"Bach, Johann Sebastian","Invention No. 9 in F minor, BWV 780",G2,2,4
705,"Bach, Johann Sebastian","Invention No. 9 in F minor, BWV 780",G3,2,36
706,"Bach, Johann Sebastian","Invention No. 9 in F minor, BWV 780",G4,1,20
707,"Bach, Johann Sebastian","Invention No. 9 in F minor, BWV 780",G4,2,4


In [89]:
# Create the bar chart
fig = px.bar(
    corpus_note_counts,
    x='Note',
    y='Count',
    color='Title',
    height=800,
    width=1000,
    barmode='stack',
    # note that the pitch order is set earlier in our NB!
    category_orders={'Note': pitch_order}
)

fig.show()

### 3.3.2 As Radar Plot

Here we display the notes on a radar plot.  But now the colors correspond to the title of each piece.

Of course we could set the color to correspond to some other feature in the df, like composer.  

In [90]:

# # Option to remove rests
corpus_clean = corpus_note_counts[corpus_note_counts['Note'] != 'Rest']

# remove octave designation
corpus_clean['Note'] = corpus_clean['Note'].str.replace(r'\d+', '', regex=True)

# # Function to standardize note names
def standardize_note(note):
   
    if '-' in note:
        return note.replace('-', 'b')
    return note

corpus_clean['Note'] = corpus_clean['Note'].apply(standardize_note)

# Count notes by Composer Title Note (but NOT voice)
corpus_note_counts_grouped = corpus_clean.groupby(['Composer', 'Title', 'Note', ]).size().reset_index(name='Count')


# now let's scale the counts relative to the total number of notes in the piece
total_notes = corpus_note_counts_grouped['Count'].sum()
corpus_note_counts_grouped['Scaled_Count'] = corpus_note_counts_grouped['Count']/total_notes
corpus_note_counts_grouped.head(10)

Unnamed: 0,Composer,Title,Note,Count,Scaled_Count
0,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",A,5,0.007052
1,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",B,4,0.005642
2,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",Bb,3,0.004231
3,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",C,6,0.008463
4,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",C#,2,0.002821
5,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",D,5,0.007052
6,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",E,4,0.005642
7,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",F,4,0.005642
8,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",F#,2,0.002821
9,"Bach, Johann Sebastian","Invention No. 1 in C major, BWV 772",G,5,0.007052


In [91]:

titles = corpus_note_counts_grouped['Title'].unique()
fig = px.line_polar(
    corpus_note_counts_grouped,
    r='Scaled_Count',
    theta='Note',
    color='Title',
    line_close=True,
    range_r=[0, corpus_note_counts_grouped['Scaled_Count'].max() * 1.1],
    markers=True,
    height=800,
    width=800,
    category_orders={'Note': pitch_class_order}
)
fig.update_traces(fill='toself', 
                mode='markers+lines',
                opacity=.7)
fig.update_layout(
    showlegend=True,
    legend=dict(
        title='Title',
        orientation="v"
    ),
    title = 'Weighted Note Distribution in Corpus')
fig.show()

### 3.3.3. Barchart of Melodic Intervals

In [92]:
mel = corpus_mel(corpus, kind_choice='d', compound_choice=False)
# Melt the dataframe to combine columns '1' and '2' into one
mel_melted = mel.melt(
    id_vars=['Composer', 'Title', 'Date'], 
    value_vars=['1', '2'], 
    var_name='Voice', 
    value_name='Interval'
)

# Drop NaN values
mel_melted = mel_melted.dropna(subset=['Interval'])

# Count intervals by Title (and optionally Voice)
mel_counts = mel_melted.groupby(['Title', 'Interval']).size().reset_index(name='Count')

# Get unique intervals and convert to integers for sorting
unique_intervals = mel_counts['Interval'].dropna().unique()

# Sort: negatives descending (most negative first), then positives ascending
def interval_sort_key(x):
    try:
        val = int(x)
        # Negatives get sorted by their value (ascending, so -8 comes before -7)
        # Positives get sorted after all negatives
        if val < 0:
            return (0, val)  # First group, sort by value
        else:
            return (1, val)  # Second group, sort by value
    except:
        return (2, 0)  # Put non-numeric (like 'Rest') at the end

sorted_intervals = sorted(unique_intervals, key=interval_sort_key)

In [93]:

chart_title = "Distribution of Melodic Intervals in Corpus"

#create barchart
fig = px.bar(
    mel_counts,
    x='Interval',
    y='Count',
    color='Title',
    category_orders={'Interval': sorted_intervals}
)
fig.update_layout(xaxis_title="Interval", 
                    yaxis_title="Count",
                    legend_title='Voice',
                       height=500,
                       width=750)

fig.show()

## 4.0 Network of Pieces based on Shared Melodic Ngrams


Here we build a **network** of pieces:  each time two pieces share a melodic ngram, they get an edge.

- The results are often quite dense, so we include a 'minimum threshold' value:  unless two pieces have at least 'n' edges, they are excluded from the network.

- A 'thickness' adjustment also scales the width of the edges so that the results are easier to interpret.

```python
# the ngram conditions
n=4
combineUnisons=False
kind='c'
thematic = True
anywhere = False
threshold_for_shared_ngrams = 30
thickness_adjust = 5
```

In [94]:
# Network Function--Do Not Edit!
def ng_network(data, 
               threshold_for_shared_ngrams=5, 
               thickness_adjust=5, 
               network_name='my_network.html'):

    # and now, a network in which the nodes are the pieces and edges represent the ngrams they share.  
    # the thickness of the edges varies with the number of shared ngrams
    # the colors distinguish 'communities' of pieces that are highly related

    df = pd.DataFrame(data)
    df = df.reset_index()
    df.drop('level_1', axis=1, inplace=True)
    df = df.rename(columns={0: 'ngram'})

    #define the function to convert tuples to strings
    def convertTuple(tup):
        out = ""
        if isinstance(tup, tuple):
            out = '_'.join(tup)
        return out  
    # clean the tuples
    df['ngram'] = df['ngram'].apply(convertTuple)

    # Group by 'ngram' and extract a list of unique titles for each group
    grouped_titles = df.groupby('ngram')['title'].unique().reset_index(name='titles')

    # Generate all pairs of titles for each group
    all_pairs = []
    for _, row in grouped_titles.iterrows():
        pairs = list(combinations(row['titles'], 2))
        all_pairs.append((row['ngram'], pairs))

    # Create a new DataFrame with the results
    result_df = pd.DataFrame(all_pairs, columns=['ngram', 'title_pairs'])
    # remove the empty pairs
    df_filtered = result_df[result_df['title_pairs'].apply(len) > 0]

    # explode the complicated lists of tuples, effectively 'tyding' the data
    exploded_df = df_filtered.explode('title_pairs')

    # get the counts of each pair, which provides the basis of the weights
    pair_counts = exploded_df['title_pairs'].value_counts()

    # limit to high scoring pairs (>3)
    pair_counts = pair_counts[pair_counts >= threshold_for_shared_ngrams]

    # Adding Louvain Communities
    def add_communities(G):
        G = deepcopy(G)
        partition = community_louvain.best_partition(G)
        nx.set_node_attributes(G, partition, "group")
        return G

    # Create an empty NetworkX graph
    G = nx.Graph()


    # Add nodes and assign weights to edges
    for pair, count in pair_counts.items():
        # Directly unpacking the tuple into node1 and node2
        node1, node2 = pair
        # Adding nodes if they don't exist already
        if node1 not in G.nodes:
            G.add_node(node1)
        if node2 not in G.nodes:
            G.add_node(node2)
        # Adding edge with weight
        G.add_edge(node1, node2, weight=count)

    # Adjusting edge thickness based on weights
    for edge in G.edges(data=True):
        edge[2]['width'] = edge[2]['weight']/thickness_adjust

    G = add_communities(G)

    # set display parameters
    ngram_map = Network(notebook=True,
                       width="800",
                              height="800",
                              bgcolor="black", 
                              font_color="white")

    # Set the physics layout of the network
    ngram_map.set_options("""
    {
    "physics": {
    "enabled": true,
    "forceAtlas2Based": {
        "springLength": 1
    },
    "solver": "forceAtlas2Based"
    }
    }
    """)

    ngram_map.from_nx(G)
    return ngram_map.show(network_name)
    

    

In [96]:


# the ngram conditions
n=4
combineUnisons=False
kind='c'
thematic = True
anywhere = False
threshold_for_shared_ngrams = 2
thickness_adjust = 5
network_name = 'melodic_ngram_network.html'


# now we build a list of all of the ngrams in each piece
list_series = []
for item in all_bach:
    piece = importScore(item)
    title = piece.metadata['composer'] + ": " + piece.metadata['title']
    # find entries for model
    nr = piece.notes(combineUnisons=combineUnisons)
    nr = piece.numberParts(df=nr)
    mel = piece.melodic(df=nr, kind=kind, compound=False, unit=0, end=False)
    mel_ngrams = piece.ngrams(df=mel, n=n, exclude=['Rest'])
    mel_ngrams.fillna('')
    mel_ngrams['title'] = title
    mel_ngrams.set_index(['title'], inplace=True)
    # Set the new column as the index
    mel_stacked_ngs = mel_ngrams.stack()
    list_series.append(mel_stacked_ngs)

ngs = pd.concat(list_series)

print("Network of Melodic Ngrams")
ng_network(ngs, threshold_for_shared_ngrams, thickness_adjust, network_name)




Network of Melodic Ngrams
