<a href="https://colab.research.google.com/github/GrantBerg/DS-340W/blob/main/DS340w_final_project_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Required libraries installation

In [1]:
%pip install muspy
%pip install music21

Collecting muspy
  Downloading muspy-0.5.0-py3-none-any.whl.metadata (5.5 kB)
Collecting bidict>=0.21 (from muspy)
  Downloading bidict-0.23.1-py3-none-any.whl.metadata (8.7 kB)
Collecting miditoolkit>=0.1 (from muspy)
  Downloading miditoolkit-1.0.1-py3-none-any.whl.metadata (4.9 kB)
Collecting mido>=1.0 (from muspy)
  Downloading mido-1.3.3-py3-none-any.whl.metadata (6.4 kB)
Collecting pretty-midi>=0.2 (from muspy)
  Downloading pretty_midi-0.2.10.tar.gz (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pypianoroll>=1.0 (from muspy)
  Downloading pypianoroll-1.0.4-py3-none-any.whl.metadata (3.8 kB)
Downloading muspy-0.5.0-py3-none-any.whl (119 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.1/119.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bidict-0.23.1-py3-none-any.whl (32 kB)
Downloading midit

loading libraries into python instance

In [3]:
import muspy
import random
import os
import music21
import math
from collections import defaultdict

Defining functions that will be used in the code

In [4]:
#condenses note sequences into chords and gives them labels utilizing chordify
def extract_chord_labels(filepath):
    stream = music21.converter.parse(filepath)
    chords = stream.chordify()
    labels = []
    for c in chords.flat.getElementsByClass('Chord'):
        label = c.pitchedCommonName  # e.g., "C major triad"
        labels.append(label)
    return labels

# Convert chord labels to consistent token format
def tokenize_chords(labels):
    return [label.replace(" ", "_").upper() for label in labels]

#Constructs the N-Gram Language model
def compute_ngram_probs(sequences, n=2):
    model = defaultdict(lambda: defaultdict(int))
    total = 0
    for seq in sequences:
        for i in range(len(seq)-n):
            context = tuple(seq[i:i+n-1])
            target = seq[i+n-1]
            model[context][target] += 1
            total += 1
    # Normalize to get probabilities
    for context in model:
        total_count = sum(model[context].values())
        for token in model[context]:
            model[context][token] /= total_count
    return model

#computes perplexity with smoothing to account for 0
def compute_perplexity(model, sequence, n=2):
    log_prob = 0
    count = 0
    for i in range(len(sequence)-n):
        context = tuple(sequence[i:i+n-1])
        target = sequence[i+n-1]
        prob = model.get(context, {}).get(target, 1e-6)  # smoothing
        log_prob += math.log2(prob)
        count += 1
    return 2 ** (-log_prob / count) if count > 0 else float('inf')

Load the wikifonia dataset

In [5]:
import muspy
from pathlib import Path  # Add this import
from music21 import converter
class WikifoniaDataset(muspy.RemoteFolderDataset):
    """Wikifonia dataset."""
    _NAME = "Wikifonia"
    _DESCRIPTION = "A dataset of lead sheets with melody and chords."
    _HOMEPAGE = "http://www.synthzone.com/files/Wikifonia/"
    _sources = {
        "wikifonia": {
            "filename": "Wikifonia.zip",
            "url": "http://www.synthzone.com/files/Wikifonia/Wikifonia.zip",
            "archive": True,
            "size": 35727800,
            "md5": "d26e22562e67eb7d37535e96cc5eebba",
            "sha256": "e7bce509462a73cee175308b6a3cdafa9effd6e8958b3ce03b4edb293cc6b691",
        }
    }
    _extension = "mxl"

    def read(self, filename: str | Path) -> muspy.datasets:  # Now Path is defined
      """Read a .mxl file into a Music object."""
      return muspy.read_musicxml(filename)

In [6]:
wikifonia = WikifoniaDataset(
    root="wikifonia_dataset/",
    download_and_extract=True,
    verbose=True
)

Downloading source : http://www.synthzone.com/files/Wikifonia/Wikifonia.zip ...


77952638976it [00:05, 13301493382.98it/s]


Successfully downloaded source : /content/wikifonia_dataset/Wikifonia.zip .
Extracting archive : /content/wikifonia_dataset/Wikifonia.zip ...
Successfully extracted archive : /content/wikifonia_dataset .


Creates empty lists for adding note info

In [49]:
sequences = []
chord_sequences = []



Randomly samples portion of the wikifonia datset

In [52]:
test_sample = []
for file in random.sample(wikifonia.raw_filenames, 50):
  if str(file).endswith(".mxl"):
    test_sample.append(file)

Utilizes entire dataset (WARNING TAKES UPWARDS OF 30 MINS TO RUN)

In [51]:
test_sample = []
for file in wikifonia.raw_filenames:
  if str(file).endswith(".mxl"):
    test_sample.append(file)

Basic Testing

In [53]:
count = 0
for file in test_sample:
    if count%100 == 0:
        print(f"{count/len(test_sample):.1%}")
    count += 1
    if str(file).endswith(".mxl"):
      try:
        music = muspy.read_musicxml(file)
        for notes in music:
          song = []
          for note in notes:
            song.append(note.pitch_str)
          sequences.append(song)
      except (muspy.MusicXMLError, ValueError, IndexError):
        continue
print("100%")

0.0%
100%


Chordifty testing

In [54]:
count = 0
for file in test_sample:
    if count%100 == 0:
        print(f"{count/len(test_sample):.1%}")
    count += 1
    try:
      if str(file).endswith(".mxl"):
          labels = extract_chord_labels(os.path.join(str(file)))
          tokens = tokenize_chords(labels)
          if len(tokens) > 0:
              chord_sequences.append(tokens)
    # Due to some errors while calculating a try/catch was added
    # The try/catch cuases the Keyboardinterupt error to be ignored
    # If issue appears and code needs to be stopped, runtime needs to be restarted
    except:
      continue
print("100%")

0.0%
100%


Test and train the model and calculate perplexity

In [55]:
test_seq1 = sequences[0]
train_seqs1 = sequences[1:]

model = compute_ngram_probs(train_seqs1, n=2)
pp = compute_perplexity(model, test_seq1, n=2)

print(f"Perplexity of test sequence (non chordify): {pp:.4f}")

Perplexity of test sequence (non chordify): 8.8219


In [56]:
test_seq2 = chord_sequences[0]
train_seqs2 = chord_sequences[1:]

model = compute_ngram_probs(train_seqs2, n=2)
pp = compute_perplexity(model, test_seq2, n=2)

print(f"Perplexity of test chord sequence (chordify): {pp:.4f}")

Perplexity of test chord sequence (chordify): 92.5040
