____________________
# Music 255: XML Encoding and Beautiful Soup

In this homework assignment, you will get an opportunity to engage, explore, and experiment with **XML** files through Python and **Beautiful Soup**. As mentioned in class, **XML** stands for **eXtensible Markup Language** and mainly serves to transport and store data. At its core, XML was designed to be **both machine- and human-readable**. In the case of music files, the common encoding standard is **MEI**, which stands for **Music Encoding Initiative** and is built on top of the XML framework.


Mainly, this homework focuses on:

* **access**: downloading, storing, reading, processing XML files
* **analysis**: performing basic quantitative and qualitative analysis of XML files
* **interpretation**: exploring the meaning and utility of XML files from both analytical and creative perspectives 

As we deal with these files, we will make use of Beautiful Soup – a Python library that makes interacting with XML and HTML friendly and comfortable. Or at least tries to... You can access Beautiful Soup's documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

XML files are guided by **markup rules**, which you can read more about [here](https://www.w3schools.com/xml/xml_syntax.asp) and consist of **elements**, which you can dive into [here](https://www.w3schools.com/xml/xml_elements.asp).

MEI files are primarily XML files with music-specific sections, elements, and structure. You can learn more about MEI [here](https://music-encoding.org/).


#### Find your pieces from the CRIM Corpus:

* Each of you will focus on ONE model and its derivative Mass. Find your set [here](https://docs.google.com/spreadsheets/d/140JbgdPz1yBsbnF_umN3iq-nmC9drLgBtJ9Qd4hk6zE/edit#gid=0).

----

#### Setup: Importing Python Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
from bs4 import BeautifulSoup as bs
import optparse
import sys
from pathlib import Path
import requests
from lxml import etree
import re
from collections import Counter

____________________
# Part 1: Import Data

As your first and perhaps most important step, you'll have to go find some data. For this homework assignment, we will work with *these files*, which you can access *here*. Once you've picked your file and copied its URL, use the function provided below to **import it as a JSON object**:

In [5]:
# function to extract xml document from given url
def getXML(url):
    # request for xml document of given url
    response = requests.get(url)    
    # response will be provided in JSON format
    return response.text

In the cell below, **read in your file** and store it in the "xml_document" variable:

In [6]:
# xml_document = ...

xml_document = 'https://crimproject.org/mei/CRIM_Model_0008.mei'

Once you've imported a file, you should be able to **convert it into a Beautiful Soup Object**:

In [7]:
soup = bs(xml_document, 'xml')

____________________
# Part 2: Data Exploration

In this part of the homework, we ask you to **investigate your file**. The **MEI** framework tends to be easy to understand (as it is meant to be human-readable, too!) – and given some simple tools you should be able to successfully navigate your data.

#### Step 2.0: Document

First, let's take a look at the document as a whole. Make the cell below **print the "pretty"** version of your MEI file:

In [8]:
# print the file

#### Step 2.1: General Info

Generally, every MEI file has some information about its contents. This typically covers the encoding standard, creators and editors involved with the file, and its structure.

First, get your file's **Title Statement** (hint: look for the "titleStmt" tag):

In [33]:
# output your file's title statement here

#### Step 2.2: Moving Down

As you have probably noticed by now, MEI (and XML, generally) files follow a **tree-like structure**. Any document has its elements defined recursively, as its children. Intuitively, **children** are elements contained within a broader element (think section), also known as **parent**.

In a way, the Title Statement ("\<titleStmt>") element is a wrapper for all things that define a Title Statement. Mainly, a typical MEI Title Statement describes the piece's title and the people involved with the piece in some capacity. 

Use the cell below to **find all children** of the document's Title Statement:

In [None]:
# get all children here

What are some of your observations about these elements? How much do they share with the parent element? How would you go about identifying any relation between these child elements? 

Use the Markdown cell below to note some of your thoughts:

`Your thoughts here`

In our specific case, the CRIM MEI documents always have several **sections** ("\<section>"). The sections in a piece usually demarcate important divisions within a Model or Mass, and typically have a large number of "children" elements associated with them. 

Use the cell below to **find the first section** of your piece:

In [None]:
# return the first section here

Typically, one of the 'children' elements associated with a **section** is a **measure** ("\<measure>"). This should logically make sense to you, as you think about the document as a musical piece, not just an XML mess. Measures contain several things, including but not limited to **staves, layers, notes, syllables**.

**Rests** are an important part of musical notation, but remember that since a rest cannot contain a syllable (or most other performance markings, such as dynamic levels) they must be a different kind of tag/element from **notes**.  So be on the look-out for these! 

Use the cell below to find the **first measure of the first section** in your piece:

In [None]:
# return the first measure here

As we briefly mentioned before, measures tend to have notes. Notes are *also* children of measures.

Use the cell below to **find all notes** in the first measure of your piece:

In [None]:
# return the notes in the first measure here

Finally, a nice formatting tool is **"stripping"** the text, which essentially means removing any unnecessary white spaces. 

Use the space below to **iterate over the list of all first-measure notes** you just produced, accessing their pitch names (`note["pname"]`), and printing the "stripped" pitch name (`note["pname"].strip()`) \[hint: "for" loop is a great choice here\]:

In [None]:
# iterate over the array of notes here

It's important to pause here and think about your work. What are some useful aspects of XML's tree-like structure? Why is it a tree after all? Can you think of a different data structure to represent your document?

Use the Markdown cell below to record your thoughts:

`your thoughts here`

#### Step 2.3: Moving Up

Working the opposite way around, you should be able to get to the entire (original) document given just one of its children. This can be done rather easily using Beautiful Soup and its recursive tree navigation.

First, use the cell below to **pick one note element in the piece** (could be one of the notes in the first measure, or the very last note in a particular staff) and optionally store it as a variable:

In [None]:
# pick a note here

Now that you have a note, you can use Beautiful Soup's hierarchy methods to access all of its parents. This can be done in a number of ways, including but not limited to:

* `element.parent.parent.parent...`
* `element.parents`
* `element.find_parents()`

Use the cell below to **find all parents** of the note you stored:

In [None]:
# find all parents here

Finally, it might be useful to make sure that homogenous elements oftentimes have **different parents**. 

Use the cell below to **find two notes** that have **different parents** \[note: different doesn't imply "of different types (tags)"\]: 

In [None]:
# find the two notes here

#### Step 2.3: Moving Sideways

As hierarchically-related elements are named children and parents, elements **at the same depth** are called **siblings**. Typically, siblings can be both **homogenous** – for example, two neighboring sections, which serve the same purpose – and **heterogeneous** – for example, a **Title Statement** features a **Title** and a **Statement of Responsibility**, which are on the same level yet serve different purposes. 

Perhaps one meaningful way to illustrate the sibling concept would be to work with **people**. Use the cell below to **find the first person** (\<persName>) involved with your piece \[note: people involved are children of the Title Statement\]:

In [None]:
# get the first person here

Naturally, every other person involved with the same piece would relate to the first person as their **sibling element** – as they share their parent element. 

Use the cell below to **find the next siblings** of the first person involved with the piece:

In [None]:
# find next siblings of the first person here

As you deal with people involved with the piece, you might think to yourself, "wait a second, I'm involved too". Or something along those lines. 

Anyway, use this next code cell to **create a new \<persName> element** with its string value (`element.string`) equal to your name and its "role" attribute (`element["role"]`) equal to your name. Return the element in the cell:

In [None]:
# return your new element here

Now, use the cell below to **append the new \<persName> element** to the Statement of Responsibility of your document (`soup.titleStmt.respStmt`):

In [None]:
# append the element here

Finally, let's check the results of our work. Use the cell below to find the first person (\<persName>) involved with the piece and **return its next siblings**:

In [None]:
# all next siblings here

Did you notice anything interesting about the sibling-element structure? What are some common similarities between the sibling elements? What are some differences? Is sibling search an efficient parsing method?

Use the Markdown cell below to share your thoughts:

`your thoughts here`

____________________
# Part 3: Data Analysis

In this part, we'll use some of the familiar Python-based statistics and Data visualization tools to analyze your file. 

#### Step 3.1: Search

First, let's perform a simple element search on the document. While there is a multitude of ways to approach this, we typically recommend looking for all elements that match a given condition on a document level.

Use the cell below to **find all notes**, extract their pitch class (that is the generic name of the note, like "c"), and store them in a **Pandas DataFrame** (`pd.DataFrame(...)`).  

Note that here we are finding an **attribute** of an **element**!

In [None]:
# return the all notes DataFrame here

Let's look at these visually! Use the cell below to **produce a pitch class histogram** for the piece.

Note that we could dig more deeply into the pitches by taking note of the **octave** no less than the **pitch class** of each note.  This would tell us the span of each voice, something of interest to performers as they decide what to sing.  Doing this in one step with Beautiful Soup, however, is tricky, since we would need to get both the **pname** and the **oct** for each note.  As it happens, **CRIM Intervals** can do this kind of job quite easily, as we will discover later!

In [None]:
# pitch histogram

Now prepared with the general distribution of pitch classes, let's check the final notes in each voice part (staff).  These notes are particularly important, since Renaissance musicians understood these as part of the 'modality' of the composition (something like the key).

How will you go about finding the **last note of the last measure in each voice**?



In [None]:
# list final pitch in each voice part

Now compare the results of the last two tasks:  are the last notes also ones that figure prominently in the full list of pitches for the piece?  

In [None]:
# your comments on pitch histogram compared with final pitch list

#### Step 3.2: Categorical Search

One useful way to analyze your document is by describing the elements within your elements of interest. As an illustration of this, it might be useful to look at how many notes are featured in each measure.

First, use the cell below to **find all measures** of the piece.  Here we are finding an **element**.

In [None]:
# find all measures here

Now, use the cell below to iterate over the measures and count how many notes each of them has (typically, using `len(measure.find_all("note"))`). Your result should be a **list** of numbers, each representing **how many notes there were in a particular measure**:

In [None]:
# list of measure notes here

Let's visualize this one, too! Pass your list into `plt.plot(...)` to **produce a graph** for your series in the cell below.  This will be a basic plot of the activity in a piece.

In [None]:
# your line chart here

____________________
# Part 4: Sandbox

In this final part of the homework, you are asked to combine some of the tools you have employed in sections 2 and 3 – as well as the lab – in order to **explore** an aspect of your piece you are particularly curious about and **reflect** on it.

Some of the possible mini-projects include:
* pitch distribution for voices
* true pitch distribution (including octaves, sharps, flats etc)
* find the most mathematically and musically interesting measure: why? 
* compare sections? how does the piece change?
* compare model and mass
* whatever else interests you!

Good luck!

In [None]:
# your exploration here

____________________
# Reflections (optional)

Please use this space to reflect on your experience, ask any additional questions, or suggest changes to this or other assignments.