## Advanced Web Scraping for String Quartets

In this notebook, midi files from Kunst Der Fuge will be scraped with the python wrapper for selenium. Kunst Der Fuge offers midi files. If you plan to download more than 5 files in a given day then you have to pay to get the files. Since authentication is required, a robust medium like Selenium is needed.
To use the python selenium package you will need to download the package and download a driver. I chose to use geckodriver for the Firefox web browser. Selenium is a headless browser automation tool that can be used in a few other languages such as Java and C#.

In [1]:
#import the appropriate packages
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

The preferences can be adjusted in selenium. Below we change the path that the files are downloaded to and set that prompts to save are disabled and launch the Firefox browser.

In [36]:
## Set up the selenium browser

# Create a preference object for the Firefox browser
firefox_profile = webdriver.FirefoxProfile()
# Set the save to disk prompt to never ask, otherwise the files won't download
firefox_profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "audio/midi")
# Set the download path
firefox_profile.set_preference("browser.download.folderList", 2)
firefox_profile.set_preference("browser.download.dir", "<your download path>")
# Set the download to occur in the background
firefox_profile.set_preference("browser.download.manager.showWhenStarting",False)
# Create a firefox browser object
driver = webdriver.Firefox(firefox_profile=firefox_profile)
# Navigate to the login screen
driver.get('http://www.kunstderfuge.com/-/db/log-in.asp')

Next we will need to authenticate into the site to get access to the desired files.

In [38]:
## Authenticate into the Kunst Der Fuge site

# Find the email element
username = driver.find_element_by_id("Email")
# Send your email value to the email field
username.send_keys("<your email>")
# Find the password element
password = driver.find_element_by_id("Password")
password.send_keys("<your password>")
# Click the submit button
driver.find_element_by_name("Submit").click()

Next we will need to navigate to the section of the website that contains the midi files. Note, that if I wasn't concerned with saving money I would have opted to pay a bit more and download the whole database.
Since I know how to scrape these files, I will just do that. This way I can just pull the files I am interested in.

In [39]:
# create a midi object to find all the midi related xpaths
MIDI = driver.find_elements_by_xpath("//*[text()='MIDI']")
# loop through the site to find relevant xpaths
for ii in MIDI:
    link = ii.get_attribute('href')
# Next navigate to the midi section of the site
driver.get(link)

The first composer that we will aquire midi tracks for will be __Beethoven__. Since he is a popular composer, his links show on the main midi page. We will just need to navigate to the chamber music section of the site

In [10]:
# Find the chamber music xpath 
BeethovenChamber = driver.find_elements_by_xpath("//*[text()='Chamber music']")
# Navigate through the site to find the relevant link(s)
for ii in BeethovenChamber:
    link = ii.get_attribute('href')
# navigate to the Beethoven chamber music
driver.get(link)
# Next find all of the quartet related xpaths and store in the elems object
elems = driver.find_elements_by_xpath("//a[contains(@href, 'quartet')]")
# Finally loop through the elems object and click on each of the download links.
for elem in elems:
    elem.click()
    time.sleep(3)


Next we will download the desired midi tracks from __Mozart__.

In [10]:
# Find the Mozart related xpaths
Mozart = driver.find_elements_by_xpath("//*[text()='Mozart']")
# Loop through the Mozart object to get the relevant links
for ii in Mozart:
    link = ii.get_attribute('href')
# Navigate to the Mozart midi page
driver.get(link)

It did take a bit of manual inspection to find which Mozart files I could use. The modeling technique I am going to use requires the files to have 4 voices, so we just want the quartet files. Unlike Beethoven, the word quartet is not included in the actual download tags.

In [17]:
# Find the harfesoft copy writted files, these are for string quartets
elems = driver.find_elements_by_xpath("//a[contains(@href, 'harfesoft')]")
# Loop through the links and click each one to download
for elem in elems:
    elem.click()
    time.sleep(3)
    
# Find the mutopia copy writted files, these are for string quartets
elems = driver.find_elements_by_xpath("//a[contains(@href, 'mutopia')]")
# Loop through the links and click each one to download
for elem in elems:
    elem.click()
    time.sleep(3)

The format is generally te same for the rest of the composers. Below we gather tracks from Schubert, Shostakovich, Brahms, Dvořák, and Haydn.

In [13]:
## Schubert
# Find the Schubert related xpaths
Schubert = driver.find_elements_by_xpath("//*[text()='Schubert']")
# Loop through the links and click each one to download
for ii in Schubert:
    link = ii.get_attribute('href')
# Navigate to the Schubert midi page
driver.get(link)
# Here we only need to look for xpaths within the Schubert page that contain strings
elems = driver.find_elements_by_xpath("//a[contains(@href, 'strings')]")
# Loop through the links and click each one to download
for elem in elems:
    elem.click()
    time.sleep(3)
    
## Shostakovich
# gather xpaths for Shostakovich
Shostakovich = driver.find_elements_by_xpath("//*[text()='Shostakovich']")
# get the link for the Shostakovich site
for ii in Shostakovich:
    link = ii.get_attribute('href')
# navigate to the Shostakovich page
driver.get(link)
# Find the string quartet xpaths
elems = driver.find_elements_by_xpath("//a[contains(@href, 'string')]")
# Click on the links to download the Shostakovich files
for elem in elems:
    elem.click()
    time.sleep(3)

## Brahms
# gather xpaths for Brahms
Brahms = driver.find_elements_by_xpath("//*[text()='Brahms']")
# get the link for the Brahms page
for ii in Brahms:
    link = ii.get_attribute('href')
# navigate to the Brahms page
driver.get(link)
# Find the download links for quartets
elems = driver.find_elements_by_xpath("//a[contains(@href, 'quartet')]")
# Click and download the midi files
for elem in elems:
    elem.click()
    time.sleep(3)

## Dvorak
# gather xpaths for Dvorak
Dvorak = driver.find_elements_by_xpath("//*[text()='Dvořák']")
# get the link for the Dvorak site
for ii in Dvorak:
    link = ii.get_attribute('href')
# navigate to the Dvorak page
driver.get(link)
# find the Dvorak quartet xpaths
elems = driver.find_elements_by_xpath("//a[contains(@href, 'quartet')]")
# Download the Dvorak string quartet files
for elem in elems:
    elem.click()
    time.sleep(3)

## Haydn
# gather xpaths for Haydn
Haydn = driver.find_elements_by_xpath("//*[text()='Haydn']")
# get the link for the Haydn site
for ii in Haydn:
    link = ii.get_attribute('href')
# Download the Haydn string quartet files
driver.get(link)
# find the Haydn quartet xpaths
quartet = driver.find_elements_by_xpath("//a[contains(@href, 'quartet')]")
# Download the Haydn string quartet files
for elem in quartet:
    elem.click()
    time.sleep(3)