<a href="https://colab.research.google.com/github/Juanvr/Dathoven/blob/main/notebooks/1%20-%20Web%20Scraping%20MIDI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping MIDI

In [1]:
import urllib
import shutil
import re
from pathlib import Path
from bs4 import BeautifulSoup
import requests
import glob, time

We define a function that returns all anchors on a website according to its attributes:

In [2]:
def get_elements( url, tag, attrs ):
    page = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(page, 'html.parser')
    return soup.findAll(tag, attrs)

## Carlo's MIDI

The first midi website to try is: https://www.cprato.com/en/midi/all

In [3]:
target_page = 'https://www.cprato.com/en/midi/all'

On the website we have a table with a view of 50 song links. To get all the songs we have to go inside every link and go through the total of 8 views with 50 songs: 

<img src="https://github.com/Juanvr/Dathoven/raw/main/notebooks_images/Carlos_MIDI_Table.png">

We get the links for the numbers:

In [4]:
get_elements(target_page, tag='a', attrs={'class': 'page-link'})[:3]

[<a class="page-link" href="https://www.cprato.com/en/midi/all?page=2">2</a>,
 <a class="page-link" href="https://www.cprato.com/en/midi/all?page=3">3</a>,
 <a class="page-link" href="https://www.cprato.com/en/midi/all?page=4">4</a>]

We get the link to each song:

In [5]:
get_elements(target_page, tag='a', attrs={'href': re.compile("^/en/midi/details")})[:3]

[<a href="/en/midi/details/267/3lau-feat-bright-lights-how-you-love-me" style="font-weight:700;">How You Love Me </a>,
 <a href="/en/midi/details/137/above-beyond-tri-state" style="font-weight:700;">Tri-State (Original Mix)</a>,
 <a href="/en/midi/details/247/above-beyond-were-all-we-need" style="font-weight:700;">We're All We Need (Original Mix)</a>]

Inside each song we find this web page: 

<img src="https://github.com/Juanvr/Dathoven/raw/main/notebooks_images/Carlos_MIDI_Song_Detail.png">

In this webpage we have to get the element to make the download:

In [6]:
detail_page = 'https://www.cprato.com/en/midi/details/267/3lau-feat-bright-lights-how-you-love-me'

In [7]:
target_free_download = get_elements(detail_page, tag='a', attrs={'href': re.compile("^/en/midi/download")})[0]
target_free_download

<a class="btn btn-success px-3 py-3" href="/en/midi/download/267/3lau-feat-bright-lights-how-you-love-me/My40OTUyNzEwOTgwOTc5RSsxNg==" style="display:block;"><i class="fas fa-download pr-1"></i>
        Free Download
        </a>

Once we have the target free download we get the href value:

In [8]:
target_free_download['href']

'/en/midi/download/267/3lau-feat-bright-lights-how-you-love-me/My40OTUyNzEwOTgwOTc5RSsxNg=='

We get the download:

In [9]:
first_url_part = 'https://www.cprato.com'

In [10]:
download_url = first_url_part + target_free_download['href']
download_url

'https://www.cprato.com/en/midi/download/267/3lau-feat-bright-lights-how-you-love-me/My40OTUyNzEwOTgwOTc5RSsxNg=='

## Selenium

We configure selenium driver so it will not crash in colab: 

In [27]:
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920,1080')  
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:6 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:9 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:14 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:15 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic

In [14]:
from selenium.webdriver import ActionChains

In [20]:
wd = webdriver.Chrome('chromedriver', chrome_options=chrome_options)

  """Entry point for launching an IPython kernel.


In [21]:
wd.get(download_url)

# Automating the process

In [22]:
target_page = 'https://www.cprato.com/en/midi/all'

first_url_part = 'https://www.cprato.com'

First we get a list of all the songs download pages:

In [23]:
# We get all the views of the table
table_view_links = get_elements(target_page, tag='a', attrs={'class': 'page-link'})

# We get all the urls of the table views
table_view_urls = [target_page] + [link['href'] for link in table_view_links]
table_view_urls= table_view_urls[:8]

# We iterate trough the table views
songs_urls = []
for table_view_url in table_view_urls:
    songs_links = get_elements(table_view_url, tag='a', attrs={'href': re.compile("^/en/midi/details")})
    songs_urls = songs_urls + [first_url_part + link['href'] for link in songs_links]
        

In [24]:
len(songs_urls)

392

We use selenium to get all the downloads:

In [37]:
def download_file(wd, free_download_button, download_folder_path, max_seconds_download):
    number_files_download_folder = len(glob.glob(download_folder_path+r"/*.mid"))
    counter = 0
    #ActionChains(wd).click(free_download_button).perform()
    free_download_button.click()
    while True:
        current_number_files_download_folder = len(glob.glob(download_folder_path+r"/*.mid"))
        if current_number_files_download_folder > number_files_download_folder:
            number_files_download_folder = current_number_files_download_folder
            break
        time.sleep(1)
        counter+= 1
        if counter >= max_seconds_download:
            return False
    return True

In [39]:
download_folder_path = r"."
max_seconds_download = 30
max_retries = 3

# We initialize chrome web driver
wd = webdriver.Chrome('chromedriver', chrome_options=chrome_options)
wd.maximize_window()


for song_url in songs_urls[4:6]:   
    retries = 0
    print(f"Retry number: {retries}")
    print("Going to url " + song_url)
    wd.get(song_url)
    while retries<max_retries:    
        try:
            free_download_button = wd.find_element_by_link_text('FREE DOWNLOAD')
        except:
            print("Error finding download button")
            success = False
            break
        success = download_file(wd, free_download_button, download_folder_path, max_seconds_download)

        if success:
            print("Successful download")
            break
        else:
            print("Closing webdriver")
            wd.close()
            wd = webdriver.Chrome('chromedriver', chrome_options=chrome_options)
            wd.get(song_url)
        retries +=1
    
    if not success:
        print("Could not download file: " + download_url) 

wd.close()

  


Retry number: 0
Going to url https://www.cprato.com/en/midi/details/79/above-beyond-feat-alex-vargas-blue-sky-action
Successful download
Retry number: 0
Going to url https://www.cprato.com/en/midi/details/124/adam-f-circles
Successful download
