# Scrape Data from Matr.IO

The Matr.IO webpage contains the datasets used by [Severson et al.](https://data.matr.io/1/projects/5c48dd2bc625d700019f3204) and [Attia et al.](https://data.matr.io/1/projects/5d80e633f405260001c0b60a).
We are going to download each of the Arbin output files (extension CSV) used by them into a raw data directory.

In [1]:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium import webdriver
from pathlib import Path
from tqdm import tqdm
import requests
import selenium
import re

Configuration

In [2]:
base_pages = [
    'https://data.matr.io/1/projects/5c48dd2bc625d700019f3204',  # Severson
    'https://data.matr.io/1/projects/5d80e633f405260001c0b60a',  # Attia
]
out_dir = Path('raw/')

In [3]:
out_dir.mkdir(parents=True, exist_ok=True)

## Initialize Web Driver
We are going to use Selenium to drive a chrome web broswer

In [4]:
driver = webdriver.Chrome()

## Make Functions
We need a function to iterate from a page into each of its sub pages (e.g., from a project to a batch of experiments) and one to download the Arbin file from within the experiment page.

In [5]:
def iterate_into_sub_pages(driver: webdriver.Chrome, class_name: str = 'MuiListItem-container'):
    """Adjust the driver such that it iterates through the web pages
    
    Args:
        driver: Webdrive to be pushed around already at the desired page
        class_name: Class of the element to be clicked on
    Yields:
        Driver after clicking into the sub page
    """
    # Count the number of elements
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, class_name))
    )  # Wait until they appear
    num_elems = len(driver.find_elements(By.CLASS_NAME, class_name))

    # Loop over each element
    for elem_id in range(num_elems):
        elem = driver.find_elements(By.CLASS_NAME, class_name)[elem_id]  # Assume the order never changes
        elem.click()
        yield driver.current_url
        driver.back()

In [6]:
def download_arbin_file(driver: webdriver.Chrome) -> Path:
    """Download the Arbin file from the data page
    
    Args:
        driver: Drive already navigated to the target page
    Returns:
        Path to the downloaded file
    """
    
    # Find the URL of the dataset (last button) of the page
    class_name = 'MuiButton-sizeSmall'
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, class_name))
    )  # Wait until they appear
    buttons = driver.find_elements(By.CLASS_NAME, class_name)
    assert len(buttons) == 3
    last_button = buttons[-1]
    
    download_url = last_button.get_attribute('href')
    
    # Download to a target folder
    res = requests.get(download_url, stream=True)
    assert res.status_code == 200
    filename = res.headers['Content-Disposition'][22:-1]
    with (out_dir / filename).open('wb') as fp:
        for content in res.iter_content(chunk_size=1024 * 32):
            fp.write(content)

## Download everything
Download all data from each of the project pages

In [7]:
pbar = tqdm()
for project_page in base_pages:
    driver.get(project_page)
    for batch_url in iterate_into_sub_pages(driver):
        for item_url in iterate_into_sub_pages(driver):
            download_arbin_file(driver)
            pbar.update(1)

376it [15:51,  4.42s/it]

Once we're done, close out

In [8]:
driver.close()