# Downloading Dataset from The Metropolitan Museum of Art
The museum is sharing an art collection dataset with CC0, what we can use to train our CC0 ResNet backbone for other projects!
Their website: https://metmuseum.github.io/ lists various endpoints, but we'll be mostly using one to download metadata about an object and then their images.

Some of the art entries have the same photos as other ones, so we're downloading only one copy for them, and for the other objects, we set their 'path' attribute to point to the one copy.

Running this notebook results in a metadata/met.json file containing the objects metadata and a path to their downloaded images in 'path'. Note that not every art object has an associated image.


#### Downloading the dataset may result in about 500GB of images. I've decided to stop downloading at about 50000 images, what is approximately 45 GB od raw data from this museum, with about 35 hours of downloading.

#### To reduce bias that may happen with image topics being correlated with their ID, I've added randomization to select which IDs to download now.

### Constants and imports


In [1]:
import requests
import os
from tqdm.notebook import tqdm
import json
import random
from atomicwrites import atomic_write

BASE_URL = 'https://collectionapi.metmuseum.org'
METADATA_DIRECTORY = 'metadata'
RAW_IMAGE_DIRECTORY = os.path.join('dataset', 'raw')
DATASET_NAME = 'met'
DATASET_IMAGES_LIMIT = 90000
RESCUING_SAVE_PERIOD = 10
DOWNLOADING_SAVE_PERIOD = 5

### Preparing directories

In [2]:
if not os.path.exists(RAW_IMAGE_DIRECTORY):
    os.makedirs(RAW_IMAGE_DIRECTORY)

if not os.path.exists(METADATA_DIRECTORY):
    os.makedirs(METADATA_DIRECTORY)

### Downloading ObjectIDs

In [3]:
available_ids = requests.get(f'{BASE_URL}/public/collection/v1/objects').json()['objectIDs']

In [4]:
print(f'Object IDs count: {len(available_ids)}')

Object IDs count: 486903


### Checking existing metadata

In [5]:
existing_metadata = {}
BASIC_METADATA_PATH = os.path.join(METADATA_DIRECTORY, f'{DATASET_NAME}.json')

if os.path.exists(BASIC_METADATA_PATH):
    with open(BASIC_METADATA_PATH, 'r') as f:
        existing_metadata = json.load(f)

def save_metadata():
    with atomic_write(BASIC_METADATA_PATH, overwrite=True) as f:
        json.dump(existing_metadata, f)

#### Rescuing all downloaded images that got their metadata lost 
Sometimes, running the script multiple separate times and getting a SIGINT may cause the metadata file to get cut, while the images still exists. To combat it, we try to first prepare the metadata for all existing image files. 

In [6]:
to_rescue = list(filter(lambda image: image.startswith(DATASET_NAME) and image.replace(f'{DATASET_NAME}_', '').replace('.jpg', '') not in existing_metadata, list(os.listdir(RAW_IMAGE_DIRECTORY))))

for i, image in tqdm(list(enumerate(to_rescue))):
    id = image.replace(f'{DATASET_NAME}_', '').replace('.jpg', '')
    obj = requests.get(f'{BASE_URL}/public/collection/v1/objects/{id}').json()
    obj['path'] = os.path.join(RAW_IMAGE_DIRECTORY, image)
    existing_metadata[id] = obj
    if i % RESCUING_SAVE_PERIOD == 0:
        save_metadata()

  0%|          | 0/49317 [00:00<?, ?it/s]

In [7]:
image_path_cache = {}
for id in existing_metadata:
    if 'primaryImage' in existing_metadata[id] and existing_metadata[id]['primaryImage'] != '':
        image_path_cache[existing_metadata[id]['primaryImage']] = existing_metadata[id]['path']

### Filtering what we have to download

In [8]:
to_download = []

for id in available_ids:
    if id not in existing_metadata and str(id) not in existing_metadata:
        to_download.append(id)

random.shuffle(to_download)

### Downloading it and updating the existing metadata file

In [9]:
def download_image(url, path):
    if os.path.exists(path):
        return
    image_data = requests.get(url, stream=True)
    if image_data.status_code == 200:
        with atomic_write(path, overwrite=True, mode='wb') as f:
            for chunk in image_data.iter_content(2048):
                f.write(chunk)

In [10]:
with tqdm(total=DATASET_IMAGES_LIMIT) as pbar:
    for i, id in enumerate(to_download):
        if len(image_path_cache) >= DATASET_IMAGES_LIMIT:
            break
        try:
            obj = requests.get(f'{BASE_URL}/public/collection/v1/objects/{id}').json()
            images = []
            if 'primaryImage' in obj and obj['primaryImage'] != '':
                image = obj['primaryImage']
                if image in image_path_cache:
                    obj['path'] = image_path_cache[image]
                else:
                    path = os.path.join(RAW_IMAGE_DIRECTORY, f'{DATASET_NAME}_{id}.jpg')
                    download_image(image, path)
                    image_path_cache[image] = path
                    obj['path'] = path
            existing_metadata[id] = obj
            if i % DOWNLOADING_SAVE_PERIOD == 0:
                save_metadata()
        except Exception as er:
            print(f'Failed to download {id} image - {er}!')
        pbar.update(len(image_path_cache) - pbar.n)

  0%|          | 0/90000 [00:00<?, ?it/s]

Failed to download 275416 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 365609 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 853680 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 61421 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 444568 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 713809 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 64430 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 464701 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 485361 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 51184 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 451275 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 647081 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 342089 image - Expecting value: line 1 column 1 (char 0)!
Fa

KeyboardInterrupt: 