# Downloading Dataset from The Metropolitan Museum of Art
The museum is sharing an art collection dataset with CC0, what we can use to train our CC0 ResNet backbone for other projects!
Their website: https://metmuseum.github.io/ lists various endpoints, but we'll be mostly using one to download metadata about an object and then their images.

Some of the art entries have the same photos as other ones, so we're downloading only one copy for them, and for the other objects, we set their 'path' attribute to point to the one copy.

Running this notebook results in a metadata/met.json file containing the objects metadata and a path to their downloaded images in 'path'. Note that not every art object has an associated image.


#### Downloading the dataset may result in about 500GB of images. I've decided to stop downloading at about 50000 images, what is approximately 45 GB od raw data from this museum, with about 35 hours of downloading.

#### To reduce bias that may happen with image topics being correlated with their ID, I've added randomization to select which IDs to download now.

### Constants and imports


In [1]:
import requests
import os
from tqdm.notebook import tqdm
import json
import random

BASE_URL = 'https://collectionapi.metmuseum.org'
METADATA_DIRECTORY = 'metadata'
RAW_IMAGE_DIRECTORY = os.path.join('dataset', 'raw')
DATASET_NAME = 'met'
DATASET_IMAGES_LIMIT = 80000

### Preparing directories

In [2]:
if not os.path.exists(RAW_IMAGE_DIRECTORY):
    os.makedirs(RAW_IMAGE_DIRECTORY)

if not os.path.exists(METADATA_DIRECTORY):
    os.makedirs(METADATA_DIRECTORY)

### Downloading ObjectIDs

In [3]:
available_ids = requests.get(f'{BASE_URL}/public/collection/v1/objects').json()['objectIDs']

In [4]:
print(f'Object IDs count: {len(available_ids)}')

Object IDs count: 486851


### Checking existing metadata

In [5]:
existing_metadata = {}
BASIC_METADATA_PATH = os.path.join(METADATA_DIRECTORY, f'{DATASET_NAME}.json')

if os.path.exists(BASIC_METADATA_PATH):
    with open(BASIC_METADATA_PATH, 'r') as f:
        existing_metadata = json.load(f)

def save_metadata():
    with open(BASIC_METADATA_PATH, 'w') as f:
        json.dump(existing_metadata, f)

In [6]:
image_path_cache = {}
for id in existing_metadata:
    if 'primaryImage' in existing_metadata[id] and existing_metadata[id]['primaryImage'] != '':
        image_path_cache[existing_metadata[id]['primaryImage']] = existing_metadata[id]['path']

### Filtering what we have to download

In [7]:
to_download = []

for id in available_ids:
    if id not in existing_metadata and str(id) not in existing_metadata:
        to_download.append(id)

random.shuffle(to_download)

### Downloading it and updating the existing metadata file

In [8]:
def download_image(url, path):
    image_data = requests.get(url, stream=True)
    if image_data.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in image_data.iter_content(2048):
                f.write(chunk)

In [9]:
with tqdm(total=DATASET_IMAGES_LIMIT) as pbar:
    for id in to_download:
        if len(image_path_cache) >= DATASET_IMAGES_LIMIT:
            break
        try:
            obj = requests.get(f'{BASE_URL}/public/collection/v1/objects/{id}').json()
            images = []
            if 'primaryImage' in obj and obj['primaryImage'] != '':
                image = obj['primaryImage']
                if image in image_path_cache:
                    obj['path'] = image_path_cache[image]
                else:
                    path = os.path.join(RAW_IMAGE_DIRECTORY, f'{DATASET_NAME}_{id}.jpg')
                    download_image(image, path)
                    image_path_cache[image] = path
                    obj['path'] = path
            existing_metadata[id] = obj
            save_metadata()
        except Exception as er:
            print(f'Failed to download {id} image - {er}!')
        pbar.update(len(image_path_cache) - pbar.n)

  0%|          | 0/80000 [00:00<?, ?it/s]

Failed to download 282160 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 415886 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 709583 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 367959 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 218829 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 445168 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 818195 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 360766 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 756162 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 308705 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 65218 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 227670 image - Expecting value: line 1 column 1 (char 0)!
Failed to download 775612 image - Expecting value: line 1 column 1 (char 0)!


KeyboardInterrupt: 