# Downloading Dataset from The Art Institute of Chicago
This museum is also sharing an art collection dataset with some CC0 license, what we can use to train our CC0 ResNet backbone for other projects!
Their github page: https://github.com/art-institute-of-chicago/api-data contains metadata for us to download, and they encourage for us to use it for downloading a larger dataset - exactly what we want to do. Keep in mind that this metadata weighs almost 2 GB after uncompressing.

We'll be downloading them in 843 size, as supposedly, it has a large chance to result in a cache hit. In the end, we scale images to 224x224, so that quality should be enough.

As requested, we wait a second before each request and we don't multithread them.

### Constants and imports

In [9]:
import requests
import os
import random
import json
from tqdm.notebook import tqdm
import pandas as pd
import shutil
from atomicwrites import atomic_write

AIC_DATA_TARBZ = 'https://artic-api-data.s3.amazonaws.com/artic-api-data.tar.bz2'
METADATA_DIRECTORY = 'metadata'
DATASET_NAME = 'aic'
RAW_METADATA_DIRECTORY = os.path.join(METADATA_DIRECTORY, 'raw')
RAW_METADATA_ZIPPED_FILENAME = os.path.join(RAW_METADATA_DIRECTORY, 'aic.tar.bz2')
RAW_METADATA_UNZIP_LOCATION = os.path.join(RAW_METADATA_DIRECTORY, f'{DATASET_NAME}')
RAW_IMAGE_DIRECTORY = os.path.join('dataset', 'raw')
DATASET_IMAGES_LIMIT = 70000
DOWNLOADING_SAVE_PERIOD = 5

### Preparing directories

In [5]:
 
if not os.path.exists(RAW_METADATA_DIRECTORY):
    os.makedirs(RAW_METADATA_DIRECTORY)


### Downloading the compressed file onto the disk

In [6]:

if not os.path.exists(RAW_METADATA_ZIPPED_FILENAME):
    metadata_data = requests.get(AIC_DATA_TARBZ, stream=True)
    if metadata_data.status_code == 200:
        with open(RAW_METADATA_ZIPPED_FILENAME, 'wb') as f:
            for chunk in metadata_data.iter_content(2048):
                f.write(chunk)


### Unzipping the compressed file

In [10]:
if not os.path.exists(RAW_METADATA_UNZIP_LOCATION) or (os.listdir(RAW_METADATA_UNZIP_LOCATION)) == 0:
    shutil.unpack_archive(RAW_METADATA_ZIPPED_FILENAME, RAW_METADATA_UNZIP_LOCATION)