
# Download the Unsplash dataset

This notebook can be used to download all images from the Unsplash dataset: https://github.com/unsplash/datasets. There are two versions Lite (25000 images) and Full (2M images). For the Full one you will need to apply for access (see [here](https://unsplash.com/data)). This will allow you to run CLIP on the whole dataset yourself. 

Put the .TSV files in the folder `unsplash-dataset/full` or `unsplash-dataset/lite` or adjust the path in the cell below. 

In [1]:
from pathlib import Path

dataset_version = "lite"  # either "lite" or "full"
unsplash_dataset_path = Path("unsplash-dataset-lite-latest")
print(unsplash_dataset_path)

unsplash-dataset-lite-latest


## Load the dataset

The `photos.tsv000` contains metadata about the photos in the dataset, but not the photos themselves. We will use the URLs of the photos to download the actual images.

In [2]:
import pandas as pd

# Read the photos table
photos = pd.read_csv(unsplash_dataset_path / "photos.tsv000", sep='\t', header=0)

# Extract the IDs and the URLs of the photos
photo_urls = photos[['photo_id', 'photo_image_url']].values.tolist()

# Print some statistics
print(f'Photos in the dataset: {len(photo_urls)}')

Photos in the dataset: 25000


The file name of each photo corresponds to its unique ID from Unsplash. We will download the photos in a reduced resolution (640 pixels width), because they are downscaled by CLIP anyway.

In [16]:
import urllib.request

# Path where the photos will be downloaded
# photos_donwload_path = unsplash_dataset_path / "images"
photos_donwload_path = Path("/projects/katefgroup/datasets/rewards/unsplash/images")
# photos_donwload_path = "/projects/katefgroup/datasets/rewards/unsplash/images"
print(photos_donwload_path)
# Function that downloads a single photo
def download_photo(photo):
    # Get the ID of the photo
    photo_id = photo[0]

    # Get the URL of the photo (setting the width to 640 pixels)
    photo_url = photo[1] + "?w=640"

    # Path where the photo will be stored
    photo_path = photos_donwload_path / (photo_id + ".jpg")
    # print(photo_path)
    # Only download a photo if it doesn't exist
    if not photo_path.exists():
        try:
            urllib.request.urlretrieve(photo_url, photo_path)
        except:
            # Catch the exception if the download fails for some reason
            print(f"Cannot download {photo_url}")
            pass

/projects/katefgroup/datasets/rewards/unsplash/images


Now the actual download! The download can be parallelized very well, so we will use a thread pool. You may need to tune the `threads_count` parameter to achieve the optimzal performance based on your Internet connection. For me even 128 worked quite well.

In [17]:
from multiprocessing.pool import ThreadPool

# Create the thread pool
threads_count = 16
pool = ThreadPool(threads_count)

# Start the download
pool.map(download_photo, photo_urls)

# Display some statistics
display(f'Photos downloaded: {len(photos)}')

/projects/katefgroup/datasets/rewards/unsplash/images/bygTaBey1Xk.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/tlbUJKfHhEw.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/o_-gToPk62c.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/3tdvWoNxvbw.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/dxZHS55WnlM.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/s3dGFU-dHXw.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/evrHojTLBKE.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/Fgp8p6KD_Ks.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/Sjx2iAwBVnM.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/4HJiA2TWe2I.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/4-XMUn95FZU.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/XDy5I86-V78.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/9ogmgks2y-U.jpg
/projects/katefgroup/datasets/rewards/unsplash/images/SfgJnbqpUWU.jpg
/projects/katefgroup

KeyboardInterrupt: 

Cannot download https://images.unsplash.com/photo-1587279825901-c33f08032ffc?w=640
/projects/katefgroup/datasets/rewards/unsplash/images/Mo4jlwGV2Lo.jpg
Cannot download https://images.unsplash.com/photo-1569784990219-7f9e3c31313f?w=640
/projects/katefgroup/datasets/rewards/unsplash/images/3g7WswEHuvs.jpg
Cannot download https://images.unsplash.com/photo-1528039152694-87e82954a2da?w=640
/projects/katefgroup/datasets/rewards/unsplash/images/dqi31u8UoeM.jpg
Cannot download https://images.unsplash.com/photo-1553381263-8a6c0dcc231b?w=640
/projects/katefgroup/datasets/rewards/unsplash/images/lGCfApDzhYw.jpg
Cannot download https://images.unsplash.com/photo-1552580703-fbead4d641ce?w=640
/projects/katefgroup/datasets/rewards/unsplash/images/l8Xg0WdsQKc.jpg
Cannot download https://images.unsplash.com/photo-1580133318324-f2f76d987dd8?w=640
/projects/katefgroup/datasets/rewards/unsplash/images/0zh-aIFfDhE.jpg
Cannot download https://images.unsplash.com/photo-1553613063-0001e0d60cdf?w=640
/project