## Get Data

We would like to train a classifier to identify similar images of clouds. To do this, we'll make use of [NASA's Worldview Snapshots API](https://worldview.earthdata.nasa.gov/?v=-163.07942357212752,-32.18685220229665,-100.20019214430877,0.016272797703347663&t=2019-02-10-T00%3A00%3A00Z&l=VIIRS_SNPP_CorrectedReflectance_TrueColor(hidden),MODIS_Aqua_CorrectedReflectance_TrueColor(hidden),MODIS_Terra_CorrectedReflectance_TrueColor,Reference_Labels(hidden),Reference_Features(hidden),Coastlines&tr=sunglint).

Data is captured from two orbiting satellites: Terra and Aqua.

In [2]:
import os
import math
import requests
import random, string
import numpy as np
import pandas as pd

from PIL import Image
from io import BytesIO
from datetime import date
from dateutil.rrule import rrule, DAILY

In [4]:
# Set up folder structure.
if not os.path.exists('data/terra'): os.makedirs('data/terra')
if not os.path.exists('data/aqua'): os.makedirs('data/aqua')

Calls to this API look like:
```
https://wvs.earthdata.nasa.gov/api/v1/snapshot?
    REQUEST=GetSnapshot
    &TIME=2019-09-24T00:00:00Z
                BOTTOM               LEFT              TOP                 RIGHT
    &BBOX=-26.523608349900595,-119.85108101391648,0.6927808151093444,-95.30684642147116
    &CRS=EPSG:4326
    &LAYERS=MODIS_Terra_CorrectedReflectance_TrueColor,Coastlines
    &WRAP=day,x
    &FORMAT=image/jpeg
    &WIDTH=559
    &HEIGHT=619
```

Our images come solely from the ocean, so we should select them from these regions.
![https://i.gyazo.com/4f1f5d737bf4685c18e74c8b863cc859.png](https://i.gyazo.com/4f1f5d737bf4685c18e74c8b863cc859.png)

## Regions

We need to choose fives sets of coordinates from which we'll take our snapshots.

### Atlantic [1]


- Bottom	 10.08984375
- Left		 -54.597656250000014
- Top		 25.48828125
- Right		 -31.535156250000014

- Width		 15.3984375
- Height	 23.0625

### South Atlantic [1]


- Bottom	 -21.48046875
- Left		 -28.160156250000014
- Top		 -6.08203125
- Right		 -5.097656250000014
- Width		 15.3984375
- Height	 23.0625

### East Pacific [2]


- Bottom	 9.66796875
- Left		 132.01171875
- Top		 25.06640625
- Right		 155.07421875
- Width		 15.3984375
- Height	 23.0625


### South Pacific [3]


- Bottom	 -20.63671875
- Left		 -110.56640624999997
- Top		 -5.23828125
- Right		 -87.50390624999997
- Width		 15.3984375
- Height	 23.0625


### South Pacific [3]

- Bottom	 -20.49609375
- Left		 -149.02734374999997
- Top		 -5.09765625
- Right		 -125.96484374999997
- Width		 15.3984375
- Height	 23.0625


In [8]:
atlantic = np.array([10.08984375, -54.597656250000014,  25.48828125, -31.535156250000014])
south_atlantic = np.array([-21.48046875,  -28.160156250000014, -6.08203125, -5.097656250000014])
east_paficific = np.array([9.66796875, 132.01171875, 25.06640625, 155.07421875])
south_pacific_1 = np.array([-20.63671875, -110.56640624999997, -5.23828125, -87.50390624999997])
south_pacific_2 = np.array([ -20.49609375, -149.02734374999997,  -5.09765625, -125.96484374999997])
regions = np.array([atlantic, south_atlantic, east_paficific, south_pacific_1, south_pacific_2])

Let's build a list of all of the URLs we'll query. We keep them in tuple pairs of `(aquaUrl, terraUrl)`.

In [12]:
def getUrls():
    
    startDate = date(2012, 1, 1)
    endDate   = date(2019, 7, 31)

    allUrls = []

    for dt in rrule(DAILY, dtstart=startDate, until=endDate):

        current_date = dt.strftime("%Y-%m-%d")

        for bottom,left,top,right in regions:

            aquaUrl = 'https://wvs.earthdata.nasa.gov/api/v1/snapshot?REQUEST=GetSnapshot&TIME={}T00:00:00Z&BBOX={},{},{},{}&CRS=EPSG:4326&LAYERS=MODIS_{}_CorrectedReflectance_TrueColor,Coastlines&WRAP=day,x&FORMAT=image/jpeg&WIDTH=525&HEIGHT=350&ts=1569875246328'.format(current_date, bottom, left, top, right, 'Aqua')
            terraUrl = 'https://wvs.earthdata.nasa.gov/api/v1/snapshot?REQUEST=GetSnapshot&TIME={}T00:00:00Z&BBOX={},{},{},{}&CRS=EPSG:4326&LAYERS=MODIS_{}_CorrectedReflectance_TrueColor,Coastlines&WRAP=day,x&FORMAT=image/jpeg&WIDTH=525&HEIGHT=350&ts=1569875246328'.format(current_date, bottom, left, top, right, 'Terra')

            allUrls.append((aquaUrl, terraUrl))
            
    return allUrls

In [17]:
urls = getUrls()
print("Number of urls:", len(urls))
print("Sample urls:", urls[1][0], '\n', urls[1][1])

Number of urls: 13845
Sample urls: https://wvs.earthdata.nasa.gov/api/v1/snapshot?REQUEST=GetSnapshot&TIME=2012-01-01T00:00:00Z&BBOX=-21.48046875,-28.160156250000014,-6.08203125,-5.097656250000014&CRS=EPSG:4326&LAYERS=MODIS_Aqua_CorrectedReflectance_TrueColor,Coastlines&WRAP=day,x&FORMAT=image/jpeg&WIDTH=525&HEIGHT=350&ts=1569875246328 
 https://wvs.earthdata.nasa.gov/api/v1/snapshot?REQUEST=GetSnapshot&TIME=2012-01-01T00:00:00Z&BBOX=-21.48046875,-28.160156250000014,-6.08203125,-5.097656250000014&CRS=EPSG:4326&LAYERS=MODIS_Terra_CorrectedReflectance_TrueColor,Coastlines&WRAP=day,x&FORMAT=image/jpeg&WIDTH=525&HEIGHT=350&ts=1569875246328


In [10]:
def saveImages(urlPair):
    aquaUrl = urlPair[0]
    terraUrl = urlPair[1]
    aquaResponse = requests.get(aquaUrl)
    terraResponse = requests.get(terraUrl)

    aquaImg = Image.open(BytesIO(aquaResponse.content))
    terraImg = Image.open(BytesIO(terraResponse.content))

    name = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(16))

    aquaImg.save("data/aqua/" + name + ".jpg")
    terraImg.save("data/terra/" + name + ".jpg")
    
    return name

In [None]:
# Use eight threads to download them in parallel. This will take awhile.
results = ThreadPool(8).imap_unordered(saveImages, urls)

for path in results:
    print(path)