# [SpaceNet](https://aws.amazon.com/public-datasets/spacenet/)

"The current SpaceNet corpus includes **thousands of square kilometers of high resolution imagery** collected from **DigitalGlobe’s commercial satellites** which includes **8-band multispectral data**. This dataset is being made public to advance the development of **algorithms to automatically extract geometric features such as roads, building footprints, and points of interest using satellite imagery**. The currently available Areas of Interest (AOI) are **Rio De Janeiro**, Paris, Las Vegas, Shanghai and Khartoum."

### 0. Dependencies
The [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) must be installed with an active AWS account. Configure the AWS CLI using `aws configure`.

### 1. Accessing the SpaceNet Data on AWS
The imagery is [GeoTIFF](https://en.wikipedia.org/wiki/GeoTIFF) satellite imagery and corresponding [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON) building footprints.

The spacenet-dataset S3 bucket is provided as a [Requester Pays bucket](https://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html), so we use [Boto](https://boto3.readthedocs.io/en/latest/index.html), the Amazon Web Services (AWS) SDK for Python.

In [None]:
import boto3
client = boto3.client("s3")
# See: https://boto3.readthedocs.io/en/latest/reference/customizations/s3.html#boto3.s3.transfer.S3Transfer
transfer = boto3.s3.transfer.S3Transfer(client)

bucket = "spacenet-dataset"
# 20 tiffs listed in "AOI_1_Rio_manifest.txt".
names = [
    "013022223310.tif",
    "013022232021.tif",
    "013022232201.tif",
    "013022223131.tif",
    "013022223113.tif",
    "013022223103.tif",
    "013022223133.tif",
    "013022223132.tif",
    "013022223301.tif",
    "013022223112.tif",
    "013022232020.tif",
    "013022232200.tif",
    "013022223123.tif",
    "013022232022.tif",
    "013022223130.tif",
    "013022232002.tif",
    "013022232023.tif",
    "013022223311.tif",
    "013022232003.tif",
    "013022223121.tif"
]

key_prefix = "AOI_1_Rio/srcData/mosaic_8band/"
!mkdir -p /tmp/spacenet-data
filename_prefix = "/tmp/spacenet-data/"

key_filename_tuples = [
    (key_prefix + name, filename_prefix + name)
    for name in names]

# Download takes around 15 minutes.
import time

start = time.time()
for (key, filename) in key_filename_tuples:
    transfer.download_file(
        bucket=bucket, key=key, filename=filename,
        extra_args={"RequestPayer":"requester"}
    )
end = time.time()
download_time = end - start
print("Download time: %s" % download_time)

## 2. Ingest 8-Band Images with GeoPySpark

[GeoPySpark](https://github.com/locationtech-labs/geopyspark) is a Python language binding library of the Scala library, [GeoTrellis](https://github.com/locationtech/geotrellis), which reads, writes, and operates on raster data as fast as possible using Spark.

Refer to [Ingesting a Grayscale Image](https://geopyspark.readthedocs.io/en/latest/tutorials/greyscale_ingest_example.html) tutorial for code breakdown.

In [None]:
from pyspark import SparkContext
from geopyspark import geopyspark_conf
conf = geopyspark_conf("local[*]", "spacenet-ingest")
geopysc = SparkContext.getOrCreate(conf)

In [None]:
import time

start = time.time()

from geopyspark.geotrellis.geotiff import get
from geopyspark.geotrellis.constants import SPATIAL, ZOOM
from geopyspark.geotrellis.catalog import write

# Read the GeoTiff locally
rdd = get(geopysc, SPATIAL, "/tmp/spacenet-data", num_partitions=5000, max_tile_size=1024)

metadata = rdd.collect_metadata(tile_size=1024)

# tile the rdd to the layout defined in the metadata
laid_out = rdd.tile_to_layout(metadata)

# reproject the tiled rasters using a ZoomedLayoutScheme
reprojected = laid_out.reproject("EPSG:3857", scheme=ZOOM).cache().repartition(5000)

# pyramid the TiledRasterRDD to create 12 new TiledRasterRDDs
# one for each zoom level
pyramided = reprojected.pyramid(start_zoom=12, end_zoom=1)

# Save each TiledRasterRDD locally
for tiled in pyramided:
    write("file:///tmp/spacenet-catalog", "spacenet-ingest", tiled)

end = time.time()
ingest_time = end - start
print("Ingest time: %s" % ingest_time)