# Deep Learning for satellite image downloading

Ali Ben Abbes; Jeaneth Machicao
ali.benabbes@fondationbiodiversite.fr; machicao@usp.br

Presented at VI WORKSHOP ON DATA SCIENCE AND MACHINE LEARNING-PARSEC, 03-06 October 2022, São Paulo, Brazil

# PART 2: Google Earth Engine

## Pre-requisites
Register a Google account at [https://code.earthengine.google.com](https://code.earthengine.google.com). This process may take a couple of days. Without registration, the `ee.Initialize()` command below will throw an error message.


In [None]:
!pip install earthengine-api --upgrade


## Instructions

This notebook exports Landsat satellite image composites clusters from Google Earth Engine.

The images are saved in gzipped TFRecord format. By default, this notebook exports images to Google Drive. If you instead prefer to export images to Google Cloud Storage (GCS), change the `EXPORT` constant below to `'gcs'` and set `BUCKET` to the desired GCS bucket name.



|      | Google Drive (default) | GCS
|------|:-----------------------|:---
| VR  | `dhs_tfrecords_raw/`   | `{BUCKET}/dhs_tfrecords_raw/`

Once the images have finished exporting, download the exported TFRecord files to the following folder:

- VR: `data/dhs_tfrecords_raw/` 
```


In [4]:
from __future__ import annotations

In [5]:
import math
from typing import Any, Optional
import ee
import ee_utils
import pandas as pd


In [6]:
ee.Authenticate()


Successfully saved authorization token.


In [7]:
ee.Initialize()  # initialize the Earth Engine API

In [8]:
# ========== ADAPT THESE PARAMETERS ==========

# To export to Google Drive, uncomment the next 2 lines
#export: str, 'drive' for Google Drive, 'gcs' for GCS
EXPORT = 'drive'
BUCKET = None



# export location paramet
DHS_EXPORT_FOLDER = 'dhs_tfrecords_raw'
DHSNL_EXPORT_FOLDER = 'dhsnl_tfrecords_raw'
LSMS_EXPORT_FOLDER = 'lsms_tfrecords_raw'

# Set CHUNK_SIZE to None to export a single TFRecord file per (country, year). However,
# this may fail if it exceeds Google Earth Engine memory limits. Decrease CHUNK_SIZE
# to a small number (<= 50) until Google Earth Engine stops reporting memory errors
CHUNK_SIZE = None

In [9]:
# ========== DO NOT MODIFY THESE ==========

# input data paths
DHS_CSV_PATH = 'VR_clusters.csv' 
#add csv file


# band names
MS_BANDS = ['BLUE', 'GREEN', 'RED', 'NIR', 'SWIR1', 'SWIR2', 'TEMP1']

# image parameters
PROJECTION = 'EPSG:3857'  # see https://epsg.io/3857
SCALE = 30                # export resolution: 30m/px
EXPORT_TILE_RADIUS = 127  # image dimension = (2*EXPORT_TILE_RADIUS) + 1 = 255px
!pwd

/Users/machicao/Documents/WDS6_hands_on


In [10]:
def export_images(df: pd.DataFrame,
                  country: str,
                  year: int,
                  export_folder: str,
                  chunk_size: Optional[int] = None
                  ) -> dict[tuple[str, str, int, int], ee.batch.Task]:
    '''
    Args
    - df: pd.DataFrame, contains columns ['lat', 'lon', 'country', 'year']
    - country: str, together with `year` determines the survey to export
    - year: int, together with `country` determines the survey to export
    - export_folder: str, name of folder for export
    - chunk_size: int, optionally set a limit to the # of images exported per TFRecord file
        - set to a small number (<= 50) if Google Earth Engine reports memory errors

    Returns: dict, maps task name tuple (export_folder, country, year, chunk) to ee.batch.Task
    '''
    subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
    if chunk_size is None:
        chunk_size = len(subset_df)
    num_chunks = int(math.ceil(len(subset_df) / chunk_size))
    tasks = {}

    for i in range(num_chunks):
        chunk_slice = slice(i * chunk_size, (i+1) * chunk_size - 1)  # df.loc[] is inclusive
        fc = ee_utils.df_to_fc(subset_df.loc[chunk_slice, :])
        start_date, end_date = ee_utils.surveyyear_to_range(year)

        # create 3-year Landsat composite image
        roi = fc.geometry()
        imgcol = ee_utils.LandsatSR(roi, start_date=start_date, end_date=end_date).merged
        imgcol = imgcol.map(ee_utils.mask_qaclear).select(MS_BANDS)
        img = imgcol.median()

        # add nightlights, latitude, and longitude bands
        img = ee_utils.add_latlon(img)
        img = img.addBands(ee_utils.composite_nl(year))

        fname = f'{country}_{year}_{i:02d}'
        tasks[(export_folder, country, year, i)] = ee_utils.get_array_patches(
            img=img, scale=SCALE, ksize=EXPORT_TILE_RADIUS,
            points=fc, export=EXPORT,
            prefix=export_folder, fname=fname,
            bucket=BUCKET)
    return tasks

In [11]:
tasks: dict[tuple[str, str, int, int], ee.batch.Task] = {}

In [12]:
dhs_df = pd.read_csv(DHS_CSV_PATH, float_precision='high', index_col=False, sep=';')
dhs_surveys = list(dhs_df.groupby(['country', 'year']).groups.keys())

for country, year in dhs_surveys:
    new_tasks = export_images(
        df=dhs_df, country=country, year=year,
        export_folder=DHS_EXPORT_FOLDER, chunk_size=CHUNK_SIZE)
    tasks.update(new_tasks)

In [13]:
ee_utils.wait_on_tasks(tasks, poll_interval=60)

  0%|          | 0/1 [00:00<?, ?it/s]

Task ('dhs_tfrecords_raw', 'Brazil', 2010, 0) finished in 0 min with state: COMPLETED
