# Export satellite images

***Note August 15th 2023**: due to constraints imposed by Google Earth Engine on data exports, this script has become excessively slow to run in its current form (estimated >300 h). The notebook below was what we ran to produce the results in our paper, but for the sake of usability we are currently developing a faster, equivalent setup for exporting and loading images. This script will be included alongside the original one as soon as it is ready.*

## Table of Contents
- [Pre-requisites](#pre-requisites)
- [Instructions](#instructions)
- [Imports and initialization](#imports-and-initialization)
- [Prepare the survey data](#prepare-the-survey-data)
- [Download the images](#download-the-images)

## Pre-requisites
Register an account on [Google Earth Engine (GEE)](https://earthengine.google.com/). You will need to provide a Gmail account. Once you have registered, you will need to [sign up for the Google Earth Engine API](https://signup.earthengine.google.com/#!/). This can take a few days to be approved. Once you have been approved, you will be able to use the GEE API.

## Instructions
This notebook exports the Landsat and Nightlight images used as input data for the various models in the project from Google Earth Engine (GEE) to Google Cloud Storage (GCS). The exported images take up about 230 GB of disk space. After the images have been exported to GCS they will have to be downloaded into the data directory as specified in [config.ini](../config.ini).

## Imports and initialization

Import the necessary libraries and config values.

In [1]:
import ee
import pandas as pd
import os
from gee_utils import export_images, wait_on_tasks
import configparser

# Read config file
config = configparser.ConfigParser()
config.read('config.ini')

Before using the Earth Engine API, you must authenticate your credentials. Once you have run the following cell, you will be prompted to click on a link and copy a code into the text box. This will authenticate your credentials and allow you to use the Earth Engine API. You only need to do this once, unless prompted to do so again. Make sure that you log in using a Google account which has access to the GCS bucket defined in config.ini.

In [16]:
ee.Authenticate()

Enter verification code: 4/1AY0e-g4mUuTleAp3DqL9Zd8VjdMgm5afGts3TSjPhbkC106p7y9muc91yUU

Successfully saved authorization token.


Initilaize the Earth Engine API with the high volume end-point. See [here](https://developers.google.com/earth-engine/cloud/highvolume) for more information.

In [17]:
ee.Initialize(opt_url='https://earthengine-highvolume.googleapis.com')

## Prepare the survey data

Read the csv file with survey points

In [None]:
data_dir = config['PATHS']['DATA_DIR']
dhs_cluster_file_path = os.path.join(data_dir, 'dhs_clusters.csv')
df = pd.read_csv(dhs_cluster_file_path)
df.head()

Get a list of all the country-year combinations included in the dataset

In [47]:
surveys = list(df.groupby(['country', 'year']).groups.keys())

To make sure that you have all the permissions, libraries, etc. before starting the big list of tasks, run this test case which exports the first 10 clusters for a given survey. It shouldn't take more than 10 minutes.

In [None]:
def test_export(df, country, year):
    test_df = df[(df['country'] == country) & (df['year'] == year)].sample(10, random_state=0)
    test_tasks = export_images(test_df,
                               country=country,
                               year=year,
                               export_folder=config['GCS']['EXPORT_FOLDER'],  # 'data/dhs_tfrecords_raw',
                               export='gcs',
                               bucket=config['GCS']['BUCKET'],
                               ms_bands=['BLUE', 'GREEN', 'RED', 'NIR', 'SWIR1', 'SWIR2', 'TEMP1'],
                               include_nl=True,
                               start_year=1990,
                               end_year=2020,
                               span_length=3,
                               chunk_size=5)
    wait_on_tasks(test_tasks, poll_interval=60)

test_export(df, surveys[0][0], surveys[0][1])

Note that even if the "wait_on_tasks" method fails the tasks have still been started and are running in the GEE. If the task completes succesfully (seen either through wait_on_tasks or in the [GEE editor](https://code.earthengine.google.com/ )) you can continue to the next step.

## Export the images

This section sets up and queues all the tasks for your survey data. It will take a little while to load, but once it's executed you're done. The exports are now carried out in GEE and you can monitor the exports in the [GEE editor](https://code.earthengine.google.com/). Note that some of these tasks will most likely fail with memory errors. When one of them fail the next couple of ones (maybe next five) are also fails more likely to do so. It happens seamingly at random and there's nothing we've been able to do about it. After everyone are done I will rerun the exports only on the survey points which are missing in the GCS bucket, so unless a lot of them fail (for Egypt 2014 it was about 10%) this is nothing you need to worry about for now.

### Start missing downloads

Sometimes the connnection aborts before all tasks have been started. I've updated the script to first check which tasks have already been initiated before starting the rest.

In [48]:
latest_tasks = {}
for survey in surveys:
    latest_tasks[survey] = -1

Get the latest task already downloaded into the GCS bucket for each survey:

In [49]:
for survey in surveys:
    files_path = f'gs://{config['GCS']['BUCKET']}/{config['GCS']['EXPORT_FOLDER']}/{survey[0]}_{survey[1]}'
    files_in_bucket = !gsutil ls {files_path}*
    if files_in_bucket[-1].startswith(files_path):
        latest_file = files_in_bucket[-1]
        latest_file_nr = int(latest_file[len(files_path)+1:len(files_path)+5])
        latest_tasks[survey] = latest_file_nr

In [50]:
print('Latest tasks already in bucket:\n', latest_tasks)

Latest tasks already in bucket:
 {('madagascar', 2020): 1304, ('ethiopia', 2020): 1005}


Get the latest task started in GEE for each survey:

In [51]:
# Get task list from GEE
gee_tasks = !earthengine task list

# Loop over these tasks. Save the latest in "last_tasks", if it's higher than what is already in the GCS bucket.
for line in gee_tasks:
    if 'Export.table' in line:
        task = line.split()[2]
        survey_string = task.split('_')[:2]
        survey = (survey_string[0], int(survey_string[1]))
        if survey not in surveys:
            continue
        task_nr = int(task.split('_')[2][:4])
        if task_nr > latest_tasks[survey]:
            latest_tasks[survey] = task_nr

In [52]:
print('Latest tasks already started in GEE:\n', latest_tasks)

Latest tasks already started in GEE:
 {('madagascar', 2020): 1304, ('ethiopia', 2020): 1005}


Start the remaining tasks. If the connection is aborted before all tasks have started, please rerun this section of the notebook (from the "Export the images").

In [None]:
chunk_size = 5
all_tasks = dict()

for survey in surveys:
    last_started = latest_tasks[survey]
    survey_df = df[(df['country'] == survey[0]) & (df['year'] == survey[1])]
    expected_nr_of_tasks = int(math.ceil(len(survey_df) / chunk_size))
    if last_started < expected_nr_of_tasks - 1:
        # Some tasks have not been started. Starts them here:
        country = survey[0]
        year = survey[1]
        already_in_bucket = list(range(last_started + 1))
        survey_tasks = export_images(df,
                                     country=country,
                                     year=year,
                                     export_folder=config['GCS']['EXPORT_FOLDER'],
                                     export='gcs',
                                     bucket=config['GCS']['BUCKET'],
                                     ms_bands=['BLUE', 'GREEN', 'RED', 'NIR', 'SWIR1', 'SWIR2', 'TEMP1'],
                                     include_nl=True,
                                     start_year=1990,
                                     end_year=2020,
                                     span_length=3,
                                     chunk_size=5,
                                     already_in_bucket=already_in_bucket)
        all_tasks.update(survey_tasks)

If you prefer, you also monitor the tasks in the notebook with the "wait_on_tasks" method.

In [None]:
wait_on_tasks(all_tasks, poll_interval=60)