<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pre-requisites" data-toc-modified-id="Pre-requisites-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pre-requisites</a></span></li><li><span><a href="#Instructions" data-toc-modified-id="Instructions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Instructions</a></span></li><li><span><a href="#Imports-and-Constants" data-toc-modified-id="Imports-and-Constants-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Imports and Constants</a></span></li><li><span><a href="#Constants" data-toc-modified-id="Constants-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Constants</a></span></li><li><span><a href="#Export-Images" data-toc-modified-id="Export-Images-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Export Images</a></span></li></ul></div>

## Pre-requisites
Register a Google account at [https://code.earthengine.google.com](https://code.earthengine.google.com). This process may take a couple of days. Without registration, the `ee.Initialize()` command below will throw an error message.

## Instructions

This notebook exports Landsat satellite image composites from Google Earth Engine. The images are saved in gzipped TFRecord format (`*.tfrecord.gz`). The exported images take up a significant amount of storage space. Before exporting, make sure you have enough storage space.

In this project, we download satellite images corresponding to three different datasets:

- **DHS**: 19,669 clusters from DHS surveys, for which we predict cross-sectional (*i.e.*, static in time) cluster-level asset wealth
- **DHSNL**: 260,415 locations sampled near DHS survey locations, for which we train transfer learning models to predict nightlights values
- **LSMS**: 2,913 clusters from LSMS surveys, for which we predict changes in cluster-level asset wealth over time

|       | Storage  | Expected Export Time
|-------|----------|---------------------
| DHS   | ~16.0 GB | ~24h
| LSMS  |  ~2.5 GB | ~10h
| DHSNL |  ~240 GB | ~72h

By default, this notebook exports images to Google Drive. If you instead prefer to export images to Google Cloud Storage (GCS), change the `EXPORT` constant below to `'gcs'` and set `BUCKET` to the desired GCS bucket name. The images are exported to the following locations:

|       | Google Drive (default) | GCS
|-------|:-----------------------|:---
| DHS   | `dhs_tfrecords_raw/`   | `{BUCKET}/dhs_tfrecords_raw/`
| DHSNL | `dhsnl_tfrecords_raw/` | `{BUCKET}/dhsnl_tfrecords_raw/`
| LSMS  | `lsms_tfrecords_raw/`  | `{BUCKET}/lsms_tfrecords_raw/`

Once the images have finished exporting, download the exported TFRecord files to the following folders:

- DHS: `data/dhs_tfrecords_raw/`
- DHSNL: `data/dhsnl_tfrecords_raw/`
- LSMS: `data/lsms_tfrecords_raw/`

After downloading the TFRecord files, the `data/` directory should look as follows, where `XX` depends on the `CHUNK_SIZE` parameter used:

```
data/
    dhs_tfrecords_raw/
        angola_2011_00.tfrecord.gz
        ...
        zimbabwe_2015_XX.tfrecord.gz
    dhsnl_tfrecords_raw/
        angola_2010_00.tfrecord.gz
        ...
        zimbabwe_2016_XX.tfrecord.gz
    lsms_tfrecords_raw/
        ethiopia_2011_00.tfrecord.gz
        ...
        uganda_2013_XX.tfrecord.gz
```

After finishing this notebook, move on to [1_process_tfrecords.ipynb](./1_process_tfrecords.ipynb) for next steps.

## Imports and Constants

In [1]:
%load_ext autoreload
%autoreload 2

# change directory to repo root, and verify
%cd '../'
!pwd

/atlas/u/erikrozi/rural-urban-Modelbias
/atlas/u/erikrozi/rural-urban-Modelbias


In [None]:
from __future__ import annotations

import math
from typing import Any, Optional

import ee
import pandas as pd

from preprocessing import ee_utils_building

ModuleNotFoundError: No module named 'ee'

Before using the Earth Engine API, you must perform a one-time authentication that authorizes access to Earth Engine on behalf of your Google account you registered at [https://code.earthengine.google.com](https://code.earthengine.google.com). The authentication process saves a credentials file to `$HOME/.config/earthengine/credentials` for future use.

The following command `ee.Authenticate()` runs the authentication process. Once you successfully authenticate, you may comment out this command because you should not need to authenticate again in the future, unless you delete the credentials file. If you do not authenticate, the subsequent `ee.Initialize()` command below will fail.

For more information, see [https://developers.google.com/earth-engine/python_install-conda.html](https://developers.google.com/earth-engine/python_install-conda.html).

In [3]:
ee.Authenticate()

Enter verification code:  4/1AX4XfWjODBLJFyQbtrIWCO0m0ImhABWmr6pnL8bwkGFrA5TOMfQQ0gM6SkE



Successfully saved authorization token.


In [3]:
ee.Initialize()  # initialize the Earth Engine API

## Constants

In [4]:
# ========== ADAPT THESE PARAMETERS ==========

# To export to Google Drive, uncomment the next 2 lines
EXPORT = 'drive'
BUCKET = None

# To export to Google Cloud Storage (GCS), uncomment the next 2 lines
# and set the bucket to the desired bucket name
# EXPORT = 'gcs'
# BUCKET = 'mybucket'

# export location parameters
#DHS_EXPORT_FOLDER = 'dhs_tfrecords_raw'
DHS_EXPORT_FOLDER = 'dhs_pixelLonLat'
DHSNL_EXPORT_FOLDER = 'dhsnl_tfrecords_raw'
LSMS_EXPORT_FOLDER = 'lsms_tfrecords_raw'

# Set CHUNK_SIZE to None to export a single TFRecord file per (country, year). However,
# this may fail if it exceeds Google Earth Engine memory limits. Decrease CHUNK_SIZE
# to a small number (<= 50, sometimes as low as 5) until Google Earth Engine stops 
# reporting memory errors
CHUNK_SIZE = None

In [5]:
# ========== DO NOT MODIFY THESE ==========

# input data paths
DHS_CSV_PATH = 'data/dhs_clusters.csv'
DHSNL_CSV_PATH = 'data/dhsnl_locs.csv'
LSMS_CSV_PATH = 'data/lsms_clusters.csv'

# band names
MS_BANDS = ['BLUE', 'GREEN', 'RED', 'NIR', 'SWIR1', 'SWIR2', 'TEMP1']

# image parameters
PROJECTION = 'EPSG:3857'  # see https://epsg.io/3857
SCALE = 30                # export resolution: 30m/px
EXPORT_TILE_RADIUS = 127  # image dimension = (2*EXPORT_TILE_RADIUS) + 1 = 255px

## Export Images

In [6]:
def export_images(df: pd.DataFrame,
                  country: str,
                  year: int,
                  export_folder: str,
                  chunk_size: Optional[int] = None
                  ) -> dict[tuple[str, str, int, int], ee.batch.Task]:
    '''
    Args
    - df: pd.DataFrame, contains columns ['lat', 'lon', 'country', 'year']
    - country: str, together with `year` determines the survey to export
    - export_folder: str, name of folder for export
    - chunk_size: int, optionally set a limit to the # of images exported per TFRecord file
        - set to a small number (<= 50) if Google Earth Engine reports memory errors

    Returns: dict, maps task name tuple (export_folder, country, year, chunk) to ee.batch.Task
    '''
    subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
    if chunk_size is None:
        chunk_size = len(subset_df)
    num_chunks = int(math.ceil(len(subset_df) / chunk_size))
    tasks = {}
    for i in range(num_chunks):
        chunk_slice = slice(i * chunk_size, (i+1) * chunk_size - 1)  # df.loc[] is inclusive
        fc = ee_utils_building.df_to_fc(subset_df.loc[chunk_slice, :])
        # create africa building image
        #obj=ee_utils_building.AfricaBuildings(prop='area_in_meters')
        #obj=AfricaBuildings(prop='confidence')
        ##add confidence property??
        #obj.add_layer(prop='confidence')
        #img = obj.t_img
        #print(img)
        # add latitude, and longitude bands?
        #img = ee_utils_building.add_latlon(img)
        #img = ee.Image.pixelLonLat().select(['longitude', 'latitude']).rename(['long','lat'])
        img=ee.Image.pixelLonLat()
        fname = f'{country}_{year}_{i:02d}'
        tasks[(export_folder, country, year, i)] = ee_utils_building.get_array_patches(
            img=img, scale=SCALE, ksize=EXPORT_TILE_RADIUS,
            points=fc, export=EXPORT,
            prefix=export_folder, fname=fname,
            bucket=BUCKET)
    return tasks

In [7]:
tasks: dict[tuple[str, str, int, int], ee.batch.Task] = {}

In [8]:
dhs_df = pd.read_csv(DHS_CSV_PATH, float_precision='high', index_col=False)
dhs_surveys = list(dhs_df.groupby(['country', 'year']).groups.keys())
for country, year in dhs_surveys:
    new_tasks = export_images(
        df=dhs_df, country=country, year=year,
        export_folder=DHS_EXPORT_FOLDER, chunk_size=CHUNK_SIZE)
    tasks.update(new_tasks)

In [9]:
dhs_df

Unnamed: 0,country,year,lat,lon,GID_1,GID_2,wealthpooled,households,urban_rural
0,angola,2011,-12.350257,13.534922,AGO.2,AGO.2.9,2.595618,36,1
1,angola,2011,-12.360865,13.551494,AGO.2,AGO.2.9,2.209620,32,1
2,angola,2011,-12.613421,13.413085,AGO.2,AGO.2.3,0.906469,36,1
3,angola,2011,-12.581454,13.397711,AGO.2,AGO.2.3,1.105359,35,1
4,angola,2011,-12.578135,13.418748,AGO.2,AGO.2.3,1.879344,37,1
...,...,...,...,...,...,...,...,...,...
19664,zimbabwe,2015,-17.915288,31.156115,ZWE.2,ZWE.2.1,0.237659,24,1
19665,zimbabwe,2015,-18.379501,31.872287,ZWE.3,ZWE.3.4,0.492502,25,0
19666,zimbabwe,2015,-16.660612,29.850649,ZWE.6,ZWE.6.2,-0.088922,28,0
19667,zimbabwe,2015,-17.914251,30.956975,ZWE.2,ZWE.2.1,1.613829,25,1


Check on the status of each export task at [https://code.earthengine.google.com/](https://code.earthengine.google.com/), or run the following cell which checks every minute. Once all tasks have completed, download the DHS TFRecord files to `data/dhs_tfrecords_raw/`, DHSNL TFRecord files to `data/dhsnl_tfrecords_raw/`, and LSMS TFRecord files to `data/lsms_tfrecords_raw/`.

In [20]:
ee_utils_building.wait_on_tasks(tasks, poll_interval=30)

  0%|          | 0/285 [00:00<?, ?it/s]

Task ('dhs_tfrecords_raw', 'angola', 2011, 0) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 1) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 2) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 3) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 4) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 5) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 6) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 7) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 8) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'angola', 2011, 9) finished in 0 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'mozambique', 2011, 0) finished in 2 min with state: COMPLETED
Task ('dhs_tfrecords_raw', 'mozambique', 2011, 1) 

In [18]:
dhs_df = pd.read_csv(DHS_CSV_PATH, float_precision='high', index_col=False)
dhs_surveys = [('angola', 2011)]
for country, year in dhs_surveys:
    new_tasks = export_images(
        df=dhs_df, country=country, year=year,
        export_folder=DHS_EXPORT_FOLDER, chunk_size=5)
    tasks.update(new_tasks)