# To-Dos in diesem Notebook
- Nachdenken über Deletion von alten Daten (year<2013)
- Exportprobleme beheben (kein Export nach Sciebo möglich, Earth Engine unterstützt nur Drive, Google Cloud Storage und Earth Engine Arrays)
- Preprocess muss vor dem Export weitergehen
    - Normalisieren der Daten
    - Interpolation muss m.M.n. nicht gemacht werden, bzw. wird durch die Pixelangabe schon gemacht

# Prerequisites

The images are saved in gzipped TFRecord format. By default, this notebook exports images to Google Drive. If you instead prefer to export images to Google Cloud Storage (GCS), change the `EXPORT` constant below to `'gcs'` and set `BUCKET` to the desired GCS bucket name.

## Expected Storage Size of Landsat data

| Survey | Storage  | Expected Export Time |
| :-: | :-: | :-: |
| DHS  | ~16.0 GB | ~24h |
| LSMS |  ~2.5 GB | ~10h |

## Expected Storage Size of Landsat Data:
### Sentinel imagery has 3 times higher resolution than Landsat imagery

| Survey | Storage  | Expected Export Time |
| :-: | :-: | :-: |
| DHS  | ~48.0 GB | ~72h
| LSMS |  ~7.5 GB | ~30h


The folder structure should look as follows:

```
data/
    dhs_tfrecords_raw/
        angola_2011_00.tfrecord.gz
        ...
        zimbabwe_2015_00.tfrecord.gz
    lsms_tfrecords_raw/
        ethiopia_2011_00.tfrecord.gz
        ...
        uganda_2013_00.tfrecord.gz
```

## Imports and Constants

In [None]:
import math
from typing import Any, Dict, Optional, Tuple

import ee
import pandas as pd

In [None]:
ee.Authenticate()

In [None]:
ee.Initialize()  # initialize the Earth Engine API

## Constants

In [None]:
# ========== ADAPT THESE PARAMETERS ==========

# To export to Google Drive, uncomment the next 2 lines
EXPORT = 'drive'
BUCKET = None

# To export to Google Cloud Storage (GCS), uncomment the next 2 lines
# and set the bucket to the desired bucket name
# EXPORT = 'gcs'
# BUCKET = 'mybucket'

# export location parameters
DHS_EXPORT_FOLDER = 'dhs_tfrecords_raw'
LSMS_EXPORT_FOLDER = 'lsms_tfrecords_raw'

## Mögliche Bands
- 10 m/px
    - 'BLUE' (B2)
    - 'GREEN' (B3)
    - 'RED' (B4)
    - 'NIR' (B8)
- 20 m/px
    - 'RED EDGE' (B5)
    - 'NIR' (B6,B7,B8A)
    - 'SWIR' (B11,B12)
- 60 m/px
    - 'COASTAL AEROSOL' (B1)
    - 'CIRRUS' (B10)

In [None]:
# input data paths
DHS_CSV_PATH = '../data/dhs_wealthindex.csv'
#LSMS_CSV_PATH = '../data/lsms_clusters.csv'

# band names
MS_BANDS = ['BLUE', 'GREEN', 'RED', 'NIR', 'SWIR1', 'SWIR2', 'TEMP1']

# image export parameters
SCALE = 10                # export resolution: 10m/px
EXPORT_TILE_RADIUS = 335  # image dimension = (2*EXPORT_TILE_RADIUS) + 1 = 671px = 6710m
CHUNK_SIZE = 1000         # set to a small number (<= 50) if Google Earth Engine reports memory errors, may have to talk about that

## Export Images

In [None]:
def export_images(
        df: pd.DataFrame,
        country: str,
        year: int,
        export_folder: str,
        chunk_size: Optional[int] = None,
        ) -> Dict[Tuple[Any], ee.batch.Task]:
    '''
    Args
    - df: pd.DataFrame, contains columns ['lat', 'lon', 'country', 'year']
    - country: str, together with `year` determines the survey to export
    - year: int, together with `country` determines the survey to export
    - export_folder: str, name of folder for export
    - chunk_size: int, optionally set a limit to the # of images exported per TFRecord file
        - set to a small number (<= 50) if Google Earth Engine reports memory errors

    Returns: dict, maps task name tuple (export_folder, country, year, chunk) to ee.batch.Task
    '''
    subset_df = df[(df['country'] == country) & (df['year'] == year)].reset_index(drop=True)
    if chunk_size is None:
        num_chunks = 1
    else:
        num_chunks = int(math.ceil(len(subset_df) / chunk_size))
    tasks = {}

    for i in range(num_chunks):
        chunk_slice = slice(i * chunk_size, (i+1) * chunk_size - 1)  # df.loc[] is inclusive
        fc = ee_utils.df_to_fc(subset_df.loc[chunk_slice, :])
        start_date, end_date = ee_utils.surveyyear_to_range(year)

        # SENTINEL image here instead of LandsatSR, including Cloudmasking
        #roi = fc.geometry()
        #imgcol = ee_utils.LandsatSR(roi, start_date=start_date, end_date=end_date).merged
        #imgcol = imgcol.select(MS_BANDS)
        #img = imgcol.median()

        # add nightlights, latitude, and longitude bands
        #img = ee_utils.add_latlon(img)
        #img = img.addBands(ee_utils.composite_nl(year))

        fname = f'{country}_{year}_{i:02d}'
        #Export doesnt work here as planned, ee only supports export to Drive, GCS and Earth Engine Arrays
            #tasks[(export_folder, country, year, i)] = ee_utils.get_array_patches(
                #img=img, scale=SCALE, ksize=EXPORT_TILE_RADIUS,
                #points=fc, export='drive',
                #prefix=export_folder, fname=fname,
                #bucket=None)
    return tasks

In [None]:
#Imagery Download based on LSMS Survey
#doesn't work right now because of Export Problems
dhs_df = pd.read_csv(DHS_CSV_PATH, float_precision='high', index_col=False)
dhs_surveys = list(dhs_df.groupby(['country', 'year']).groups.keys())
tasks = {}

for country, year in dhs_surveys:
    new_tasks = export_images(
        df=dhs_df, country=country, year=year,
        export_folder=DHS_EXPORT_FOLDER, chunk_size=CHUNK_SIZE)
    tasks.update(new_tasks)

In [None]:
#Imagery Download based on LSMS Survey
#doesn't work right now because of Lack of LSMS Survey Data and Export Problems
lsms_df = pd.read_csv(LSMS_CSV_PATH, float_precision='high', index_col=False)
lsms_surveys = list(lsms_df.groupby(['country', 'year']).groups.keys())

for country, year in lsms_surveys:
    new_tasks = export_images(
        df=lsms_df, country=country, year=year,
        export_folder=LSMS_EXPORT_FOLDER, chunk_size=CHUNK_SIZE)
    tasks.update(new_tasks)