# Data Pipeline
The data pipeline involves three discrete steps:
1. pulling the data from the bucket
2. extracting the swath data
3. extract the sample tiles

### Libraries
Two custom modules are used, the `create_modis` module contains the hdf to numpy array code, and the `GCP_Tools` module contains all of the bucket interaction code. Both are in a parent directory, so `sys` is used to append a temporary path to the parent folder.

% ToDo dependencies

In [82]:
import sys
sys.path.append("..")

In [7]:
import create_modis
from GCP_Tools import GcpDataTools

## Data Download
We will need a folder of `.hdf` MODIS files to develop and trial out pipeline. Our `GCP_Tools` module will handle the downloading for us - we just need to initialise it, provide our connection information and define the folder we wand and where to save it.

We're using a small JSON to provide our connection details, which looks like this:

```json
{
    "project_name": "fdl-europe-atmosphere-team",
    "bucket_name": "fdl-europe-atmos-bucket"
}
```

First, we initialise it and pass the connection details

In [8]:
gcp_wrapper = GcpDataTools()
gcp_wrapper.connect("../atmos_bucket_details.json")

connection to fdl_europe_atmos_bucket on fdl-europe-atmosphere established


Next, we query down the structure and find a good directory of MODIS data. Note that without a path, it will provide the bucket's top-level directory.

In [9]:
gcp_wrapper.list_directory()

Your path is fdl_europe_atmos_bucket/


['fdl_europe_atmos_bucket/mod06-aux/',
 'fdl_europe_atmos_bucket/Fabri_Tests/',
 'fdl_europe_atmos_bucket/peruvian_sc/',
 'fdl_europe_atmos_bucket/modis-l1-aqua/',
 'fdl_europe_atmos_bucket/results/',
 'fdl_europe_atmos_bucket/modis-l1/',
 'fdl_europe_atmos_bucket/naip_trained.ckpt']

In [10]:
gcp_wrapper.list_directory("modis-l1/2008/2008")

Your path is fdl_europe_atmos_bucket/modis-l1/2008/2008


['fdl_europe_atmos_bucket/modis-l1/2008/2008/108/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/289/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/101/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/315/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/198/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/321/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/113/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/292/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/178/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/176/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/110/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/254/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/134/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/087/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/304/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/300/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/308/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/096/',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/303/',
 'fdl_europe

In [11]:
gcp_wrapper.list_directory("modis-l1/2008/2008/057")

Your path is fdl_europe_atmos_bucket/modis-l1/2008/2008/057


['fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD03.A2008057.0525.061.2017255010033.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD021KM.A2008057.0700.061.2017255035619.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD03.A2008057.0840.061.2017255005954.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD03.A2008057.0700.061.2017255010039.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD021KM.A2008057.0525.061.2017255035638.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD021KM.A2008057.0215.061.2017255035524.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD021KM.A2008057.0345.061.2017255035604.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD03.A2008057.0350.061.2017255005942.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD021KM.A2008057.0210.061.2017255035742.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD03.A2008057.0210.061.2017255005917.hdf',
 'fdl_europe_atmos_bucket/modis-l1/2008/2008/057/MOD03.

Day 57 seems to hold enough data for a trial - Now we can pull the files down, using the `GcpDataTools.get_data_directory()` method.

In [12]:
gcp_wrapper.get_data_directory(bucketpath="modis-l1/2008/2008/056", output_dir="../DATA/")

path: ../DATA//modis-l1/2008/2008/056 already exists, not created
path: ../DATA//modis-l1/2008/2008/056 already exists, not created
path: ../DATA//modis-l1/2008/2008/056 already exists, not created
path: ../DATA//modis-l1/2008/2008/056 already exists, not created
path: ../DATA//modis-l1/2008/2008/056 already exists, not created
fdl_europe_atmos_bucket/modis-l1/2008/2008/056/MOD021KM.A2008056.0755.061.2017255070653.hdf
fdl_europe_atmos_bucket/modis-l1/2008/2008/056/MOD021KM.A2008056.0935.061.2017255070241.hdf
fdl_europe_atmos_bucket/modis-l1/2008/2008/056/MOD021KM.A2008056.1115.061.2017255070329.hdf
fdl_europe_atmos_bucket/modis-l1/2008/2008/056/MOD03.A2008056.0755.061.2017255030609.hdf
fdl_europe_atmos_bucket/modis-l1/2008/2008/056/MOD03.A2008056.0935.061.2017255030503.hdf
fdl_europe_atmos_bucket/modis-l1/2008/2008/056/MOD03.A2008056.1115.061.2017255030626.hdf
getting file from fdl_europe_atmos_bucket/modis-l1/2008/2008/056/MOD021KM.A2008056.0755.061.2017255070653.hdf to ../DATA//modis

## Extracting images

Now we extract the iamges from the downloaded `hdf` files, with the custom `extract_modis` module, providing the data directory and an output directory.

In [83]:
import create_modis

In [67]:
for root, dirs, files in os.walk("../DATA/modis-l1/"):
    print(len(files))

0
0
0
6
16


In [86]:
test_paths = []
for path, subdirs, files in os.walk("../DATA/modis-l1"):
        for file in files:
            test_paths.append(os.path.join(path, file))

In [87]:
test_paths

['../DATA/modis-l1\\2008\\2008\\056\\MOD021KM.A2008056.0755.061.2017255070653.hdf',
 '../DATA/modis-l1\\2008\\2008\\056\\MOD021KM.A2008056.0935.061.2017255070241.hdf',
 '../DATA/modis-l1\\2008\\2008\\056\\MOD021KM.A2008056.1115.061.2017255070329.hdf',
 '../DATA/modis-l1\\2008\\2008\\056\\MOD03.A2008056.0755.061.2017255030609.hdf',
 '../DATA/modis-l1\\2008\\2008\\056\\MOD03.A2008056.0935.061.2017255030503.hdf',
 '../DATA/modis-l1\\2008\\2008\\056\\MOD03.A2008056.1115.061.2017255030626.hdf',
 '../DATA/modis-l1\\2008\\2008\\057\\MOD021KM.A2008057.0210.061.2017255035742.hdf',
 '../DATA/modis-l1\\2008\\2008\\057\\MOD021KM.A2008057.0215.061.2017255035524.hdf',
 '../DATA/modis-l1\\2008\\2008\\057\\MOD021KM.A2008057.0345.061.2017255035604.hdf',
 '../DATA/modis-l1\\2008\\2008\\057\\MOD021KM.A2008057.0350.061.2017255035558.hdf',
 '../DATA/modis-l1\\2008\\2008\\057\\MOD021KM.A2008057.0520.061.2017255035814.hdf',
 '../DATA/modis-l1\\2008\\2008\\057\\MOD021KM.A2008057.0525.061.2017255035638.hdf',
 

In [63]:
modis_paths = []


        modis_paths.append(os.path.join(roots, file))

        
source_dirs = set()
for path in modis_paths:
    head, tail = os.path.split(path)
    source_dirs.add(head)

source_dirs

{'../DATA/output'}

In [71]:
create_modis.run(path="../DATA/modis-l1/2008/2008/057/", save_dir="../DATA/output/")

Don't know how to open the following files: {'../DATA/modis-l1/2008/2008/057\\MOD021KM.A2008057.0210.061.2017255035742.hdf', '../DATA/modis-l1/2008/2008/057\\MOD03.A2008057.0210.061.2017255005917.hdf'}


ValueError: No supported files found

## Tile extraction
Now to randomly sample down the swaths.

In [72]:
import extract_payload
import os

paths = []
for roots, dirs, files in os.walk("../DATA/output/"):
    for file in files:
        paths.append(os.path.join(roots, file))

for file in paths:
    extract_payload.random_tile_extract_from_file(file_in=file, payload_path="../DATA/tiles/payload/", metadata_path="../DATA/tiles/metadata", tile_size=10)

ValueError: Only odd-sized tile sizes accepted.

In [8]:
walk_results = os.walk("../DATA/")

In [11]:
for root, dirs, files in walk_results:
    print ("ROOT:  {}".format(root))
    print ("DIRS:  {}".format(dirs))
    print ("FILES  {}".format(files))

ROOT:  ../DATA/.ipynb_checkpoints
DIRS:  []
FILES  []
ROOT:  ../DATA/modis-l1
DIRS:  ['2008']
FILES  []
ROOT:  ../DATA/modis-l1\2008
DIRS:  ['2008']
FILES  []
ROOT:  ../DATA/modis-l1\2008\2008
DIRS:  ['056', '057']
FILES  []
ROOT:  ../DATA/modis-l1\2008\2008\056
DIRS:  []
FILES  ['MOD021KM.A2008056.0755.061.2017255070653.hdf', 'MOD021KM.A2008056.0935.061.2017255070241.hdf', 'MOD021KM.A2008056.1115.061.2017255070329.hdf', 'MOD03.A2008056.0755.061.2017255030609.hdf', 'MOD03.A2008056.0935.061.2017255030503.hdf', 'MOD03.A2008056.1115.061.2017255030626.hdf']
ROOT:  ../DATA/modis-l1\2008\2008\057
DIRS:  []
FILES  ['MOD021KM.A2008057.0210.061.2017255035742.hdf', 'MOD021KM.A2008057.0215.061.2017255035524.hdf', 'MOD021KM.A2008057.0345.061.2017255035604.hdf', 'MOD021KM.A2008057.0350.061.2017255035558.hdf', 'MOD021KM.A2008057.0520.061.2017255035814.hdf', 'MOD021KM.A2008057.0525.061.2017255035638.hdf', 'MOD021KM.A2008057.0700.061.2017255035619.hdf', 'MOD021KM.A2008057.0840.061.2017255035846.hdf', 

In [32]:
file_paths = []

In [17]:
file_paths

[]

In [74]:
import numpy as np

In [79]:
a = np.load("../DATA/tiles/payload/payload_2008057_0840.npy")

In [80]:
a.shape

(2020, 15, 10, 10)