# Monitoring Changes in Surface Water Using Satellite Image Data 

## Set-up data pipeline

<table style="font-size: 1em; padding: 0; margin: 0;">
<p style="border: 1px solid #ff5733; border-left: 15px solid #ff5733; padding: 10px; text-align:justify;">
    <strong style="color: #ff5733">Deliverable</strong>  
    <br/>The deliverable for Part 2 is a jupyter notebook showing a workflow to create test and training datasets, consisting of folders of imagery and corresponding label imagery, ready for training a semantic segmentation model in keras. This will mostly test your understanding the generic workflow for preparing a dataset to train and test a deep learning model, which is an essential component of the remaining Parts.
    </p>

### Python packages importation

In [1]:
import shutil, os
import glob
import json

from matplotlib import pyplot as plt
# In order to avoid DecompressionBombError
from PIL import Image
Image.MAX_IMAGE_PIXELS = None

import rasterio
print("rasterio version",rasterio.__version__)
import tensorflow as tf
print("TensorFlow version",tf.__version__)


rasterio version 0.36.0
TensorFlow version 2.0.0


### Directories and data folder set-up


In [6]:
## create directories to move the images and labels into. 
## It is wrapped in a "try:except" loop 
## in case you have run this cell before and want to avoid errors
try:
    os.mkdir('training_images')
    os.mkdir('training_labels')
    os.mkdir('validation_images')
    os.mkdir('validation_labels')
    os.mkdir('testing_images')
    os.mkdir('testing_labels')
except:
    pass

print("Code executed")


Code executed


### Data acquisition

##### Download the dataset from google drive. 

Warning, this will download 415 MB. We saw this function for downloading a file in the previous Part

In [4]:
# from https://stackoverflow.com/questions/38511444/python-download-files-from-google-drive-using-url
import requests

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    """
    response = filename for input
    destination = filename for output
    """    
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

import zipfile

def unzip_nwpu(f):
    """
    f = file to be unzipped
    """    
    with zipfile.ZipFile(f, 'r') as zip_ref:
        zip_ref.extractall()
print("Code loaded")


Code loaded


Download the google drive file into a zipped folder on your computer called `NWPU_images.zip`. This should be 405 MB

In [None]:
file_id = '14kkcuU6wd9UMvjaDrg3PNI-e_voCi8HL'
destination = 'NWPU_images.zip'
download_file_from_google_drive(file_id, destination)
print("Code executed")


Unzip the folder (this may take a few minutes) as a new folder called images

In [42]:
unzip_nwpu(destination)
print("Code executed")


Code executed


Rename the `images` directory

In [43]:
import shutil, os

try:
    os.rename('images','nwpu_images')
except:
    pass

print("Code executed")


Code executed


Remove non-lake directories that we won't need. First find all subdirectories (except the first, which is the parent directory)

In [44]:
subdirecs = [x[0] for x in os.walk('nwpu_images')][1:]
to_delete = [s for s in subdirecs if 'lake' not in s]
for k in to_delete:
    shutil.rmtree(k, ignore_errors=True) 
print("Code executed")


Code executed


### Dispatch the data in the appropriate directory

#### Dispatch training data

In [9]:
import shutil, os
import json

data_dir = "/home/user/Documents/ImgSatCNN-Project/Buscombe_liveProject_Feb2020/SatImgCNN-Deliverables"
data_dir_labels = data_dir + os.sep + "nwpu_labels" 
data_dir_images = data_dir + os.sep + "nwpu_images" + os.sep + 'lake'


def dispatch_data(grouped_json_labels,images_dir,labels_dir):
    with open(grouped_json_labels, 'r') as read_file:
        grouped_json_data = json.load(read_file)
        for lake_id in grouped_json_data:
            # print(lake_id)
            shutil.copyfile(data_dir_images+os.sep+lake_id, images_dir+os.sep+lake_id)
            individual_json_lake_label = json.dumps(grouped_json_data[lake_id])
            with open(labels_dir + os.sep + lake_id.split('.')[0] + '.json', 'w') as write_file:
                write_file.write(individual_json_lake_label)

print("Code loaded")


Code loaded


In [11]:
# training data
grouped_json_training_labels =  data_dir_labels + os.sep + "nwpu_lakes_30samples.json"
training_images_dir = data_dir + os.sep + "training_images"
training_labels_dir = data_dir + os.sep + "training_labels"
dispatch_data(grouped_json_training_labels,training_images_dir,training_labels_dir)
print("Code executed")


Code executed


In [13]:
# validation data
grouped_json_validation_labels =  data_dir_labels + os.sep + "nwpu_lakes_20samplesA.json"
validation_images_dir = data_dir + os.sep + "validation_images"
validation_labels_dir = data_dir + os.sep + "validation_labels"
dispatch_data(grouped_json_validation_labels,validation_images_dir,validation_labels_dir)
print("Code executed")


Code executed


In [14]:
# testing data
grouped_json_testing_labels =  data_dir_labels + os.sep + "nwpu_lakes_20samplesB.json"
testing_images_dir = data_dir + os.sep + "testing_images"
testing_labels_dir = data_dir + os.sep + "testing_labels"
dispatch_data(grouped_json_testing_labels,testing_images_dir,testing_labels_dir)
print("Code executed")


Code executed


In [None]:
print("Notebook executed")