# Dependencies

Use extra UbuntuGIS repository to get GDAL version 3.0.4 or higher, sice Colab's native version 2.2.x is too old for the pipeline.

If after installation version of GDAL at the end is still 2.2.x, then restart runtime.

In [None]:
# Check container OS version (for correct UbuntuGIS package version)
!lsb_release -a

## GDAL

> If you are working locally and already have GDAL 3.0.4 or above, then just skip or comment the cell below, 'cause it's for Google Colab, which has GDAL 2.2.x only.

In [None]:
# Black magic goes here: installing dependencies for GDAL 3.0.4
# build process via APT and install GDAL itself via PyPI
!add-apt-repository -y ppa:ubuntugis/ubuntugis-unstable
!apt install python3-gdal=3.0.4+dfsg-1~bionic0
!apt purge --autoremove python3-gdal
!pip install gdal==3.0.4
!apt install gdal-bin=3.0.4+dfsg-1~bionic0

from osgeo import gdal; print(f"GDAL version {(gdal.__version__)}")

## MaritimeAI

In [None]:
!pip install git+https://github.com/MaritimeAI/copernicus
!pip install git+https://github.com/MaritimeAI/maritimeai
!git clone https://github.com/MaritimeAI/resources.git

# Common

Common part of this pipeline is applicable to (and is used by) all the three parts (snapshot downloading, processing and clustering).


## Google Drive

Mount _Google Drive_ for SAR images (recomended to store input and output images). If _Google Colab_ is not used, then this cell may be commented out.

In [None]:
# Google Drive

from os import path as osp

from google.colab import drive

PATH_DRIVE = osp.join('/', 'content', 'drive')

# Do not mount if it is already attached
if not osp.exists(PATH_DRIVE):
    print("Mounting Google Drive...")
    drive.mount(PATH_DRIVE)
else:
    print("Google Drive has been already mounted!")

## Paths

Paths to be used by preprocessing steps. Use `PATH_STORAGE` as a subdirectory hierarchy to store processed GeoTIFFs/Shapefiles right into _Google Drive_ (empty string will make saving to _Google Drive_'s root into folders `input`/`output`). `PATH_STORAGE` is used with the `PATH_DRIVE` variable only.

`PATH_TEMP` is used to store intermediate GeoTIFFs while processing (clustering).

`PATH_INPUT` is used as a source of GeoTIFFs (which may already exist in _Google Drive_).

`PATH_OUTPUT` is used to save final processed raster and vector images (clustering).

`PATH_SNAPSHOTS` is used to store _Sentinel_ .SAFE products (as zip archives).

`PATH_RESOURCES` is used as a source of auxiliary files such as GeoJSON search area or Shapefile cutline.

In [None]:
from os import path as osp

PATH_STORAGE = 'maritime.ai'  # arbitrary subpath in Google Drive (if any)
if 'PATH_DRIVE' in locals():
    PREFIX_DRIVE = osp.join(osp.basename(PATH_DRIVE), 'MyDrive', PATH_STORAGE)
else:
    PREFIX_DRIVE = ''

PATH_TEMP = osp.join('/', 'content', 'temp')
PATH_INPUT = osp.join('/', 'content', PREFIX_DRIVE, 'input')
PATH_OUTPUT = osp.join('/', 'content', PREFIX_DRIVE, 'output')
PATH_RESOURCES = osp.join('/', 'content', 'resources')
PATH_SNAPSHOTS = osp.join('/', 'content', 'snapshots')

FILE_SHAPEFILE = osp.join(PATH_RESOURCES, 'clustering', 'cutline',
                          'Start_Ice_Map_UTMz40WGS84f_r.shp')

print('\n'.join((PATH_STORAGE, PATH_TEMP, PATH_INPUT, PATH_OUTPUT,
                 PATH_RESOURCES, PATH_SNAPSHOTS)))

# Part 1. Download satellite images from Copernicus Open Acess Hub

A premium account at [Copernicus Open Access Hub](https://scihub.copernicus.eu) gives more access opportunities rather than free account (as usual), so with free account it's possible to get access to the last 45 days of snapshot history.

May 4th in example below seems to be inaccessible for free accounts, so it's replaced with the date range of yesterday.

In [None]:
%%time

import json

from datetime import datetime, timedelta

from copernicus import Config
from copernicus import DataHub
from copernicus import download
from copernicus import Polygons


FORMAT_COPERNICUS_DATETIME = '%Y-%m-%dT%H:%M:%S.%f'
PATH_AREA_SEARCH = osp.join('resources', 'copernicus',
                            'areas', 'pechora.geojson')

config = Config()
config.username = ''  # <-- set Copernicus Open Access Hub username here
config.password = ''  # <-- set Copernicus Open Access Hub password here

data_hub = DataHub(config)

area = json.load(open(PATH_AREA_SEARCH))
search = area['features'][0]['properties'].copy()

# Get May 4th of the current year (full day) - example
now = datetime.now().replace(month=5, day=4, hour=0, minute=0,
                             second=0, microsecond=0)
day = timedelta(hours=23, minutes=59,
                seconds=59, microseconds=999999)

date_start = now.strftime(FORMAT_COPERNICUS_DATETIME)[:-3] + 'Z'
date_stop = (now + day).strftime(FORMAT_COPERNICUS_DATETIME)[:-3] + 'Z'

# Update time range for yesterday
del search['filenames']
search.update({
    'start': 0,
    'platformName': 'Sentinel-1',
    'productType': 'GRD',
    # 'beginPosition': f"[{date_start} TO {date_stop}]",
    'beginPosition': f"[NOW-2DAYS TO NOW-1DAYS]",
})
print(f"DEBUG: search = {search}")

polygon, properties = Polygons.read_geojson(PATH_AREA_SEARCH)
snapshots = data_hub.search(search, area=polygon)
# print(f"DEBUG: snapshots = {snapshots}")

config.output = PATH_SNAPSHOTS

for i, snapshot in enumerate(snapshots):
    print(f"{i:03d}", snapshot.link)
    download(snapshot.link, config)
    print()

# Part 2. Process archives from Copernicus Open Acess Hub

Setting arguments `ratio=True` or/and `negative=True` in `process_sentinel1` function may consume a lot of RAM, so it may not be enough in Google Colab with 12 Gb RAM.

> Some snapshots may appear black after processing. That's because a small data area in a corner was cut off with Shapefile cutline.

In [None]:
%%time

from glob import glob
from zipfile import BadZipFile

from maritimeai import process_sentinel1

if osp.isdir(PATH_SNAPSHOTS):
    for filename in glob(osp.join(PATH_SNAPSHOTS, '*.zip')):
        try:
            process_sentinel1(filename, PATH_INPUT, area='default',
                            shapes=[FILE_SHAPEFILE], negative=False)
        except BadZipFile:
            print(f"ERROR: {filename} is damaged!")
else:
    raise FileNotFoundError(f"Path '{PATH_SNAPSHOTS}' must exist!")

# Part 3. Unsupervised annotation

This part takes a snapshot and does k-means clustering on it. Clustered snapshot as an output is a vector shapefile, which can be further processed manually in _ArcGIS_.

> Use `gdalinfo` or `ogrinfo` to see metadata from raster and vector geospatial images respectively.

In [None]:
%%time

import os

from datetime import datetime
from glob import glob
from os import path as osp

import matplotlib.pyplot as plt
import numpy as np

from osgeo import ogr, gdal, gdalconst

from maritimeai import read_to_channels
from maritimeai import enhance_dataset
from maritimeai import save_grayscale
from maritimeai import cluster_dataset
from maritimeai import vectorize_dataset

from maritimeai.utils import GDALCallback

from maritimeai import CHANNELS_RGB


FORMAT_DATE = '%Y-%m-%d-%H-%M-%S'
NUM_CLUSTERS = 7 # K-means
SHOW_HIST = False

if osp.isdir(PATH_INPUT):
    os.makedirs(PATH_OUTPUT, exist_ok=True)
else:
    raise FileNotFoundError(f"Path '{PATH_INPUT}' must exist!")

if not osp.isfile(FILE_SHAPEFILE):
    raise FileNotFoundError(f"Shapefile '{FILE_SHAPEFILE}' must exist!")

filenames = glob(osp.join(PATH_INPUT, '*', '*', '*.tiff'))
print(f"Source files -->\n")
print('\n'.join(filenames))
print(f"Source shape is {FILE_SHAPEFILE}")
# !gdalinfo "{filenames[0]}"

try:
    # To shake your shape like a sine wave
    if osp.isfile(FILE_SHAPEFILE):
        shape = osp.abspath(osp.realpath(FILE_SHAPEFILE))
    else:
        raise FileNotFoundError
except (TypeError, FileNotFoundError) as e:
    print(f"Shapefile '{FILE_SHAPEFILE}' does not exist!")
    shape = None
print(f"Available shape is {shape}")
# !ogrinfo "{shape}"

assert int(gdal.__version__.split('.')[0]) >= 3, f"Required GDAL version >=3.0!"
gdal.UseExceptions()

timestamp = datetime.utcnow().strftime(FORMAT_DATE)

print(f"\nProcessing files -->")
for filename in filenames:
    # Input images assumed to be grayscale
    print(f"\nInput file is {filename}")
    output, _ = osp.splitext(filename.replace(PATH_INPUT,
                                              osp.join(PATH_OUTPUT, timestamp)))
    print(f"Output is {output}")
    os.makedirs(osp.dirname(output), exist_ok=True)
    temp = filename.replace(PATH_INPUT, PATH_TEMP)
    print(f"Temporary filename is {temp}")
    os.makedirs(osp.dirname(temp), exist_ok=True)

    # Create temporary RGB GeoTIFF
    if 'dataset' in locals():
        del dataset
    dataset = read_to_channels(filename, CHANNELS_RGB)
    print(f"Warping {filename} into {temp}...")
    gdal.Warp(temp, dataset, format='GTiff', dstSRS='EPSG:32640',
              srcNodata=0, dstNodata=0, xRes=40, yRes=40,
              cutlineDSName=f"{shape}", cropToCutline=(True if shape else False),
              creationOptions=['COMPRESS=DEFLATE'], callback=GDALCallback())

    # Process temporary RGB GeoTIFF
    dataset = enhance_dataset(temp, bilateral=(7, 15, 15))
    try:

        # Save temporary dataset to destination (tile-wise filtering result)
        destination = osp.join(output, 'image')
        os.makedirs(destination, exist_ok=True)
        destination = osp.join(destination, osp.basename(temp))
        name, extension = osp.splitext(destination)
        destination = f"{name}_warped{extension}"

        save_grayscale(dataset, destination, callback=GDALCallback())

        # Try to process the whole RGB image at once
        # Clustering tile-wise is a bad idea
        cluster_dataset(dataset, clusters=NUM_CLUSTERS,
                        plt=plt, show_histogram=SHOW_HIST)

        # Save RGB dataset to destination (output raster file)
        destination = osp.join(output, 'image')
        os.makedirs(destination, exist_ok=True)
        destination = osp.join(destination, osp.basename(temp))
        name, extension = osp.splitext(destination)
        destination = f"{name}_clustered{extension}"

        print(f"Saving image to {destination}...")
        gdal.Translate(destination, dataset, options=['-co', 'COMPRESS=DEFLATE'],
                       callback=GDALCallback())

        # Vectorize clusters (create output shapefile(s) from clustering)
        # WARNING: shapefile may be quite large
        destination = osp.join(output, 'shape')
        os.makedirs(destination, exist_ok=True)
        destination = osp.join(destination, osp.basename(temp))
        destination = osp.splitext(destination)[0] + '.shp'

        vectorize_dataset(dataset, destination, GDALCallback())
    finally:
        try:
            dataset.FlushCache()
            del dataset
        except (NameError, AttributeError):
            pass