# SARFish dataset demo

This jupyter notebook is designed to guide new users of the SARFish dataset. This notebook shows users how to use convenience functions included in this repo to get working with the SARFish dataset as quickly as possible.

## What you will learn

1. What the SARFish dataset is
2. How to access the SARFish dataset
3. Dataset structure
4. How to load and visualise the SARFish imagery data
5. How to load and visualise the SARFish groundtruth labels
6. SARFish challenge prediction submission format
7. How to evaluate model performance using the SARFish metric
8. How to participate in the SARFish challenge

In [1]:
from pathlib import Path
import os
from time import time

import numpy as np
import pandas as pd
import yaml

from GeoTiff import load_GeoTiff
from visualise_labels import scale_sentinel_1_image, SARFish_Plot
from SARFish_metric import score

rng = np.random.default_rng(1234)

%gui qt
pd.set_option('display.max_columns', None)
start = time()

  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)



## 1. What is the SARFish dataset

### 1.1 Overview

SARFish is an imagery dataset for the purpose of training, validating and testing supervised machine learning models on the task of ship detection and classification. SARFish builds on the excellent work of the [xView3-SAR dataset](https://iuu.xview.us/dataset) by expanding the imagery data to include [Single Look Complex (SLC)](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-1-sar/products-algorithms/level-1-algorithms/single-look-complex) as well as [Ground Range Detected (GRD)](https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-1-sar/products-algorithms/level-1-algorithms/ground-range-detected) imagery data taken directly from the European Space Agency 
(ESA) Copernicus Programme [Open Access Hub Website](https://scihub.copernicus.eu/).

The following image shows a summarised description of the Sentinel-1 product family for the Interferrometric Wide (IW) mode. [^1] ![Sentinel-1 product processing pipeline summary](./images/sentinel_1_data_product_processing_levels_summary.jpg)

[^1]: G. Hajduch, M. Bourbigot, H. Johnsen, and R. Piantanida, Sentinel-1 Product Specification. Sentinel-1 Mission Performance Centre, 2022, p. 34. [Online]. Available: [https://sentinel.esa.int/web/sentinel/user-guides/sentinel-1-sar/document-library/-/asset_publisher/1dO7RF5fJMbd/content/id/4762447](https://sentinel.esa.int/web/sentinel/user-guides/sentinel-1-sar/document-library/-/asset_publisher/1dO7RF5fJMbd/content/id/4762447)

### 1.2 The Sentinel-1 processing pipeline

The following diagram shows how the SARFish dataset extends the xView3-SAR dataset by providing the minimally pre-processed GRD and SLC counterparts to the xView3-SAR dataset imagery products and provides labels which have been re-projected labels into the pixel space of the images. ![Relationship between the xView3-SAR and SARFish datasets](./images/xView3-SAR_SARFish_dataset_relation.jpg)

### 1.3 Minimal SARFish processing

The preprocessing applied to the Sentinel-1 images to create the SARFish dataset was chosen in order to be minimally invasive. The preprocessing of the xView3-SAR dataset included radiometric calibration, decibel scaling, range doppler geocoding, projection to UTM using the [SeNtinel Application Platform (SNAP)](https://earth.esa.int/eogateway/tools/snap) [Graph Processing Tool](https://seadas.gsfc.nasa.gov/help-8.3.0/gpf/GraphProcessingTool.html). In contrast, the only operations applied to the SARFish data have been those necesary to make the images usable for computer vison tasks. The philosophy was to provide GRD and SLC data in a format as close as practicable to the Sentinel-1 data that can be downloaded from Copernicus.

| Operation | xView3-SAR dataset | SARFish dataset |
|-----------|--------------------|-----------------|
| [radiometric-calibration](https://sentinels.copernicus.eu/web/sentinel/radiometric-calibration-of-level-1-products) | True | False |
| [decibel scaling](https://en.wikipedia.org/wiki/Decibel) | True | False |
| [range dopper geocoding](https://sentinel.esa.int/documents/247904/1653442/Guide-to-Sentinel-1-Geocoding.pdf)  | True | False |
| [projection to UTM](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system) | True | False |
| flipping | True | True |
| [de-bursting](https://sentinels.copernicus.eu/web/sentinel/level-1-post-processing-algorithms) | True | True |
| [no data masking](https://gdal.org/development/rfc/rfc15_nodatabitmask.html) | True | True  |

### Flipping:

Flipping is applied to both GRD and SLC products. Sentinel-1 images are reflected with respect to the Earth's surface. This is due to the data aquisition method; first sensed azimuth lines are placed in the first rows in the image array. Images from both ascending and descending orbits will not map to the Earth's surface with a rotation, necessitating a flip along one axis. The images and ground control points (GCPS) are reversed along the range/x axis. 

### Debursting

Debursting is applied only to SLC products. Sentinel-1 SLC products are provided as sets of 3 "swaths" per channel per scene. These swaths consist of "sub-swaths" or "bursts" which are overlapping segments of the image. The process of de-bursting is the alignment of these bursts into a contiguous image. This was done to create a one-to-one correspondence between the objects in each swath and the features on the Earth to which they correspond. It is important to note that as the deburst images are concatenations of bursts which themselves are individual SAR images, there are significant phase discontinuities on the boundaries of the bursts. It was decided for the purposes of this dataset that the bursts within the individual swaths should be merged rather than being split into seperate images.

### No data masking

No data masking is applied to both GRD and SLC products. Invalid pixels in the image have been masked using a nodata mask.

## 2. Accessing the data

### 2.1 Downloading data from huggingface

The SARFish dataset is available for download at:
- [full SARFish dataset](https://huggingface.co/datasets/ConnorLuckettDSTG/SARFish)
- [sample SARFish dataset](https://huggingface.co/datasets/ConnorLuckettDSTG/SARFishSample)

| dataset       | coincident GRD, SLC products | compressed (GB) | uncompressed (GB) |
| ------------- | ---------------------------- | --------------- | ----------------- |
| SARFishSample | 1                            | 4.3             | 8.2               |
| SARFish       | 753                          | 3293            | 6468              |

#### Full SARFish dataset

Make sure you have at least enough storage space for the uncompressed dataset.

```bash
cd /path/to/large/storage/location
```

[Create|login] to a [huggingface](https://huggingface.co) account.

Login to the huggingface command line interface.

```bash
huggingface-cli login
```

Copy the access token in settings/Access Tokens from your huggingface account. Clone the dataset

```bash
git lfs install
git clone https://huggingface.co/datasets/ConnorLuckettDSTG/SARFish
```

#### SARFish sample dataset

Substitute the final command for the full dataset with the following:

```bash
git clone https://huggingface.co/datasets/ConnorLuckettDSTG/SARFishSample
```

### 2.2 Checking the md5sums

Use the provided sum checking functionn to check the md5 sums of the downloaded SARFish products

```bash
./check_SARFish_md5sum.py
```

### 2.3 Unizipping the data

Use the provided unzipping function to unzip the SARFish data products in parallel.

```bash
cd /path/to/SARFish/directory/GRD
unzip\_batch.sh -p $(find './' -type f -name "*.SAFE.zip")

cd /path/to/SARFish/directory/SLC
unzip\_batch.sh -p $(find './' -type f -name "*.SAFE.zip")
```

### 2.4 Setting the SARFish dataset root directory.

Modify the environment.yaml file in this directory and substitude the dummy path with the SARFish root directory. For example; if your local copy of the SARFish dataset resides in /data/SARFish, subsitute /path/to/SARFish/root/ with /data/.

```
SARFISH_ROOT_DIRECTORY: /path/to/SARFish/root/ 
```

In [2]:
with open("environment.yaml", "r") as f:
    environment = yaml.safe_load(f)

SARFish_root_directory = environment['SARFish_root_directory']
os.environ['SARFISH_ROOT_DIRECTORY'] = SARFish_root_directory

## 3. Dataset Structure

The SARFish dataset is packaged in the [SAFE format](https://sentinels.copernicus.eu/web/sentinel/user-guides/sentinel-1-sar/data-formats/safe-specification). The product file name consists of the unique [product identifier](https://sentinel.esa.int/documents/247904/1877131/Sentinel-1-Product-Specification) (section 3.5.1) from which the SARFish product was derived.

The following tree shows an overview of the SARFish dataset:

```
SARFish/
├── GRD
│   ├── public
│   │   └── S1B_IW_GRDH_1SDV_*.SAFE
│   ├── train
│   │   └── S1B_IW_GRDH_1SDV_*.SAFE
│   └── validation
│       └── S1B_IW_GRDH_1SDV_*.SAFE
└── SLC
    ├── public
    │   └── S1B_IW_SLC__1SDV_*.SAFE
    ├── train
    │   └── S1B_IW_SLC__1SDV_*.SAFE
    └── validation
        └── S1B_IW_SLC__1SDV_*.SAFE
```

## 3.1 Dataset Partitions

The partitions of the dataset are as follows:

| partition   | labels provided |
| ----------- | --------------- |
| train       | True            |
| validation  | True            |
| public      | False           |

The public partition is provided with no labels. It will to be used to determine competitor's ranking in the SARFish challenge. Competitors will run their model over the public partition of the dataset producing predictions for each constituent scene and submit these in the submisson format (see section 6)

### 3.2 The two SARFish product types: GRD and SLC

The SARFish dataset consists of pairs of coincident [real-valued GRD and complex valued SLC](https://sentinels.copernicus.eu/web/sentinel/user-guides/sentinel-1-sar/product-types-processing-levels/level-1) imagery products from the Sentinel-1 satellite constellation. The GRD and SLC designations are the **product\_type**. The mapping between xView3-SAR **scene\_id** and the [Copernicus](https://www.copernicus.eu/en) product identifier is contained in the xView3\_SLC\_GRD\_correspondences.csv file. The following cell shows an example of the mapping between xView3-SAR and SARFish GRD, SLC products.

In [3]:
xView3_SLC_GRD_correspondences = pd.read_csv("./labels/xView3_SLC_GRD_correspondences.csv")
xView3_SLC_GRD_correspondences[['scene_id', 'GRD_product_identifier', 'SLC_product_identifier']].iloc[703:704]

Unnamed: 0,scene_id,GRD_product_identifier,SLC_product_identifier
703,5c3d986db930f848v,S1B_IW_GRDH_1SDV_20200803T075721_20200803T0757...,S1B_IW_SLC__1SDV_20200803T075720_20200803T0757...


The xView3\_SLC\_GRD\_correspondences.csv file also contains the file names of the vh, vh imagery products and their associated annotation.xml files. This is used to pick out individual imagery data for processing and evalutation.

In [4]:
xView3_SLC_GRD_correspondences.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 753 entries, 0 to 752
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   DATA_PARTITION             753 non-null    object
 1   scene_id                   753 non-null    object
 2   GRD_product_identifier     753 non-null    object
 3   GRD_md5sum                 753 non-null    object
 4   GRD_vh                     753 non-null    object
 5   GRD_vv                     753 non-null    object
 6   GRD_vh_annotation          753 non-null    object
 7   GRD_vv_annotation          753 non-null    object
 8   SLC_product_identifier     753 non-null    object
 9   SLC_md5sum                 753 non-null    object
 10  SLC_swath_1_vh             753 non-null    object
 11  SLC_swath_1_vv             753 non-null    object
 12  SLC_swath_1_vh_annotation  753 non-null    object
 13  SLC_swath_1_vv_annotation  753 non-null    object
 14  SLC_swath_

### 3.3 SARFish dataset format

#### GRD imagery

GRD products are uniquely identified by their:

- product\_type
- partition 
- GRD\_product\_identifier
- polarisation 

The tree of an example SARFish GRD Product:

```
SARFish/GRD/validation/S1B_IW_GRDH_1SDV_20201013T054010_20201013T054035_023790_02D350_506A.SAFE/
├── annotation
│   └── ...
├── manifest.safe
├── measurement <- imagery data
│   ├── S1B_IW_GRDH_1SDV_20201013T054010_20201013T054035_023790_02D350_506A_global_shoreline_vector.npy
│   ├── S1B_IW_GRDH_1SDV_20201013T054010_20201013T054035_023790_02D350_506A_xView3_shoreline.npy
│   ├── s1b-iw-grd-vh-20201013t054010-20201013t054035-023790-02d350-002_SARFish.tiff
│   └── s1b-iw-grd-vv-20201013t054010-20201013t054035-023790-02d350-001_SARFish.tiff
├── preview
│   └── ...
├── ...
└── support
    └── ...
```

The imagery data located in "measurement" consists of 2 images containing the polarimetic channels called VV, VH polarisations packaged in the GeoTiff format. Also included in the measurement folder are numpy archives which contain shoreline vectors which are used in the evaluation of a model's close-to-shore detection performance.

#### SLC imagery

SLC products are uniquely identified by their:

- product\_type
- partition 
- GRD\_product\_identifier
- polarisation
- swath\_index

The tree of an example SARFish SLC Product:

```
SARFish/SLC/validation/S1B_IW_SLC__1SDV_20201013T054008_20201013T054035_023790_02D350_04D0.SAFE
├── annotation
│   └── ...
├── manifest.safe
├── measurement
│   ├── s1b-iw1-slc-vh-20201013t054009-20201013t054034-023790-02d350-001_SARFish.tiff                 <- swath 1 image data
│   ├── s1b-iw1-slc-vh-20201013t054009-20201013t054034-023790-02d350-001_SARFish.tiff.msk             <- swath 1 image mask
│   ├── s1b-iw1-slc-vv-20201013t054009-20201013t054034-023790-02d350-004_SARFish.tiff
│   ├── s1b-iw1-slc-vv-20201013t054009-20201013t054034-023790-02d350-004_SARFish.tiff.msk
│   ├── s1b-iw2-slc-vh-20201013t054010-20201013t054035-023790-02d350-002_SARFish.tiff
│   ├── s1b-iw2-slc-vh-20201013t054010-20201013t054035-023790-02d350-002_SARFish.tiff.msk
│   ├── s1b-iw2-slc-vv-20201013t054010-20201013t054035-023790-02d350-005_SARFish.tiff
│   ├── s1b-iw2-slc-vv-20201013t054010-20201013t054035-023790-02d350-005_SARFish.tiff.msk
│   ├── s1b-iw3-slc-vh-20201013t054008-20201013t054033-023790-02d350-003_SARFish.tiff
│   ├── s1b-iw3-slc-vh-20201013t054008-20201013t054033-023790-02d350-003_SARFish.tiff.msk
│   ├── s1b-iw3-slc-vv-20201013t054008-20201013t054033-023790-02d350-006_SARFish.tiff
│   ├── s1b-iw3-slc-vv-20201013t054008-20201013t054033-023790-02d350-006_SARFish.tiff.msk
│   ├── S1B_IW_SLC__1SDV_20201013T054008_20201013T054035_023790_02D350_04D0_1_global_shoreline_vector.npy  <- swath 1 global shoreline vector
│   ├── S1B_IW_SLC__1SDV_20201013T054008_20201013T054035_023790_02D350_04D0_1_xView3_shoreline.npy         <- swath 1 xView3 shoreline vector
│   ├── S1B_IW_SLC__1SDV_20201013T054008_20201013T054035_023790_02D350_04D0_2_global_shoreline_vector.npy
│   ├── S1B_IW_SLC__1SDV_20201013T054008_20201013T054035_023790_02D350_04D0_2_xView3_shoreline.npy
│   ├── S1B_IW_SLC__1SDV_20201013T054008_20201013T054035_023790_02D350_04D0_3_global_shoreline_vector.npy
│   └── S1B_IW_SLC__1SDV_20201013T054008_20201013T054035_023790_02D350_04D0_3_xView3_shoreline.npy
├── preview
│   └── ...
├── ...
└── support
    └── ...
```

The imagery data located in the "measurement" directory consists of 3 sets (swaths) of 2 images containing the polarimetric channels called VV, VH packaged in the GeoTiff. Each Sentinel-1 SLC product is the composite of 3 swaths. The process of [debursting](https://github.com/senbox-org/s1tbx/blob/master/s1tbx-op-sentinel1-ui/src/main/resources/org/esa/s1tbx/sentinel1/docs/operators/TOPSARDeburstOp.html) has been applied to each swath, but the swaths have not been merged. Each SLC swath is again accompanied by a corresponding xView3-SAR and global shoreline vector. In addition corresponding [.msk](https://gdal.org/development/rfc/rfc15_nodatabitmask.html) (mask) files are used by the GeoTiff.load\_Geotiff function in order to mask the no-data portions of the SLC products. The function load_Geotiff allow a user to easily load both GRD and SLC SARFish products taking into account their respective masks.

## 4. How to load and visualise the SARFish imagery data

To generate the path to the GRD and SLC products associated with a single scene, we first we pick out a specific row from the xView3_SLC_GRD_correspondence table:

In [5]:
correspondence = xView3_SLC_GRD_correspondences.iloc[703:704].squeeze()
correspondence

DATA_PARTITION                                                      validation
scene_id                                                     5c3d986db930f848v
GRD_product_identifier       S1B_IW_GRDH_1SDV_20200803T075721_20200803T0757...
GRD_md5sum                                    3f8ec460304f087c8f9a59b7c0897561
GRD_vh                       s1b-iw-grd-vh-20200803t075721-20200803t075746-...
GRD_vv                       s1b-iw-grd-vv-20200803t075721-20200803t075746-...
GRD_vh_annotation            s1b-iw-grd-vh-20200803t075721-20200803t075746-...
GRD_vv_annotation            s1b-iw-grd-vv-20200803t075721-20200803t075746-...
SLC_product_identifier       S1B_IW_SLC__1SDV_20200803T075720_20200803T0757...
SLC_md5sum                                    c32f40b7d3a1304a30c287d7eae75684
SLC_swath_1_vh               s1b-iw1-slc-vh-20200803t075720-20200803t075748...
SLC_swath_1_vv               s1b-iw1-slc-vv-20200803t075720-20200803t075748...
SLC_swath_1_vh_annotation    s1b-iw1-slc-vh-20200803

### 4.1 Loading GRD imagery

The correspondence mapping is used to generate the specific GRD product path.

In [6]:
measurement_path_GRD = Path(
    SARFish_root_directory, "GRD", correspondence['DATA_PARTITION'], f"{correspondence['GRD_product_identifier']}.SAFE",
    "measurement", correspondence[f'GRD_vh']
)
str(measurement_path_GRD)

'/auto/SARFish/SARFish/GRD/validation/S1B_IW_GRDH_1SDV_20200803T075721_20200803T075746_022756_02B2FF_033A.SAFE/measurement/s1b-iw-grd-vh-20200803t075721-20200803t075746-022756-02b2ff-002_SARFish.tiff'

The image is loaded into numpy arrays using the provided GeoTiff.load_Geotiff function. The function returns an array of image data, and a second array masking the no data areas. Since the data is in linear scale, we use the provided scaling function visualise\_labels.scale_sentinel\_1\_image to convert the data to a decibel scale which is more easily interpereted by humans.

In [7]:
data_GRD, mask_GRD, _, _ = load_GeoTiff(str(measurement_path_GRD))
data_scaled_GRD = scale_sentinel_1_image(data_GRD, product_type = "GRD")

### 4.2 Plotting GRD imagery

The data can be plotted using the provided SARFish\_Plot class. 

In [8]:
plot_GRD = SARFish_Plot(
    data_scaled_GRD, mask_GRD, title = f"example plotting groundtruth labels in {correspondence[f'GRD_product_identifier']}",
)

### 4.3 Plotting SLC imagery

There are three SLC swaths per product. We will load and visualise the imagery in one cell.

In [9]:
SLC_plots = []
for swath_index in [1, 2, 3]:
    start = time()
    
    measurement_path_SLC = Path(
        SARFish_root_directory, "SLC", correspondence['DATA_PARTITION'], f"{correspondence['SLC_product_identifier']}.SAFE",
        "measurement", correspondence[f'SLC_swath_{swath_index}_vh']
    )
    
    data_SLC, mask_SLC, _, _ = load_GeoTiff(str(measurement_path_SLC))
    data_SLC_scaled = scale_sentinel_1_image(data_SLC, product_type = "SLC")
    
    stop = time()
    print(f"SLC swath {swath_index} loading time: {stop - start} seconds.")

    plot_SLC = SARFish_Plot(
        data_SLC_scaled, mask_SLC,
        title = f"example plotting groundtruth labels in {correspondence['SLC_product_identifier']}, swath: {swath_index}",
    )
    SLC_plots.append(plot_SLC)
    data_SLC = None
    mask_SLC = None
    data_scaled_SLC = None

SLC swath 1 loading time: 23.21375823020935 seconds.
SLC swath 2 loading time: 21.402974128723145 seconds.
SLC swath 3 loading time: 23.968801975250244 seconds.


## 5. How to load and visualise the SARFish labels

The groundtruth labels for the SARFish dataset are aranged similarly to the xView3-SAR dataset. The labels contain the location, classification and length of each maritime object present in all scenes for a particular product type and dataset partition. The information contained in the groundtruth labels for both the GRD and SLC products is summarised in the table below.

| field     | data_type | description |
| --------- | ----------- | --------- |
| partition | str: \{"train", "validation"\} | split of the dataset |
| product\_type | str: \{"GRD", "SLC"\} | product type of the data |
| scene\_id | str | unique xView3 scene ID for challenge purposes |
| detect\_id | str | unique detection ID in format scene\_id\_detect\_lat\_detect\_lon |
| \{product\_type\}\_product\_identifier | str | The copernicus Sentinel-1 product identifier for the designated product type |
| detect\_lat | float | latitude of detection in World Geodetic System (WGS) 84 coordinates |
| detect\_lon | float | longitude of detection in WGS84 coordinates |
| detect\_scene\_row | int | pixel row of scene containing detection |
| detect\_scene\_column | int | pixel column of scene containing detection |
| top | float | pixel row of the top left corner of the bounding box, where available |
| left | float | pixel column of the top left corner of the bounding box, where available |
| bottom | float | pixel row of the bottom right corner of the bounding box, where available |
| right | float | pixel column of the bottom right corner of the bounding box, where available |
| vessel\_length\_m | float | length of vessel in meters; only provided where available from AIS |
| source | str: \{AIS, AIS/Manual, Manual\} | source of detection (AIS, manual label, or both) |
| is\_vessel | bool | True if detection is a vessel, False otherwise |
| is\_fishing | bool | True if detection is a fishing vessel, False otherwise |
| global\_shoreline\_vector\_distance\_from\_shore\_km | float | distance from shore of detection in kilometers as determined using the global shoreline vectors projected into the pixel space of the SARFish products  |
| xView3\_shoreline\_vector\_distance\_from\_shore\_km | float | distance from shore of detection in kilometers as determined using the  xView3-SAR shoreline vectors projected into the pixel space of the SARFish products  |
| confidence | str: \{HIGH, MEDIUM, LOW\} | level of confidence for is\_vessel and is\_fishing labels |

### 5.1 Loading and visualising GRD groundtruth labels

In [10]:
groundtruth_GRD = pd.read_csv(
    str(
        Path(SARFish_root_directory, "GRD", correspondence['DATA_PARTITION'], f"GRD_{correspondence['DATA_PARTITION']}.csv")
    )
)
groundtruth_GRD = groundtruth_GRD[groundtruth_GRD['GRD_product_identifier'] == correspondence['GRD_product_identifier']]

In [11]:
plot_GRD.add_bboxes(groundtruth_GRD[['left', 'right', 'bottom', 'top']])
plot_GRD.add_labels(
    columns = groundtruth_GRD['detect_scene_column'], rows = groundtruth_GRD['detect_scene_row'], 
    categories = groundtruth_GRD[['detect_id', 'is_vessel', 'is_fishing', 'vessel_length_m', 'confidence']], 
    legend_label = "groundtruth", color = "yellow"
)

In [12]:
groundtruth_GRD.info()

<class 'pandas.core.frame.DataFrame'>
Index: 329 entries, 0 to 328
Data columns (total 20 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   partition                                       329 non-null    object 
 1   product_type                                    329 non-null    object 
 2   scene_id                                        329 non-null    object 
 3   detect_id                                       329 non-null    object 
 4   GRD_product_identifier                          329 non-null    object 
 5   detect_lat                                      329 non-null    float64
 6   detect_lon                                      329 non-null    float64
 7   detect_scene_column                             329 non-null    float64
 8   detect_scene_row                                329 non-null    float64
 9   top                                             

### 5.2 Loading and visualising SLC groundtruth labels

In [13]:
groundtruth_SLC = pd.read_csv(
    str(
        Path(SARFish_root_directory, "SLC", correspondence['DATA_PARTITION'], f"SLC_{correspondence['DATA_PARTITION']}.csv")
    )
)
groundtruth_SLC = groundtruth_SLC[groundtruth_SLC['SLC_product_identifier'] == correspondence['SLC_product_identifier']]

The SLC groundtruth labels are very similar to the GRD labels except the addition of a 'swath\_index' column which specifies (within a particular SLC product) what swath the groundtruth label belongs to.

In [14]:
groundtruth_SLC.info()

<class 'pandas.core.frame.DataFrame'>
Index: 321 entries, 0 to 320
Data columns (total 21 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   partition                                       321 non-null    object 
 1   product_type                                    321 non-null    object 
 2   scene_id                                        321 non-null    object 
 3   detect_id                                       321 non-null    object 
 4   SLC_product_identifier                          321 non-null    object 
 5   swath_index                                     321 non-null    int64  
 6   detect_lat                                      321 non-null    float64
 7   detect_lon                                      321 non-null    float64
 8   detect_scene_column                             321 non-null    float64
 9   detect_scene_row                                

Plotting the SLC groundtruth labels.

In [15]:
for swath_index, plot_SLC in zip([1, 2, 3], SLC_plots):
    swath_groundtruth_SLC = groundtruth_SLC[groundtruth_SLC['swath_index'] == swath_index]
    plot_SLC.add_bboxes(swath_groundtruth_SLC[['left', 'right', 'bottom', 'top']])
    plot_SLC.add_labels(
        columns = swath_groundtruth_SLC['detect_scene_column'], rows = swath_groundtruth_SLC['detect_scene_row'],
        categories = swath_groundtruth_SLC[['detect_id', 'is_vessel', 'is_fishing', 'vessel_length_m', 'confidence']],
        legend_label = "groundtruth", color = "yellow"
    )

## 6. SARFish challenge prediction submission format

### 6.1 GRD

In [16]:
reference_GRD_predictions = pd.read_csv(str(Path("./labels/reference_GRD_predictions.csv")))
reference_GRD_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Unnamed: 0              300 non-null    int64 
 1   partition               300 non-null    object
 2   product_type            300 non-null    object
 3   scene_id                300 non-null    object
 4   detect_scene_column     300 non-null    int64 
 5   detect_scene_row        300 non-null    int64 
 6   vessel_length_m         300 non-null    int64 
 7   is_vessel               300 non-null    bool  
 8   is_fishing              300 non-null    bool  
 9   GRD_product_identifier  300 non-null    object
dtypes: bool(2), int64(4), object(4)
memory usage: 19.5+ KB


The plotted simulated GRD predictions. Use your mouse to click on individual labels within the plot to see more information.

In [17]:
plot_GRD.add_labels(
    columns = reference_GRD_predictions['detect_scene_column'], rows = reference_GRD_predictions['detect_scene_row'], 
    categories = reference_GRD_predictions[['is_vessel', 'is_fishing', 'vessel_length_m']], 
    legend_label = "reference GRD predictions", color = "red"
)

### 6.2 SLC

The following cell loads an simulated example of a set of predictions to illustrate the format of submissions for SLC product type in the SARFish challenge. Submissions for the challenge must have the following required columns. The predictions format (like the groundtruth labels) differs from the GRD by requiring a swath\_index column to specify which swath the prediction belongs to. 

In [18]:
reference_SLC_predictions = pd.read_csv(str(Path("./labels/reference_SLC_predictions.csv")))
reference_SLC_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Unnamed: 0              600 non-null    int64 
 1   partition               600 non-null    object
 2   product_type            600 non-null    object
 3   scene_id                600 non-null    object
 4   detect_scene_column     600 non-null    int64 
 5   detect_scene_row        600 non-null    int64 
 6   vessel_length_m         600 non-null    int64 
 7   is_vessel               600 non-null    bool  
 8   is_fishing              600 non-null    bool  
 9   SLC_product_identifier  600 non-null    object
 10  swath_index             600 non-null    int64 
dtypes: bool(2), int64(5), object(4)
memory usage: 43.5+ KB


Plotting the simulated SLC predictions.

In [19]:
for swath_index, plot_SLC in zip([1, 2, 3], SLC_plots):
    swath_reference_SLC_predictions = reference_SLC_predictions[reference_SLC_predictions['swath_index'] == swath_index]
    plot_SLC.add_labels(
        columns = swath_reference_SLC_predictions['detect_scene_column'], rows = swath_reference_SLC_predictions['detect_scene_row'],
        categories = swath_reference_SLC_predictions[['is_vessel', 'is_fishing', 'vessel_length_m']],
        legend_label = "reference SLC predictions", color = "red"
    )

### 6.3 SARFish reference example

Provided in refence/SARFish_reference.py is a simple didactic example of an algorithm which produces predictions. This is to be used as a basic staring point showing the conceptual layout of a training and inference script. 

# UPDATE for deep learning reference model

The following cell runs the SARFish reference algorithm on the command line to generate the predictions for the GRD and SLC product.

In [20]:
#! ./SARFish_reference.py --sarfish_root_directory "${SARFISH_ROOT_DIRECTORY}"

## 7. How to evaluate model performance using the SARFish metric

The evaluation metrics for the SARFish dataset is the same the xView3-SAR challenge. The provided SARFish\_metrics script takes into consideration the imaging geometry differences between the GRD and SLC products and makes comparing the detection, classification, and length regression results of models trained on the two product\_types straight-forward.

### 7.1 SARFish metric tasks

#### Aggregate score

The aggregate overall score is calculated using a linear combination of the F1 scores from the 5 following tasks. It is a scalar that sums up the performance for direct comparison on a leaderboard. 

$$ M_R = F1_D * \frac{1 + F1_S  + F1_V + F1_F + PE_L}{5} $$

$$ PE_{L} = 1 - \frac{1}{N} \sum_{n=1}^{N} \frac{| \hat{\ell} - \ell|}{\ell}. $$

More information about the definitions of the metrics used can be found in the [xView3-SAR paper](https://arxiv.org/abs/2206.00897) [2]

#### 1. Maritime object detection task ($F1_D$)

The detection task is measured by assigning the predictions to the groundtruth maritime object locations using the Hungarian matching algorithm. The assignments determined by the algorithm are affected by the *assignment\_tolerance\_meters* and *costly\_dist* flag. The subset of groundtruth and prediction labels used in the evalutation of this task is determined by the *score\_all* and *drop\_low\_detect* flags. Due to the dependence of the following tasks on the output of the detection task, the implications of the choice of the parameters relevant to this task will propagate through the scores.

#### 2. close-to-shore subset detection task ($F1_S$)

The close-to-shore detection task is measured by assigning the predictions to groundtruth maritime object locations given the *distance\_from\_shore\_tolerance\_meters* threshold. The task of correctly detecting maritime objects in the close-to-shore environment is a harder than in open sea. The definition of close to shore is determined by both the *distance\_from\_shore\_tolerance\_meters* threshold and the *shoreline\_type* used.

#### 3. 'is vessel' classification task ($F1_V$)

The boolean 'is\_vessel' classification task is measured on the predictions and groundtruth that were determined to be true positives from the detection task. Only the true postives for which the groundtruth label's 'is\_vessel' category is True or False are evaluated; NaN values are discarded. Each detection task true postive prediction and groundtruth pair is assigned as a true positive, false positive, false negative and true negative for the 'is\_vessel' task. 

#### 4. 'is fishing' classification task ($F1_F$)

The boolean fishing vessel classification task is dependent on the true positive prediction and groundtruth pairs from the 'is\_vessel' classification task. Differing from the evaluation of the 'is\_vessel' classification, only true positive pairs for which the 'is\_vessel' label is True are evaluated. This has the implication of restricting the evaluation of the 'is\_fishing' task in a way that respects the hierarchical nature of the classification of fishing vessels as a subset of vessels. Again NaN values are discarded. 

#### 5. vessel length regression task ($PE_L$)

The length regression task is measured on the true positive predictions and groundtruth determined by the detection task. The subset to evaluate is further restricted to just the true positive prediction and groundtruth pairs for which the groundtruth 'vessel\_length\_m' attribute is given. Again NaN values are discarded. 

Change from xView3-SAR:

Previously in the metrics of the xView3-SAR challenge, the relative error was calculated with respect to the groundtruth 'vessel\_length\_m' attribute. The implication is that relative errors could be larger than 1. Subsequently the relative error is subtracted from 1 to get the mean relative accuracy. If the mean relative error is greater than 1, a length regression task accuracy of 0 is returned.

In this implementation, the relative error is calculated with respect to the max(groundtruth\['vessel\_length\_m'\], predictions\['vessel\_length\_m'\]). This ensures that the relative error is at maximum 1. The implication is that the SARFish length regression performance metric is not as harsh as the xView3-SAR version. 

[2] F. Paolo et al., xView3-SAR: Detecting Dark Fishing Activity Using Synthetic Aperture Radar Imagery. arXiv, 2022. doi: 10.48550/ARXIV.2206.00897. 

### 7.2 SARFish metric parameters

#### str: shoreline\_type ["xView3\_shoreline", "global\_shoreline\_vector"] - default "xView3\_shoreline"

The type of shoreline vector to use for the evaluation of the close-to-shore detection task. The choices are:
    - "xView3\_shoreline" which are the xView3-SAR dataset shorelines projected into the pixel space of the GRD and SLC SARFish products. This shoreline allows a one to one comparison between the performance of models trained using the xView3-SAR imagery and SARFish imagery on the close-to-shore detection task.
    - "global\_shoreline\_vector" derived from the [GlobalIslands global shoreline vector](https://www.tandfonline.com/doi/full/10.1080/1755876X.2018.1529714). This shoreline is an update to the shorelines using a more accurate source. [3]

#### float: distance\_from\_shore\_tolerance\_meters - default: 2000.0

The tolerance used to designate a prediction and groundtruth as close-to-shore. This tolerance is used to pick out a subset of the predicitions and groundtruth for evaluating the close-to-shore detection task.

#### float: assignment\_tolerance\_meters - default: 200.0

The tolerance used to threshold assignments between prediction and groundtruth detections as true positives. As a default any assignment between prediction and groundtruth under 200.0 meters distance apart is counted as a true positive for that particular detection task. This tolerance is also used in the determinination of the predictions asocciated with low confidence groundtruth if the *score\_all* flag is False AND the *drop_\low\_detect* flag is True.

#### float: score\_all - default: False

Whether to score the predictions against all groundtruth label confidence levels. By default the score function drops groundtruth for which the "confidence" attribute is "LOW" and evaluates model performance against "MEDIUM" and "HIGH" confidence groundtruth only.

#### bool: drop\_low\_detect - default: True

Whether to use the Hungarian matching algorithm the find the lowest distance cost assignment of predictions to the "LOW" confidence groundtruth and remove them from further consideration in the metrics. This option is used in concert with *score\_all* = False, and means that when evaluating the performance of the predictions against "MEDIUM" and "HIGH" confidence groundtruth only, an unfair penalty  is not inccurred for correctly detecting maritime objects with a confidence attribute of "LOW".

#### costly\_dist - default: True

Whether to assign a very large distance to pairwise distances between predictions and groundtruth when the distance is larger than the *assignment\_tolerance\_meters* threshold for true positive assignment. This modifies the optimisation problem solved by the Hungarian matching algorithm and means that predictions and groundtruth with a pairwise distances larger than the threshold are unlikely to be assigned to each other. Without this option the matching algorithm may find low cost solutions which associate more groundtruth and predictions further apart than the threshold than otherwise, and may increase the false positive and false negative count.

[3] R. Sayre et al., “A new 30 meter resolution global shoreline vector and associated global islands database for the development of standardized ecological coastal units,” Journal of Operational Oceanography, vol. 12, no. sup2, pp. S47–S56, 2019. Downloaded from https://www.sciencebase.gov/catalog/item/63bdf25dd34e92aad3cda273 at https://www.sciencebase.gov/catalog/file/get/63bdf25dd34e92aad3cda273

### 7.3 GRD metric evaluation

The following cell shows an example of calling the SARFish\_metric.score function on the simulated predictions for a single scene. **Note:** the score function automatically iterates over all the scenes denoted in the 'scene\_id' columns of the predictions csv, hence for a particular product\_type and partition, simply include all the predictions from each of the scene\_ids you want to evaluate. The output of the score function shows the confusion matrix and f1 score, recall, and precison summary table for the 5 metrics that comprise the xView3-SAR/SARFish aggregate metric.

In [21]:
score(
    predictions = reference_GRD_predictions, groundtruth = groundtruth_GRD, 
    xView3_SLC_GRD_correspondences = xView3_SLC_GRD_correspondences, SARFish_root_directory = SARFish_root_directory, 
    product_type = "GRD", shoreline_type = "xView3_shoreline", 
    score_all =  False, drop_low_detect = True, costly_dist = True, evaluation_mode = False
)

shoreline_type:     xView3_shoreline
score_all:          False
drop_low_detect:    True
costly_dist:        True
evaluation_mode:    False

dropping predictions corresponding to low confidence groundtruth...


0it [00:00, ?it/s]


evaluating S1B_IW_GRDH_1SDV_20200803T075721_20200803T075746_022756_02B2FF_033A
[1;93m
location [0mtask confusion matrix:
                      ╔═══════════════════╗
                      ║    groundtruth    ║
                      ╠═════════╤═════════╣
                      ║ True    │ False   ║
╔═════════════╦═══════╬═════════╪═════════╣
║ predictions ║ True  ║ tp:  29 │ fp: 262 ║
║             ╟───────╫─────────┼─────────╢
║             ║ False ║ fn: 188 │ tn: N/A ║
╚═════════════╩═══════╩═════════╧═════════╝
[1;93m
location [0mtask performance:
╔═══════════╤════════════════════╗
║[1;94m precision [0m│ 0.0996563573883162 ║
║[1;31m recall    [0m│ 0.1336405529953917 ║
║[1;35m F1 score  [0m│ 0.1141732283464567 ║
╚═══════════╧════════════════════╝
[1;93m
location_close_to_shore [0mtask confusion matrix:
                      ╔═══════════════════╗
                      ║    groundtruth    ║
                      ╠═════════╤═════════╣
                      ║ True    │ False   

{'loc_fscore': 0.1141732283464567,
 'loc_fscore_shore': 0.1461794019933555,
 'vessel_fscore': 0.65,
 'fishing_fscore': 0.6666666666666666,
 'length_acc': 0.0,
 'aggregate': 0.05550736768140202}

The following cell illustrates how to run the SARFish metric script from the command line.

In [22]:
! ./SARFish_metric.py \
    -p ./labels/reference_GRD_predictions.csv \
    -g "${SARFISH_ROOT_DIRECTORY}"/GRD/validation/GRD_validation.csv \
    --sarfish_root_directory "${SARFISH_ROOT_DIRECTORY}"\
    --product_type GRD \
    --xview3_slc_grd_correspondences ./labels/xView3_SLC_GRD_correspondences.csv \
    --shore_type xView3_shoreline \
    --drop_low_detect \
    --costly_dist \
    --no-evaluation_mode

shoreline_type:     xView3_shoreline
score_all:          False
drop_low_detect:    True
costly_dist:        True
evaluation_mode:    False

dropping predictions corresponding to low confidence groundtruth...
1it [00:00, 51.09it/s]

evaluating S1B_IW_GRDH_1SDV_20200803T075721_20200803T075746_022756_02B2FF_033A
[1;93m
location [0mtask confusion matrix:
                      ╔═══════════════════╗
                      ║    groundtruth    ║
                      ╠═════════╤═════════╣
                      ║ True    │ False   ║
╔═════════════╦═══════╬═════════╪═════════╣
║ predictions ║ True  ║ tp:  29 │ fp: 262 ║
║             ╟───────╫─────────┼─────────╢
║             ║ False ║ fn: 188 │ tn: N/A ║
╚═════════════╩═══════╩═════════╧═════════╝
[1;93m
location [0mtask performance:
╔═══════════╤════════════════════╗
║[1;94m precision [0m│ 0.0996563573883162 ║
║[1;31m recall    [0m│ 0.1336405529953917 ║
║[1;35m F1 score  [0m│ 0.1141732283464567 ║
╚═══════════╧════════════════════╝
[

### 7.4 SLC metric evaluation

The evaluation of predictions on an SLC product is straight-forward. The metrics script handles the multiple swaths and imaging geometry of the SLC products. The script transforms the pixel indices into distances of meters to evaluate the detection tasks in the same units as the xView3-SAR data (in its metric script), or the SARFish GRD products shown in the cells above.

In [23]:
score(
   reference_SLC_predictions, groundtruth_SLC, xView3_SLC_GRD_correspondences, SARFish_root_directory, "SLC", "xView3_shoreline", 
    score_all =  False, drop_low_detect = True, costly_dist = True, evaluation_mode = False
)

shoreline_type:     xView3_shoreline
score_all:          False
drop_low_detect:    True
costly_dist:        True
evaluation_mode:    False

dropping predictions corresponding to low confidence groundtruth...


0it [00:00, ?it/s]


evaluating S1B_IW_SLC__1SDV_20200803T075720_20200803T075748_022756_02B2FF_E5D2
[1;93m
location [0mtask confusion matrix:
                      ╔═══════════════════╗
                      ║    groundtruth    ║
                      ╠═════════╤═════════╣
                      ║ True    │ False   ║
╔═════════════╦═══════╬═════════╪═════════╣
║ predictions ║ True  ║ tp:  25 │ fp: 565 ║
║             ╟───────╫─────────┼─────────╢
║             ║ False ║ fn: 190 │ tn: N/A ║
╚═════════════╩═══════╩═════════╧═════════╝
[1;93m
location [0mtask performance:
╔═══════════╤════════════════════╗
║[1;94m precision [0m│ 0.0423728813559322 ║
║[1;31m recall    [0m│ 0.1162790697674419 ║
║[1;35m F1 score  [0m│ 0.0621118012422360 ║
╚═══════════╧════════════════════╝
[1;93m
location_close_to_shore [0mtask confusion matrix:
                      ╔═══════════════════╗
                      ║    groundtruth    ║
                      ╠═════════╤═════════╣
                      ║ True    │ False   

{'loc_fscore': 0.06211180124223603,
 'loc_fscore_shore': 0.08908685968819599,
 'vessel_fscore': 0.8292682926829269,
 'fishing_fscore': 0.5,
 'length_acc': 0.0,
 'aggregate': 0.029706585017703884}

## 8. How to participate in the SARFish challenge.

Competitors will use the SLC products of the SARFish dataset to generate detection, classification, and vessel length regression models. We are looking for algorithms which exploit the complex data contained in the SLC products. 

1. Download the SARFish dataset (see section 2)
2. Use the train and validation partitions of the SARFish dataset to generate your model (see section 3.1).
3. Run inference over the entire public data partition outputing your predictions in submission format (see section 6). 
4. Submit your predictions in csv format to the [Kaggle](INSERT Kaggle link)