# SARToVisibleDataset Class Usage

This notebook contains examples for how the [SARToVisibleDataset](https://github.com/MIT-AI-Accelerator/multiearth-challenge/blob/c2318665ab94451eea4bd1b9e31a71655c6be001/src/multiearth_challenge/datasets/translation_dataset.py#L10) class can be used to sample data from the NetCDF files provided as part of the MultiEarth challenge.

In [None]:
import pkg_resources

from matplotlib import pyplot as plt
import numpy as np

from multiearth_challenge.datasets import translation_dataset as td

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Specifying Dataset Data
In this example, Sentinel-1 SAR imagery will serve as the source imagery and Sentinel-2 visible imagery will serve as the target.

In [None]:
# Set data paths to sample data included as part of the MultiEarth repository
sar_files = [pkg_resources.resource_filename("multiearth_challenge", "data/sample_dataset/sent1_sample.nc")]
visible_files = [pkg_resources.resource_filename("multiearth_challenge", "data/sample_dataset/sent2_sample.nc")]

Specify the bands to include for the source and target data. For source data from Sentinel-1 a list containing 'VV' and / or 'VH' polarizations specifies the desired bands.

For target imagery, the desired bands are set with a dictionary whose keys are the sensor and the values are a list of bands.</br>
Acceptable sensor and visible band values are:</br>
"Landsat-5": ["SR_B3", "SR_B2", "SR_B1"]</br>
"Landsat-8": ["SR_B4", "SR_B3", "SR_B2"]</br>
"Sentinel-2": ["B4", "B3", "B2"]</br> 

In [None]:
sar_bands =["VV", "VH"] # VV and VH polarizations

# All RGB bands for Sentinel-2 are specified here, but the default value for SARToVisibleDataset initialize will automatically select all visible bands automatically.
visible_bands = {
    "Sentinel-2": ["B4", "B3", "B2"],
}

### Creating the Dataset
Set additional parameters used by the SARToVisibleDataset

In [None]:
# If True, returned source images will have multiple channels in increasing order of frequency (e.g., red, green, blue for visible), co-pol before cross-pol, and with bands not originating from collected imagery coming last and in alphabetical order. 
# The metadata returned with the imagery will also specify the channel order. If False, each band is treated as a separate sample.
merge_sar_bands = False # bool
merge_visible_bands = False # bool

# The maximum allowable cloud coverage allowed in visible imagery as a fraction [0, 1]. Setting the maximum above 0 may be useful to incorporate additional samples even if the truth visible image is slightly obscured.
# Note, there may be some innacuracies in the identified cloud coverage provided by the sensor's QA bands. This is especially true for Sentinel-2 data.
max_visible_cloud_coverage = 0.0 # bool

# The minimum and maximum inclusive relative time window in days around the segmentation image from which source imagery is pulled. 
# If the minimum is None, there is no filter on the minimum relative date. Similarly, no maximum can be specified with a value of None.
# For example, with a value of (-7, 7) only SAR imagery within the interval of -7 days before and 7 days after a visible image date will be returned as source imagery. 
sar_date_window = (-7, 7) # Tuple[Optional[float], Optional[float]]

# If True, for each target image only a single source image is returned in a unique pair. A single source image may be paired with multiple target images and vice-versa depending on data filters applied. 
# If False, each target image is returned with all source images at the same location that satisfy applied data filters. This may be useful if you want to include information from multiple images when making a single segmentation prediction.
single_source_image = True

# If True, if no source or target image remain after data filtering, raise a ValueError, otherwise this dataset will have length 0. 
error_on_empty = True # bool

Create the dataset. The sample data is small, but depending on the number of images contained in the NetCDF files, calculating the cloud coverage statistics may take several minutes.

In [None]:
dataset = td.SARToVisibleDataset(
    sar_files,
    visible_files,
    sar_bands,
    merge_sar_bands,
    sar_date_window,
    visible_bands,
    merge_visible_bands,
    max_visible_cloud_coverage,
    single_source_image,
    error_on_empty,       
)   

### Data Returned by the Dataset
The dataset serves as a sequence of samples. Each call to \_\_getitem__ will return a two element tuple where the second element holds a dictionary with a single target image and associated metadata. The first element holds a list of dictionaries where each one holds a source image along with associated metadata at the same location that satisfies the source_date_window. If single_source_image is True, this will always be a one element list with the multiple possible pairings being returned as separate samples.

Note, the returned source data is cloud free and within +- 10 days of the target visible image date. Cloud coverage is determined by information in the sensor's QA band, which may have inaccuracies.

In [None]:
print(f"Number of dataset samples: {len(dataset)}")

# Get sample with index 10
source_data, target_data = dataset[10]

# The returned source_data is a list of all data related to a target image. 
# During initialization since single_source_image was set to True, this list will always have one element.
print(f"Source data key values returned: {source_data[0].keys()}")
print(f"Target data key values returned: {target_data.keys()}")

### Single channel vs. Multi-channel Imagery
The returned source images in the dataset above are single band and there are separate paired samples for each band. Below is an example where the bands have been merged into a 3-channel RGB image.

In [None]:
# Make a second dataset with merged bands
merge_sar_bands = True
merge_visible_bands = True
dataset_merged_bands = td.SARToVisibleDataset(
    sar_files,
    visible_files,
    sar_bands,
    merge_sar_bands,
    sar_date_window,
    visible_bands,
    merge_visible_bands,
    max_visible_cloud_coverage,
    single_source_image,
    error_on_empty,       
)   

# 1/3 the target samples are merged and 1/2 the source samples are merged resulting in 1/6 of the number of samples compared to separate bands.
print(f"Number of merged band dataset samples: {len(dataset_merged_bands)}") 

# Get sample with index 2
source_data_merged_bands, target_data_merged_bands = dataset_merged_bands[2]
print(f"Shape of source image without merging bands: {source_data[0]['image'].shape}")
print(f"List of bands associated with the single band source image: {source_data[0]['bands']}")
print(f"Shape of the target image without merging bands: {target_data['image'].shape}")
print(f"List of bands associated with the single band target image: {target_data['bands']}\n")

print(f"Shape of the source image with merged bands: {source_data_merged_bands[0]['image'].shape}")
print(f"List of bands associated with the multi-band source image: {source_data_merged_bands[0]['bands']}") # This list corresponds to the channels in the image with the first band corresponding to channel index 0, the second channel index 1, etc.
print(f"Shape of the target image with merged bands: {target_data_merged_bands['image'].shape}")
print(f"List of bands associated with the multi-band target image: {target_data_merged_bands['bands']}") # This list corresponds to the channels in the image with the first band corresponding to channel index 0, the second channel index 1, etc.

### Image Characteristics
The collected images may require normalization for visualization. Here we perform a simple normalization without color balancing.</br>
The images plotted below demonstrate some of the challenges working with SAR imagery, where the landscape features may be visually very different than those in visible imagery.</br>
Also, it should be noted that the filtering of cloudy images uses the Sentinel-2 QA band which may have inaccuracies resulting in some images with high pixel value cloud coverage obscuring land. Additional filtering or masking of cloud coverage can be applied if needed.

In [None]:
def normalize(img):
    img = img.astype(np.float64)
    img -= np.mean(img)
    img_std = np.std(img)
    img += img_std
    img /= img_std * 4.0
    img = np.clip(img, 0, 1)
    return img

In [None]:
fig1, axs1 = plt.subplots(1, 2)
fig1.suptitle("Single Band Example (source / target)")
axs1[0].set_title(f"{source_data[0]['data_source']}\n{source_data[0]['bands']}")
_ = axs1[0].imshow(normalize(source_data[0]["image"].squeeze()), cmap="gray")
axs1[1].set_title(f"{target_data['data_source']}\n{target_data['bands']}")
_ = axs1[1].imshow(normalize(target_data["image"].squeeze()), cmap="gray")

# Plot the single band Sentinel-1 source image and the RGB Sentinel-2 target image
fig2, axs2 = plt.subplots(1, 2)
fig2.suptitle("Multi-Band Target Example (source / target)")
axs2[0].set_title(f"{source_data[0]['data_source']}\n{source_data[0]['bands']}")
_ = axs2[0].imshow(normalize(source_data[0]["image"]).squeeze(), cmap="gray")
axs2[1].set_title(f"{target_data_merged_bands['data_source']}\n{target_data_merged_bands['bands']}")
_ = axs2[1].imshow(normalize(target_data_merged_bands["image"]).transpose((1, 2, 0)), cmap="gray")

### Closing the Datasets
The SARToVisibleDataset class holds an open file handle to the NetCDF files which need to be manually closed.

In [None]:
dataset.close()
dataset_merged_bands.close()

### Incorporation into Training and Evaluation
This dataset can be wrapped in a straightforward manner for use in a desired ML training / evaluation framework, allowing for selection of desired data from within the returned dictionary, applying data transforms such as image resizing, and converting to framework compatible types.