# <center><strong>HDSCAN Clustering of FLAIR Data</strong></center>
## <center><strong>Comprehensive Analysis</strong></center>
<br/>

<br/><center>This notebook provides a comprehensive view of the exploratory project to test whether HDBSCAN clustering can serve as a replacement for manual annotation. As a comprehensive analysis, all visualizations and analyses are displayed here. </center>
<br/> <br/> 
  


<hr style="height:1.5px;border-width:0;color:red;background-color:red">    

# <center><font color='red'>PART-1: Data visualization with the toy dataset training data</font></center>

Note, that this project uses the FLAIR #2 dataset, a publicly available dataset. A reference implementation (including a baseline model) is available in a GitHub repository. The baseline model uses a two-branch architecture integrating a U-Net with a pre-trained ResNet34 encoder and a U-TAE encompassing a temporal self-attention encoder. This project experiments with an alternative data analysis technique, HDBSCAN. As such, this notebook does not use the baseline model. Part 1 uses the reference implementation with minor modifications to load and display the unanalyzed data. The novel portion of this project starts at Part 2. 

Links
Datapaper: https://arxiv.org/pdf/2305.14467.pdf
Dataset link: https://ignf.github.io/FLAIR/#FLAIR2
Reference Source code link: https://github.com/IGNF/FLAIR-2/tree/main
Challenge page: https://codalab.lisn.upsaclay.fr/competitions/13447

<p dir="auto">Citation required when using the FLAIR #2 dataset:</p>
<div class="highlight highlight-text-bibtex notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="@inproceedings{ign2023flair2,
      title={FLAIR: a Country-Scale Land Cover Semantic Segmentation Dataset From Multi-Source Optical Imagery}, 
      author={Anatol Garioud and Nicolas Gonthier and Loic Landrieu and Apolline De Wit and Marion Valette and Marc Poupée and Sébastien Giordano and Boris Wattrelos},
      year={2023},
      booktitle={Advances in Neural Information Processing Systems (NeurIPS) 2023},
      doi={https://doi.org/10.48550/arXiv.2310.13336},
}"><pre><span class="pl-k">@inproceedings</span>{<span class="pl-en">ign2023flair2</span>,
      <span class="pl-s">title</span>=<span class="pl-s"><span class="pl-pds">{</span>FLAIR: a Country-Scale Land Cover Semantic Segmentation Dataset From Multi-Source Optical Imagery<span class="pl-pds">}</span></span>, 
      <span class="pl-s">author</span>=<span class="pl-s"><span class="pl-pds">{</span>Anatol Garioud and Nicolas Gonthier and Loic Landrieu and Apolline De Wit and Marion Valette and Marc Poupée and Sébastien Giordano and Boris Wattrelos<span class="pl-pds">}</span></span>,
      <span class="pl-s">year</span>=<span class="pl-s"><span class="pl-pds">{</span>2023<span class="pl-pds">}</span></span>,
      <span class="pl-s">booktitle</span>=<span class="pl-s"><span class="pl-pds">{</span>Advances in Neural Information Processing Systems (NeurIPS) 2023<span class="pl-pds">}</span></span>,
      <span class="pl-s">doi</span>=<span class="pl-s"><span class="pl-pds">{</span>https://doi.org/10.48550/arXiv.2310.13336<span class="pl-pds">}</span></span>,
}</pre></div>


<br/>

Handle all the generic imports

In [None]:
import yaml
import sys

import numpy as np
import matplotlib.pyplot as plt

from os.path import join
from pathlib import Path
from importlib import reload

Import code from or based upon the FLAIR-2 reference implementation 

In [None]:
FLAIR_path = join(Path.cwd().parents[0],'FLAIR-code/src')
if FLAIR_path not in sys.path:
    sys.path.append(FLAIR_path)

from data_display import (display_nomenclature,
                            display_samples, 
                            display_time_serie,
                            display_all_with_semantic_class, 
                            display_all, 
                            read_dates, 
                            filter_dates)
from load_data import load_data
from FusedDataset import FusedDataset
from calc_miou import calc_miou


## <font color='#90c149'>Nomenclatures</font>

<br/><hr>

The predefined semantic land-cover classes used in the FLAIR #2 datatset. <font color='#90c149'>Two nomenclatures are available </font> : 
<ul>
    <li>the <strong><font color='#90c149'>full nomenclature</font></strong> corresponds to the semantic classes used by experts in photo-interpretation to label the pixels of the ground-truth images.</li>
    <li>the <font color='#90c149'><b>main (baseline) nomenclature</b></font> is a simplified version of the full nomenclature. It regroups (into the class 'other') classes that are either strongly under-represented or irrelevant to this challenge.</li>
</ul>        
See the associated datapaper (https://arxiv.org/pdf/2305.14467.pdf) for additionnal details on these nomenclatures.<br/><br/>

<font color='#90c149'>Note:</font> For this project, the reduced nomenclature is used. <br/><hr><br/> 

In [None]:
display_nomenclature()

## <font color='#90c149'>Load Data</font>

<br/><hr>

Use reference code to create lists containing the paths to the input images (`images`) and supervision masks (`masks`) files of the dataset.<hr><br/>

In [None]:
config_path = "/app/FLAIR-HDBSCAN/flair-2-config.yml" 
with open(config_path, "r") as f:
    config = yaml.safe_load(f)

# Creation of the train, val, and test dictionaries with the data file paths
# Note that due to using the toy dataset we assign 100% of the data for training. 
# While not best practice for machine learning, for the toy dataset when using the stock code and less than 100%, issues randomly arise when using the reference loader
# as the random selection can result in some semantic classes not being represented. If the full FLAIR #2 dataset were employed, validation data should be separate. 
# Due to size limitations on HDBSCAN, for actual training we downsample to ~1 % of the training data for training and use 100% of the training data for fitting. 
d_train, d_val, d_test = load_data(config, val_percent=1)

# Convert to torch Datasets
train_dataset = FusedDataset(dict_files=d_train, config=config)
valid_dataset = FusedDataset(dict_files=d_val, config=config)
test_dataset = FusedDataset(dict_files=d_val, config=config)

## <font color='#90c149'>Training Data</font>

<br/><hr>

Load the training data. <hr><br/>

In [None]:
train_aerial_images = d_train["PATH_IMG"]
train_sentinel_images = d_train["PATH_SP_DATA"]
train_labels = d_train["PATH_LABELS"]
train_sentinel_masks = d_train["PATH_SP_MASKS"] # Cloud masks
train_sentinel_products = d_train["PATH_SP_DATES"] # Needed to get the dates of the sentinel images
train_centroids = d_train["SP_COORDS"] # Position of the aerial image in the sentinel super area

In [None]:
len(train_aerial_images)

## <font color='#90c149'>Visualize Training Data</font>

<br/><hr>

Display some random samples of image and mask pairs. <font color='#90c149'>Re-run the cell bellow for a different image.</font> Here we also plot the Sentinel super area, super patch and patch. Even though the last one is not used in practice, it is shown to provide an idea of what the Sentinel data looks like. The red rectangle shows the extent of the RGB image inside the Sentinel image. <hr><br/>

In [None]:
display_samples(train_aerial_images, train_labels, train_sentinel_images, train_centroids)

<br/><hr>
We can also plot a few images from sentinel time series along with the acquisition date. Note that some dates may have extensive cloud coverage.

<hr><br/>

In [None]:
display_time_serie(train_sentinel_images, train_sentinel_masks, train_sentinel_products, nb_samples=3)

<br/><hr>

Next let's have a closer look at some specific semantic class.<br/> By setting `semantic_class` to a class number (*e.g.*, `semantic_class`=1 for building or `semantic_class`=5 for water) we can visualize the images containing pixels of this specific class. (the full nomenclature is be used.)<br/>
<hr><br/>

In [None]:
display_all_with_semantic_class(train_aerial_images, train_labels, semantic_class=1)

<br/><hr> 

We can directly display all images.<br/> <hr><br/>

In [None]:
display_all(train_aerial_images, train_labels)

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-2: Naive Implementations </font></center>

<br/><hr>

In this section, I calculate performance metrics for two naive implementations:<br>
1) Randomly assign each pixel to one of the 13 semantic classes, with uniform distribution across the classes. 
2) Randomly assign each pixel to one of the 13 semantic classes, with the probability of being assigned to a class equal to the prevalence of that class. 

<br/> 
Note, all code imported from this point onward or in this notebook was developed specifically for this project. 

<hr><br/>

In [None]:
project_path = join(Path.cwd().parents[0],'code')
if project_path not in sys.path:
    sys.path.append(project_path)

In [None]:
import naive
from naive import naive_clustering
import display
from display import display_confusion

<br/><hr> 

Example of convenient code allowing reloading of a function. <br/> <hr><br/>

In [None]:
reload(display)
from display import display_confusion

<br/><hr> 

Generate confusion matrices for the naive implementations. <br/> <hr><br/>

In [None]:
predictions_dict = naive_clustering(train_dataset, config)

<br/><hr> 

Display the confusion matrix and MIOU metric for naive implementation #1, uniform distribution. <br/> <hr><br/>

In [None]:
display_confusion(predictions_dict['true_classes'], predictions_dict['random_classes'], config, 'naive uniform')

<br/><hr> 

Display the confusion matrix and MIOU metric for naive implementation #2, distribution with matched prevalence. <br/> <hr><br/>

In [None]:
display_confusion(predictions_dict['true_classes'], predictions_dict['permuted_classes'], config, 'naive prevalence')

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-3: Visualizing the Aerial Data</font></center>

<br/><hr>

Visualizations of the aerial data from the training data. 

<hr><br/>

In [None]:
import display
reload(display)
from display import box_whisker_by_class
from display import class_distributions

import classifier
reload(classifier)
from classifier import extract_spectra

<hr>
Display the class distribution of pixels in the training data. 
<hr><br/>

In [None]:
train_data = class_distributions(train_dataset, config)

In [None]:
dataframe = extract_spectra(train_dataset, config, downsample=True, no_other=True, scale_by_intensity=False)

In [None]:
print(dataframe.shape)
dataframe.head()

<hr>
Display the distribution of channel values by semantic classes.<br/> Setting third input (`channel`) to a channel number (*e.g.*, Blue=1, Green=2, Red=3, NIR=4, ..., Elevation=15) displays a box and whisker plot. The box extends from the data's first quartile (Q1) to the third quartile (Q3), where the orange line represents the median. The interquartile distance (IQR) is between Q1 and Q3 (Q3 - Q1). Data points below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are classified as outliers or fliers; such points are displayed individually with circles. Whiskers extend from the box in each direction to the farthest data point which is not an outlier or flier. 
<hr><br/>

In [None]:
box_whisker_by_class(dataframe, config, 4)

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-4: K-nearest neighbor analysis of aerial imagery</font></center>

<br/><hr>

Here, I train a k-nearest neighbor classifier on aerial imagery. The FLAIR #2 toy dataset is employed, which previously was split into training and test datasets. While best practice is typically to employ training, validation, and test datasets, when working with the toy dataset the random subsetting of the training data into training and validation caused issues as not all classes were always represented in all datasets. Additionally, HDBSCAN (run later) was found to have issues scaling to 1 million pixels. Therefore, rather than randomly assigning some of the 512x512 pixel patches to train and others to validation, we do not assign any patches to validation. Instead, we downsample the training data by a factor of 10 in each dimension before training, effectively using 1% of the training data to train. Validation is performed on the complete training data, of which 99% was not used for training. <br/> 

<hr><br/>

In [None]:
import classifier
reload(classifier)
from classifier import train_and_validate_model

In [None]:
%%time
knn_model_and_predictions = train_and_validate_model(train_dataset, config)

<br/><hr> 

Display the confusion matrix and MIOU metric for k-nearest neighbor classification on the aerial spectra. <br/> <hr><br/>

In [None]:
display_confusion(knn_model_and_predictions['true_classes'], knn_model_and_predictions['predicted_classes'], config, 'knn validation')

In [None]:
%reset_selective -f knn_model_and_predictions

##### <br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-5: Spectral Normalization</font></center>

<br/><hr>

Spectral analysis is sometimes improved by separating the spectral profile from the intensity. Normalize all the spectra and add an additional feature corresponding to the intensity. One reason this can be useful is that due to a combination of sun angle and/or off-nadir imaging, there might be shadows. Shadows will generally have a similar spectral shape but different intensity. 

<hr><br/>

In [None]:
import display
reload(display)
from display import display_normalization_scatter

<br/><hr> 

The two plots below show scatter plot of the distribution of values of the two spectral components without and with spectral normalization. 
The input channel values (e.g., 'channel1', 'channel2') can be varied using the channel numbers (*e.g.*, Blue=1, Green=2, Red=3, NIR=4, ..., Elevation=15) 
The raw spectra show a strong correlation between the red and green channels, but comparison with the normalized spectra indicates that much of that correlation is simply a similar dependence upon intensity. When the spectra are normalized, additional grouping and altered distribution is seen. <br/> <hr><br/>

In [None]:
display_normalization_scatter(train_dataset, config, channel1=2, channel2=3)

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-6: K-nearest neighbor analysis of normalized aerial imagery</font></center>

<br/><hr>
This section shows the effects of spectral normalization on the k-nearest neighbor analysis. 
<br/> 

<hr><br/>

In [None]:
%%time
knn_normalized = train_and_validate_model(train_dataset, config, scale_by_intensity=True, append_intensity=True)

<br/><hr> 

Display the confusion matrix and MIOU metric for k-nearest neighbor classification on the normalized aerial spectra with appended intensity. <br/> <hr><br/>

In [None]:
display_confusion(knn_normalized['true_classes'], knn_normalized['predicted_classes'], config, 'knn validation')

In [None]:
%reset_selective -f knn_normalized

<br/><hr> 

These results indicate that normalizing the spectra and appending the intensity do not improve the classification. One possibility is that appending the intensity is the issue. Try without that. <br/> <hr><br/>

In [None]:
%%time
knn_normalized_no_intensity = train_and_validate_model(train_dataset, config, scale_by_intensity=True, append_intensity=False)

<br/><hr> 

Display the confusion matrix and MIOU metric for k-nearest neighbor classification on the normalized aerial spectra with appended intensity. <br/> <hr><br/>

In [None]:
display_confusion(knn_normalized_no_intensity['true_classes'], knn_normalized_no_intensity['predicted_classes'], config, 'knn normalized spectra no intensity validation')

In [None]:
%reset_selective -f knn_normalized_no_intensity

<br/><hr> 

Neither of these showed improvement over the baseline k-nearest neighbor classifier. <br/> <hr><br/>

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-7: HDBSCAN analysis of raw aerial imagery</font></center>

<br/><hr>
Here, HDBSCAN is applied to the raw aerial imagery to determine clusters. Since HDBSCAN is a density-aware clustering algorithm, some samples will be classified as outliers and not assigned to a cluster. To ensure that the HDSCAN analysis does not simply fit only the easiest-to-classify samples and to offer a fair comparison to standard k-nearest neighbors, processing with HDBSCAN needed to be extended to assign a cluster or class label to all samples. Therefore, analogously to the prior k-nearest neighbor analysis, a k-nearest neighbor model was trained using some pixels and their labels and then applied to all the pixels. For standard k-nearest neighbors, the model was trained using the ground truth classes and the 1% downsampled training data. For HDBSCAN, for each cluster label, the most commonly represented class was designated as the class label which was then used to train the k-nearest neighbor model. Additionally, only pixels assigned to a cluster were used to train the k-nearest neighbor model. 
<br/> 

<hr><br/>

In [None]:
%%time
hdbscan_model_and_predictions = train_and_validate_model(train_dataset, config, use_hdbscan=True)

<br/><hr> 

As hypothesized, HDBSCAN found a workable number of clusters where clusters tend to be sub-classes of the FLAIR semantic classes. The accuracy of mapping clusters to individual semantic classes is seen to be quite high. 
<br/> <hr><br/>

In [None]:
display_confusion(hdbscan_model_and_predictions['true_classes'], hdbscan_model_and_predictions['predicted_classes'], config, 'HDBSCAN Validation cluster size 10')

In [None]:
%reset_selective -f hdbscan_model_and_predictions

<br/><hr> 

Unsurprisingly, training the k-nearest neighbor model with fewer pixels and predicted labels rather than ground truth labels results in a lower metric than simply using k-nearest neighbors directly. However, k-nearest neighbors is a supervised machine learning method requiring expensive manual annotation. HDBSCAN offers excellent performance when considering that it is an unsupervised machine learning technique. <br/> <hr><br/>

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-8: HDBSCAN analysis of normalized aerial imagery</font></center>

<br/><hr>
This section shows the effects of spectral normalization on HDBSCAN analysis. While spectral normalization was not beneficial to k-nearest neighboring, it might help HDBSCAN clustering based upon the observed changes when visualizing the distributions after spectral normalization. 
<br/> 

<hr><br/>

In [None]:
%%time
hdbscan_normalized = train_and_validate_model(train_dataset, config, use_hdbscan=True, scale_by_intensity=True, append_intensity=True)

In [None]:
display_confusion(hdbscan_normalized['true_classes'], hdbscan_normalized['predicted_classes'], config, 'HDBSCAN with Normalized Spectra')

In [None]:
%reset_selective -f hdbscan_normalized

<br/><hr> 

Worse performance than the baseline HDBSCAN analysis. <br/> <hr><br/>

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-9: Analysis of robust scaled aerial imagery</font></center>

<br/><hr>
This section shows the effects of robust scaling on HDBSCAN and KNN analysis. Robust scaling is data preprocessing to normalize the data, specifically centering each component to its median and scaling each component according to its interquartile range. 
<br/> 

<hr><br/>

In [None]:
%%time
knn_robust = train_and_validate_model(train_dataset, config, use_hdbscan=False, robust_scale=True)

In [None]:
display_confusion(knn_robust['true_classes'], knn_robust['predicted_classes'], config, 'KNN with robust scaling')

In [None]:
%reset_selective -f knn_robust

In [None]:
%%time
hdbscan_robust = train_and_validate_model(train_dataset, config, use_hdbscan=True, robust_scale=True)

In [None]:
display_confusion(hdbscan_robust['true_classes'], hdbscan_robust['predicted_classes'], config, 'HDBSCAN Validation cluster size 10')

In [None]:
%reset_selective -f hdbscan_robust

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-10: Analysis of robust scaled aerial imagery with spectral normalization</font></center>

<br/><hr>
This section shows the effects of combining robust scaling and spectral normalization on HDBSCAN and KNN analysis. 
<br/> 

<hr><br/>

In [None]:
%%time
knn_robust = train_and_validate_model(train_dataset, config, use_hdbscan=False, robust_scale=True, scale_by_intensity=True, append_intensity=True)

In [None]:
display_confusion(knn_robust['true_classes'], knn_robust['predicted_classes'], config, 'KNN with robust scaling')

In [None]:
%reset_selective -f knn_robust

In [None]:
%%time
hdbscan_robust = train_and_validate_model(train_dataset, config, use_hdbscan=True, robust_scale=True, scale_by_intensity=True, append_intensity=True)

In [None]:
display_confusion(hdbscan_robust['true_classes'], hdbscan_robust['predicted_classes'], config, 'HDBSCAN Validation cluster size 10')

In [None]:
%reset_selective -f hdbscan_robust

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-11: Data Fusion - Satellite & Aerial Imagery</font></center>

<br/><hr>
This section shows the effects of data fusion of satellite imagery with aerial imagery.
<br/> 

<hr><br/>

In [None]:
%%time
knn_satellite = train_and_validate_model(train_dataset, config, use_satellite=True, use_hdbscan=False, robust_scale=False, scale_by_intensity=False, append_intensity=False)

In [None]:
display_confusion(knn_satellite['true_classes'], knn_satellite['predicted_classes'], config, 'KNN Aerial and Satellite')

In [None]:
%reset_selective -f knn_satellite

In [None]:
%%time
hdbscan_satellite = train_and_validate_model(train_dataset, config, use_satellite=True, use_hdbscan=True, robust_scale=False, scale_by_intensity=False, append_intensity=False)

In [None]:
display_confusion(hdbscan_satellite['true_classes'], hdbscan_satellite['predicted_classes'], config, 'HDBSCAN Aerial and Satellite')

In [None]:
%reset_selective -f hdbscan_satellite

### <br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-12: Data Fusion - Spectral Normalization</font></center>

<br/><hr>
This section shows the effects of data fusion of satellite imagery with aerial imagery combined with spectral normalization
<br/> 

<hr><br/>

In [None]:
%%time
knn_fusion_normalized = train_and_validate_model(train_dataset, config, use_satellite=True, use_hdbscan=False, robust_scale=False, scale_by_intensity=True, append_intensity=True)

In [None]:
display_confusion(knn_fusion_normalized['true_classes'], knn_fusion_normalized['predicted_classes'], config, 'KNN Fusion w/ Spectral Normalization')

In [None]:
%reset_selective -f knn_fusion_normalized

In [None]:
%%time
hdbscan_fusion_normalized = train_and_validate_model(train_dataset, config, use_satellite=True, use_hdbscan=True, robust_scale=False, scale_by_intensity=True, append_intensity=True)

In [None]:
display_confusion(hdbscan_fusion_normalized['true_classes'], hdbscan_fusion_normalized['predicted_classes'], config, 'HDBSCAN Fusion w/ Spectral Normalization')

In [None]:
%reset_selective -f hdbscan_fusion_normalized

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-13: Data Fusion - Robust</font></center>

<br/><hr>
This section shows the effects of data fusion of satellite imagery with aerial imagery combined with spectral normalization
<br/> 

<hr><br/>

In [None]:
%%time
knn_fusion_robust = train_and_validate_model(train_dataset, config, use_satellite=True, use_hdbscan=False, robust_scale=True, scale_by_intensity=False, append_intensity=False)

In [None]:
display_confusion(knn_fusion_robust['true_classes'], knn_fusion_robust['predicted_classes'], config, 'KNN Fusion Robust')

In [None]:
%reset_selective -f knn_fusion_robust

In [None]:
%%time
hdbscan_fusion_robust = train_and_validate_model(train_dataset, config, use_satellite=True, use_hdbscan=True, robust_scale=True, scale_by_intensity=False, append_intensity=False)

In [None]:
display_confusion(hdbscan_fusion_robust['true_classes'], hdbscan_fusion_robust['predicted_classes'], config, 'HDBSCAN Fusion Robust')

In [None]:
%reset_selective -f hdbscan_fusion_robust