# <center><strong>HDSCAN Clustering of FLAIR Data</strong></center>
## <center><strong>Complete Analysis</strong></center>
<br/>

<br/><center>This notebook provides a view of the exploratory project to test whether HDBSCAN clustering can serve as a replacement for manual annotation. </center>
<br/> <br/>
  

<hr style="height:1.5px;border-width:0;color:red;background-color:red">    

# <font color='red'>PART-0: Setting up Google Colab</font>

The section below is added to setup Google Colab, cloning the GitHub Repository and downloading the data.

<br/>

In [None]:
! git clone https://github.com/StephenMAnthony/FLAIR-HDBSCAN.git

In [None]:
!wget https://storage.gra.cloud.ovh.net/v1/AUTH_366279ce616242ebb14161b7991a8461/defi-ia/flair_data_2/flair_2_toy_dataset.zip

In [None]:
!unzip flair_2_toy_dataset.zip -d data

In [None]:
%cd /content/FLAIR-HDBSCAN/

In [None]:
!git checkout development

In [None]:
# Install python 3.11
# !sudo apt-get update -y
# !sudo apt-get install python3.11

In [None]:
# Change default python3 to 3.12
# !sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
# !sudo apt install python3-pip

# Confirm version
# !python3 --version
# Python 3.11.9

In [None]:
%pip install -r requirements.txt

<hr style="height:1.5px;border-width:0;color:red;background-color:red">    

# <font color='red'>PART-1: Data visualization with the toy dataset training data</font>

Note, that this project uses the FLAIR #2 dataset, a publicly available dataset. A reference implementation (including a baseline model) is available in a GitHub repository. The baseline model uses a two-branch architecture integrating a U-Net with a pre-trained ResNet34 encoder and a U-TAE encompassing a temporal self-attention encoder. This project experiments with an alternative data analysis technique, HDBSCAN. As such, this notebook does not use the baseline model. Part 1 uses the reference implementation with minor modifications to load and display the unanalyzed data. The novel portion of this project starts at Part 2.

Links
Datapaper: https://arxiv.org/pdf/2305.14467.pdf
Dataset link: https://ignf.github.io/FLAIR/#FLAIR2
Reference Source code link: https://github.com/IGNF/FLAIR-2/tree/main
Challenge page: https://codalab.lisn.upsaclay.fr/competitions/13447

<p dir="auto">Citation required when using the FLAIR #2 dataset:</p>
<div class="highlight highlight-text-bibtex notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="@inproceedings{ign2023flair2,
      title={FLAIR: a Country-Scale Land Cover Semantic Segmentation Dataset From Multi-Source Optical Imagery},
      author={Anatol Garioud and Nicolas Gonthier and Loic Landrieu and Apolline De Wit and Marion Valette and Marc Poupée and Sébastien Giordano and Boris Wattrelos},
      year={2023},
      booktitle={Advances in Neural Information Processing Systems (NeurIPS) 2023},
      doi={https://doi.org/10.48550/arXiv.2310.13336},
}"><pre><span class="pl-k">@inproceedings</span>{<span class="pl-en">ign2023flair2</span>,
      <span class="pl-s">title</span>=<span class="pl-s"><span class="pl-pds">{</span>FLAIR: a Country-Scale Land Cover Semantic Segmentation Dataset From Multi-Source Optical Imagery<span class="pl-pds">}</span></span>,
      <span class="pl-s">author</span>=<span class="pl-s"><span class="pl-pds">{</span>Anatol Garioud and Nicolas Gonthier and Loic Landrieu and Apolline De Wit and Marion Valette and Marc Poupée and Sébastien Giordano and Boris Wattrelos<span class="pl-pds">}</span></span>,
      <span class="pl-s">year</span>=<span class="pl-s"><span class="pl-pds">{</span>2023<span class="pl-pds">}</span></span>,
      <span class="pl-s">booktitle</span>=<span class="pl-s"><span class="pl-pds">{</span>Advances in Neural Information Processing Systems (NeurIPS) 2023<span class="pl-pds">}</span></span>,
      <span class="pl-s">doi</span>=<span class="pl-s"><span class="pl-pds">{</span>https://doi.org/10.48550/arXiv.2310.13336<span class="pl-pds">}</span></span>,
}</pre></div>


<br/>

In [None]:
import yaml
import sys

import numpy as np
import matplotlib.pyplot as plt

from os.path import join
from pathlib import Path
from importlib import reload

In [None]:
%cd /content/FLAIR-HDBSCAN/

In [None]:
FLAIR_path = join(Path.cwd(),'FLAIR-code/src')
if FLAIR_path not in sys.path:
    sys.path.append(FLAIR_path)

In [None]:
from data_display import (display_nomenclature,
                            display_samples,
                            display_time_serie,
                            display_all_with_semantic_class,
                            display_all,
                            read_dates,
                            filter_dates)
from load_data import load_data
from FusedDataset import FusedDataset
from calc_miou import calc_miou

## <font color='#90c149'>Nomenclatures</font>

<br/><hr>

The predefined semantic land-cover classes used in the FLAIR #2 datatset. <font color='#90c149'>Two nomenclatures are available </font> :
<ul>
    <li>the <strong><font color='#90c149'>full nomenclature</font></strong> corresponds to the semantic classes used by experts in photo-interpretation to label the pixels of the ground-truth images.</li>
    <li>the <font color='#90c149'><b>main (baseline) nomenclature</b></font> is a simplified version of the full nomenclature. It regroups (into the class 'other') classes that are either strongly under-represented or irrelevant to this challenge.</li>
</ul>        
See the associated datapaper (https://arxiv.org/pdf/2305.14467.pdf) for additionnal details on these nomenclatures.<br/><br/>

<font color='#90c149'>Note:</font> For this project, the reduced nomenclature is used. <br/><hr><br/>

In [None]:
display_nomenclature()

## <font color='#90c149'>Load Data</font>

<br/><hr>

Use reference code to create lists containing the paths to the input images (`images`) and supervision masks (`masks`) files of the dataset.<hr><br/>

In [None]:
config_path = "/content/FLAIR-HDBSCAN/flair-2-config-colab.yml"
with open(config_path, "r") as f:
    config = yaml.safe_load(f)

# Creation of the train, val, and test dictionaries with the data file paths
# Note that due to using the toy dataset we assign 100% of the data for training.
# While not best practice for machine learning, for the toy dataset when using the stock code and less than 100%, issues randomly arise when using the reference loader
# as the random selection can result in some semantic classes not being represented. If the full FLAIR #2 dataset were employed, validation data should be separate.
# Due to size limitations on HDBSCAN, for actual training we downsample to ~1 % of the training data for training and use 100% of the training data for fitting.
d_train, d_val, d_test = load_data(config, val_percent=1)

# Create a torch dataset of the training data
train_dataset = FusedDataset(dict_files=d_train, config=config)

## <font color='#90c149'>Training Data</font>

<br/><hr>

Load an alternate representation of the training data. <hr><br/>

In [None]:
train_aerial_images = d_train["PATH_IMG"]
train_sentinel_images = d_train["PATH_SP_DATA"]
train_labels = d_train["PATH_LABELS"]
train_sentinel_masks = d_train["PATH_SP_MASKS"] # Cloud masks
train_sentinel_products = d_train["PATH_SP_DATES"] # Needed to get the dates of the sentinel images
train_centroids = d_train["SP_COORDS"] # Position of the aerial image in the sentinel super area

## <font color='#90c149'>Visualize Training Data</font>

<br/><hr>

Display some random samples of image and mask pairs. <font color='#90c149'>Re-run the cell bellow for a different image.</font> Here we also plot the Sentinel super area, super patch and patch. Even though the last one is not used in practice, it is shown to provide an idea of what the Sentinel data looks like. The red rectangle shows the extent of the RGB image inside the Sentinel image. <hr><br/>

In [None]:
display_samples(train_aerial_images, train_labels, train_sentinel_images, train_centroids)

<br/><hr>
We can also plot a few images from sentinel time series along with the acquisition date. Note that some dates may have extensive cloud coverage.<BR>
<font color='#90c149'>Re-run the cell below to display sentinel images for different patches.</font>

<hr><br/>

In [None]:
display_time_serie(train_sentinel_images, train_sentinel_masks, train_sentinel_products, nb_samples=3)

<br/><hr>

Next let's have a closer look at some specific semantic class.<br/>
By setting `semantic_class` to a class number (*e.g.*, `semantic_class`=1 for building or `semantic_class`=5 for water) we can visualize the images containing pixels of this specific class. (the full nomenclature is used here.)<br/>
<font color='#90c149'>Re-run the cell below with different `semantic_class` values as desired.</font>
<hr><br/>

In [None]:
display_all_with_semantic_class(train_aerial_images, train_labels, semantic_class=1)

<br/><hr>

We can directly display all images.<br/> <hr><br/>

In [None]:
display_all(train_aerial_images, train_labels)

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-2: Initialize the Exploratory Project </font></center>

All code imported from this point onward was developed specifically for this exploratory project.

<hr><br/>

In [None]:
project_path = join(Path.cwd(),'code')
if project_path not in sys.path:
    sys.path.append(project_path)

In [None]:
import naive
reload(naive)
from naive import naive_clustering

import display
reload(display)
from display import box_whisker_by_class
from display import class_distributions
from display import display_normalization_scatter
from display import display_confusion
from display import plot_timing
from display import compare_labels
from display import display_pixel_spectrum
from display import dict_to_dataframe

import classifier
reload(classifier)
from classifier import extract_spectra
from classifier import train_and_validate_model
from classifier import apply_model
from classifier import predict_pixels

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-3: Visualizing the Data</font></center>

<br/><hr>

Visualizations of the training data.

<hr><br/>

<hr>
Display the class distribution of pixels in the training data.
<hr><br/>

In [None]:
train_labels = class_distributions(train_dataset, config)

<hr>
Display the shape of the dataframe and the first 5 rows.
<hr><br/>

In [None]:
dataframe = extract_spectra(train_dataset, config, downsample=True, no_other=True, scale_by_intensity=False)
print(f"The dataframe has shape {dataframe.shape}, and the first 5 rows look like below:")
dataframe.head()

<hr>
Display the distribution of channel values by semantic classes.<br/> Setting third input (`channel`) to a channel number (*e.g.*, Blue=1, Green=2, Red=3, NIR=4, ..., Elevation=15) displays a box and whisker plot. The box extends from the data's first quartile (Q1) to the third quartile (Q3), where the orange line represents the median. The interquartile distance (IQR) is between Q1 and Q3 (Q3 - Q1). Data points below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are classified as outliers or fliers; such points are displayed individually with circles. Whiskers extend from the box in each direction to the farthest data point which is not an outlier or flier. <BR>
<font color='#90c149'>Re-run the cell below with a different value of the third input, `channel`, to see the results for a different channel, band, or elevation.</font>
<hr><br/>

In [None]:
box_whisker_by_class(dataframe, config, 4)

##### <br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-4: Spectral Normalization</font></center>

<br/><hr>

Spectral analysis is sometimes improved by separating the spectral profile from the intensity. Normalize all the spectra and add an additional feature corresponding to the intensity. One reason this can be useful is that due to a combination of sun angle and/or off-nadir imaging, there might be shadows. Shadows will generally have a similar spectral shape but different intensity.

<hr><br/>

<br/><hr>

The two plots below show scatter plot of the distribution of values of the two spectral components without and with spectral normalization.
The input channel values (e.g., 'channel1', 'channel2') can be varied using the channel numbers (*e.g.*, Blue=1, Green=2, Red=3, NIR=4, ..., Elevation=15)
The raw spectra show a strong correlation between the red and green channels, but comparison with the normalized spectra indicates that much of that correlation is simply a similar dependence upon intensity. When the spectra are normalized, additional grouping and altered distribution is seen. <br/>
<font color='#90c149'>Re-run the cell below with a different value(s) of the (`channel1`) and (`channel2`) inputs to see the effect of spectral normalization on different pairs of channels or bands.</font>
<hr><br/>

In [None]:
display_normalization_scatter(train_dataset, config, channel1=2, channel2=3)

##### <br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-5: Models Training and Validation</font></center>

<br/><hr>

The FLAIR #2 toy dataset is employed, which previously was split into training and test datasets. While best practice is typically to employ training, validation, and test datasets, when working with the toy dataset the random subsetting of the training data into training and validation caused issues as not all classes were always represented in all datasets. Additionally, HDBSCAN (run later) was found to have issues scaling to 1 million pixels. Therefore, rather than randomly assigning some of the 512x512 pixel patches to train and others to validation, we do not assign any patches to validation. Instead, we downsample the training data by a factor of 10 in each dimension before training, effectively using 1% of the training data to train. Validation is performed on the complete training data, of which 99% was not used for training. <br/>

Various model configurations can be developed here, with the exact model specified by numerous Boolean parameters. <br>

If (`use_hdbscan`) is (`False`), the KNN model will be trained using the manually annotated class labels. If (`use_hdbscan`)<br> is (`True`), autonomously generated labels will be found using HDBSCAN clustering and these labels will be used to train the KNN model. <br>

If (`use_satellite`) is (`False`), only the aerial data will be used. If (`use_satellite`)<br> is (`True`), the Sentinel-2 satellite spectra will be appended. <br>

If (`scale_by_intensity`) is (`False`), spectral normalization is not employed. If (`scale_by_intensity`)<br> is (`True`), spectral normalization is employed. <br>

If (`append_intensity`) is (`False`), nothing happens. If (`append_intensity`)<br> is (`True`) AND (`scale_by_intensity`)<br> is (`True`), the aerial intensity is appended as an additional feature. <br>

If (`robust_scale`) is (`False`), RobustScaler() is not employed. If (`robust_scale`)<br> is (`True`), RobustScaler() is employed. <br>

If (`check_reliability`) is (`False`), nothing happens. If (`check_reliability`)<br> is (`True`), additional calculations are performed and some reliability metrics are displayed. <br>

The recommended (best mIoU scores) models for both when (`use_hdbscan`) is (`False`) and when (`use_hdbscan`) is (`True`) are set to run below by default. The settings are (`use_satellite`) is (`True`), (`scale_by_intensity`) is (`True`), (`append_intensity`) is (`False`), and (`robust_scale`) is (`False`).


<font color='#ff0000'>Warning: The steps in this section can be very computationally expensive. Configuration should allow 8 GB of RAM; with a normal CPU each cell may take 10-20 minutes to complete.</font>

<font color='#90c149'>Re-run the two cells below with different value(s) of the inputs specified above to try different combinations of settings as desired.</font>

<hr><br/>

In [None]:
%%time
knn_model = train_and_validate_model(train_dataset, config, use_hdbscan=False, use_satellite=True, scale_by_intensity=True, append_intensity=False, robust_scale=False, check_reliability=True)

In [None]:
%%time
hdbscan_model = train_and_validate_model(train_dataset, config, use_hdbscan=True, use_satellite=True, scale_by_intensity=True, append_intensity=False, robust_scale=False, check_reliability=True)

<br/><hr>

Display the confusion matrices and MIOU metrics for the models run above. <br/> <hr><br/>

In [None]:
display_confusion(knn_model, config);

In [None]:
display_confusion(hdbscan_model, config);

<hr>

The first row displays the aerial data, the manually annotated class labels (treated as ground truth), and an overlay. 
The second row highlights the pixels where the KNN model trained on the manually annotated labels misidentifies the class. 
The third row highlights the pixels where the KNN model trained on the manually annotated labels misidentifies the class. 

<font color='#90c149'>Re-run the two cells below with different value(s) of the input (`index`) to see different patches as desired.</font>

<hr><br/>

In [None]:
true_labels = compare_labels(train_dataset, knn_model, hdbscan_model, config, index=4)

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-6: Computational Time</font></center>

<br/><hr>
This section generates figures of the computational time. The computational time must be manually recorded from runs in Part 5. The prepopulated values correspond to the times obtained running in a Docker container configured to use 8 GB of RAM and 8 cores on a Ryzen 7 1700.
<br/>

<hr><br/>

In [None]:
times = {
        'KNN alone': (10.0 + 11/60, 10.0 + 18/60, 10.0 + 3/60, 10.0 + 16/60, 10.0 + 30/60, 11.0 + 3/60),
        'HDBSCAN + KNN': (11.0 + 9/60, 11.0 + 6/60, 10.0 + 57/60, 11.0 + 4/60, 11.0 + 10/60, 11.0 + 44/60),
    }
plot_timing(times)

In [None]:
times = {
        'KNN alone': (12.0 + 42/60, 12.0 + 7/60, 11.0 + 22/60, 13.0 + 55/60, 9*60.0 + 15.0, 9*60.0 + 19),
        'HDBSCAN + KNN': (13.0 + 17/60, 13.0 + 9/60, 12.0 + 53/60, 15.0 + 6/60, 4*60.0 + 25.0, 4*60.0 + 57.0),
    }
plot_timing(times, use_satellite=True)

###### <br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-7: Trivial Algorithms for Comparison </font></center>

<br/><hr>

In this section, I calculate performance metrics for results obtainable through random chance:<br>
1) Randomly assign each pixel to one of the semantic classes, with uniform distribution across the classes.
2) Randomly assign each pixel to one of the semantic classes, with the probability of being assigned to a class equal to the prevalence of that class.

<hr><br/>

In [None]:
predictions_dict = naive_clustering(train_dataset, config, hdbscan_model)

<br/><hr>

Display the confusion matrix and MIOU metric for trivial implementation #1, uniform distribution. <br/> <hr><br/>

In [None]:
display_confusion(predictions_dict['random_dict'], config);

<br/><hr>

Display the confusion matrix and MIOU metric for trivial implementation #2, distribution with matched prevalence. <br/> <hr><br/>

In [None]:
display_confusion(predictions_dict['permuted_dict'], config);

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-8: Visualizing the Testing Data</font></center>

<br/><hr>

Visualizations of the testing data.

<hr><br/>

In [None]:
# Create a torch dataset of the training data
test_dataset = FusedDataset(dict_files=d_test, config=config)

<hr>
Display the class distribution of pixels in the training data.
<hr><br/>

In [None]:
test_labels = class_distributions(test_dataset, config)

<hr>
Display the shape of the dataframe and the first 5 rows.
<hr><br/>

In [None]:
dataframe = extract_spectra(test_dataset, config, downsample=True, no_other=True, scale_by_intensity=False)
print(f"The dataframe has shape {dataframe.shape}, and the first 5 rows look like below:")
dataframe.head()

<hr>
Display the distribution of channel values by semantic classes.<br/> Setting third input (`channel`) to a channel number (*e.g.*, Blue=1, Green=2, Red=3, NIR=4, ..., Elevation=15) displays a box and whisker plot. The box extends from the data's first quartile (Q1) to the third quartile (Q3), where the orange line represents the median. The interquartile distance (IQR) is between Q1 and Q3 (Q3 - Q1). Data points below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are classified as outliers or fliers; such points are displayed individually with circles. Whiskers extend from the box in each direction to the farthest data point which is not an outlier or flier. <BR>
<font color='#90c149'>Re-run the cell below with a different value of the third input, `channel`, to see the results for a different channel, band, or elevation.</font>
<hr><br/>

In [None]:
box_whisker_by_class(dataframe, config, 4)

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-9: Applying Models to Testing Data</font></center>

<br/><hr>

Applying the best KNN and HDBSCAN models to the testing dataset.

<hr><br/>

In [None]:
%%time
knn_test = apply_model(test_dataset, knn_model)

In [None]:
display_confusion(knn_test, config);

In [None]:
%%time
hdbscan_test = apply_model(test_dataset, hdbscan_model)

In [None]:
display_confusion(hdbscan_model, config);

<br><br>
<hr style="height:3px;border-width:0;color:red;background-color:red">   

# <center><font color='red'>PART-10: Application - Predicting on Input Data</font></center>

<br/><hr>

This section allows visualization of the data for any pixel in the training or testing datasets. The class for that individual pixel can be predicted, or the class predicted for any manually input values. 

<hr><br/>

In [None]:
training_dataframe = extract_spectra(train_dataset, config, downsample=False, no_other=False, scale_by_intensity=False)
testing_dataframe = extract_spectra(test_dataset, config, downsample=False, no_other=False, scale_by_intensity=False)

<hr>
Visualize the data from a pixel of your choice. <BR>
The first input can be either 'training_dataframe' or 'testing_dataframe'. <BR>
Specify the index you want for the patch, the row, and the column. 

<font color='#90c149'>Re-run the cell below with a different settings for the values above to see results of your choice.</font>
<hr><br/>

In [None]:
selected_df = display_pixel_spectrum(training_dataframe, config, patch_index=10, row_index=0, column_index=0)
selected_df

<hr>
Manually construct a pixel of your choice to be predicted. <BR>

<font color='#90c149'>Re-run the cell below with a different settings for the values above to see results of your choice.</font>
<hr><br/>

In [None]:
manual_dict = {
    "True_Class": [0.0], 
    "Blue": [0.317627],
    "Green": [0.325439], 
    "Red": [0.42749], 
    "NIR": [0.043152],
    "490": [0.198364], 
    "560": [0.242798], 
    "665": [0.307373], 
    "705": [0.321777], 
    "740": [0.359375], 
    "783": [0.373047], 
    "842": [0.443115], 
    "865": [0.404053], 
    "1610": [0.621582], 
    "2190": [0.567871],
    "elevation": [0.133301],
}
manual_df = dict_to_dataframe(manual_dict)
manual_df


<hr>
The cells below can be used to predict the class for an individual pixel. <BR>
Either 'manual_df' or 'selected_df' can be used. <BR>

<font color='#90c149'>Re-run the cell below with a different previously generated inputs to see results of your choice.</font>
<hr><br/>

In [None]:
predict_pixels(selected_df, knn_model)

In [None]:
predict_pixels(selected_df, hdbscan_model)