# Example workflow
This notebook shows example workflow for dataSetProcessor.
## Images
First we need an image in a given scope for what we will use find_sentinel2.

### Copernicus authentication
You need to open config.py and fill in your user data in "copernicus_token_data"

### Workspace
By default workspace is created in directory where it is installed. To change it, pass absolute path to "workspace" parameter in config.py

### Searching images
To find images we will use find_sentinel2 function. This has given parameters:\
    1. start_date - datetime from which we will search\
    2. end_date - datetime to which we will search\
    3. bbox - bounding box in "EPSG:4326" in order left, bottom, right, top\
    4. products - list of products. It may contain 'L1C' or 'L2A' or both\
    5. cloud_cover - highest percentage of cloud cover\
The output is a metadata of found images

In [1]:
from bbox import Bbox
from image import find_sentinel2
import datetime
found = find_sentinel2(
        datetime.datetime(2023, 9, 10, 0, 0, 0),
        datetime.datetime(2023, 9, 11, 23, 59, 59),
        bbox=Bbox(
            14.8,
            53.5,
            14.9,
            53.6,
            "EPSG:4326"
        ),
        products=['L1C', 'L2A'],
        cloud_cover=30
    )
for el in found:
    print(el.name)

S2A_MSIL1C_20230910T100601_N0509_R022_T33UVV_20230910T135051.SAFE
S2A_MSIL2A_20230910T100601_N0509_R022_T33UVV_20230910T161500.SAFE


### Saving images
To keep this metadata in directory we should use save function

In [2]:
for el in found:
    el.save()

### Cloud masks

DataSetProcessor can detect clouds and cirruses using IDEPIX Sentinel-2 algorithm - which is implemented in Python. Function creates 60 meters cloud mask for a L1C image with parameters:
1. buffer_size - indicating queen (8 connected) distance from clouds, meaning to which extent cloud mask should be widened.
2. dem_path - path to digital elevation model (DEM) covering the image with approbiate CRS and 60m pixel aligned to scene. DEM for this example is in the repository under path "dems/srtm.tif"

The output is a combined mask, where:\
(output & 8) == 1 are cloud sure,\
(output & 4) == 1 are cloud ambigous,\
(output & 2) == 1 are cirrus sure,\
(output & 1) == 1 are cirrus ambigous\
The mask will be saved to cloud_masks directory with name of a scene

In [3]:
for el in found:
    if 'L1C' in el.name:
        el.get_cloud_mask(2, "dems/srtm.tif")

C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B01.jp2
C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B02.jp2
C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B03.jp2
C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B04.jp2
C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B05.jp2
C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B06.jp2
C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B07.jp2
C:\Users\trole\Documents\dataSetProcessorWorkspace\temp\2939f355-a84b-4407-acb6-da49657d6bac\T33UVV_20230910T100601_B08.jp2
C:\Users

## Datasets
### Vector Dataset
Now we can generate dataset on found scene. We will use a dataset from examples as a label data. To make a dataset we will use class VectorDataset. The constructor takes parameters:
1. dataset_name - the name under which the dataset will be saved
2. label_source - GeoDataFrame with polygons serving as labels
3. class_column - the name of column with classes as integers
4. sentinel2image - list of image objects to take picture data. It's designed to take L2A images
5. apply_cloud_mask - list of cloud_mask layers to exclude pixels. The meaning of integers is given above
6. window_size - radius of processing window. If you take 0, it will be the pixel only. For n it will be 2*n+1 x 2*n+1 square with labeled pixel in the middle
7. is_conv - whether to prepare the dataset to be served by convolutional neural networks. If True, it keeps the shape of parameters: 10 x 2*n+1 x 2*n+1. Otherwise they're flattened
8. apply_rotations_and_reflections - if True, it performs data augmentation by adding to dataset rotations and reflections of rotations resulting in increasing of dataset 8 times. If window_size=0 it only repeats every feature 8 times and does not give any real additional features

dataset.save_dataset() saves dataset to directory "datasets". It also writes docx report to "reports/dataset" under the same name

In [2]:
import geopandas as gpd
from dataset import VectorDataset
from image import load_image

data = gpd.read_file('examples/pokrycie2.shp')
print(data)

dataset = VectorDataset(
        dataset_name="goleniow1",
        label_source=data,
        class_column="klasa",
        sentinel2image=[load_image("S2A_MSIL2A_20230910T100601_N0509_R022_T33UVV_20230910T161500.SAFE")],
        apply_cloud_mask=[8, 4],
        window_size=0,
        is_conv=False,
        apply_rotations_and_reflections=False
    )
dataset.save_dataset()

    klasa                                           geometry
0       0  POLYGON ((476941.315 5934303.877, 476935.787 5...
1       0  POLYGON ((489648.891 5935665.666, 489794.468 5...
2       0  POLYGON ((500283.378 5933406.460, 500209.668 5...
3       0  POLYGON ((497747.759 5933175.195, 497891.494 5...
4       1  POLYGON ((489860.807 5935099.943, 489693.318 5...
..    ...                                                ...
91      3  POLYGON ((493656.920 5932403.273, 493616.878 5...
92      3  POLYGON ((498117.435 5937455.194, 498239.190 5...
93      3  POLYGON ((497782.235 5937095.943, 497956.599 5...
94      3  POLYGON ((492720.158 5933242.893, 492876.485 5...
95      3  POLYGON ((492720.158 5933511.454, 492635.982 5...

[96 rows x 2 columns]
['T33UVV_20230910T100601_B02_10m.jp2', 'T33UVV_20230910T100601_B03_10m.jp2', 'T33UVV_20230910T100601_B04_10m.jp2', 'T33UVV_20230910T100601_B05_20m.jp2', 'T33UVV_20230910T100601_B06_20m.jp2', 'T33UVV_20230910T100601_B07_20m.jp2', 'T33UVV_20230910

### Partial Dataset
To extract part of features to make dataset equally distributed in classes, code below runs it. Count variable indicates how many objects of one class should be in partial dataset. The partial dataset will be saved to directory "datasets/{name of full dataset}" as "{count}" and report docx to directory "reports/datasets/{name of full dataset}"

In [5]:
import numpy as np
from dataset import PartialDataset

indices = np.arange(len(dataset.labels))
labels_indices = {el: indices[dataset.labels == el] for el in set(dataset.labels)}

for label in labels_indices:
    np.random.shuffle(labels_indices[label])

count = 750

part_features = []
part_labels = []
for label in labels_indices:
    part_features += dataset.features[labels_indices[label][:count]].tolist()
    part_labels += [label] * count

PartialDataset(
    dataset=dataset,
    subset_id=f"{count}",
    features=np.array(part_features),
    labels=np.array(part_labels)
).save_dataset()


(3750, 10)


## Learning models
### Defining model
Now we will define model on which data will be trained. Parameters are:
1. name - shows the name under which the model will be saved.
2. classifier - function initializing classifying algorithm. In example is passed function initializing random forest classifier with 150 estimators

In [5]:
import model
from functools import partial

m = model.Model(
    name="rfc150",
    classifier=partial(
        model.random_forest,
        n_estimators=150
    )
)

### Learning
To learn will be used function learn with parameters:
1. dataset_name - the name dataset to be trained on. To choose partial dataset indicate "{dataset_name}_parts/{subset_id}"
2. report_name - the name of docx report saved to "reports/learning" directory
3. test_size - the size of validation dataset from 0 to 1. Indicate 0 if you don't want validation

In [6]:
m.learn("goleniow1_parts/750", "goleniow1_rfc150", test_size=0.2)

(0.7626666666666667, 0.702433359560805)

## Test dataset
### Classified image
Now we have learned model and we want to create test dataset in random points with comparable number of features in different classess. So first we need a classified image. To create classified image we will use "classify_image" function with parameters:
1. image_name - name of image to be classified
2. output_file - name of ouput file saved to directory "classImages"
3. bbox - bbox of area in which we want do the classification. Indicate None if you want to classify whole image
4. apply_cloud_mask - indicate which layers of cloud mask overrides predictions of a model and sets it to invalid. Indicate None to skip, defaults to None
5. invalid_value - value of pixels which are considered invalid after applying cloud mask, defaults to 255

In [8]:
m.classify_image(
    "S2A_MSIL2A_20230910T100601_N0509_R022_T33UVV_20230910T161500.SAFE",
    "RFC150",
    Bbox(
        53.82,
        14.0324,
        54.1,
        14.25,
        "EPSG:4326"
    ),
    apply_cloud_mask=(8,4),
    invalid_value=255
)

### Creating test dataset
Now we choose pixels on classified image. It is made by create_test_dataset function with parameters:
1. dataset_name - name of dataset under which it will be saved to directory "test_datasets"
2. class_raster - the path to classiffied image
3. bbox - bounding box of a dataset. Indicate None to take data from whole image
4. mode - mode of creating dataset. It has three possible values:\
&nbsp;&nbsp;&nbsp;&nbsp;a. "total_random" - every pixel will be chosen with the same propability\
&nbsp;&nbsp;&nbsp;&nbsp;b. "proportional" - mode to preserve propotions between classes\
&nbsp;&nbsp;&nbsp;&nbsp;c. "equal" - mode in which every class will be represented by the same amount of pixels
5. count - final size of the dataset
6. included_values - values that represent classes.

In [9]:
from dataset import create_test_dataset

create_test_dataset(
    dataset_name="Swinoujscie",
    class_raster="classImages/RFC150.tif",
    bbox=None,
    mode="equal",
    count=500,
    included_values=[0,1,2,3,4]
)

### Composition
Created geojson file contains field "ground_truth" with all values set to -1. The values must be assigned manually with the value showing the truth for this class. Useful will be composite image with which you can easily make RGB representations. We will use "draw_composition" function with parameters:
1. output_file - the name of file which will be saved to directory "compositions"
2. bbox - bounding box of part which we want to extract. Indicate None to take whole image. Defaults to None

In [10]:
for el in found:
    if 'L2A' in el.name:
        el.draw_composition(
            "Swinoujscie",
            Bbox(
                53.82,
                14.0324,
                54.1,
                14.25,
                "EPSG:4326"
            )
        )

### Import test dataset
After fulfilling of "ground_truth" column, we can load test dataset using PointDataset class. The constructor has parameters like VectorDataset

In [3]:
from dataset import PointDataset
from config import workspace
import os

data = gpd.read_file(os.path.join(workspace, 'test_datasets', 'Swinoujscie.geojson')
print(data)



dataset = PointDataset(
        dataset_name="swinoujscie1",
        label_source=data,
        class_column="ground_truth",
        sentinel2image=[load_image("S2A_MSIL2A_20230910T100601_N0509_R022_T33UVV_20230910T161500.SAFE")],
        apply_cloud_mask=[8, 4],
        window_size=0,
        is_conv=False,
        apply_rotations_and_reflections=False
    )
dataset.save_dataset()

             y         x  value  ground_truth                        geometry
0    5979825.0  447325.0    0.0           0.0  POINT (447325.000 5979825.000)
1    5968685.0  439655.0    0.0           0.0  POINT (439655.000 5968685.000)
2    5978095.0  436515.0    0.0           0.0  POINT (436515.000 5978095.000)
3    5992845.0  445875.0    0.0           0.0  POINT (445875.000 5992845.000)
4    5966965.0  450045.0    0.0           0.0  POINT (450045.000 5966965.000)
..         ...       ...    ...           ...                             ...
495  5979565.0  443365.0    4.0           4.0  POINT (443365.000 5979565.000)
496  5971375.0  445095.0    4.0           4.0  POINT (445095.000 5971375.000)
497  5978425.0  442315.0    4.0           4.0  POINT (442315.000 5978425.000)
498  5973205.0  442765.0    4.0           4.0  POINT (442765.000 5973205.000)
499  5982025.0  440295.0    4.0           4.0  POINT (440295.000 5982025.000)

[500 rows x 5 columns]
['T33UVV_20230910T100601_B02_10m.jp2', '

### Classify test dataset
Now we have test dataset imported, we can use it to evaluate model, for what we will use "classify_dataset" functions with parameters:
1. dataset_name - name of the dataset
2. report_name - name of docx report saved to "reports/classifying" directory

In [7]:
m.classify_dataset("swinoujscie1", "rfc150_test")

(0.722, 0.6525)

So we've made the example workflow. Thanks!