# Data Loaders

There are two dataloaders avaialable to make working with the provided data more straightforward. 

1. Data loader providing spectral data and labels for a single pixel. Useful for scikit-learn classifiers
2. Pytorch dataset and Pytorch Ligthning dataloader providing image chips together with labels

In this notebook we show how these data loaders can be used.

## Pixel data loader



In [1]:
from disfor.data import ForestDisturbanceData

The class `ForestDisturbanceData` provides arguments to filter the dataset and returns class properties which can be used for training of sklearn classifiers.

In [None]:
data = ForestDisturbanceData(
    # If None, data gets dynamically downloaded and cached from Huggingface
    data_folder=None,
    # selecting healthy forest (110), clear cut (211) and bark beetle (231)
    target_classes=[110, 211, 231],
    # we remap salvage logging (221 and 222) to also be part of the clear cut class
    class_mapping_overrides={221: 211, 222: 211},
    # suset to only include samples with high confidence
    confidence=["high"],
    # only include acquisitions from "leaf-on" months
    months=[5, 6, 7, 8, 9],
    # including also dark pixels (2) as valid
    valid_scl_values=[2, 4, 5, 6],
    # only include acquisitions where the clear cut is recent (maximum of 90 days),
    # for all other classes include everything
    max_days_since_event={211: 90},
    max_samples_per_event=5,
    # omit samples which have low tcd in the comment
    omit_low_tcd=True,
    # omit samples which have border in the comment
    omit_border=True,
)

Downloading file 'classes.json' from 'https://huggingface.co/datasets/JR-DIGITAL/DISFOR/resolve/main/classes.json' to 'C:\Users\Jonas.Viehweger\AppData\Local\disfor\disfor\Cache\0.1.0'.
Downloading file 'labels.parquet' from 'https://huggingface.co/datasets/JR-DIGITAL/DISFOR/resolve/main/labels.parquet' to 'C:\Users\Jonas.Viehweger\AppData\Local\disfor\disfor\Cache\0.1.0'.
Downloading file 'pixel_data.parquet' from 'https://huggingface.co/datasets/JR-DIGITAL/DISFOR/resolve/main/pixel_data.parquet' to 'C:\Users\Jonas.Viehweger\AppData\Local\disfor\disfor\Cache\0.1.0'.
Downloading file 'samples.parquet' from 'https://huggingface.co/datasets/JR-DIGITAL/DISFOR/resolve/main/samples.parquet' to 'C:\Users\Jonas.Viehweger\AppData\Local\disfor\disfor\Cache\0.1.0'.


Once initialized, the class instance provides train and test data as numpy arrays. 

In [22]:
print(data.y_train, data.X_train, data.y_test, data.X_test, sep="\n")

[0 0 0 ... 0 0 0]
[[ 245  516  561 ... 4265 2567 1417]
 [ 353  628  648 ... 4570 2475 1414]
 [ 275  546  664 ... 4411 2556 1405]
 ...
 [ 193  333  146 ... 3737 1374  597]
 [ 174  226  103 ... 2004  553  245]
 [ 228  350  194 ... 3847 1495  650]]
[0 0 0 ... 0 0 0]
[[1136 1490 1604 ... 2288 2277 1760]
 [1011 1358 1458 ... 2384 2243 1830]
 [1058 1442 1556 ... 2514 2307 1745]
 ...
 [ 190  398  224 ... 1943  957  473]
 [ 419  534  311 ... 2495 1044  491]
 [ 242  416  264 ... 2416 1347  667]]


It also provides the used label encoder, to go from the 0 to n-1 encoded labels back to the original labels.  

In [6]:
data.label_encoder.inverse_transform(data.y_test)

array([110, 110, 110, ..., 110, 110, 110], shape=(7482,), dtype=uint16)

Now, let's very quickly train a Random Forest model and validate the output:

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(oob_score=True)
rf.fit(data.X_train, data.y_train)

print(rf.oob_score_)

0.9091450700029788


The out of box accuracy for the Random Forest model is 0.9. However let's use the held out set to get a better idea of the model accuracy. For this we apply the trained model on the held out predictors (`X_test`) and derive accuracy metrics from this.

In [23]:
from sklearn.metrics import classification_report

y_pred = rf.predict(data.X_test)
print(
    classification_report(
        data.y_test, y_pred, target_names=data.label_encoder.classes_.astype(str)
    )
)

              precision    recall  f1-score   support

         110       0.98      0.91      0.95      6806
         211       0.44      0.73      0.55       241
         231       0.43      0.76      0.55       435

    accuracy                           0.90      7482
   macro avg       0.62      0.80      0.68      7482
weighted avg       0.93      0.90      0.91      7482



We can see that the healthy class is predicted well, however the other two classes are not predicted particularly well. Especially the precision is not great.