## Simple baseline with Sentinel-2 data — ResNet-18 + Binary Cross Entropy

The occurrence of different types of organisms, whether plants or animals, is generally associated with the characteristics of the environment or ecosystem in which they live. This relationship between the presence of species and their habitat is often interdependent and can be affected by various factors, such as climate, which is another modality we provide.

To demonstrate the performance while using just the _image data_, i.e., Sentinel Image Patches, we provide a straightforward baseline that is based on a slighly modified ResNet-18 and Binary Cross Entropy. 
As described above, the satellite patches provide an image-like modalities that captures habitats and other aspects of the locality.

Considering the significant extent for enhancing performance of this baseline, we encourage you to experiment with various techniques, architectures, losses, etc.

#### **Have Fun!**

In [1]:
import os
import torch
import tqdm
import rasterio
import numpy as np
import pandas as pd
import albumentations as A
import torchvision.models as models
import torchvision.transforms as transforms
import torch.nn as nn
from PIL import Image
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import CosineAnnealingLR
from sklearn.metrics import precision_recall_fscore_support

from src.dataset.dataset import TrainDataset, TestDataset

  from .autonotebook import tqdm as notebook_tqdm
  check_for_updates()


## Data description

The Sentinel-2 data was acquired through the Sentinel2 satellite program and pre-processed by [Ecodatacube](https://stac.ecodatacube.eu/) to produce raster files scaled to the entire European continent and projected into a unique CRS. 
Each TIFF file corresponds to a unique observation location (via "surveyId"). To load the patches for a selected observation, take the "surveyId" from any occurrence CSV and load it following this rule --> '…/CD/AB/XXXXABCD.jpeg'. For example, the image location for the surveyId 3018575 is "./75/85/3018575.tiff". For all "surveyId" with less than four digits, you can use a similar rule. For a "surveyId" 1 is "./1/1.tiff".
The data can simply be loaded using the following method:

```python
def construct_patch_path(output_path, survey_id):
    """Construct the patch file path based on survey_id as './CD/AB/XXXXABCD.tiff'"""
    path = output_path
    for d in (str(survey_id)[-2:], str(survey_id)[-4:-2]):
        path = os.path.join(path, d)

    path = os.path.join(path, f"{survey_id}.tiff")

    return path
```

**For more information about data processing, normalization, and visualization, please refer to the following notebook**: [Kaggle Notebook](https://www.kaggle.com/code/picekl/sentinel-2-data-processing-and-normalization).

**References:**
- *Traceability (lineage): The dataset was produced entirely by mosaicking and seasonally aggregating imagery from the Sentinel-2 Level-2A product (https://sentinels.copernicus.eu/web/sentinel/user-guides/sentinel-2-msi/product-types/level-2a)*
- *Ecodatacube.eu: Analysis-ready open environmental data cube for Europe (https://doi.org/10.21203/rs.3.rs-2277090/v3)*

## Prepare custom dataset loader

We have to slightly update the Dataset to provide the relevant data in the appropriate format.

### Load metadata and prepare data loaders

In [None]:
# Dataset and DataLoader
batch_size = 128
num_workers = 8

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.5, 0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5, 0.5)),
])

# Load Training metadata
train_data_path = "data/SatellitePatches/PA-train"
train_metadata_path = "data/GLC25_PA_metadata_train.csv"
train_metadata = pd.read_csv(train_metadata_path)
train_dataset = TrainDataset(train_data_path, train_metadata, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers)

# Load Test metadata
test_data_path = "data/SatellitePatches/PA-test/"
test_metadata_path = "data/GLC25_PA_metadata_test.csv"
test_metadata = pd.read_csv(test_metadata_path)
test_dataset = TestDataset(test_data_path, test_metadata, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers)

## Modify pretrained ResNet-18 model

To fully use all the R,G,B and NIR channels, we have to modify the input layer of the standard ResNet-18.
That is all :)

In [3]:
# Check if cuda is available
device = torch.device("cpu")

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("DEVICE = CUDA")
elif torch.mps.is_available():
    device = torch.device("mps")
    print("DEVICE = MPS")
else:
    device = torch.device("cpu")
    print("DEVICE = CPU")

# Hyperparameters
learning_rate = 0.0001
num_epochs = 25
positive_weigh_factor = 1.0
num_classes = 11255 # Number of all unique classes within the PO and PA data.

DEVICE = MPS


In [4]:
from src.model.resnet18 import ResNet18


model = ResNet18()
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = CosineAnnealingLR(optimizer, T_max=25)

In [5]:
def set_seed(seed):
    # Set seed for Python's built-in random number generator
    torch.manual_seed(seed)
    # Set seed for numpy
    np.random.seed(seed)
    # Set seed for CUDA if available
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        # Set cuDNN's random number generator seed for deterministic behavior
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

set_seed(77)

## Training Loop

Nothing special, just a standard Pytorch training loop.

In [None]:
print(f"Training for {num_epochs} epochs started.")

for epoch in range(num_epochs):
    model.train()
    for batch_idx, (data, targets, _) in enumerate(train_loader):

        data = data.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()
        outputs = model(data)

        pos_weight = targets*positive_weigh_factor  # All positive weights are equal to 10
        criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        loss = criterion(outputs, targets)

        loss.backward()
        optimizer.step()

        if batch_idx % 348 == 0:
            print(f"Epoch {epoch+1}/{num_epochs}, Batch {batch_idx}/{len(train_loader)}, Loss: {loss.item()}")

    scheduler.step()
    print("Scheduler:",scheduler.state_dict())

# Save the trained model
model.eval()
torch.save(model.state_dict(), "resnet18-eurosat.pth")

Training for 25 epochs started.
Epoch 1/25, Batch 0/696, Loss: 0.711972177028656
Epoch 1/25, Batch 348/696, Loss: 0.014029133133590221
Scheduler: {'T_max': 25, 'eta_min': 0.0, 'base_lrs': [0.0001], 'last_epoch': 1, '_step_count': 2, '_get_lr_called_within_step': False, '_last_lr': [9.96057350657239e-05]}
Epoch 2/25, Batch 0/696, Loss: 0.008208966813981533
Epoch 2/25, Batch 348/696, Loss: 0.006792403757572174
Scheduler: {'T_max': 25, 'eta_min': 0.0, 'base_lrs': [0.0001], 'last_epoch': 2, '_step_count': 3, '_get_lr_called_within_step': False, '_last_lr': [9.842915805643155e-05]}
Epoch 3/25, Batch 0/696, Loss: 0.006560108158737421
Epoch 3/25, Batch 348/696, Loss: 0.006731483619660139
Scheduler: {'T_max': 25, 'eta_min': 0.0, 'base_lrs': [0.0001], 'last_epoch': 3, '_step_count': 4, '_get_lr_called_within_step': False, '_last_lr': [9.648882429441257e-05]}
Epoch 4/25, Batch 0/696, Loss: 0.006035337224602699
Epoch 4/25, Batch 348/696, Loss: 0.006378504913300276
Scheduler: {'T_max': 25, 'eta_mi

100%|██████████| 116/116 [00:39<00:00,  7.63it/s]

## Test Loop

Again, nothing special, just a standard inference.

In [10]:
with torch.no_grad():
    all_predictions = []
    surveys = []
    top_k_indices = None
    for data, surveyID in tqdm.tqdm(test_loader, total=len(test_loader)):

        data = data.to(device)
        
        outputs = model(data)
        predictions = torch.sigmoid(outputs).cpu().numpy()

        # Sellect top-25 values as predictions
        top_25 = np.argsort(-predictions, axis=1)[:, :25] 
        if top_k_indices is None:
            top_k_indices = top_25
        else:
            top_k_indices = np.concatenate((top_k_indices, top_25), axis=0)

        surveys.extend(surveyID.cpu().numpy())

100%|██████████| 116/116 [01:04<00:00,  1.80it/s]


## Save prediction file! 🎉🥳🙌🤗

In [11]:
data_concatenated = [' '.join(map(str, row)) for row in top_k_indices]

pd.DataFrame(
    {'surveyId': surveys,
     'predictions': data_concatenated,
    }).to_csv("submission.csv", index = False)