# Embed2Scale challenge "mean" baseline

This notebook creates baseline embeddings by bilinear interpolation and averaging of the modalities.

We use the E2SChallengeDataset to load the data. The datacubes of the challenge data are of shapes (1, 4, 27, 264, 264), (number of samples, number of timesteps, number of channels, height, width).

The embedding works as follow:
1. Subsample each channel to 8x8 pixels using bilinear interpolation -> shape (1, 4, 27, 8, 8)
2. Average B01 through B09 for both S2L1C and L2 L2A along the channel dimension. Average B11 and B12 along the channel dimension. Average S1 channels along the channel dimension. Concatenate the three averages and B10 along channel dimension -> shape (1, 4, 4, 8, 8)
3. Flatten into 1024 element vector -> shape (1024,)

After embedding, a submission file is created in the expected format for the embed2scale eval.ai challenge. If you use this code, verify that it produces the right number of decimals for your output.

At the end, a function to test that a submission file is readable for evaluation is provided.

Note that parts of this notebook is simplified for demonstration purposes. However, the datasets and dataloaders, as well as the verification of the submission file are intended to be directly usable and true to the data and the expected submission file formats.

In [1]:
import numpy as np
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
from scipy.ndimage import zoom
from torchvision import transforms

from challenge_dataset import E2SChallengeDataset, S2L1C_MEAN, S2L1C_STD, S2L2A_MEAN, S2L2A_STD, S1GRD_MEAN, S1GRD_STD

# Configurations

In [2]:
# Order of modalities.
# In this demo, modalities are ordered the same as the default order in the SSL4EOS12 dataset class.
# Modalities are loaded in the order provided here.
# Change the order based on your needs.
modalities = ['s2l1c', 's2l2a', 's1']

# Path to challenge data folder, i.e. the folder containing the s1, s2l1c and s2l2a subfolders.
path_to_data = '/path/to/challenge/data/'

# Path to where the submission file should be written.
path_to_output_file = 'path/to/output/file.csv'

write_result_to_file = True  # Set to True to trigger saving of the csv at the end.

# Create data transformation
# Get mean and standard deviations for the modealities in the same order as the modalities
mean_data = S2L1C_MEAN + S2L2A_MEAN + S1GRD_MEAN
std_data = S2L1C_STD + S2L2A_STD + S1GRD_STD

data_transform = transforms.Compose([
    # Add additional transformation here
    transforms.Normalize(mean=mean_data, std=std_data)
])

# Note that both E2SChallengeDataset and SSL4EOS12Dataset outputs torch tensors, so there is no need to a ToTensor transform.

# Load data

In [None]:
# Concatenate modalities
# dataloader output is {'data': concatenated_data, 'file_name': file_name}
# The data has shapes [n_samples, n_seasons, n_channels, height, width] (for concatenated_data [1, 4, 27, 264, 264])

dataset_e2s = E2SChallengeDataset(path_to_data, 
                               modalities = modalities, 
                               dataset_name='bands', 
                               transform=data_transform, 
                               concat=False,
                               output_file_name=True
                              )

# Print dataset length
print(f"Length of train dataset: {len(dataset_e2s)}")

# Print shape of first sample
for m, d in dataset_e2s[0]['data'].items():
    print(f'Modality {m} shape:', d.shape)

Length of train dataset: 5537
Modality s2l1c shape: torch.Size([1, 4, 13, 264, 264])
Modality s2l2a shape: torch.Size([1, 4, 12, 264, 264])
Modality s1 shape: torch.Size([1, 4, 2, 264, 264])


# Create submission file

In this section, we create a submission by randomly generating embeddings of the correct size.
Finally, we create a submission file.

We use the E2SChallengeDataset since we can easily get the sample ID (file name) from the this.

In [4]:
def create_submission_from_dict(emb_dict):
    """Assume dictionary has format {hash-id0: embedding0, hash-id1: embedding1, ...}
    """
    df_submission = pd.DataFrame.from_dict(emb_dict, orient='index')
    
    # Reset index with name 'id'
    df_submission.index.name = 'id'
    df_submission.reset_index(drop=False, inplace=True)
        
    return df_submission
        

# Compress by bilinear transform and channel averaging

In this section, we create a submission file by processing each sample accordingly:
1. Subsampling each channel to 8x8 pixels using bilinear interpolation
2. Average channels B01 to B09 for both L1C and L2A, average B11 and B12, and average S1 channels. Together with B10, this turns into 4 new channels.
3. Flatten into 1024 element vector.

We use the dataloader based on the E2SChallengeDataset since we can easily get the sample ID (file name) from the dataloader.

In [5]:
# Correlation analysis show that L1C and L2A channels B01 to B09 are correlated, B11 and B12 are correlated, 
# and S1 VV and VH are correlated, so we average these, leaving (together with B10) 4 averaged channels.

def embed(data, file_name, emb_len=1024):
    # Bilinear interpolation of each channel separately.
    rescaled_mod = {m: zoom(d, (1, 1, 1, 8/d.shape[3], 8/d.shape[4]), order=1) for m, d in data.items()}

    # Calculate mean of correlated channels.
    b1_b9 = np.mean(np.concatenate((rescaled_mod['s2l1c'][:, :, 0:9, :, :], 
                                   rescaled_mod['s2l2a'][:, :, 0:9, :, :]), axis=2), 
                    axis=2, keepdims=True)
    b10 = rescaled_mod['s2l1c'][:, :, 9:10, :, :]
    b11_b12 = np.mean(np.concatenate((rescaled_mod['s2l1c'][:, :, 10:, :, :], 
                                     rescaled_mod['s2l2a'][:, :, 10:, :, :]), axis=2), 
                      axis=2, keepdims=True)
    s1 = np.mean(rescaled_mod['s1'], axis=2, keepdims=True)

    # Concatenate aggregated channels
    emb = np.concatenate((b1_b9, b10, b11_b12, s1), axis=2)

    # Flatten
    emb = emb.flatten()

    return {'file_name': file_name, 'embedding': emb}


def mean_embedding_parallel(dataset, n_workers=4, n_samples=None):
    
    # Initialize result embeddings
    embeddings = {}

    # Run embedding in parallel
    with ThreadPoolExecutor(max_workers=n_workers) as executor:
        futures = []
        
        for ind, data_file_name in enumerate(dataset):
            data = data_file_name['data']
            # print(data)
            file_name = data_file_name['file_name']
            # Submit the batch for processing
            future = executor.submit(embed, data, file_name)
            futures.append(future)

            if (n_samples is not None) and (ind-1 > n_samples):
                break
        
        # Extract results
        for future in futures:
            res = future.result()
            # Compile embeddings
            embeddings[res['file_name']] = res['embedding']
    return embeddings


In [6]:
n_workers = 1
if n_workers != 1:
    # Embed data
    embeddings = mean_embedding_parallel(dataset_e2s, n_workers=n_workers, n_samples=10)
else:
    embeddings = {}
    for ind, data_file_name in enumerate(dataset_e2s):
        data = data_file_name['data']
        file_name = data_file_name['file_name']
        emb = embed(data, file_name, 1024)
        embeddings[file_name] = emb['embedding']
        

In [7]:
# Create submission file
submission_file = create_submission_from_dict(embeddings)

In [8]:
print('Number of embeddings:', len(submission_file))

Number of embeddings: 5537


In [9]:
submission_file.head()

Unnamed: 0,id,0,1,2,3,4,5,6,7,8,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,fec24d0cda8793ff55e1059c7b88763fee8d58d3decf78...,-0.985362,-1.220834,-1.25356,-1.357725,-1.309369,-1.271099,-1.30429,-1.152729,-0.949668,...,0.395265,0.711939,0.601892,0.383944,0.874982,0.449806,0.952038,-0.268883,0.884533,0.243575
1,67960f4c8870a8aa52f295da0f0fea6d708c3cee2555a4...,-0.589695,-0.765339,-0.805102,-0.516161,-0.775137,-1.128073,-1.012956,-0.751013,-0.794546,...,0.113527,-0.575751,-0.560006,-0.238343,-0.913553,-0.952944,-0.011693,-0.66444,0.862798,-0.504407
2,9688abfaebaea5dca2ec8bde771a7bf1e2bba8e661b777...,-0.780118,-0.73624,-0.763768,-0.729622,-0.836079,-0.952889,-0.829808,-0.714072,-0.977379,...,0.388513,-0.447894,-1.262257,-1.520254,-0.984263,-1.121416,-0.635569,-1.050879,-1.350882,-0.926634
3,fa3ae237ee6e2ee569c20a1e088112cf2105300d9272cc...,-2.013993,-2.028528,-2.034303,-2.032173,-2.028834,-2.032128,-2.032431,-2.032903,-1.997453,...,-1.174436,-1.286493,-1.486834,-0.839548,0.361805,0.279468,-0.059674,-0.799558,-0.876158,-1.462009
4,430590d31e38c5b345a92dc7d9eb8d126c01abced0cf1a...,-1.015036,-1.160182,-1.149326,-1.192975,-1.233959,-1.093594,-1.148571,-1.135589,-1.070416,...,1.354896,0.118833,0.74598,1.308391,0.539959,0.52965,0.233003,0.646347,0.746715,0.449681


In [10]:
# Write submission
if write_result_to_file:
    submission_file.to_csv(path_to_output_file, index=False)

# Verify submission file integrity

Below we provide a snippet from a function which will read your embeddingsand test for the same errors that the evaluation will check for. The function is similar to how the submission files are loaded.

The intention of this function is to help to verify that a submission has the right structure and contents, check for missing embeddings or NaN values, prior to submission.

The function is intended to be a support. Successfully completing this function does not guarantee fault-free submission file, but is an indication that the most common errors are not present.

In [11]:
def test_submission(path_to_submission: str, 
                    expected_embedding_ids: set, 
                    embedding_dim: int = 1024):
    # Load data
    df = pd.read_csv(path_to_submission, header=0)

    # Verify that id is in columns
    if 'id' not in df.columns:
        raise ValueError(f"""Submission file must contain column 'id'.""")

    # Temporarily set index to 'id'
    df.set_index('id', inplace=True)

    # Check that all samples are included
    submitted_embeddings = set(df.index.to_list())
    n_missing_embeddings = len(expected_embedding_ids.difference(submitted_embeddings))
    if n_missing_embeddings > 0:
        raise ValueError(f"""Submission is missing {n_missing_embeddings} embeddings.""")
    
    # Check that embeddings have the correct length
    if len(df.columns) != embedding_dim:
        raise ValueError(f"""{embedding_dim} embedding dimensions, but provided embeddings have {len(df.columns)} dimensions.""")

    # Convert columns to float
    try:
        for col in df.columns:
            df[col] = df[col].astype(float)
    except Exception as e:
        raise ValueError(f"""Failed to convert embedding values to float.
    Check embeddings for any not-allowed character, for example empty strings, letters, etc.
    Original error message: {e}""")

    # Check if any NaNs 
    if df.isna().any().any():
        raise ValueError(f"""Embeddings contain NaN values.""")

    # Successful completion of the function
    return True

In [12]:
# We use the created embeddings as the list of all samples.
# This can be done since we are sure to have fully looped through the dataset.
# A better way would be to find all the IDs in the challenge data separately, e.g. from the dataloader.
embedding_ids = set(embeddings.keys())
embedding_dim = 1024

# Test submission
assert test_submission(path_to_output_file, embedding_ids, embedding_dim)