# Generate Major TOM Embeddings

This notebook demonstrates how to turn Major TOM data parquets from the Core datasets into embedding parquets.

This involves the process of reading, fragmenting and normalizing the data and then passing through one of supported embedding models.

## Data Acquisition
You can quickly get a local copy of some example parquets from the Major TOM Core datasets directly in Python like this:
```python
from huggingface_hub import hf_hub_download

# Sentinel-2 L2A
hf_hub_download(repo_id="Major-TOM/Core-S2L2A",
                subfolder="images",
                filename="part_00001.parquet",
                repo_type="dataset",
                local_dir="data/Major-TOM/Core-S2L2A"
               )
               
# Sentinel-2 L1C
hf_hub_download(repo_id="Major-TOM/Core-S2L1C",
                subfolder="images",
                filename="part_00001.parquet",
                repo_type="dataset",
                local_dir="data/Major-TOM/Core-S2L1C"
               )

# Sentinel-1 RTC
hf_hub_download(repo_id="Major-TOM/Core-S1RTC",
                subfolder="images",
                filename="part_00001.parquet",
                repo_type="dataset",
                local_dir="data/Major-TOM/Core-S1RTC"
               )
```
but you can also pull the data directly from HuggingFace during the processing, although that can be slower.

## Define a Model & Embedder
The main tool for turning a row of Major TOM data .parquet into a Major TOM embedding parquet is `MajorTOM_Embedder`, which is what we need to initialize first.

When initialized, we want to pass the embedding model (which takes care of normalization of raw values and application of the model). We currently support:
* `SigLIP_S2RGB_Embedder()`
* `DINOv2_S2RGB_Embedder()`
* `SSL4EO_S2L1C_Embedder()`
* `SSL4EO_S1RTC_Embedder()`

> This list might have been expanded further - check https://github.com/ESA-PhiLab/Major-TOM/tree/main/src/embedder/models to see the latest options.

In [None]:
from MajorTOM import *

## Model Selection

In [3]:
# Set Up Metadata
DATASET_NAME = 'Major-TOM/Core-S2L2A'
model = DINOv2_S2RGB_Embedder()
#model = SigLIP_S2RGB_Embedder() also works for S2

# Sentinel-2 L1C
# DATASET_NAME = 'Major-TOM/Core-S2L1C'
#model = SSL4EO_S2L1C_Embedder()

# Sentinel-1 RTC
# DATASET_NAME = 'Major-TOM/Core-S1RTC'
# model = SSL4EO_S1RTC_Embedder()

In [4]:
embedder = MajorTOM_Embedder(model)

embedder.to('cuda')

MajorTOM_Embedder(
  (embedder): DINOv2_S2RGB_Embedder(
    (model): Dinov2Model(
      (embeddings): Dinov2Embeddings(
        (patch_embeddings): Dinov2PatchEmbeddings(
          (projection): Conv2d(3, 768, kernel_size=(14, 14), stride=(14, 14))
        )
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (encoder): Dinov2Encoder(
        (layer): ModuleList(
          (0-11): 12 x Dinov2Layer(
            (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
            (attention): Dinov2Attention(
              (attention): Dinov2SelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.0, inplace=False)
              )
              (output): Dinov2SelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=Tru

## Assemble Data
This notebook works directly with Major TOM parquets. So, we only need that **and** the metadata file for some additional context.

In [30]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.parquet import open_parquet_file
import rasterio as rio

PARQUET_FILE = 'part_00001.parquet'

meta_path = 'https://huggingface.co/datasets/{}/resolve/main/metadata.parquet'.format(DATASET_NAME)
meta_df = pd.read_parquet(meta_path)

# Set Up Data File

# from HF directly
parquet_file_path = 'https://huggingface.co/datasets/{}/resolve/main/images/{}'.format(DATASET_NAME, PARQUET_FILE)

# local
#parquet_file_path = f'./data/{DATASET_NAME}/images/part_00001.parquet'

bands=embedder.bands()

f = open_parquet_file(parquet_file_path, columns=[*bands,
                                                  'product_id',
                                                  'grid_cell',
                                                  'timestamp',
                                                 ])
pf = pq.ParquetFile(f)



## Embed the rows
The `MajorTOM_Embedder()` object can simply be called with `row, row_meta` fed in as arguments and it will return a `DataFrame` containing all standardized variables.

In the process, it takes care of normalizing the data, and fragmenting it into smaller crops. The geometry of each crop is automatically generated.

In [31]:
from tqdm.notebook import tqdm

embed_df = None

for row_idx in tqdm(range(pf.num_row_groups)):

    # data & context
    row= pf.read_row_group(row_idx, columns=[*bands, 'product_id','grid_cell','timestamp'])
    row_meta = meta_df[(meta_df.grid_cell==row['grid_cell'][0].as_py()) & (meta_df.product_id==row['product_id'][0].as_py())].head(1)

    # embedding functions
    embed_dict = embedder(row, row_meta)

    # store_df
    if embed_df is None:
        embed_df = embed_dict
    else:
        embed_df = pd.concat([embed_df, embed_dict])

embed_df = embed_df.reset_index(drop=True)

  0%|          | 0/500 [00:00<?, ?it/s]

## Save the embedding parquet
Major TOM embeddings are delivered in the geoparquet format, so the file must be following this way:

In [32]:
embed_df

Unnamed: 0,unique_id,embedding,timestamp,product_id,grid_cell,grid_row_u,grid_col_r,geometry,centre_lat,centre_lon,utm_footprint,utm_crs,pixel_bbox
0,8018b920fd8262ca6426d8d0c7bdb09905d19daf59d6f2...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.41163 -82.72221, -178.41551 -82...",-82.731987,-178.492844,POLYGON ((480037.35680356843 814306.7102547885...,EPSG:32701,"[0, 0, 224, 224]"
1,7f74df14d868694602ff4980a2176470dcb25c570de459...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.41528 -82.74111, -178.41918 -82...",-82.750885,-178.496704,POLYGON ((480037.35680356843 812196.7102547885...,EPSG:32701,"[0, 211, 224, 435]"
2,304046a18855a98a0d5aa22ccded49286ac580145106a2...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.41896 -82.76001, -178.42288 -82...",-82.769783,-178.500595,POLYGON ((480037.35680356843 810086.7102547885...,EPSG:32701,"[0, 422, 224, 646]"
3,f80b07f5c7ddefe1fa8aa536a911e1a9a587b453dc5084...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.42265 -82.77890, -178.42659 -82...",-82.788673,-178.504517,POLYGON ((480037.35680356843 807976.7102547885...,EPSG:32701,"[0, 633, 224, 857]"
4,55aad9b5df58ea9fd14cdbd9e4b2ffbe3a8954ac5d9f52...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.42636 -82.79780, -178.43033 -82...",-82.807571,-178.508438,POLYGON ((480037.35680356843 805866.7102547885...,EPSG:32701,"[0, 844, 224, 1068]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,9ea04479469d5955f64a2e767528a286aef8fae3f093ba...,"[1.8706962, -0.36548963, -0.16844065, -1.15036...",20231023T083029,S2B_MSIL2A_20231023T083029_N0509_R135_T22CEP_2...,918D_75L,-918,-75,"POLYGON ((-50.53373 -82.36227, -50.53251 -82.3...",-82.372375,-50.608696,POLYGON ((506918.63283075346 854703.0551686661...,EPSG:32722,"[844, 0, 1068, 224]"
12496,9c09db25fe30b6fc8d7f24dacd71c6a6dc3048a7d1865f...,"[1.6086377, 0.3244056, 0.69208705, -1.4272233,...",20231023T083029,S2B_MSIL2A_20231023T083029_N0509_R135_T22CEP_2...,918D_75L,-918,-75,"POLYGON ((-50.53258 -82.38118, -50.53135 -82.4...",-82.391281,-50.607727,POLYGON ((506918.63283075346 852593.0551686661...,EPSG:32722,"[844, 211, 1068, 435]"
12497,054f382885a34a9086701caa5d63596fed2b539afc4109...,"[1.1596069, -0.452088, -0.72552645, -0.7785724...",20231023T083029,S2B_MSIL2A_20231023T083029_N0509_R135_T22CEP_2...,918D_75L,-918,-75,"POLYGON ((-50.53142 -82.40008, -50.53019 -82.4...",-82.410179,-50.606758,POLYGON ((506918.63283075346 850483.0551686661...,EPSG:32722,"[844, 422, 1068, 646]"
12498,75bcaa8a658a9eb5cf7d2874c0da45e1c77a2555e0ce7b...,"[1.7653571, -0.40404075, -0.2192422, -1.091481...",20231023T083029,S2B_MSIL2A_20231023T083029_N0509_R135_T22CEP_2...,918D_75L,-918,-75,"POLYGON ((-50.53026 -82.41898, -50.52902 -82.4...",-82.429085,-50.605782,POLYGON ((506918.63283075346 848373.0551686661...,EPSG:32722,"[844, 633, 1068, 857]"


In [33]:
embed_df.to_parquet(f'embedded-{PARQUET_FILE}')

## Read the saved file

In [34]:
sanity_check = gpd.read_parquet(f'embedded-{PARQUET_FILE}')

In [35]:
sanity_check.head()

Unnamed: 0,unique_id,embedding,timestamp,product_id,grid_cell,grid_row_u,grid_col_r,geometry,centre_lat,centre_lon,utm_footprint,utm_crs,pixel_bbox
0,8018b920fd8262ca6426d8d0c7bdb09905d19daf59d6f2...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.41163 -82.72221, -178.41551 -82...",-82.731987,-178.492844,POLYGON ((480037.35680356843 814306.7102547885...,EPSG:32701,"[0, 0, 224, 224]"
1,7f74df14d868694602ff4980a2176470dcb25c570de459...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.41528 -82.74111, -178.41918 -82...",-82.750885,-178.496704,POLYGON ((480037.35680356843 812196.7102547885...,EPSG:32701,"[0, 211, 224, 435]"
2,304046a18855a98a0d5aa22ccded49286ac580145106a2...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.41896 -82.76001, -178.42288 -82...",-82.769783,-178.500595,POLYGON ((480037.35680356843 810086.7102547885...,EPSG:32701,"[0, 422, 224, 646]"
3,f80b07f5c7ddefe1fa8aa536a911e1a9a587b453dc5084...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.42265 -82.77890, -178.42659 -82...",-82.788673,-178.504517,POLYGON ((480037.35680356843 807976.7102547885...,EPSG:32701,"[0, 633, 224, 857]"
4,55aad9b5df58ea9fd14cdbd9e4b2ffbe3a8954ac5d9f52...,"[1.766272, -0.4046016, -0.21980742, -1.0928067...",20230119T161811,S2A_MSIL2A_20230119T161811_N0509_R111_T01CDJ_2...,922D_249L,-922,-249,"POLYGON ((-178.42636 -82.79780, -178.43033 -82...",-82.807571,-178.508438,POLYGON ((480037.35680356843 805866.7102547885...,EPSG:32701,"[0, 844, 224, 1068]"
