- Chips distribution (source, train, test)
    - [How many chips are there for the test, train and overall dataset?](#how-many-chips-are-there-for-the-test-train-and-overall-dataset)
    - [Are there some unused chips in different splits as well as the overall dataset?](#are-there-some-unused-chips-in-different-splits-as-well-as-the-overall-dataset)
- Fields distribution
    - [What is the total number of fields in different splits?](#what-is-the-total-number-of-fields-in-different-splits)
    - [What are the chips with overlapping field ids?](#what-are-the-chips-with-overlapping-field-ids)
- Crops distribution
    - [What is the crop distribution per chip?]()
    - [What is the crop distribution per field id?]()
    - [What is the crop distribution per pixels?]()

In [82]:
import json
import rasterio
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import glob
from tqdm import tqdm
from collections import defaultdict


In [31]:
bands = ['B01', 'B02', 'B03', 'B04','B05', 'B06', 'B07', 'B08','B8A', 'B09', 'B11', 'B12']

data_dir = '../data/train_test'

main = 'ref_agrifieldnet_competition_v1'

source_collection = f'{main}_source'
train_label_collection = f'{main}_labels_train'
test_label_collection = f'{main}_labels_test'

In [35]:
with open (f'{data_dir}/{main}/{source_collection}/collection.json') as f:
    source_json = json.load(f)
    source_folder_ids = [i['href'].split('_')[-1].split('.')[0] for i in source_json['links'][4:]]
    source_band_paths = [[f'{data_dir}/{main}/{source_collection}/{source_collection}_{i}/{band}.tif'  for band in bands ] for i in source_folder_ids]
    
with open (f'{data_dir}/{main}/{train_label_collection}/collection.json') as f:
    train_json = json.load(f)
    train_folder_ids = [i['href'].split('_')[-1].split('.')[0] for i in train_json['links'][4:]]
    train_field_paths = [f'{data_dir}/{main}/{train_label_collection}/{train_label_collection}_{i}/field_ids.tif' for i in train_folder_ids]
    train_label_paths = [f'{data_dir}/{main}/{train_label_collection}/{train_label_collection}_{i}/raster_labels.tif' for i in train_folder_ids]

with open (f'{data_dir}/{main}/{test_label_collection}/collection.json') as f: 
    test_json = json.load(f)
    test_folder_ids = [i['href'].split('_')[-1].split('.')[0] for i in test_json['links'][4:]]
    test_field_paths = [f'{data_dir}/{main}/{test_label_collection}/{test_label_collection}_{i}/field_ids.tif' for i in test_folder_ids]
    test_label_paths = [f'{data_dir}/{main}/{test_label_collection}/{test_label_collection}_{i}/raster_labels.tif' for i in test_folder_ids]

## Chips distribution

### How many chips are there for the test, train and overall dataset?

In [33]:
assert len(train_folder_ids) == len(train_field_paths) == len(train_label_paths)
print("number of train chips: ", len(train_folder_ids))

number of train chips:  1165


In [34]:
assert len(test_folder_ids) == len(test_field_paths) == len(test_label_paths)
print("number of test chips: ", len(test_folder_ids))

number of test chips:  707


In [36]:
assert len(source_folder_ids) == len(source_band_paths)
print("number of source chips: ", len(source_folder_ids))

number of source chips:  1217


### Are there some unused chips in different splits as well as the overall dataset?

In [59]:
print(f"{len(set(test_folder_ids).union(set(train_folder_ids)))} total chips used in train/test splits")

1212 total chips used in train/test splits


In [43]:
print(f"{len(source_folder_ids) - len(set(source_folder_ids).intersection(set(test_folder_ids).union(set(train_folder_ids))))} unused chips from the source collection")

5 unused chips from the source collection


In [47]:
print(f"{len(source_folder_ids) - len(set(source_folder_ids).intersection(set(train_folder_ids)))} unused chips from the source collection (by the train split).")

52 unused chips from the source collection (by the train split).


In [52]:
print(f"{len(train_folder_ids) - len(set(train_folder_ids).intersection(set(test_folder_ids)))} unused chips from the train collection (by the test split).")

505 unused chips from the train collection (by the test split).


In [49]:
print(f"{len(source_folder_ids) - len(set(source_folder_ids).intersection(set(test_folder_ids)))} unused chips from the source collection (by the test split).")


510 unused chips from the source collection (by the test split).


(505) missing chips from the train + (5) chips missing from both splits  = 510 from source collection.

In [61]:
print(f"{len(test_folder_ids) - len(set(test_folder_ids).intersection(train_folder_ids))} chips in the test data but absent in the train data.")

47 chips in the test data but absent in the train data.


## Fields distribution

In [89]:

def get_field_ids_per_chips(collection, folder_ids):

    chips = {}

    pbar = tqdm(folder_ids)
    pbar.set_description(f'Extracting {collection}',)

    for idx in pbar:
        with rasterio.open(f'{data_dir}/{main}/{collection}/{collection}_{idx}/field_ids.tif') as src:
            field_data = src.read()
            field_data = list(np.unique(field_data))
            field_data.remove(0)
            chips[idx] = field_data

    return chips

In [95]:
train_field_ids_per_chip = get_field_ids_per_chips(train_label_collection, train_folder_ids)
test_field_ids_per_chip = get_field_ids_per_chips(test_label_collection, test_folder_ids)

Extracting ref_agrifieldnet_competition_v1_labels_train: 100%|██████████| 1165/1165 [00:02<00:00, 507.52it/s]
Extracting ref_agrifieldnet_competition_v1_labels_test: 100%|██████████| 707/707 [00:01<00:00, 506.13it/s]


In [96]:
train_field_ids = [field for fields in train_field_ids_per_chip.values() for field in fields]
test_field_ids = [field for fields in test_field_ids_per_chip.values() for field in fields]

### What is the total number of fields in different splits?

- The same chips from source collection may have both training and testing fields. An elegant way to seperate them might be nice.

In [92]:
print(f"{len(np.unique(train_field_ids))} fields in the train split")

5551 fields in the train split


In [98]:
print(f"{len(np.unique(test_field_ids))} fields in the test split")

1530 fields in the test split


In [99]:
print(f"{len(np.unique(test_field_ids)) + len(np.unique(train_field_ids))} fields in the dataset")

7081 fields in the dataset


### What are the chips with overlapping field ids?

In [110]:
def get_overlapping_chips_per_field(field_ids_per_chip):
    field_ids = np.unique([field for fields in field_ids_per_chip.values() for field in fields])

    chips = {}

    for idx in tqdm(field_ids):
        chips[idx] = [chip for chip in field_ids_per_chip.keys() if idx in field_ids_per_chip[chip]]

    return chips

In [125]:
overlapping_train_chips = get_overlapping_chips_per_field(train_field_ids_per_chip)
sets, counts = np.unique([len(chips) for chips in overlapping_train_chips.values()], return_counts=True)

for v, c in zip(sets, counts):
    print(f"{c} sets of {v} overlapping training chips")

100%|██████████| 5551/5551 [00:00<00:00, 8078.83it/s]

5285 sets of 1 overlapping training chips
264 sets of 2 overlapping training chips
1 sets of 3 overlapping training chips
1 sets of 4 overlapping training chips





In [126]:
overlapping_test_chips = get_overlapping_chips_per_field(test_field_ids_per_chip)
sets, counts = np.unique([len(chips) for chips in overlapping_test_chips.values()], return_counts=True)

for v, c in zip(sets, counts):
    print(f"{c} sets of {v} overlapping testing chips")

100%|██████████| 1530/1530 [00:00<00:00, 19568.95it/s]

1458 sets of 1 overlapping testing chips
70 sets of 2 overlapping testing chips
2 sets of 4 overlapping testing chips





## Crops distribution

In [129]:
crops = {
    "1": "Wheat",
    "2": "Mustard",
    "3": "Lentil",
    "4": "No Crop/Fallow",
    "5": "Green pea",
    "6": "Sugarcane",
    "8": "Garlic",
    "9": "Maize",
    "13": "Gram",
    "14": "Coriander",
    "15": "Potato",
    "16": "Bersem",
    "36": "Rice",
}

### What is the crop distribution per chip?

In [187]:
def get_crop_labels_per_chips(collection, folder_ids):

    chips = {}

    pbar = tqdm(folder_ids)
    pbar.set_description(f'Extracting {collection}',)

    for idx in pbar:
        with rasterio.open(f'{data_dir}/{main}/{collection}/{collection}_{idx}/raster_labels.tif') as src:
            label_data = src.read()
            label_data = list(np.unique(label_data))
            label_data.remove(0)
            chips[idx] = [crops[str(label_idx)] for label_idx in label_data]

    return chips


def get_overlapping_crops_in_chips(train_crops_per_chip):

    chips = {}

    pbar = tqdm(crops.values())

    for crop in pbar:
        pbar.set_description(f"Extracting {crop}")
        chips[crop] = [chip for chip in train_crops_per_chip.keys() if crop in train_crops_per_chip[chip]]

    return chips

In [164]:
train_crops_per_chip = get_crop_labels_per_chips(train_label_collection, train_folder_ids)

sets, counts = np.unique([len(crops_set) for crops_set in train_crops_per_chip.values()], return_counts=True)

for v, c in zip(sets, counts):
    print(f"{c} chips containing {v} crops type(s)")

Extracting ref_agrifieldnet_competition_v1_labels_train: 100%|██████████| 1165/1165 [00:02<00:00, 486.37it/s]

586 chips containing 1 crops type(s)
313 chips containing 2 crops type(s)
178 chips containing 3 crops type(s)
69 chips containing 4 crops type(s)
14 chips containing 5 crops type(s)
3 chips containing 6 crops type(s)
2 chips containing 7 crops type(s)





In [191]:
overlapping_crops_in_chips = get_overlapping_crops_in_chips(train_crops_per_chip)

crop_counts_chip = {crop: len(chips) for crop, chips in overlapping_crops_in_chips.items()}

for crop, count in crop_counts_chip.items():
    print(f"{crop} can be found in {count} chips")

Extracting Rice: 100%|██████████| 13/13 [00:00<00:00, 1978.66it/s]

Wheat can be found in 661 chips
Mustard can be found in 479 chips
Lentil can be found in 82 chips
No Crop/Fallow can be found in 446 chips
Green pea can be found in 24 chips
Sugarcane can be found in 111 chips
Garlic can be found in 40 chips
Maize can be found in 119 chips
Gram can be found in 42 chips
Coriander can be found in 11 chips
Potato can be found in 38 chips
Bersem can be found in 16 chips
Rice can be found in 55 chips





### What is the crop distribution per field id?

In [189]:
def get_crop_labels_per_field_ids(collection, folder_ids):

    crop_labels_per_field_ids = defaultdict(lambda: set())

    pbar = tqdm(folder_ids)
    pbar.set_description(f'Extracting {collection}',)

    for idx in pbar:

        with rasterio.open(f'{data_dir}/{main}/{collection}/{collection}_{idx}/raster_labels.tif') as src:
            label_data = src.read()[0]

        with rasterio.open(f'{data_dir}/{main}/{collection}/{collection}_{idx}/field_ids.tif') as src:
            field_data = src.read()[0]

        field_ids = list(np.unique(field_data))
        field_ids.remove(0)

        for fidx in field_ids:
            crop_labels_per_field_ids[fidx] = crop_labels_per_field_ids[fidx].union(set([crops[str(cidx)] for cidx in np.unique(label_data[field_data == fidx])]))

    return crop_labels_per_field_ids


def get_overlapping_crops_in_field_ids(train_crops_per_field_ids):

    chips = {}

    pbar = tqdm(crops.values())

    for crop in pbar:
        pbar.set_description(f"Extracting {crop}")
        chips[crop] = [chip for chip in train_crops_per_field_ids.keys() if crop in train_crops_per_field_ids[chip]]

    return chips

In [190]:
crop_labels_per_field_ids = get_crop_labels_per_field_ids(train_label_collection, train_folder_ids)

sets, counts = np.unique([len(crops_set) for crops_set in crop_labels_per_field_ids.values()], return_counts=True)

for v, c in zip(sets, counts):
    print(f"{c} fields containing {v} crops type(s)")

Extracting ref_agrifieldnet_competition_v1_labels_train: 100%|██████████| 1165/1165 [00:04<00:00, 259.17it/s]

5551 fields containing 1 crops type(s)





In [196]:
overlapping_crops_in_field_ids = get_overlapping_crops_in_field_ids(crop_labels_per_field_ids)

crop_counts_field = {crop: len(chips) for crop, chips in overlapping_crops_in_field_ids.items()}

for crop, count in crop_counts_field.items():
    print(f"{crop} can be found in {count} fields")

Extracting Rice: 100%|██████████| 13/13 [00:00<00:00, 1624.29it/s]

Wheat can be found in 2031 fields
Mustard can be found in 990 fields
Lentil can be found in 103 fields
No Crop/Fallow can be found in 1641 fields
Green pea can be found in 23 fields
Sugarcane can be found in 163 fields
Garlic can be found in 48 fields
Maize can be found in 293 fields
Gram can be found in 59 fields
Coriander can be found in 14 fields
Potato can be found in 41 fields
Bersem can be found in 16 fields
Rice can be found in 129 fields





### What is the crop distribution per pixels?

In [197]:
def get_num_pixels_per_crops_in_each_chip(collection, folder_ids):

    num_pixels_per_field_ids = defaultdict(lambda: {})

    pbar = tqdm(folder_ids)
    pbar.set_description(f'Extracting {collection}',)

    for idx in pbar:

        with rasterio.open(f'{data_dir}/{main}/{collection}/{collection}_{idx}/raster_labels.tif') as src:
            label_data = src.read()[0]

        for crop in crops.keys():
            num_pixels_per_field_ids[idx][crops[crop]] = len(label_data[label_data == int(crop)])

    return num_pixels_per_field_ids

In [198]:
num_pixels_per_field_ids = get_num_pixels_per_crops_in_each_chip(train_label_collection, train_folder_ids)

Extracting ref_agrifieldnet_competition_v1_labels_train: 100%|██████████| 1165/1165 [00:02<00:00, 552.11it/s]


In [210]:
pixel_dist = np.asarray([list(num_pixels_per_field_ids[idx].values())  for idx in num_pixels_per_field_ids.keys()]).sum(axis=0)

for i in range(len(crops.keys())):
    print(f"'{crops[list(crops.keys())[i]]}' can be found in {pixel_dist[i]} pixels")

'Wheat' can be found in 75118 pixels
'Mustard' can be found in 46818 pixels
'Lentil' can be found in 2883 pixels
'No Crop/Fallow' can be found in 36397 pixels
'Green pea' can be found in 531 pixels
'Sugarcane' can be found in 5820 pixels
'Garlic' can be found in 3150 pixels
'Maize' can be found in 8773 pixels
'Gram' can be found in 3503 pixels
'Coriander' can be found in 678 pixels
'Potato' can be found in 886 pixels
'Bersem' can be found in 261 pixels
'Rice' can be found in 3410 pixels
