## Create initial dataset

This notebook splits the tiles data into train and test set based on a year configuration. It also filters the tiles w.r.t the *non-nan* and *fire* pixel percentage they contain. Then, there is an option to sample tiles for train set, maintaining a specific class ratio (by keeping all *fire* tiles and an additional number of *no-fire* tiles). Finally, it saves the resulting train and test dataframes to disk to be used for semantic segmentation experiments.

***

Import packages

In [1]:
import pathlib

import numpy as np
import pandas as pd

Define a dataset configuration

In [2]:
cfg = {
    "ds_version": "dataset_v1",  # Name of dataset version to keep track of its configuration
    "ds_csv": "final_dataset_alltiles_ts32.csv",  # Name of the dataset csv being processed
    "test_years": [2019],  # List of years to be used for testing
    "train_months": [],  # Months to keep for train tiles, [] to keep all
    "test_months": ["06", "07", "08", "09"],  # Months to keeo for test tiles, [] to keep all
    # The following keeps a tile iff it contains at least 1px of land, fire pct is irrelevant.
    "train_non_nan_thr": 0.0,  # Minimum required non-nan pixel percentage for a train tile to keep
    "train_fire_thr": -1.0,  # Minimum required fire pixel percentage for a train tile to keep
    "test_non_nan_thr": 0.0,  # Minimum required non-nan pixel percentage for a test tile to keep
    "test_fire_thr": -1.0,  # Minimum required fire pixel percentage for a test tile to keep
    "non_fire_ratio": 2,  # The ratio of non-fire tiles over the fire ones to sample for the train set
    "random_state": 42,
    "tile_size": 32,
}

Load dataset and split it to train and test based on a year

In [3]:
dataset = pd.read_csv(cfg["ds_csv"])

train_df_ = dataset[dataset["year"] < np.min(cfg["test_years"])].reset_index(
    drop=True)
test_df = dataset[dataset["year"].isin(cfg["test_years"])].reset_index(
    drop=True)

train_df_.shape, test_df.shape

((2505750, 6), (245435, 6))

Filter tiles based on `months` args

In [4]:
train_df_["month"] = train_df_["date"].apply(lambda x: str(x)[4:6])
if cfg["train_months"]:
    train_df_ = train_df_[train_df_["month"].isin(cfg["train_months"])
                         ].reset_index(drop=True)

test_df["month"] = test_df["date"].apply(lambda x: str(x)[4:6])
if cfg["test_months"]:
    test_df = test_df[test_df["month"].isin(cfg["test_months"])].reset_index(
        drop=True)

train_df_.shape, test_df.shape

((2505750, 7), (155485, 7))

Filter tiles based on `non_nan_thr` and `fire_thr` args

In [5]:
train_df_ = train_df_[(train_df_["non_nan_pct"] > cfg["train_non_nan_thr"]) & (
    train_df_["fire_pct"] > cfg["train_fire_thr"])].reset_index(drop=True)
test_df = test_df[(test_df["non_nan_pct"] > cfg["test_non_nan_thr"]) & (test_df[
    "fire_pct"] > cfg["test_fire_thr"])].reset_index(drop=True)

train_df_.shape, test_df.shape

((2505750, 7), (155485, 7))

Sample train dataset's tile instances w.r.t fire and non-fire ratio. Keep all fire tiles (N) plus (`non_fire_ratio`\*N) of random additional non-fire tiles

In [6]:
train_fire_df = train_df_[train_df_["fire_pct"] > 0.0].reset_index(drop=True)
train_non_fire_df = train_df_[train_df_["fire_pct"] == 0.0].reset_index(
    drop=True)

train_df = pd.concat([train_fire_df, train_non_fire_df.sample(
    n=cfg["non_fire_ratio"]*len(train_fire_df), replace=False,
    random_state=cfg["random_state"]).reset_index(drop=True)]).reset_index(
    drop=True)

train_df = train_df.sample(frac=1, random_state=cfg["random_state"]
                          ).reset_index(drop=True)

train_df.shape

(3342, 7)

Save train and test dataframes to csv

In [7]:
output_dir = pathlib.Path(cfg["ds_version"])
if not output_dir.is_dir():
    output_dir.mkdir()
    pd.DataFrame.from_dict(cfg, orient='index').to_csv(pathlib.Path(
        output_dir, "config.csv"))

    train_df.to_csv(pathlib.Path(output_dir, "train_tiles.csv"), index=False)
    test_df.to_csv(pathlib.Path(output_dir, "test_tiles.csv"), index=False)

    print(f"Dataset directory {output_dir} created.")
else:
    print(f"Dataset directory {output_dir} already exists.")

Dataset directory dataset_v1 created.
