## Ash Color Images Dataset Creation Notebook

We will create a Ash Color Images dataset of the satellite images of this competition for our models using this notebook. Some main points:
* Save only the labeled frame, which will be used for training.
* Save only the human_pixel_masks.
* Save the ash color image and the mask label in the same numpy file, so that we have to load only one file during training.
* Save the final numpy arrays in float16 dtype to reduce total data size.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
from tqdm.notebook import tqdm
from pathlib import Path

In [None]:
data_dir = '/kaggle/input/google-research-identify-contrails-reduce-global-warming/'

## Make the DataFrames

We will create train and valid dataframes, which will contain the record ids for each image.

In [None]:
train_rs = os.listdir(data_dir + 'train')
valid_rs = os.listdir(data_dir + 'validation')

train_df = pd.DataFrame(train_rs, columns=['record_id'])
valid_df = pd.DataFrame(valid_rs, columns=['record_id'])

train_df['train'] = 'train'
valid_df['train'] = 'valid'

In [None]:
train_df.shape, valid_df.shape

In [None]:
train_df.head()

In [None]:
train_df.to_csv('train_df.csv', index=False)
valid_df.to_csv('valid_df.csv', index=False)

## Save the Images as Numpy arrays

In [None]:
def read_record(record_id, directory):
    record_data = {}
    for x in [
        "band_11", 
        "band_14", 
        "band_15",
    ]:
        record_data[x] = np.load(os.path.join(directory, record_id, x + ".npy"))
    
    return record_data

In [None]:
_T11_BOUNDS = (243, 303)
_CLOUD_TOP_TDIFF_BOUNDS = (-4, 5)
_TDIFF_BOUNDS = (-4, 2)

def normalize_range(data, bounds):
    """Maps data to the range [0, 1]."""
    return (data - bounds[0]) / (bounds[1] - bounds[0])

N_TIMES_BEFORE = 5

In [None]:
def get_false_color(record_data):
    _T11_BOUNDS = (243, 303)
    _CLOUD_TOP_TDIFF_BOUNDS = (-4, 5)
    _TDIFF_BOUNDS = (-4, 2)

    r = normalize_range(record_data["band_15"] - record_data["band_14"], _TDIFF_BOUNDS)
    g = normalize_range(record_data["band_14"] - record_data["band_11"], _CLOUD_TOP_TDIFF_BOUNDS)
    b = normalize_range(record_data["band_14"], _T11_BOUNDS)
    false_color = np.clip(np.stack([r, g, b], axis=2), 0, 1)
    img = false_color[..., N_TIMES_BEFORE]
    
    return img

In [None]:
path = Path('contrails')
path.mkdir(exist_ok=True, parents=True)

In [None]:
#Train
for i in tqdm(train_rs):
    data = read_record(str(i), data_dir+'train')
    img = get_false_color(data)
    img = img.astype(np.float16)
#     final = np.dstack([img, data['human_pixel_masks']])
#     final = final.astype(np.float16)
    
    pathc = path/f"{i}.npy"
    np.save(str(pathc), img)

In [None]:
#Valid
for i in tqdm(valid_rs):
    data = read_record(str(i), data_dir+'validation')
    img = get_false_color(data)
    img = img.astype(np.float16)
#     final = np.dstack([img, data['human_pixel_masks']])
#     final = final.astype(np.float16)
    
    pathc = path/f"{i}.npy"
    np.save(str(pathc), img)