## Data Preparation
This Tutorial was created by C. Baumhoer and is based on work from https://github.com/mmorphew/unet_remote_sensing & https://github.com/karolzak/keras-unet

----- Access data via Colab ----

In [None]:
#import colab package and mount your accounts associated google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
##use % as an meta escape sign to execute cd command
# you have to accept the terms and conditions in the pop-up window
!mkdir ./drive/My\ Drive/DL
%cd /content/drive/

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Check whats in your folder
#You should see the unziped AP-S2 data
%ls

In [None]:
#Change directory into the data folder
%cd /content/drive/MyDrive/MET-3

------ Set up the python environment -----

In [None]:
#Install all packages we need for this tutorial
%pip install opencv-python tifffile scikit-image rasterio scipy

In [None]:
import cv2
import os
import tifffile # for reading tiff files. There are other modules that can do this, but tifffile is most stable on Windows
import numpy as np # for array handling
import matplotlib.pyplot as plt # for QC
import glob # to gather up image filepath lists
from skimage.transform import resize # we're gonna do some rearranging
import rasterio
from rasterio.plot import show
import scipy # same

----- Let's explore the available data -----




There are the folders:
- scenes
- masks
- val --> masks


The scenes folder includes Sentinel-2 imagery with 3 bands (also called channels in DL-jargon (B3-B4-B8). The mask contains vaule 0 for ocean and value 1 for ice sheet.

In [None]:
#Let's have a look at the available files
top_train_list = glob.glob('./scenes/*.tif')
top_train_list = np.sort(top_train_list)
print(top_train_list)

In [None]:
#Available validation scenes
top_test_list = glob.glob('./val/*.tif')
top_test_list = np.sort(top_test_list)
print(top_test_list)

Lets see how our data looks

In [None]:
# Plot the first band of a training image
src = rasterio.open('./scenes/S2B_MSIL1C_20220128T131859_N0400_R095_T20DNP_20220128T150039-1.tif')
img0 = show(src.read(1),transform=src.transform, cmap='gray')

That looks great. We can see the glacier front and the image is already rectangular. All images have the size 512 x 512 pixel, hence no resizing is necessary.

TASK 1: Print the pixel values of random training scenes. Which range have the pixel values and what do they represent?

----- Make our data DL ready -----

In [None]:
#creating empty arrays to store our data
imgResize = 512
channels = 3

# we create an array that has the shape : sample number, height, width, channel
top_train_total = np.zeros((len(top_train_list), imgResize, imgResize, channels)) 
top_test_total = np.zeros((len(top_test_list), imgResize, imgResize, channels)) 

In [None]:
for i in range(len(top_train_list)):
    img0 = tifffile.imread(top_train_list[i]) # read the image
    # resizing the image is not necessary in our case 
    # as they are all 512 x 512 pixels, but you could use this function if
    # you have different sized images
    #img_reshaped = resize(img0, (imgResize, imgResize, channels)) # resize it
    
    # Local normalization & standardization of the image values
    img_norm = np.clip((img0 - img0.mean()) / (0.5 * img0.std()), -1, 1)
    top_train_total[i] = img_norm # add it to the array

In [None]:
for i in range(len(top_test_list)):
    #img = rasterio.open(top_test_list[i]) # read the image
    # nan values to zero
    #img0 = np.nan_to_num(img.read(1))
    #img_reshaped = resize(img0, (imgResize, imgResize, channels)) # resize it
    img0 = tifffile.imread(top_test_list[i]) # read the image
    #normalize
    img_norm = np.clip((img0 - img0.mean()) / (0.5 * img0.std()), -1, 1)
    top_test_total[i] = img_norm # add it to the array

Great, now that we have both pieces, let's pack it all together.

TASK 2: Why do we need to scale our image data and what different methods exist? A great resource is: https://machinelearningmastery.com/how-to-manually-scale-image-pixel-data-for-deep-learning/


In [None]:
# Finally save our data for Part 2
np.save('data_train.npy', top_train_total)
np.save('data_test.npy', top_test_total)

----- Processing the labels ------

In [None]:
label_train_list = glob.glob('./masks/*.tif')
label_train_list = np.sort(label_train_list)
print(label_train_list)

In [None]:
label_test_list = glob.glob('./val/masks/*.tif')
label_test_list = np.sort(label_test_list)
print(label_test_list)

In [None]:
#Lets have a look at our labels
img = tifffile.imread(label_train_list[8])
plt.imshow(img, cmap='Blues')



Let's start reshaping our labels.

In [None]:
label_train_total = np.zeros((len(label_train_list), imgResize, imgResize, 1))
for i in range(len(label_train_list)):
    img = tifffile.imread(label_train_list[i])
    #again in our case resizing is not necessary but reshaping
    img_reshaped = resize(img, (imgResize, imgResize, 1))
    label_train_total[i] = img_reshaped

In [None]:
label_test_total = np.zeros((len(label_test_list), imgResize, imgResize, 1))
for i in range(len(label_test_list)):
    img = tifffile.imread(label_test_list[i])
    img_reshaped = resize(img, (imgResize, imgResize, 1))
    label_test_total[i] = img_reshaped

TASK 3: What are the label values and dimensions before and after reshaping? And why do we need to one-hot-encode our images?

In [None]:
onehot_label_train_total = np.zeros((len(label_train_list),imgResize,imgResize,2), dtype=float)
for k in range(len(label_train_list)):
    for i in range(imgResize):
        for j in range(imgResize):
            # ocean
            if label_train_total[k,i,j,0]==0.:
                onehot_label_train_total[k,i,j,0]=1
            # ice sheet
            elif label_train_total[k,i,j,0]>0:
                onehot_label_train_total[k,i,j,1]=1

In [None]:
onehot_label_test_total = np.zeros((len(label_test_list),imgResize,imgResize,2), dtype=float)
for k in range(len(label_test_list)):
    for i in range(imgResize):
        for j in range(imgResize):
            # ocean
            if label_test_total[k,i,j,0]==0.:
                onehot_label_test_total[k,i,j,0]=1
            # ice sheet
            elif label_test_total[k,i,j,0]> 0:
                onehot_label_test_total[k,i,j,1]=1

TASK 4: Check your one-hot-encoded labels regarding dimensions and unique values.

Let's check how our one-hot-encoded labels look like

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,10))
axes[0].imshow(onehot_label_test_total[2,:,:,0], cmap='gray')
axes[1].imshow(onehot_label_test_total[2,:,:,1], cmap='gray')
print(np.unique(onehot_label_train_total))

Great, this is looking good. Finnaly, we save our labels for part 2

In [None]:
np.save('./label_train.npy', onehot_label_train_total)
np.save('./label_test.npy', onehot_label_test_total)