### Acknowledgement

In this notebook, we follow the approach outlined by Martin Görner in [Part 1 of his Keras on TPU series](https://codelabs.developers.google.com/codelabs/keras-flowers-data/#0).

### Loading libraries

In [None]:
import numpy as np
import pandas as pd
import os, sys, math
import tensorflow as tf
from pathlib import Path
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler

# AUTO will be used in tf.data.Dataset API
AUTO = tf.data.experimental.AUTOTUNE 

print("Tensorflow version " + tf.__version__)

### Setting up basic parameters

In [None]:
show_files=0

# if you want to see the full content of the
# 'kaggle/input'directory set show_files=1

if show_files:
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

In [None]:
NFOLDS=5 # this must be consistent with PATH_FOLDS below
SHARDS = 4 # the number of .tfrec files in each fold
TARGET_SIZE = [512, 512] # the desired size of the output images
CLASSES = [b'benign', b'malignant']

PATH_DATA=Path('/kaggle/input/siim-isic-melanoma-classification/')
PATH_FOLDS=Path('/kaggle/input/siim-stratified-groupkfold-5-folds/')

### Loading training data

In [None]:
train=pd.read_csv(PATH_DATA/'train.csv')
print(f"The shape of the `train` is {train.shape}.\n")
print(f"The columns present in `train` are {train.columns.values}.")

In [None]:
test=pd.read_csv(PATH_DATA/'test.csv')
print(f"The shape of the `test` is {test.shape}.\n")
print(f"The columns present in `test` are {test.columns.values}.")

Note that there is no `diagnosis` column in the test set. 

### Imputing missing values

The `sex`, `age_approx`, and `anatom_site_general_challenge` columns of the training set contain missing values:

In [None]:
train.isna().sum()

The `anatom_site_general_challenge` column of the test set contains missing values as well

In [None]:
test.isna().sum()

We will replace the missing values for `age_approx` with the median age of the patients present in the dataset. As for the other two columns, we will mark the missing values with the word "unknown". 

In [None]:
median_age=train['age_approx'].median()
print(f"The median age of the patients in the training set is {median_age} years.")

In [None]:
train['age_approx'].fillna(median_age, inplace=True)
train.fillna('unknown', inplace=True)
test.fillna('unknown', inplace=True)

In [None]:
print(f"The total number of NA's after imputation in `train` is {train.isna().sum().sum()}.")
print(f"The total number of NA's after imputation in `test` is {test.isna().sum().sum()}.")

### One-hot encoding for categorical variables

The unique values in `train`:

In [None]:
print("The unique values of 'age_approx':")
print(np.unique(train['age_approx'].values))
print("\nThe unique values of 'sex':")
print(np.unique(train['sex'].values))
print("\nThe unique values of 'anatom_site_general_challenge':")
print(np.unique(train['anatom_site_general_challenge'].values))
print("\nThe unique values of 'diagnosis':")
print(np.unique(train['diagnosis'].values))

The unique values in `test`:

In [None]:
print("The unique values of 'age_approx':")
print(np.unique(test['age_approx'].values))
print("\nThe unique values of 'sex':")
print(np.unique(test['sex'].values))
print("\nThe unique values of 'anatom_site_general_challenge':")
print(np.unique(test['anatom_site_general_challenge'].values))

Observe that the age values are all integer in both `train` and `test`. Let's cast `age_approx` into `np.uint8` format.

In [None]:
train['age_approx']=train['age_approx'].astype(np.uint8)
test['age_approx']=test['age_approx'].astype(np.uint8)

Checking if `anatom_site_general_challenge` has the same set of values in `train` and `test`:

In [None]:
np.equal(np.unique(test['anatom_site_general_challenge'].values),
         np.unique(train['anatom_site_general_challenge'].values)
        ).all()

Yes, it does. Now we will apply one-hot encoding to `sex` and `anatom_site_general_challenge`. We will not be one-hot encoding `diagnosis` since it is present only in the training set. 

In [None]:
train = pd.concat([train, pd.get_dummies(train['sex'], prefix='sex')], axis=1)
train = pd.concat([train, pd.get_dummies(train['anatom_site_general_challenge'], 
                                         prefix='site')], axis=1)
# train = pd.concat([train, pd.get_dummies(train['diagnosis'], prefix='diagn')], axis=1)

train.shape

In [None]:
test = pd.concat([test, pd.get_dummies(test['sex'], prefix='sex')], axis=1)
test = pd.concat([test, pd.get_dummies(test['anatom_site_general_challenge'],
                                       prefix='site')], axis=1)

test.shape

### Scaling the age feature

In [None]:
%%time

scaler=StandardScaler()

train['age_scaled']=scaler.fit_transform(train['age_approx'].values.reshape(-1, 1))
test['age_scaled']=scaler.transform(test['age_approx'].values.reshape(-1, 1))

### Loading the fold indicies

We will use the validation and training fold indicies generated in [this kernel](https://www.kaggle.com/graf10a/siim-stratified-groupkfold-5-folds).

In [None]:
train_idx={fn: np.load(PATH_FOLDS/f"train_idx_fold_{fn}.npy") for fn in range(1, NFOLDS+1)}
val_idx={fn: np.load(PATH_FOLDS/f"val_idx_fold_{fn}.npy") for fn in range(1, NFOLDS+1)}

for fn in range(1, NFOLDS+1):
    print("="*50)
    print(f"Fold {fn}:")
    print(f"The training set consists of {len(train_idx[fn])} elements.")
    print(f"The validation set consists of {len(val_idx[fn])} elements.")

    assert len(train)==(len(train_idx[fn])+len(val_idx[fn])), "Wrong total number of elements"

In [None]:
excluded_cols=['sex', 'anatom_site_general_challenge', 'diagnosis', ]

cols=[c for c in train.columns if c not in excluded_cols]

print(cols)
print(f"\nThe total number of features is {len(cols)}.")

In [None]:
train_fold={fn: train.loc[val_idx[fn], cols] for fn in range(1, NFOLDS+1)}

### Turning the fold data into a TF dataset

Our next step is to make a set of Tensor Flow datasets, one for each fold. Each dataset will include the input data and the labels. The input data will placed in a dictionary. We can access different features by calling the dictionary with the corresponding key.

In [None]:
labels={fn: train_fold[fn].pop('target') for fn in range(1, NFOLDS+1)}
dataset0 = {fn: tf.data.Dataset.from_tensor_slices((dict(train_fold[fn]), labels[fn])) 
            for fn in range(1, NFOLDS+1)
           }

The function below reads each J PEG image file from the disk using the filename provided in the `image_name` column of `train` or `test`. Then it turns the JPEG-encoded image into a uint8 tensor using `tf.image.decode_jpeg`.

In [None]:
def decode_jpeg(data_dict, label): 
    fname="/kaggle/input/siim-isic-melanoma-classification/jpeg/train/" \
          +data_dict['image_name']+".jpg"
    bits = tf.io.read_file(fname)
    data_dict['image'] = tf.image.decode_jpeg(bits)  
    return data_dict, label

Applying this function to each of the datasets using `map`:

In [None]:
dataset1 = {fn: dataset0[fn].map(decode_jpeg, num_parallel_calls=AUTO) for fn in range(1, NFOLDS+1)}

### Resizing and cropping

The function below does resizing and cropping of the original images. We also add the original height and width to the list of features.

In [None]:
def resize_and_crop_image(data, label):
    # Resize and crop using "fill" algorithm:
    # always make sure the resulting image
    # is cut out from the source image so that
    # it fills the TARGET_SIZE entirely with no
    # black bars and a preserved aspect ratio.
    w = tf.shape(data['image'])[0] 
    h = tf.shape(data['image'])[1]
    tw = TARGET_SIZE[1]
    th = TARGET_SIZE[0]
    resize_crit = (w * th) / (h * tw)
    data['image'] = tf.cond(resize_crit < 1,
                            # if true
                            lambda: tf.image.resize(data['image'], [w*tw/w, h*tw/w],
                                                    method='lanczos3',
                                                    antialias=True
                                                   ),
                            # if false
                            lambda: tf.image.resize(data['image'], [w*th/h, h*th/h],
                                                    method='lanczos3',
                                                    antialias=True
                                                   )
                           )
    nw = tf.shape(data['image'])[0]
    nh = tf.shape(data['image'])[1]
    data['image'] = tf.image.crop_to_bounding_box(data['image'], 
                                                  (nw - tw) // 2, 
                                                  (nh - th) // 2, 
                                                  tw, th
                                                 )
    return data, label, h, w

In [None]:
dataset2 = {fn: dataset1[fn].map(resize_and_crop_image, num_parallel_calls=AUTO) 
            for fn in range(1, NFOLDS+1)}

### Recompress the images

Google Cloud Storage is capable of great throughput but has a per-file access penalty. Training on thousands of individual files will be too slow. We have to use the TFRecord format to group files together. To do that, we first need to recompress our images. The bandwidth savings outweight the decoding CPU cost.

In [None]:
def recompress_image(data, label, h, w):
    data['image'] = tf.cast(data['image'], tf.uint8)
    data['image'] = tf.image.encode_jpeg(data['image'], 
                                         #quality=100, # the default is 95% (the original images 
                                         # are already compressed, so no need to increase this 
                                         # value -- we can't create new information.)
                                         optimize_size=True, 
                                         chroma_downsampling=False)
    return data, label, h, w

In [None]:
dataset3 = {fn: dataset2[fn].map(recompress_image, num_parallel_calls=AUTO) for fn in range(1, NFOLDS+1)}

### Write dataset to TFRecord files 

In [None]:
nb_images = {fn: len(train_fold[fn]) for fn in range(1, NFOLDS+1)}
shard_size = {fn: math.ceil(1.0 * nb_images[fn] / SHARDS) for fn in range(1, NFOLDS+1)}

for fn in range(1, NFOLDS+1):
    print("="*50)
    print(f"Fold {fn}:")
    print(f"The total number of images = {nb_images[fn]}")
    print(f"The number of  .tfrecord files = {SHARDS}")
    print(f"The number of images in each .tfrecord file = {shard_size[fn]}")

Sharding: there will be one "batch" of images per file

In [None]:
dataset4 = {fn: dataset3[fn].batch(shard_size[fn]) for fn in range(1, NFOLDS+1)}

Three types of data can be stored in TFRecords: bytestrings, integers and floats. They are always stored as lists, a single data element will be a list of size 1.

In [None]:
def _bytestring_feature(list_of_bytestrings):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=list_of_bytestrings))

In [None]:
def _int_feature(list_of_ints): # int64
    return tf.train.Feature(int64_list=tf.train.Int64List(value=list_of_ints))

In [None]:
def _float_feature(list_of_floats): # float32
    return tf.train.Feature(float_list=tf.train.FloatList(value=list_of_floats))

In [None]:
def to_tfrecord(tfrec_filewriter, image, image_name, patient_id, 
                benign_malignant, age, age_scaled, sex_female, sex_male, sex_unknown, 
                site_head_neck, site_lower_extremity, site_oral_genital, 
                site_palms_soles, site_torso, site_unknown, site_upper_extremity, 
                label, height, width):

    feature = {
        # bytestring features
        "image": _bytestring_feature([image]), 
        "image_name": _bytestring_feature([image_name]),
        "patient_id": _bytestring_feature([patient_id]), 
        "benign_malignant": _bytestring_feature([benign_malignant]),
        # integer features
        "age": _int_feature([age]),
        "sex_female": _int_feature([sex_female]),        
        "sex_male": _int_feature([sex_male]),
        "sex_unknown": _int_feature([sex_unknown]),
        "site_head/neck": _int_feature([site_head_neck]),
        "site_lower extremity": _int_feature([site_lower_extremity]),
        "site_oral/genital": _int_feature([site_oral_genital]),
        "site_palms/soles": _int_feature([site_palms_soles]), 
        "site_torso": _int_feature([site_torso]), 
        "site_unknown": _int_feature([site_unknown]), 
        "site_upper extremity": _int_feature([site_upper_extremity]),
        "height": _int_feature([height]),
        "width": _int_feature([width]),
        "target": _int_feature([label]),
        # float features
        "age_scaled": _float_feature([age_scaled]),
    }
    
    return tf.train.Example(features=tf.train.Features(feature=feature))

In [None]:
print("Writing TFRecords")

for fn in range(1, NFOLDS+1):
    
    print("="*50)
    print(f"Fold {fn} out of {NFOLDS}:")
    
    for shard, (data, label, height, width) in enumerate(dataset4[fn]):
        # batch size used as shard size here
        shard_size = data['image'].numpy().shape[0]
        # good practice to have the number of records in the filename
        filename = "fold_{}_{:02d}-{}.tfrec".format(fn, shard, shard_size)

        with tf.io.TFRecordWriter(filename) as out_file:
            for i in range(shard_size):
                example = to_tfrecord(out_file,
                                      # re-compressed image: already a byte string
                                      data['image'].numpy()[i],
                                      data['image_name'].numpy()[i],
                                      data['patient_id'].numpy()[i],
                                      data['benign_malignant'].numpy()[i],
                                      data['age_approx'].numpy()[i],
                                      data['age_scaled'].numpy()[i],
                                      data['sex_female'].numpy()[i],
                                      data['sex_male'].numpy()[i],
                                      data['sex_unknown'].numpy()[i],
                                      data['site_head/neck'].numpy()[i],
                                      data['site_lower extremity'].numpy()[i],
                                      data['site_oral/genital'].numpy()[i],
                                      data['site_palms/soles'].numpy()[i],
                                      data['site_torso'].numpy()[i],
                                      data['site_unknown'].numpy()[i],
                                      data['site_upper extremity'].numpy()[i],
                                      label.numpy()[i],
                                      height.numpy()[i],
                                      width.numpy()[i]
                                     )

                out_file.write(example.SerializeToString())

            print("Wrote file {} containing {} records".format(filename, shard_size))