## Validation Dataset

One of the most important things I've learned participating in Kaggle competitions is that [validation sets matter](https://joshvarty.com/2019/03/25/validation-sets-matter/)! Let's take a quick look at the results we've been getting so far:

From [GitHub](https://github.com/JoshVarty/KaggleClouds/issues/5):

| Description | n_train | NFolds | Threshold/MinSize | Valid | LB |
| :---         |:---         |      :---:      |    :---:      |          :---: |:---: |
| Resnet18, get_transforms() | - | 2 |  0.5/10000   | 0.446 | 0.600|
| Resnet18, no warp, no rotate  | 4000 |2 |  0.5/10000   | 0.452  | 0.597 |
| XResnet18, no warp, no rotate (no pretrain)  | 4000 |2 |  0.5/10000   | 0.401  | 0.588 |
| Resnet18, get_transforms() DiceLoss  | 4000 | 2 |  0.5/10000  | 0.508 | 0.607 |
| Resnet18, no warp, no rotate DiceLoss  | 4000 | 2 |  0.5/10000   | 0.490  | 0.600 |


Notice that our validation score (`Valid`) and leaderboard score (`LB`) do not seem strongly corellated. Often we improve on our valid datset, but do worse on the LB (and vice versa).

There are some possible reasons for this:

- Our valid score comes from `multiclass_dice_score` which does not take into account thresholding or minimum segment size. Our LB submissions use a threshold value of `0.5` and a minimum segment size of `10,000`
- Our train/valid set is small. We're only using 1,000 images (4,000 train items / 4 classes) and a fold size of 2.
- Our validation datasets may come from a different distribution as compared to our test dataset


Let's start by taking a look at the images in our valid dataset.

In [2]:
import pandas as pd
import numpy as np

from PIL import Image
from fastai.vision import get_image_files
from pathlib import Path
from sklearn.model_selection import train_test_split, StratifiedKFold

In [3]:
NFOLDS = 5
RANDOM_STATE = 42
skf = StratifiedKFold(n_splits=NFOLDS, random_state=RANDOM_STATE)

DATA = Path('data')
TRAIN = DATA/"train.csv"
TEST = DATA/"sample_submission.csv"

In [4]:
train = pd.read_csv(TRAIN)
test = pd.read_csv(TEST)

train['label'] = train['Image_Label'].apply(lambda x: x.split('_')[1])
train['im_id'] = train['Image_Label'].apply(lambda x: x.split('_')[0])
test['label'] = test['Image_Label'].apply(lambda x: x.split('_')[1])
test['im_id'] = test['Image_Label'].apply(lambda x: x.split('_')[0])

In [5]:
# We count how many non-NAN labels are present for each image
# Then we create our K-Fold splits based on this
id_mask_count = train.loc[train['EncodedPixels'].isnull() == False, 'Image_Label'].apply(lambda x: x.split('_')[0]).value_counts().\
reset_index().rename(columns={'index': 'img_id', 'Image_Label': 'count'})

We're using `id_mask_count` to create our fold, it contains a count of the number of masks present for a given image.

In [6]:
id_mask_count.head()

Unnamed: 0,img_id,count
0,6851470.jpg,4
1,e8b4c68.jpg,4
2,baee301.jpg,4
3,7b59752.jpg,4
4,0eff39e.jpg,4


This seems like a sensible way to split our data but we may also want to take other things into consideration. From our previous analyses we've noticed that many images across the dataset are very similar to one another. I believe we should also take this fact into account when creating our validation dataset.

In [8]:
# Load image paths
TRAIN_FOLDER = DATA/'train_images_350x525'
TEST_FOLDER = DATA/'test_images_350x525'
train_images = get_image_files(TRAIN_FOLDER)
test_images = get_image_files(TEST_FOLDER)

In [25]:
train_train_pairs = np.load(DATA/'train_train_pairs.npy', allow_pickle=True)[()]
train_test_pairs = np.load(DATA/'train_test_pairs.npy', allow_pickle=True)[()]
test_test_pairs = np.load(DATA/'test_test_pairs.npy', allow_pickle=True)[()]

## Desired Validation Distribution

We would like our validation distribution to match our test distribution. So what does our test set look like?

In [26]:
print("Similar pairs within training set\t{}\t{}".format(len(train_train_pairs), len(train_train_pairs)/len(train_images)))
print("Similar pairs across train-test sets\t{}\t{}".format(len(train_test_pairs), len(train_test_pairs)/len(train_images)))
print("Unique images\t\t\t\t{}\t{}".format(len(train_images)-len(train_train_pairs)-len(train_test_pairs),(len(train_images)-len(train_train_pairs)-len(train_test_pairs))/len(train_images)))
print()

print("Similar pairs within test set\t\t{}\t{}".format(len(test_test_pairs), len(test_test_pairs)/len(test_images)))
print("Similar pairs across train-test sets\t{}\t{}".format(len(train_test_pairs), len(train_test_pairs)/len(test_images)))
print("Unique images\t\t\t\t{}\t{}".format(len(test_images)-len(test_test_pairs)-len(train_test_pairs),(len(test_images)-len(test_test_pairs)-len(train_test_pairs))/len(test_images)))
print()

Similar pairs within training set	1770	0.3191489361702128
Similar pairs across train-test sets	1202	0.2167327803822575
Unique images				2574	0.46411828344752976

Similar pairs within test set		798	0.21579232017306652
Similar pairs across train-test sets	1202	0.32504056246619795
Unique images				1698	0.4591671173607355



So we want a valid dataset in which:
 - 22% of images are pairs
 - 33% of images have a corresponding pair in the validation set
 - 46% of images are "unique"
 
We have 5,546 training images, so we'll take 1,000 images to form our validation dataset:

 - 216 images are pairs
 - 325 images have a pair in our training set
 - 459 images are unique

In [117]:
valid_valid_pairs = {}
train_valid_pairs = {}
valid_unque = {}

In [118]:
# Select 216 images from train_train_pairs to put into valid_valid_pairs
keys = list(train_train_pairs.keys())
valid_valid_list = []
i = 0

while len(valid_valid_list) < 216:
    
    currentKey = keys[i]
    if currentKey not in valid_valid_list and currentKey not in valid_valid_list:
        valid_valid_list.append(currentKey)
        valid_valid_list.append(train_train_pairs[currentKey])

    i = i + 1

In [119]:
# Select 325 images from train_train_pairs that have a corresponding pair in the training set
keys = list(train_train_pairs.keys())
train_valid_list = []
i = 0

while len(train_valid_list) < 325:
    
    currentKey = keys[i]
    
    if currentKey not in train_valid_list and currentKey not in valid_valid_list:
        train_valid_list.append(currentKey)
        
    i = i + 1    

In [120]:
# Choose 459 images that are unique 
unique_images = []
i = 0

while len(unique_images) < 459:
    
    if i not in valid_valid_list and i not in train_valid_list:
        unique_images.append(i)
        
    i = i + 1

In [121]:
valid_list = valid_valid_list
valid_list.extend(train_valid_list)
valid_list.extend(unique_images)

In [123]:
print(len(valid_list))
print(len(set(valid_list)))

1000
1000


In [126]:
valid_im_ids = [train_images[k].name for k in valid_list]

Now we can create our validation dataset and corresponding training dataset!

In [140]:
valid = train[train['im_id'].isin(valid_im_ids)]
newTrain = train[~train['im_id'].isin(valid_im_ids)]

In [144]:
print(len(valid))
print(len(newTrain))

4000
18184


In [146]:
valid.to_csv(DATA/'valid_split.csv', index=False)
newTrain.to_csv(DATA/'train_split.csv', index=False)