## Validation Set

Probably the most important part of the competitions I've looked at is to create a realistic validation set. Without this it's very hard to know whether or not we're actually improving our model and whether we'll see an improvement on leaderboard score.


The dataset was created by taking scans of multiple individuals and breaking them up into smaller pieces, some with cancer, some without. A good validation set will be made up of images taken from **people we've never seen before**. This will prevent us from overfitting to individuals and hopefully encourage us to build a model that generalizes across different people.

In [12]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision.models as models
from tqdm import tqdm
from matplotlib import pyplot as plt

In [6]:
wsi = pd.read_csv('data/patch_id_wsi.csv')
len(wsi)
wsi.head()

Unnamed: 0,id,wsi
0,f38a6374c348f90b587e046aac6079959adf3835,camelyon16_train_normal_033
1,c18f2d887b7ae4f6742ee445113fa1aef383ed77,camelyon16_train_tumor_054
2,755db6279dae599ebb4d39a9123cce439965282d,camelyon16_train_tumor_008
3,bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,camelyon16_train_tumor_077
4,acfe80838488fae3c89bd21ade75be5c34e66be7,camelyon16_train_tumor_036


In [14]:
lookup = {}
for i, slide_name in tqdm(enumerate(wsi['wsi'])):
    
    slide_id = slide_name.split('_')[-1]
    if slide_id not in lookup:
        lookup[slide_id] = []
        
    imageId = wsi.iloc[i]['id']
    lookup[slide_id].append(imageId)

192752it [00:18, 10154.42it/s]


In [19]:
len(lookup)

145

In [22]:
validationIds = []
lookupIterator = iter(lookup.keys())

while len(validationIds) < 20000: #~10% of training set
    
    slideId = next(lookupIterator)
    
    imageIds = lookup[slideId]
    validationIds.extend(imageIds)
    print("Added slide", slideId)


Added slide 033
Added slide 054
Added slide 008
Added slide 077
Added slide 036
Added slide 004
Added slide 094
Added slide 034
Added slide 023


In [24]:
print(len(validationIds))

20381


Now we would like to move all of our validation images out of the `/train` folder and into the `/valid` folder.

In [27]:
import shutil

for imageId in tqdm(validationIds):
    fromPath = 'data/train/' + imageId + '.tif'
    toPath = 'data/valid/' + imageId + '.tif'
    shutil.copy(fromPath, toPath)

100%|██████████| 20381/20381 [00:01<00:00, 15512.68it/s]
