# Training and test sets creation

This notebook is used to create our training and test set from the HAM10000 dataset, and ensures that there are no duplicate images of lesions. These are saved as two pickles of dataframes which will be used in other parts of our code. 

The HAM10000 dataset can be found here https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000

## Reference
Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, 180161 (2018)


### Import statements

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from glob import glob
from PIL import Image
np.random.seed(123)

Using TensorFlow backend.


### Get all paths

You should have the unzipped dataset, the images are saved in the two subfolders.

In [2]:
path_list = glob('HAM10000_images_part_1/*.jpg') + \
        glob('HAM10000_images_part_2/*.jpg')

In [3]:
len(path_list)

10015

In [4]:
imageid_path_dict = {}
for x in path_list:
    imageid_path_dict[os.path.splitext(os.path.basename(x))[0]] = x;

In [5]:
lesion_type_dict = {
    'nv': 'Melanocytic nevi',
    'mel': 'Melanoma',
    'bkl': 'Benign keratosis-like lesions ',
    'bcc': 'Basal cell carcinoma',
    'akiec': 'Actinic keratoses',
    'vasc': 'Vascular lesions',
    'df': 'Dermatofibroma'
}

Load metadata

In [6]:
skin_df = pd.read_csv('HAM10000_metadata.csv')
skin_df['path'] = skin_df['image_id'].map(imageid_path_dict.get)
skin_df['cell_type'] = skin_df['dx'].map(lesion_type_dict.get) 
skin_df['cell_type_idx'] = pd.Categorical(skin_df['cell_type']).codes

In [7]:
skin_df.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,path,cell_type,cell_type_idx
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2


In [8]:
skin_df['lesion_id'].unique().shape

(7470,)

Look at the lesion id metadata to find duplicates

In [9]:
skin_df = skin_df.drop_duplicates(subset='lesion_id')

Get rid of null values for age and set them to the mean

In [10]:
skin_df.isnull().sum()

lesion_id         0
image_id          0
dx                0
dx_type           0
age              52
sex               0
localization      0
path              0
cell_type         0
cell_type_idx     0
dtype: int64

In [11]:
skin_df['age'].fillna((skin_df['age'].mean()), inplace=True)

Checks that all our null values have been filled

In [12]:
skin_df.isnull().sum()

lesion_id        0
image_id         0
dx               0
dx_type          0
age              0
sex              0
localization     0
path             0
cell_type        0
cell_type_idx    0
dtype: int64

Open all images

In [13]:
skin_df['image'] = skin_df['path'].map(lambda x: np.asarray(Image.open(x)))

In [14]:
# Checking the image size distribution
skin_df['image'].map(lambda x: x.shape).value_counts()

(450, 600, 3)    7470
Name: image, dtype: int64

In [15]:
skin_df.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,path,cell_type,cell_type_idx,image
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2,"[[[188, 147, 191], [186, 148, 189], [187, 150,..."
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2,"[[[186, 128, 140], [188, 128, 136], [183, 126,..."
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2,"[[[122, 80, 102], [124, 82, 104], [127, 83, 10..."
6,HAM_0002761,ISIC_0029176,bkl,histo,60.0,male,face,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2,"[[[188, 142, 118], [189, 141, 119], [189, 142,..."
8,HAM_0005132,ISIC_0025837,bkl,histo,70.0,female,back,./../../Resources/HAM10000\HAM10000_images_par...,Benign keratosis-like lesions,2,"[[[94, 58, 58], [91, 61, 59], [91, 61, 59], [9..."


In [16]:
skin_df.shape

(7470, 11)

In [17]:
skin_df = skin_df.drop('lesion_id', axis=1).drop('image_id', axis=1).drop('dx_type', axis=1).\
        drop('age', axis=1).drop('sex', axis=1).drop('localization', axis=1).drop('path', axis=1).\
        drop('cell_type', axis=1).drop('cell_type_idx', axis=1)

Drop all images that are not nv or mel

In [18]:
skin_df = skin_df[(skin_df.dx == 'mel') | (skin_df.dx == 'nv')]

In [19]:
skin_df['id'] = skin_df.pop('dx')

In [20]:
skin_df = skin_df.reset_index()
skin_df = skin_df.drop('index', axis=1)

In [21]:
skin_df.head()

Unnamed: 0,image,id
0,"[[[163, 137, 166], [163, 137, 162], [164, 138,...",nv
1,"[[[229, 146, 164], [229, 146, 162], [233, 149,...",nv
2,"[[[120, 99, 104], [125, 102, 108], [128, 103, ...",mel
3,"[[[128, 106, 129], [129, 107, 128], [130, 106,...",mel
4,"[[[24, 12, 22], [21, 9, 19], [18, 6, 16], [19,...",mel


### Save a seperate test set 

Our test set contains 200 mel and 200 nv.

In [22]:
skin_df.shape

(6017, 2)

In [23]:
mel = skin_df[skin_df.id == 'mel'].sample(n=200,random_state=200)
mel.shape

(200, 2)


In [24]:
skin_df = skin_df.drop(skin_df.index[mel.index])

In [25]:
skin_df.shape

(5817, 2)

In [26]:
skin_df = skin_df.reset_index()
skin_df = skin_df.drop('index', axis=1)

In [27]:
nv = skin_df[skin_df.id == 'nv'].sample(n=200,random_state=200)
nv.shape

(200, 2)


Drop the test images from the training set

In [28]:
skin_df = skin_df.drop(skin_df.index[nv.index])

In [29]:
skin_df.shape

(5617, 2)

In [30]:
skin_df = skin_df.reset_index()
skin_df = skin_df.drop('index', axis=1)

In [31]:
skin_df.head()

Unnamed: 0,image,id
0,"[[[163, 137, 166], [163, 137, 162], [164, 138,...",nv
1,"[[[229, 146, 164], [229, 146, 162], [233, 149,...",nv
2,"[[[120, 99, 104], [125, 102, 108], [128, 103, ...",mel
3,"[[[24, 12, 22], [21, 9, 19], [18, 6, 16], [19,...",mel
4,"[[[180, 146, 144], [182, 147, 145], [182, 147,...",mel


In [32]:
test_df = pd.concat((mel, nv))

In [33]:
test_df.shape

(400, 2)

In [34]:
test_df.head()

Unnamed: 0,image,id
379,"[[[176, 157, 185], [176, 157, 185], [178, 159,...",mel
105,"[[[157, 133, 159], [157, 133, 155], [156, 131,...",mel
163,"[[[48, 28, 37], [51, 30, 39], [52, 32, 41], [6...",mel
455,"[[[156, 111, 106], [153, 112, 106], [157, 117,...",mel
386,"[[[255, 194, 189], [255, 194, 191], [255, 191,...",mel


In [35]:
test_df = test_df.reset_index()
test_df = test_df.drop('index', axis=1)

In [36]:
skin_df.head()

Unnamed: 0,image,id
0,"[[[163, 137, 166], [163, 137, 162], [164, 138,...",nv
1,"[[[229, 146, 164], [229, 146, 162], [233, 149,...",nv
2,"[[[120, 99, 104], [125, 102, 108], [128, 103, ...",mel
3,"[[[24, 12, 22], [21, 9, 19], [18, 6, 16], [19,...",mel
4,"[[[180, 146, 144], [182, 147, 145], [182, 147,...",mel


In [37]:
skin_df.shape

(5617, 2)

In [38]:
test_df.head()

Unnamed: 0,image,id
0,"[[[176, 157, 185], [176, 157, 185], [178, 159,...",mel
1,"[[[157, 133, 159], [157, 133, 155], [156, 131,...",mel
2,"[[[48, 28, 37], [51, 30, 39], [52, 32, 41], [6...",mel
3,"[[[156, 111, 106], [153, 112, 106], [157, 117,...",mel
4,"[[[255, 194, 189], [255, 194, 191], [255, 191,...",mel


In [39]:
test_df.shape

(400, 2)

Save the dataframe as a pickle.

In [40]:
skin_df.to_pickle('NvAndMelNoDuplicatesFullSize.zip')

In [41]:
test_df.to_pickle('NvAndMelNoDuplicatesFullSizeTestSet.zip')