# Get the data

First we need access to data.
- You can use this link to add the data to your drive: https://drive.google.com/drive/folders/1pHNxZVrlcKh5usWoNC_V7gR2WdeDutjv
- If you have not done this yet, right click on the **CS4MS_Data** folder and click on the **Add shortcut to Drive** option.
- Inside the folder **CS4MS_Data** you will see the folder **HAM10000** - this is the dataset (set of images) we will be working with.

Now you can run the next cell

In [None]:
# connect notebook to your Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data_dir = "/content/drive/My Drive/CS4MS_Data/HAM10000"

classes = [ 'actinic keratoses', 'basal cell carcinoma', 'benign keratosis-like lesions',
           'dermatofibroma','melanoma', 'melanocytic nevi', 'vascular lesions']

Quick example for object oriented programming: working with paths (folders and files)

Documentation of this module: https://docs.python.org/3/library/pathlib.html

In [None]:
# import Path class from the pathlib module
from pathlib import Path

In [None]:
# make sure data is mounted
assert Path(data_dir).is_dir(), 'you need to add the CS4MS folder to you google drive and mount it (go to top)'

In [None]:
# create Path instance with path string from above
p = Path(data_dir)

In [None]:
# get the name of folder / file
p.name

In [None]:
# get the path objective of the parent folder
p.parent

In [None]:
# does the folder nv exist in our path?
(p / 'nv').exists()

In [None]:
# how to iterate over the paths within a path?
for child in p.iterdir():
  print(child)

In [None]:
# how to iterate over specific files also within sub-folders?
# cave: this might take a while since it is iterating through all subdirectories and files
for path in p.glob('**/*.csv'):
  print(path)

## About the data

The HAM10000 ("Human Against Machine with 10000 training images") dataset which contains 10,015 dermatoscopic images was made publically available by the Harvard database on June 2018 in the hopes to provide training data for automating the process of skin cancer lesion classifications. The motivation behind this act was to provide the public with an abundance and variability of data source for machine learning training purposes such that the results may be compared with that of human experts. If successful, the appplications would bring cost and time saving regimes to hospitals and medical professions alike.

Apart from the 10,015 images, a metadata file with demographic information of each lesion is provided as well. More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal)

You can download the dataset here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T

The 7 classes of skin cancer lesions included in this dataset are:

- Melanocytic nevi
- Melanoma
- Benign keratosis-like lesions
- Basal cell carcinoma
- Actinic keratoses
- Vascular lesions
- Dermatofibroma

Let's analyze the metadata of the dataset

In [None]:
# import panda module for tabular data - https://pandas.pydata.org/docs/
import pandas as pd

# importing metadata and checking for its shape
metadata = pd.read_csv(data_dir + '/HAM10000_metadata.csv')

# label encoding the seven classes for skin cancers
metadata['label'] = pd.Categorical(metadata["dx"]).codes
metadata.sample(10)

In [None]:
# numerical statistics
metadata.describe()


Plot of class distribution

In [None]:
# import matplotlib module for plotting - https://matplotlib.org/3.2.1/contents.html
import matplotlib.pyplot as plt
%matplotlib inline

# Getting a sense of what the distribution of each column looks like
fig = plt.figure(figsize=(20,10))

ax1 = fig.add_subplot(221)
metadata['dx'].value_counts().plot(kind='bar', ax=ax1)
ax1.set_ylabel('Count')
ax1.set_title('Cell Type')


plt.tight_layout()
plt.show()

Plot 5 images of each class

In [None]:
import imageio

#Visualizing the images

label = [ 'akiec', 'bcc','bkl','df','mel', 'nv',  'vasc']
label_images = []
classes = [ 'actinic keratoses', 'basal cell carcinoma', 'benign keratosis-like lesions',
           'dermatofibroma','melanoma', 'melanocytic nevi', 'vascular lesions']

fig = plt.figure(figsize=(20, 20))
num_images = 5

for i in label:
    sample = metadata[metadata['dx'] == i]['image_id'][:num_images]
    label_images.extend(sample)

for position,ID in enumerate(label_images):
    labl = metadata[metadata['image_id'] == ID]['dx']
    im_sample = data_dir + "/" + labl.values[0] + f'/{ID}.jpg'

    im_sample = imageio.imread(im_sample)

    plt.subplot(7,num_images,position+1)
    plt.imshow(im_sample)
    plt.axis('off')

    if position%5 == 0:
        title = int(position/num_images)
        plt.title(classes[title], loc='left', size=20)

plt.show()

# Loading the data

Use the **torchvision.datasets.ImageFolder** dataset class. This class requires the dataset to be arranged into folders of their respective class or labels. We already provide the dataset in suitable preprocessed format.

Here we also apply the augmentation that we defined above.

You can check here : https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder

In [None]:
# import the torchvision module from the Pytorch framework
# "The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision."
# documentation: https://pytorch.org/docs/stable/torchvision/index.html
import torchvision

Wait! Why is this module even available? It is not part of the list of default python module.

Here's why: Google thinks you are most likely interested in machine learning and has the most popular frameworks preinstalled.

You could install new modules with this command:

In [None]:
!pip install torchvision

In [None]:
# create an instance of the image folder class to load images by classes defined with the folders given
dataset = torchvision.datasets.ImageFolder(root= data_dir)

Nice, that was easy. Now let's have a look at the dataset:

In [None]:
# How many images are in the dataset?
len(dataset)

In [None]:
# Some useful attributes:
print(f'folder names: {dataset.classes} ' +
      f'\n\n number of classes:  {len(dataset.classes)}' +
      f'\n\n dictionary with label (class) to encoding (target index): {dataset.class_to_idx}')

In [None]:
# What type does the dataset's get item method return?
type(dataset[0])

In [None]:
type(dataset[0][0])

In [None]:
type(dataset[0][1])

In [None]:
# Let's separate the input and output
image, label = dataset[0]

In [None]:
# show image
image

In [None]:
# print label
label

In [None]:
# What did those numbers mean again?
classes[label]

In [None]:
# little helper to show the data points
def show_data_entry(data):
  image, label = data
  print(f"Image Shape: {image.size} \n Label: {label} \n Lesion Type: {classes[label]}")
  return image


In [None]:
# let's play with this a bit
show_data_entry(dataset[1000])

# Train, Test and Validation Split
It is a best practice to split the entire dataset into 3 parts:
- Train: Used to train a network.
- Validation: Fine tune the network.
- Test: Kept as unseen data to gauge the performance of out trained network.


The splitting should be done class wise so that we have equal representation of all classes in each subset of the data.

In [None]:
# import of the main Pytorch module - see: https://pytorch.org/docs/stable/torch.html
# "The torch package contains data structures for multi-dimensional tensors and mathematical operations over these are defined. Additionally, it provides many utilities for efficient serializing of Tensors and arbitrary types, and other useful utilities."
import torch

# import the numpy module a powerful package for scientific computing
# https://numpy.org/doc/stable/
import numpy as np

#import a helpful method for splitting the dataset from SciKit Learn - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

In [None]:
# get the total amount of images in the dataset
num_images = len(dataset)

# create a list of indices for the whole dataset
indices = np.arange(num_images)

# get the class labels from the dataset object (0-6)
class_labels = dataset.targets

# define the percentage of data that is not used for training
split_size = 0.2

# call a function of sklarn that takes care of splitting the dataset into training and validation+testing
train_indices, test_indices = train_test_split(indices,
                                               test_size=split_size,
                                               shuffle=True,
                                               stratify= class_labels,
                                               random_state=42)

# call a function of sklearn that splits validation+testing into validation and testing
train_indices, val_indices = train_test_split(train_indices,
                                               test_size=split_size,
                                               shuffle=True,
                                               stratify= np.asarray(class_labels)[train_indices],
                                               random_state=42)

Now, we have our dataset loaded! Next week we will look into data loaders, augmentation and how to apply the data.

# Hausaufgabe

1. Count the appearance of each class in the different splits

In [None]:
# create a dictionary containing the list of indices for each dataset split
indices_dict = {
    'train': train_indices,
    'val': val_indices,
    'test': test_indices
}
print(indices_dict)

In [None]:
# another dictionary to save the count of each class for the 3 dataset splits
class_count = {}

In [None]:
# loop through the index lists - split is the key of the dictionary
# with .items() a dictionary can bee looped through
for split, indices_list in indices_dict.items():
  print(f'counting classes in {split}')
  # set the count of each class to 0
  class_count[split] = [0 for i in range(len(dataset.classes))]
  for index in indices_list:
    # get dataset item for each index in the split
    # dataset.targets contains only the label with is computationally more efficient
    label = dataset.targets[index]
    
    # this would also work but is a lot slower since every image is accessed
    # _, label = dataset[index]
    # _ discards the image since we are only interested in the label
    
    # increase the count of the label by one
    class_count[split][label] += 1
print('done')

In [None]:
# print the result
print(class_count)

In [None]:
# now did the stratified shuffle split work? i.e. the distribution of the classes per split are the same?
for split, counts in class_count.items():
  normalized_counts = [count / max(counts) for count in counts]
  print(normalized_counts)