# Dermatology HAM10000: Loading and Processing

This Python notebook is using data from Skin Cancer MNIST: HAM10000 code retrieved from https://www.kaggle.com/kmader/dermatology-mnist-loading-and-processing. I will use this as a base to build my federated dataset with the **Skin Cancer MNIST: HAM10000** data.

### Step1: Importing essential libraries and loading data from project assets

First I load a variety of different libraries that I will need, such as numpy, pandas and os. I like using nb_black to automatically align and format my code cells - this way the notebook will look more structured and clean.

In [1]:
#!pip install -U nb_black

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import types
from glob import glob
import seaborn as sns

%matplotlib inline
%load_ext nb_black

<IPython.core.display.Javascript object>

First, I will load the **HAM10000_metadata.csv**

In [2]:
from botocore.client import Config
import ibm_boto3


def __iter__(self):
    return 0

body = client_1c5864e71f0244bbaa8eb91d268d9377.get_object(
    Bucket="masterthesistff-donotdelete-pr-lrtkopitckdzal", Key="HAM10000_metadata.csv"
)["Body"]
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"):
    body.__iter__ = types.MethodType(__iter__, body)

df_data_1 = pd.read_csv(body)
df_data_1.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear


<IPython.core.display.Javascript object>

In [3]:
df_data_1.describe(exclude=[np.number])

Unnamed: 0,lesion_id,image_id,dx,dx_type,sex,localization
count,10015,10015,10015,10015,10015,10015
unique,7470,10015,7,4,3,15
top,HAM_0003789,ISIC_0029374,nv,histo,male,back
freq,6,1,6705,5340,5406,2192


<IPython.core.display.Javascript object>

Next, I will load the **HAM10000_images.zip** and extract the zip file to load it into my current directory.

In [4]:
# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    "IAM_SERVICE_ID": "YOUR_SERVICE_ID",
    "IBM_API_KEY_ID": "YOUR_API_KEY_ID",
    "ENDPOINT": "https://s3.eu-geo.objectstorage.service.networklayer.com",
    "IBM_AUTH_ENDPOINT": "https://iam.eu-de.bluemix.net/oidc/token",
    "BUCKET": "masterthesistff-donotdelete-pr-lrtkopitckdzal",
    "FILE": "HAM10000_images.zip",
}

<IPython.core.display.Javascript object>

In [5]:
# 2. Download the file into IBM Cloud Object Storage
from ibm_botocore.client import Config
import ibm_boto3

cos = ibm_boto3.client(
    service_name="s3",
    ibm_api_key_id=credentials_1["IBM_API_KEY_ID"],
    ibm_service_instance_id=credentials_1["IAM_SERVICE_ID"],
    ibm_auth_endpoint=credentials_1["IBM_AUTH_ENDPOINT"],
    config=Config(signature_version="oauth"),
    endpoint_url=credentials_1["ENDPOINT"],
)

<IPython.core.display.Javascript object>

In [6]:
cos.download_file(
    Bucket=credentials_1["BUCKET"],
    Key="HAM10000_images.zip",
    Filename="HAM10000_images.zip",
)

<IPython.core.display.Javascript object>

In [7]:
# 3. Extract Zip
from zipfile import ZipFile

print("Extract all files in ZIP to current directory")
# Create a ZipFile Object and load in it
with ZipFile("HAM10000_images.zip", "r") as zipObj:
    # Extract all the contents of zip file in current directory
    zipObj.extractall()

Extract all files in ZIP to current directory


<IPython.core.display.Javascript object>

In [8]:
os.listdir("./HAM10000_images")

['ISIC_0034105.jpg',
 'ISIC_0026189.jpg',
 'ISIC_0029897.jpg',
 'ISIC_0033837.jpg',
 'ISIC_0032690.jpg',
 'ISIC_0030698.jpg',
 'ISIC_0031282.jpg',
 'ISIC_0031895.jpg',
 'ISIC_0029186.jpg',
 'ISIC_0027036.jpg',
 'ISIC_0032245.jpg',
 'ISIC_0034266.jpg',
 'ISIC_0026370.jpg',
 'ISIC_0028922.jpg',
 'ISIC_0025199.jpg',
 'ISIC_0029968.jpg',
 'ISIC_0030513.jpg',
 'ISIC_0026909.jpg',
 'ISIC_0025222.jpg',
 'ISIC_0028032.jpg',
 'ISIC_0031241.jpg',
 'ISIC_0028343.jpg',
 'ISIC_0033109.jpg',
 'ISIC_0030834.jpg',
 'ISIC_0025232.jpg',
 'ISIC_0029448.jpg',
 'ISIC_0034126.jpg',
 'ISIC_0034098.jpg',
 'ISIC_0030781.jpg',
 'ISIC_0025194.jpg',
 'ISIC_0026291.jpg',
 'ISIC_0030775.jpg',
 'ISIC_0030977.jpg',
 'ISIC_0025006.jpg',
 'ISIC_0026442.jpg',
 'ISIC_0032208.jpg',
 'ISIC_0031687.jpg',
 'ISIC_0032652.jpg',
 'ISIC_0030431.jpg',
 'ISIC_0034057.jpg',
 'ISIC_0029697.jpg',
 'ISIC_0030187.jpg',
 'ISIC_0028062.jpg',
 'ISIC_0027807.jpg',
 'ISIC_0028380.jpg',
 'ISIC_0033699.jpg',
 'ISIC_0029658.jpg',
 'ISIC_003373

<IPython.core.display.Javascript object>

In [9]:
# 4. Verify
# Verify that extraction of images has worked and I can access them.

import os

path = os.getcwd()
dir_list = os.listdir(path)
print(dir_list)

['__MACOSX', 'HAM10000_images.zip', 'HAM10000_images']


<IPython.core.display.Javascript object>

## Step 2: Mapping images from HAM10000_images into HAM10000_metadata.csv

In [10]:
base_skin_dir = os.path.join("..", "HAM10000_images")

imageid_path_dict = {
    os.path.splitext(os.path.basename(x))[0]: x
    for x in glob(os.path.join(base_skin_dir, "*", "*.jpg"))
}

# This dictionary is useful for displaying more human-friendly labels later on

lesion_type_dict = {
    "nv": "Melanocytic nevi",
    "mel": "Melanoma",
    "bkl": "Benign keratosis-like lesions ",
    "bcc": "Basal cell carcinoma",
    "akiec": "Actinic keratoses",
    "vasc": "Vascular lesions",
    "df": "Dermatofibroma",
}

<IPython.core.display.Javascript object>

In [14]:
tile_df = pd.read_csv(
    os.path.join(
        base_skin_dir,
        body=client_1c5864e71f0244bbaa8eb91d268d9377.get_object(
            Bucket="masterthesistff-donotdelete-pr-lrtkopitckdzal",
            Key="HAM10000_metadata.csv",
        )["Body"],
    )
)
tile_df["path"] = tile_df["image_id"].map(imageid_path_dict.get)
tile_df["cell_type"] = tile_df["dx"].map(lesion_type_dict.get)
tile_df["cell_type_idx"] = pd.Categorical(tile_df["cell_type"]).codes
tile_df.sample(3)

TypeError: join() got an unexpected keyword argument 'body'

<IPython.core.display.Javascript object>

## Step 3: Create new column for client_ids 

Plan is to assign the df to a number of client ID's... There is 10015 images, maybe deviding these images into 10 client sets makes sense? 20 clients with 500 images each maybe a good distribution?

https://www.tensorflow.org/federated/api_docs/python/tff/simulation/ClientData

## Step 4: Export df in a format which can be used to run the Tutorial code

https://www.tensorflow.org/federated/tutorials/federated_learning_for_image_classification#creating_a_model_with_keras