# Skin Cancer Classification

Skin cancer is a very common condition and visual-based scans are important in the diagnosis process. Diagnosis process starts with clinical screening first.It is then followed by dermoscopic analysis, a biopsy and histopathological examination.Achieving an autonomous structure is of great importance in the aforementioned processes and involves various difficulties.

The dataset given to us has a total of 10000 images, where each image is classified with one of 5 different types of skin cancer.This dataset is available to be used for training an effective machine learning algorithm.

The cancer classes mentioned are:

* Melanoma (MEL)
* Melanocytic nevus (NV)
* Basal cell carcinoma (BCC)
* Actinic keratosis (AK)
* Benign keratosis (BKL)

#### File descriptions

Train.csv - Training data and it consists of 10,000 images along with their labels (also known as the “ground truth”)
SkinCancerTest.csv - Testing data and it consist of 5,000 simages. Your final submission should be similar to this file.

In this core, models will be trained to accurately match these skin cancer lesions regarding the classes they adhere to.Convolutional Neural Networks are chosen as the model in order to classify the regarding data.

The kernel steps as followed:
1. Data Analysis and Preprocessing
    * Import necessary libraries
    * Read and store the raw data 
    * Create a label dictionary 
    * Image Pipeline
    * EDA (Exploratary Data Analysis)
2. Model Building
3. Model Training
4. Model Evaluation

In [0]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
"""
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
"""
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

"\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n    for filename in filenames:\n        print(os.path.join(dirname, filename))\n"

## 1.Data Analysis and Preprocessing

Defaultly given by kaggle.

In [0]:
import os
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from PIL import Image as img
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Flatten,Conv2D,MaxPool2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.applications import ResNet50

## Read and store the raw data

In [0]:
img_prefix = "/kaggle/input/machinelearning412-skincancerclassification/Data_SkinCancer/Data_SkinCancer/"
df = pd.read_csv("/kaggle/input/machinelearning412-skincancerclassification/Train.csv")
testdf = pd.read_csv("/kaggle/input/machinelearning412-skincancerclassification/Test.csv")

In [0]:
df.head()

Unnamed: 0,Id,Category
0,Image_1,2
1,Image_2,2
2,Image_3,5
3,Image_4,2
4,Image_5,1


As we can see below,there is a serious class imbalance in the dataset.

In [0]:
temp = df.groupby('Category').count()
temp.rename(columns={'Id':'Count'})

Unnamed: 0_level_0,Count
Category,Unnamed: 1_level_1
1,2204
2,4489
3,1592
4,427
5,1288


In [0]:
#Printing a random image, in order to see whether I'm doing correctly or not
#Also, I want to see image shape

for i in [str(j) for j in range(1,10)]:
    image = img.open(img_prefix+"Image_"+i+".jpg").convert("RGB")
    print("Image dimensions, ", np.asarray(image).shape)

Image dimensions,  (450, 600, 3)
Image dimensions,  (450, 600, 3)
Image dimensions,  (1024, 1024, 3)
Image dimensions,  (450, 600, 3)
Image dimensions,  (1024, 1024, 3)
Image dimensions,  (450, 600, 3)
Image dimensions,  (450, 600, 3)
Image dimensions,  (450, 600, 3)
Image dimensions,  (1024, 1024, 3)


Ok, Image dimensions are not fixed
We must fix the dimensions by whether shrinking the large ones or padding the small ones

## Create a label dictionary

In [0]:
targetLabels={
    1: "MEL",
    2: "NV",
    3: "BCC",
    4: "AK",
    5: "BKL"
}
df["CategoryNames"]=df["Category"].apply(lambda col: targetLabels[col])

Labels that regards to the given Category is also inserted into the dataframe as CategoryNames columns.

## Image Pipeline

Despite the fact that there haven't been any EDA done on the dataset it is known that the images were not with the same sizes.
The following function takes an image as input, resizes and normalizes the image.Finally the labels are also one hot encoded in order to use softmax layer for the model.

In [0]:
WIDTH = 128
HEIGHT = 128
def imagePipeline(imgPostFix):
    return tf.cast(np.array(
        img.open(img_prefix+imgPostFix+".jpg")
                      .resize((WIDTH,HEIGHT))
                      .convert("RGB")), tf.float32)/255.0


images = np.array(df["Id"])
target = np.array(df["Category"])
target=target-1 #[1,5] ==> [0,4]
target = to_categorical(target)#One hot encoding labels
target = [tf.cast(i, tf.int64) for i in target]
target=tf.stack(target)
images = tf.stack([imagePipeline(i) for i in images])
df.head()

Unnamed: 0,Id,Category,CategoryNames
0,Image_1,2,NV
1,Image_2,2,NV
2,Image_3,5,BKL
3,Image_4,2,NV
4,Image_5,1,MEL


Load=>Resize=>Convert To Rgb=>Create A Tensor=>Normalize

Therefore image shapes become (10000, WIDTH, HEIGHT, 3)
Label shapes become (10000, 5)

In [0]:
print("Image dimensions, ", np.asarray(images).shape)

Image dimensions are fixed after the preprocessing operation.

## Activate the TPU

In [0]:
# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

# instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

## Exploratory Data Analysis (EDA)

Lets start with checking how the classes are distributed in the training set.

In [0]:
temp = df.groupby('Category').count().drop(columns=["CategoryNames"]).rename(columns={"Id":"Count"})
plt.bar(targetLabels.values(),temp["Count"])

We can see that the dataset is unbalanced so it must be considered when checking the performance of the trained model in the further steps.

In [0]:
n_samples = 4
fig, m_axs = plt.subplots(5, n_samples, figsize = (4*n_samples, 3*5))
for n_axs, (type_name, type_rows) in zip(m_axs, 
                                         df.sort_values(['CategoryNames']).groupby('CategoryNames')):
    n_axs[0].set_title(type_name)
    for c_ax, (_, c_row) in zip(n_axs, type_rows.sample(n_samples, random_state=1234).iterrows()):
        c_ax.imshow(img.open(img_prefix+c_row['Id']+".jpg").resize((WIDTH,HEIGHT))
                      .convert("RGB"))
        c_ax.axis('off')
fig.savefig('category_samples.png', dpi=300)

Taking the TPU

In [0]:
# instantiating the model in the strategy scope creates the model on the TPU
""" 
with tpu_strategy.scope():
    
    model = tf.keras.Sequential()
    model.add(Conv2D(32,kernel_size=3, strides=2,activation='relu',padding='Same',input_shape=(WIDTH,HEIGHT,3)))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(32,kernel_size=3, strides=2,activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))
    
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))
    
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))
    

    model.add(Conv2D(128,(3,3), strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(128,(3,3), strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.40))
    model.add(Dense(5,activation='softmax'))
    model.summary()
   """ 
"""
with tpu_strategy.scope():
    model = tf.keras.Sequential()
    model.add(Conv2D(32,kernel_size=3, strides=2,activation='relu',padding='Same',input_shape=(WIDTH,HEIGHT,3)))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(32,kernel_size=3, strides=2,activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))
    
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))
    
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.40))
    model.add(Dense(5,activation='softmax'))
    model.summary()
"""


"""
with tpu_strategy.scope():
    model = tf.keras.Sequential()
    model.add(Conv2D(32,kernel_size=3, strides=6,activation='relu',padding='Same',input_shape=(WIDTH,HEIGHT,3)))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    #model.add(Dropout(0.25))
    model.add(Conv2D(32,kernel_size=3, strides=4,activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    #model.add(Dropout(0.25))
    
    model.add(Conv2D(64,kernel_size=3, strides=3, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    #model.add(Dropout(0.25))
    

    model.add(Flatten())
    model.add(Dense(32,activation='relu'))
    #model.add(Dropout(0.25))
    model.add(Dense(5,activation='softmax'))
    model.summary()

"""
"""
with tpu_strategy.scope():
    model = tf.keras.Sequential()
    model.add(Conv2D(32,kernel_size=3, strides=2,activation='relu',padding='Same',input_shape=(WIDTH,HEIGHT,3)))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Conv2D(32,kernel_size=3, strides=2,activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))
    
    model.add(Conv2D(64,kernel_size=3, strides=2, activation='relu',padding='Same'))
    model.add(MaxPool2D(pool_size=(2,2), strides=1, padding='same'))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(64,activation='relu'))
    model.add(Dropout(0.40))
    model.add(Dense(5,activation='softmax'))
    model.summary()
"""
"""

with tpu_strategy.scope():
    model = tf.keras.Sequential()
    

    model.add(ResNet50(input_shape=(WIDTH, HEIGHT, 3), include_top=False, pooling='avg', weights="imagenet"))
    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.40))
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.40))
    model.add(Dense(64,activation='relu'))
    model.add(Dropout(0.40))
    model.add(Dense(5,activation='softmax'))
    model.layers[0].trainable = False
    model.summary()



"""


Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 64, 64, 32)        896       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 64, 64, 32)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 32, 32, 32)        9248      
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 32, 32, 32)        0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 16, 16, 64)        18496     
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 16, 16, 64)       

'\n\nwith tpu_strategy.scope():\n    model = tf.keras.Sequential()\n    \n\n    model.add(ResNet50(input_shape=(WIDTH, HEIGHT, 3), include_top=False, pooling=\'avg\', weights="imagenet"))\n    model.add(Dense(256,activation=\'relu\'))\n    model.add(Dropout(0.40))\n    model.add(Dense(128,activation=\'relu\'))\n    model.add(Dropout(0.40))\n    model.add(Dense(64,activation=\'relu\'))\n    model.add(Dropout(0.40))\n    model.add(Dense(5,activation=\'softmax\'))\n    model.layers[0].trainable = False\n    model.summary()\n\n\n\n'

In [0]:
optimizer=Adam(lr=0.01,beta_1=0.8,beta_2=0.999,epsilon=1e-7,decay=0.0,amsgrad=False)
learning_reductor = ReduceLROnPlateau(monitor='val_accuracy',patience=3,verbose=1,factor=0.5,min_lr=0.00001)
with tpu_strategy.scope():
    model.compile(optimizer=tf.keras.optimizers.Adam(),loss='categorical_crossentropy',metrics=["accuracy"])
    history  = model.fit(images, target, validation_split=0.2,epochs=150,verbose=1,callbacks=[learning_reductor])

Train on 8000 samples, validate on 2000 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 00027: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 00033: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 00036: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.
Epoch 37/150
Epoch 38/150


KeyboardInterrupt: 

This looks promising. Thus for the next step training the model with all the data is required.

Reason for this, the model is splitted to validation in order to get a adequate model.

When adequate model is choosen, as it is deployed in unknown data, the train data should

not be wasted.


In [0]:
optimizer=Adam(lr=0.01,beta_1=0.8,beta_2=0.999,epsilon=1e-7,decay=0.0,amsgrad=False)
learning_reductor = ReduceLROnPlateau(monitor='val_accuracy',patience=3,verbose=1,factor=0.5,min_lr=0.00001)
with tpu_strategy.scope():
    model.compile(optimizer=tf.keras.optimizers.Adam(),loss='categorical_crossentropy',metrics=["accuracy"])
    history  = model.fit(images, target, epochs=50,verbose=1,callbacks=[learning_reductor])

Train on 10000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [0]:
testids = np.array(testdf["Id"]).tolist()
preds = np.array([], dtype=np.int32)
def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

for x in batch(testids, 100):
    x = tf.stack([imagePipeline(i) for i in x])
    prediction = model.predict(x)
    prediction = np.argmax(prediction, axis=1)
    prediction = prediction+1
    preds=np.concatenate((preds, prediction))

In [0]:
import csv
header = ["Id", "Category"]
lines = [[i, j] for i, j in zip(testids, preds)]
with open("Results.csv", "w", newline='') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerow(header)
    for l in lines:
        writer.writerow(l)