Problem statement: To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution which can evaluate images and alert the dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

### Importing Skin Cancer Data
#### To do: Take necessary actions to read the data
### Importing all the important libraries

In [6]:
import pathlib
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processingb
import os
import PIL
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.layers import BatchNormalization
import warnings
warnings.filterwarnings('ignore')

In [7]:
print(tf.__version__)

# Data Reading/Data Understanding

In [14]:
## If you are using the data by mounting the google drive, use the following :
## from google.colab import drive
## drive.mount('/content/gdrive')

##Ref:https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166

This assignment uses a dataset of about 2357 images of skin cancer types. The dataset contains 9 sub-directories in each train and test subdirectories. The 9 sub-directories contains the images of 9 skin cancer types respectively.

In [59]:
# Defining the path for train and test images
## Todo: Update the paths of the train and test dataset
#Kaggle
data_dir_train = pathlib.Path("/kaggle/input/dlcnnassignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/")
data_dir_test = pathlib.Path('/kaggle/input/dlcnnassignment/Skin cancer ISIC The International Skin Imaging Collaboration/Test/')
#google
# data_dir_train = pathlib.Path("/content/drive/MyDrive/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/")
# data_dir_test = pathlib.Path('/content/drive/MyDrive/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Test/')

In [60]:
image_count_train = len(list(data_dir_train.glob('*/*.jpg')))
print("Total image count in Train",image_count_train)
image_count_test = len(list(data_dir_test.glob('*/*.jpg')))
print("Total image count in Test",image_count_test)
print("Total image count in Dataset",(image_count_test+image_count_train))

In [61]:
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk(data_dir_train):
    print("Train", dirname.split("/")[-1], len(filenames))
for dirname, _, filenames in os.walk(data_dir_test):
    print("Test", dirname.split("/")[-1], len(filenames))

Hence output number of classes or neurons is 9

# Dataset Creation

Create train & validation dataset from the train directory with a batch size of 32. Also, make sure you resize your images to 180*180.

### Load using keras.preprocessing

Let's load these images off disk using the helpful image_dataset_from_directory utility.



### Create a dataset

Define some parameters for the loader:

In [19]:
batch_size = 32
img_height = 180
img_width = 180

Use 80% of the images for training, and 20% for validation.

In [20]:
## Write your train dataset here
## Note use seed=123 while creating your dataset using tf.keras.preprocessing.image_dataset_from_directory
## Note, make sure your resize your images to the size img_height*img_width, while writting the dataset
train_ds =  tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train,
    seed=123,
    validation_split= 0.2, # 20% for validation
    subset= 'training',
    image_size=(img_height,img_width),
    batch_size = batch_size
)

In [21]:
train_ds

In [22]:
## Write your validation dataset here
## Note use seed=123 while creating your dataset using tf.keras.preprocessing.image_dataset_from_directory
## Note, make sure your resize your images to the size img_height*img_width, while writting the dataset
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

In [23]:
val_ds

In [24]:
# List out all the classes of skin cancer and store them in a list. 
# You can find the class names in the class_names attribute on these datasets. 
# These correspond to the directory names in alphabetical order.
class_names = train_ds.class_names
print(class_names)

## Dataset visualisation
Create a code to visualize one instance of all the nine classes present in the dataset 


### Visualize the data
#### Todo, create a code to visualize one instance of all the nine classes present in the dataset

In [27]:
import matplotlib.pyplot as plt

### your code goes here, you can use training or validation data to visualize
plt.figure(figsize=(10, 10))
for im, lab in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(im[i].numpy().astype("uint8"))
        plt.title(class_names[lab[i]])
        plt.axis("off")
        


In [28]:
ds=train_ds.take(1) #divides dataset to batches of 32, generates first 32 batch
print(type(ds))
print(ds)
for im, lab in ds:
    print(im.shape)
# print(im[0].numpy().astype("uint8"),im[0].numpy().astype("uint8").shape,class_names[lab[0]])
im[0]

The `image_batch` is a tensor of the shape `(32, 180, 180, 3)`. This is a batch of 32 images of shape `180x180x3` (the last dimension refers to color channels RGB). The `label_batch` is a tensor of the shape `(32,)`, these are corresponding labels to the 32 images.

`Dataset.cache()` keeps the images in memory after they're loaded off disk during the first epoch.

`Dataset.prefetch()` overlaps data preprocessing and model execution while training.

In [29]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [30]:
train_ds

In [31]:
val_ds

## Model Building & training : 
Create a CNN model, which can accurately detect 9 classes present in the dataset. While building the model rescale images to normalize pixel values between (0,1).
Choose an appropriate optimiser and loss function for model training
Train the model for ~20 epochs
Write your findings after the model fit, see if there is evidence of model overfit or underfit

### Create the model
#### Todo: Create a CNN model, which can accurately detect 9 classes present in the dataset. Use ```layers.experimental.preprocessing.Rescaling``` to normalize pixel values between (0,1). The RGB channel values are in the `[0, 255]` range. This is not ideal for a neural network. Here, it is good to standardize values to be in the `[0, 1]`

In [32]:
### Your code goes here
normalized_layers = tf.keras.layers.experimental.preprocessing.Rescaling(1./255, input_shape=(180, 180, 3))


In [33]:
#Mapping values to images
normalized_ds = train_ds.map(lambda x, y: (normalized_layers(x), y))
image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))


In [34]:
print(first_image)

# Basic Model 1

In [35]:

model = Sequential([
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
])
model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', activation ='relu', input_shape = (180, 180, 3)))
model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same',activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Flatten())
model.add(Dense(512, activation='relu')) # fully connected
model.add(Dropout(0.5))

model.add(Dense(9, activation = "softmax"))

### Compile the model
Choose an appropirate optimiser and loss function for model training 

In [36]:
### Todo, choose an appropirate optimiser and loss function
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [37]:
# View the summary of all layers
model.summary()

### Train the model

In [38]:
epochs = 20
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

### Visualizing training results

In [39]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit

Yes, Model is clearly overestimating/overfitting

(
Epoch 20/20

56/56 [==============================] - 2s 31ms/step - loss: 0.7274 - accuracy: 0.7494 - val_loss: 1.6475 - val_accuracy: 0.5481)

as train accuracy is larger than validation accuracy.
After 5 epoch validation staled at 50% kept oscillating arounf similarly the validation loss.

Reason being model had been trained/learnt more parameters. It is recommended to maintain model generalized to perform some of the following steps:
1. Dropouts.
2.Regularization(l1 and l2)
3. Data Augmentation.(To increase training data)
Since suggested to follow let us for now choose Data Augmentation.

### Data Augmentation



As there are many Augmentation methods: ImageDataNet, Keras.PreProcessing, Imgaug etc,

Let us fornow start with basic coarse tuned augmentation strategy-Keras Preprocessing Layers.

In [40]:
# Todo, after you have analysed the model fit history for presence of underfit or overfit, choose an appropriate data augumentation strategy. 
# Your code goes here

data_augument = keras.Sequential([
                             
                             layers.experimental.preprocessing.RandomFlip(mode="horizontal_and_vertical",input_shape=(180,180,3)),
                             layers.experimental.preprocessing.RandomRotation(0.2, fill_mode='reflect')
#                              layers.experimental.preprocessing.RandomZoom(height_factor=(0.2, 0.3), width_factor=(0.2, 0.3), fill_mode='reflect')
])

In [41]:
# Todo, visualize how your augmentation strategy works for one instance of training image.
# Your code goes here
ds=train_ds.take(1) #divides dataset to batches of 32, generates first 32 batch
print(type(ds))
print(ds)
for im, lab in ds:
    print(im.shape)

In [42]:
img=im[0].numpy().astype("uint8")
plt.imshow(img)

In [43]:
img.shape

In [44]:
img

In [45]:
image_resized_cast = tf.cast(tf.expand_dims(img, 0), tf.float32)
image_resized_cast.shape


In [46]:
#Individual Image
plt.figure(figsize=(10, 10))
for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(data_augument(image_resized_cast)[0].numpy().astype("uint8"))
    plt.axis("off")

In [47]:
augmented_image = data_augument(image_resized_cast)
augmented_image

In [48]:
rescale= keras.Sequential([
    layers.experimental.preprocessing.Rescaling(1./255)
])
rescaled_image=rescale(image_resized_cast)[0].numpy()
plt.imshow(rescaled_image)

In [49]:
rescaled_image

In [50]:
#For batch
plt.figure(figsize=(12, 12))
for images, labels in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(data_augument(images)[0].numpy().astype("uint8"))
        plt.axis("off")

### Todo:
### Create the model, compile and train the model

In [52]:
## You can use Dropout layer if there is an evidence of overfitting in your findings
## Your code goes here
model = Sequential([
                    data_augument,
                    rescale
])
model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', activation ='relu', input_shape = (180, 180, 3)))
model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same',activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512, activation='relu')) # fully connected
model.add(Dropout(0.5))

model.add(Dense(9, activation = "softmax"))


### Compiling the model

In [53]:
## Your code goes here
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.summary()

### Training the model

In [54]:
## Your code goes here, note: train your model for 20 epochs
epochs=25
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

### Visualizing the results

In [55]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit. Do you think there is some improvement now as compared to the previous model run?

Though Model overfit has been addressed, the accuracy of model reduced,

(
Epoch 25/25

56/56 [==============================] - 2s 34ms/step - loss: 1.3128 - accuracy: 0.5363 - val_loss: 1.3443 - val_accuracy: 0.5347

therefore might be many possible reasons among which class imablance is one of reason. 
As images count which observed for each individual category is different model may be more biased. 

Hence let us use some techniques to maintain similar proportion image rate among the 9 categories.


# Class Imbalance

#### **Todo:** Find the distribution of classes in the training dataset.
#### **Context:** Many times real life datasets can have class imbalance, one class can have proportionately higher number of samples compared to the others. Class imbalance can have a detrimental effect on the final model quality. Hence as a sanity check it becomes important to check what is the distribution of classes in the data.

In [63]:
## Your code goes here.
class_names=[]
file_count=[]
for dirname, _, filenames in os.walk(data_dir_train):
    if dirname.split("/")[-1]!="Train":
        print("Train", dirname.split("/")[-1], len(filenames))
        class_names.append(dirname.split("/")[-1])
        file_count.append(len(filenames))
    

plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
plt.bar(class_names,file_count)
plt.xticks(fontsize=8, rotation=90)
plt.subplot(1,2,2)

plt.pie(file_count, labels = class_names,autopct='%.2f')
plt.show()

#### **Todo:** Write your findings here: 
#### - Which class has the least number of samples?
#### - Which classes dominate the data in terms proportionate number of samples?


As from above bar and pie charts we can infer following points:

pigmented benign keratosis has Highest sample proportion rate--20.63%

seborrheic keratosis has least sample rate--3.44%


#### **Todo:** Rectify the class imbalance
#### **Context:** You can use a python package known as `Augmentor` (https://augmentor.readthedocs.io/en/master/) to add more samples across all classes so that none of the classes have very few samples.

In [64]:

class_names=['pigmented benign keratosis',
 'melanoma',
 'vascular lesion',
 'actinic keratosis',
 'squamous cell carcinoma',
 'basal cell carcinoma',
 'seborrheic keratosis',
 'dermatofibroma',
 'nevus']

In [65]:
!pip install Augmentor

To use `Augmentor`, the following general procedure is followed:

1. Instantiate a `Pipeline` object pointing to a directory containing your initial image data set.<br>
2. Define a number of operations to perform on this data set using your `Pipeline` object.<br>
3. Execute these operations by calling the `Pipeline’s` `sample()` method.

In [68]:
path_to_training_dataset=str(data_dir_train)
import Augmentor
for i in class_names:
    p = Augmentor.Pipeline( source_directory=pathlib.Path(path_to_training_dataset +"/"+ i),output_directory="/output/"+i)
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    p.sample(500)

Augmentor has stored the augmented images in the output sub-directory of each of the sub-directories of skin cancer types.. Lets take a look at total count of augmented images.

In [69]:
# image_count_train = len(list(data_dir_train.glob('/output/*.jpg')))
# print(image_count_train)
import os
for dirname, _, filenames in os.walk('/output/'):
    print("Train", dirname.split("/")[-1], len(filenames))

In [84]:
### Lets see the distribution of augmented data after adding new images to the original training data.
from glob import glob
path_list = [x for x in os.walk('/output/')]
path_list

# dataframe_dict_new = dict(zip(path_list_new, lesion_list_new))
# df2 = pd.DataFrame(list(dataframe_dict_new.items()),columns = ['Path','Label'])
# new_df = original_df.append(df2)
# new_df['Label'].value_counts()

So, now we have added 500 images to all the classes to maintain some class balance. We can add more images as we want to improve training process.

#### **Todo**: Train the model on the data created using Augmentor

In [106]:
batch_size = 32
img_height = 180
img_width = 180

#### **Todo:** Create a training dataset

In [107]:
data_dir_train="/output/"
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  seed=123,
  validation_split = 0.2,
  subset= 'training',## Todo choose the correct parameter value, so that only training data is refered to,,
  image_size=(img_height, img_width),
  batch_size=batch_size)

#### **Todo:** Create a validation dataset

In [108]:
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  seed=123,
  validation_split = 0.2,
  subset = 'validation', ## Todo choose the correct parameter value, so that only validation data is refered to,
  image_size=(img_height, img_width),
  batch_size=batch_size)

#### **Todo:** Create your model (make sure to include normalization)

In [109]:
model = Sequential([
                    
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
])
model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', activation ='relu', input_shape = (180, 180, 3)))
model.add(BatchNormalization())
model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same',activation ='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.35))

model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))

model.add(Dropout(0.4))

model.add(Flatten())
model.add(Dense(512, activation='relu')) # fully connected

model.add(Dropout(0.5))

model.add(Dense(9, activation = "softmax"))


#### **Todo:** Compile your model (Choose optimizer and loss function appropriately)

In [110]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.summary()

#### **Todo:**  Train your model

In [111]:
epochs=55
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

#### **Todo:**  Visualize the model results

In [112]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

### Observations

Though training and validation accuracy is 89% and  78% at 55 epoch as below
Epoch 55/55

113/113 [==============================] - 12s 96ms/step - loss: 0.2986 - accuracy: 0.8914 - val_loss: 1.0442 - val_accuracy: 0.7856
    
From The train and validation accuracy, loss graphs and logs of model fit , we may conclude we certain that there are many other saddle points that it may converge, but at epoch 38-42 we may conclude to be early stopping point as model found to have significant accuracy at mimnum loss with acceptable variation between test and validation data.

As below:
Epoch 38/55

113/113 [==============================] - 12s 101ms/step - loss: 0.4759 - accuracy: 0.8214 - val_loss: 0.7690 - val_accuracy: 0.7933
Epoch 39/55

113/113 [==============================] - 12s 98ms/step - loss: 0.5008 - accuracy: 0.8114 - val_loss: 0.8947 - val_accuracy: 0.7578
Epoch 40/55
113/113 [==============================] - 12s 101ms/step - loss: 0.4350 - accuracy: 0.8383 - val_loss: 0.9823 - val_accuracy: 0.7500

Epoch 41/55
113/113 [==============================] - 12s 94ms/step - loss: 0.4349 - accuracy: 0.8394 - val_loss: 0.7954 - val_accuracy: 0.7711
                
                
Therefore we may model above very significant however we may experiment with adding Regularization and variating droput  percentage similary other  hyperparameters, cross validation and optimisation methods to obtain many more high significant and performant models.
