<a href="https://colab.research.google.com/github/GautamPoddar18/Melanoma-Detection-Assignment/blob/main/mda_assignment_notebook_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  Melanoma Detection Assignment




**Problem statement**: To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution that can evaluate images and alert dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

Objective: Building a multiclass classification model using a custom CNN in TensorFlow

**Data Description**
The dataset consists of 2357 images of malignant and benign oncological diseases, which were formed from the International Skin Imaging Collaboration (ISIC). All images were sorted according to the classification taken with ISIC, and all subsets were divided into the same number of images, with the exception of melanomas and moles, whose images are slightly dominant.

The data set contains the following diseases:

1. Actinic keratosis
2. Basal cell carcinoma
3. Dermatofibroma
4. Melanoma
5. Nevus
6. Pigmented benign keratosis
7. Seborrheic keratosis
8. Squamous cell carcinoma
9. Vascular lesion

Dataset link: https://drive.google.com/file/d/1xLfSQUGDl8ezNNbUkpuHOYvSpTyxVhCs/

## Project Pipeline

In [4]:
## Importing Libraries

import pathlib
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import PIL
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

In [5]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [6]:
import tensorflow.keras.preprocessing as pre

### 1. Data Reading/Data Understanding

In [7]:
## If you are using the data by mounting the google drive, use the following :
from google.colab import drive
drive.mount('/content/gdrive')
drive_path = '/content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/'
## Defining Path Variable in goggle Colab
path_train = pathlib.Path(drive_path + "Train")
path_test = pathlib.Path(drive_path + "Test")
## Image Count
train_count = len(list(path_train.glob('*/*.jpg')))
print('Train Image Count', train_count)
test_count = len(list(path_test.glob('*/*.jpg')))
print('Test Image Count',test_count)

Mounted at /content/gdrive
Train Image Count 2239
Test Image Count 118


In [8]:
'''

## Defining Path Variable in kaggle
path_train = pathlib.Path("../input/skin-cancer/Skin cancer ISIC The International Skin Imaging Collaboration/Train")
path_test = pathlib.Path('../input/skin-cancer/Skin cancer ISIC The International Skin Imaging Collaboration/Test')
## Image Count
train_count = len(list(path_train.glob('*/*.jpg')))
print('Train Image Count', train_count)
test_count = len(list(path_test.glob('*/*.jpg')))
print('Test Image Count',test_count)

'''

'\n\n## Defining Path Variable in kaggle\npath_train = pathlib.Path("../input/skin-cancer/Skin cancer ISIC The International Skin Imaging Collaboration/Train")\npath_test = pathlib.Path(\'../input/skin-cancer/Skin cancer ISIC The International Skin Imaging Collaboration/Test\')\n## Image Count\ntrain_count = len(list(path_train.glob(\'*/*.jpg\')))\nprint(\'Train Image Count\', train_count)\ntest_count = len(list(path_test.glob(\'*/*.jpg\')))\nprint(\'Test Image Count\',test_count)\n\n'

### 2. Dataset creation

#### Creating the Dataset

In [9]:
batch_size = 32
img_height = 180
img_width = 180

#### Keeping 70/30 Train and Validation Dataset Ratio and using seed=123

In [10]:
## Writing Train dataset
train_ds = pre.image_dataset_from_directory(
    path_train,
    seed=123,
    validation_split= 0.3,
    subset= 'training',
    image_size=(img_height,img_width),
    batch_size = batch_size
)

Found 3866 files belonging to 9 classes.
Using 2707 files for training.


In [11]:
## Writing Validation dataset
val_ds = pre.image_dataset_from_directory(
    path_train,
    seed=123,
    validation_split= 0.3,
    subset= 'validation',
    image_size=(img_height,img_width),
    batch_size = batch_size
)

Found 3866 files belonging to 9 classes.
Using 1159 files for validation.


In [12]:
## Writing Test dataset
test_ds = pre.image_dataset_from_directory(
    path_test,
    seed=123,
    image_size=(img_height,img_width),
    batch_size = batch_size
)

Found 118 files belonging to 9 classes.


In [13]:
print("Train Dataset Class Names :\n",train_ds.class_names)

Train Dataset Class Names :
 ['actinic keratosis', 'basal cell carcinoma', 'dermatofibroma', 'melanoma', 'nevus', 'pigmented benign keratosis', 'seborrheic keratosis', 'squamous cell carcinoma', 'vascular lesion']


In [14]:
print("Train Dataset Class Names :\n",test_ds.class_names)

Train Dataset Class Names :
 ['actinic keratosis', 'basal cell carcinoma', 'dermatofibroma', 'melanoma', 'nevus', 'pigmented benign keratosis', 'seborrheic keratosis', 'squamous cell carcinoma', 'vascular lesion']


### 3. Dataset visualisation

In [None]:
dataset_classes = train_ds.class_names
test_dataset_classes = test_ds.class_names

In [None]:
import matplotlib.image as mpimg
plt.figure(figsize=(10,10))
for i in range(9): 
  plt.subplot(3, 3, i + 1)
  image = mpimg.imread(str(list(path_train.glob(dataset_classes[i]+'/*.jpg'))[1]))
  plt.title(dataset_classes[i])
  plt.imshow(image)

In [None]:
# test dataset visualization
plt.figure(figsize=(10,10))
for i in range(9): 
  plt.subplot(3, 3, i + 1)
  image = mpimg.imread(str(list(path_test.glob(test_dataset_classes[i]+'/*.jpg'))[1]))
  plt.title(test_dataset_classes[i])
  plt.imshow(image)

The `image_batch` is a tensor of the shape `(32, 180, 180, 3)`. This is a batch of 32 images of shape `180x180x3` (the last dimension refers to color channels RGB). The `label_batch` is a tensor of the shape `(32,)`, these are corresponding labels to the 32 images.



`Dataset.cache()` keeps the images in memory after they're loaded off disk during the first epoch.

`Dataset.prefetch()` overlaps data preprocessing and model execution while training.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### 4. Model Building & training

#### Model 1

In [None]:
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D

In [None]:
## Model Defination
num_classes = 9
model = Sequential([
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
])
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Flatten())
model.add(Dense(num_classes, activation = "softmax"))

In [None]:
## Model Compile
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
## View the summary of all layers
model.summary()

In [None]:
## Model Training
epochs=10
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
## Visualizing Training Results
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

**Since no pattern observed, increasing the number of epochs to 30 for observation**

In [None]:
## Model Training Epoch =20
epochs=20
history_epoch20 = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
## Visualizing Training Results
acc = history_epoch20.history['accuracy']
val_acc = history_epoch20.history['val_accuracy']

loss = history_epoch20.history['loss']
val_loss = history_epoch20.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

**Intial Findings**
There is overfitting observed in the model as validation accuracy decreases while train accuracy increases around 17th epoch

In [None]:
## Test dataset Prediction and Accuracy
y_true=[]
y_pred=[]
for images, labels in test_ds.take(1):
  # print(model.predict_classes(images))
  # print(labels.numpy())
  y_true=list(labels.numpy())
  y_pred=model.predict_classes(images)
  # break
print(classification_report(y_true,y_pred,target_names=dataset_classes))
print("------"*20)
print("Accuracy on test dataset : ",(accuracy_score(y_true,y_pred)*100      )         )

#### Using a RMSprop Optimizers

In [None]:
## Model Defination
num_classes = 9
model_rmsprop = Sequential([
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
])
model_rmsprop.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model_rmsprop.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_rmsprop.add(MaxPool2D(pool_size=(2,2)))
model_rmsprop.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_rmsprop.add(MaxPool2D(pool_size=(2,2)))
model_rmsprop.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_rmsprop.add(MaxPool2D(pool_size=(2,2)))
model_rmsprop.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_rmsprop.add(MaxPool2D(pool_size=(2,2)))
model_rmsprop.add(Dropout(0.25))


model_rmsprop.add(Flatten())
model_rmsprop.add(Dense(num_classes, activation = "softmax"))


In [None]:
## Model Compile
model_rmsprop.compile(optimizer='rmsprop',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
## View the summary of all layers
model_rmsprop.summary()

In [None]:
## Model Training
epochs=20
history_rmsprop = model_rmsprop.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
## Visualizing Training Results
acc = history_rmsprop.history['accuracy']
val_acc = history_rmsprop.history['val_accuracy']

loss = history_rmsprop.history['loss']
val_loss = history_rmsprop.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

In [None]:
## Test dataset Prediction and Accuracy
y_true=[]
y_pred=[]
for images, labels in test_ds.take(1):
  # print(model.predict_classes(images))
  # print(labels.numpy())
  y_true=list(labels.numpy())
  y_pred=model_rmsprop.predict_classes(images)
  # break
print(classification_report(y_true,y_pred,target_names=dataset_classes))
print("------"*20)
print("Accuracy on test dataset : ",(accuracy_score(y_true,y_pred)*100      )         )

**Observation: No major improvement in Accuracy**

### 5. Data augmentation

In [None]:
data_aug = keras.Sequential([
                             layers.experimental.preprocessing.RandomFlip(mode="horizontal_and_vertical",input_shape=(img_height,img_width,3)),
                             layers.experimental.preprocessing.RandomRotation(0.2, fill_mode='reflect'),
                             layers.experimental.preprocessing.RandomZoom(height_factor=(0.2, 0.3), width_factor=(0.2, 0.3), fill_mode='reflect')
])

In [None]:
# visualize how your augmentation strategy works for one instance of training image.
plt.figure(figsize=(12, 12))
for images, labels in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(data_aug(images)[i].numpy().astype("uint8"))
        plt.title(dataset_classes[labels[i]])
        plt.axis("off")

### 6. Model Building & training on Augmented Data

#### Model 2

In [None]:
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D\

#### Model Building on Augmented Data using Adam Optimizer

In [None]:
 
num_classes = 9
model_aug_adam = Sequential([ data_aug,
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
      
])
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adam.add(MaxPool2D(pool_size=(2,2)))
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adam.add(MaxPool2D(pool_size=(2,2)))
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adam.add(MaxPool2D(pool_size=(2,2)))
model_aug_adam.add(Dropout(0.25))


model_aug_adam.add(Flatten())
model_aug_adam.add(Dense(num_classes, activation = "softmax"))

## Model 2 Compilation
model_aug_adam.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Model 2 Training
epochs=30
history_aug_adam = model_aug_adam.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
# Model 2 Visualizaiton
acc = history_aug_adam.history['accuracy']
val_acc = history_aug_adam.history['val_accuracy']

loss = history_aug_adam.history['loss']
val_loss = history_aug_adam.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

**Observation: Overfitting and loss both reduce in this model**

#### Model Building on Augmented Data using Stochastic gradient descent(SGD) Optimizer

In [None]:
 
num_classes = 9
model_aug_SGD = Sequential([ data_aug,
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
      
])
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_SGD.add(MaxPool2D(pool_size=(2,2)))
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_SGD.add(MaxPool2D(pool_size=(2,2)))
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_SGD.add(MaxPool2D(pool_size=(2,2)))
model_aug_SGD.add(Dropout(0.25))


model_aug_SGD.add(Flatten())
model_aug_SGD.add(Dense(num_classes, activation = "softmax"))

## Model Compilation
model_aug_SGD.compile(optimizer='sgd',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Model 2 Training
epochs=30
history_aug_sgd = model_aug_SGD.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
# Model 2 Visualizaiton
acc = history_aug_sgd.history['accuracy']
val_acc = history_aug_sgd.history['val_accuracy']

loss = history_aug_sgd.history['loss']
val_loss = history_aug_sgd.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Model Building on Augmented Data using Adagrad Optimizer

In [None]:
 
num_classes = 9
model_aug_adagrad = Sequential([ data_aug,
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
      
])
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adagrad.add(MaxPool2D(pool_size=(2,2)))
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adagrad.add(MaxPool2D(pool_size=(2,2)))
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adagrad.add(MaxPool2D(pool_size=(2,2)))
model_aug_adagrad.add(Dropout(0.25))


model_aug_adagrad.add(Flatten())
model_aug_adagrad.add(Dense(num_classes, activation = "softmax"))

## Model Compilation
model_aug_adagrad.compile(optimizer='adagrad',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Model 2 Training
epochs=30
history_aug_adagrad = model_aug_adagrad.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
# Model 2 Visualizaiton
acc = history_aug_adagrad.history['accuracy']
val_acc = history_aug_adagrad.history['val_accuracy']

loss = history_aug_adagrad.history['loss']
val_loss = history_aug_adagrad.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Augmented Models Predication on Test Dataset

In [None]:
#checking the performance on the test set 
y_true=[]
y_pred=[]
for images, labels in test_ds.take(1):
  y_true=list(labels.numpy())
  y_pred=model_aug_adam.predict_classes(images)
  # break
  print("Adam optimizer")
  print(classification_report(y_true,y_pred,target_names=dataset_classes))
  print("Accuracy on test dataset : ",accuracy_score(y_true,y_pred))
  

  print("*"*20)
  y_pred=model_aug_adagrad.predict_classes(images)
  # break
  print("Adagrad optimizer")
  print(classification_report(y_true,y_pred,target_names=dataset_classes))
  print("Accuracy on test dataset : ",accuracy_score(y_true,y_pred))
  
    
  print("*"*20)
  y_pred=model_aug_SGD.predict_classes(images)
  # break
  print("SGD optimizer")
  print(classification_report(y_true,y_pred,target_names=dataset_classes))
  print("Accuracy on test dataset : ",accuracy_score(y_true,y_pred))
  print("*"*20)

**Findings** 
1. After addition of Agumentation layers we were able to reduce the model's overfitting. However in this case the model is not able to generalise well. 
2. We tried out different set of optmizers sgd , adagrad , adams which gave models that had low Training and Validation accuracy. 
3. The accuracy figures were less than 50% both for training and validation.
4. Maximum Accuracy on Test Dataset we were able to achive was 46%

### 7. Class distribution

In [None]:
path_list=[]
lesion_list=[]
for i in dataset_classes:
      
    for j in path_train.glob(i+'/*.jpg'):
        path_list.append(str(j))
        lesion_list.append(i)
dataframe_dict_original = dict(zip(path_list, lesion_list))
original_df = pd.DataFrame(list(dataframe_dict_original.items()),columns = ['Path','Label'])
original_df.head()

In [None]:
dataframe_dict_original = dict(zip(path_list, lesion_list))
original_df = pd.DataFrame(list(dataframe_dict_original.items()),columns = ['Path','Label'])
original_df.head()

In [None]:
count=[]
for i in dataset_classes:
    count.append(len(list(path_train.glob(i+'/*.jpg'))))
plt.figure(figsize=(25,10))
plt.bar(dataset_classes,count)

In [None]:
dataset_classes

#### Observation
* Highest Distribution Class: pigmented benign keratosis
* Lowest Distribution Class: seborrheic keratosis

### 8. Handling class imbalances

In [15]:
!pip install Augmentor

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Augmentor
  Downloading Augmentor-0.2.10-py2.py3-none-any.whl (38 kB)
Installing collected packages: Augmentor
Successfully installed Augmentor-0.2.10


In [16]:
import Augmentor

In [17]:
from glob import glob

To use `Augmentor`, the following general procedure is followed:

1. Instantiate a `Pipeline` object pointing to a directory containing your initial image data set.<br>
2. Define a number of operations to perform on this data set using your `Pipeline` object.<br>
3. Execute these operations by calling the `Pipeline’s` `sample()` method.

In [18]:
path_train

PosixPath('/content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train')

In [19]:
path_to_training_dataset = '/content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/'
path_to_training_dataset

'/content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/'

In [20]:
dataset_classes = train_ds.class_names
dataset_classes

['actinic keratosis',
 'basal cell carcinoma',
 'dermatofibroma',
 'melanoma',
 'nevus',
 'pigmented benign keratosis',
 'seborrheic keratosis',
 'squamous cell carcinoma',
 'vascular lesion']

In [21]:
for i in dataset_classes:
    p = Augmentor.Pipeline(path_to_training_dataset + i)
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    p.sample(500)

Initialised with 114 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/actinic keratosis/output.

Processing <PIL.Image.Image image mode=RGB size=600x450 at 0x7F9C1CC11F50>: 100%|██████████| 500/500 [00:20<00:00, 24.84 Samples/s]


Initialised with 376 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/basal cell carcinoma/output.

Processing <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=600x450 at 0x7F9C10338610>: 100%|██████████| 500/500 [00:19<00:00, 26.22 Samples/s]


Initialised with 95 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/dermatofibroma/output.

Processing <PIL.Image.Image image mode=RGB size=600x450 at 0x7F9C10380950>: 100%|██████████| 500/500 [00:18<00:00, 26.70 Samples/s]


Initialised with 438 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/melanoma/output.

Processing <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1024x768 at 0x7F9C102EC3D0>: 100%|██████████| 500/500 [01:48<00:00,  4.62 Samples/s]


Initialised with 357 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/nevus/output.

Processing <PIL.Image.Image image mode=RGB size=919x802 at 0x7F9C102F7E90>: 100%|██████████| 500/500 [01:34<00:00,  5.28 Samples/s]


Initialised with 462 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/pigmented benign keratosis/output.

Processing <PIL.Image.Image image mode=RGB size=600x450 at 0x7F9C103B3250>: 100%|██████████| 500/500 [00:22<00:00, 22.63 Samples/s]


Initialised with 77 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/seborrheic keratosis/output.

Processing <PIL.Image.Image image mode=RGB size=1024x768 at 0x7F9C1CD857D0>: 100%|██████████| 500/500 [00:45<00:00, 10.88 Samples/s]


Initialised with 181 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/squamous cell carcinoma/output.

Processing <PIL.Image.Image image mode=RGB size=600x450 at 0x7F9C1E374050>: 100%|██████████| 500/500 [00:20<00:00, 24.42 Samples/s]


Initialised with 139 image(s) found.
Output directory set to /content/gdrive/My Drive/Machine Learning/Melanoma Detection Assignment/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/vascular lesion/output.

Processing <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=600x450 at 0x7F9C1CB63B90>: 100%|██████████| 500/500 [00:19<00:00, 26.14 Samples/s]


In [23]:
image_count_train = len(list(path_train.glob('*/output/*.jpg')))
print(image_count_train)

6127


### 9. Model Building & training on the rectified class imbalance data 

In [24]:
path_list_new = [x for x in glob(os.path.join(path_train, '*','output', '*.jpg'))]
# path_list

In [25]:
lesion_list_new = [os.path.basename(os.path.dirname(os.path.dirname(y))) for y in glob(os.path.join(path_train, '*','output', '*.jpg'))]
# lesion_list_new

In [26]:
dataframe_dict_new = dict(zip(path_list_new, lesion_list_new))

In [27]:
df2 = pd.DataFrame(list(dataframe_dict_new.items()),columns = ['Path','Label'])
# new_df = original_df.append(df2) 

In [29]:
#created 500 samples for each
df2['Label'].value_counts()

dermatofibroma                1000
actinic keratosis             1000
basal cell carcinoma          1000
melanoma                       627
squamous cell carcinoma        500
pigmented benign keratosis     500
vascular lesion                500
nevus                          500
seborrheic keratosis           500
Name: Label, dtype: int64

In [34]:
train_path_list = list(path_train.glob('*/*.jpg'))
df=pd.DataFrame({"cancer_type":[str(x).split("/")[2] for x in train_path_list]})

In [35]:
#new counts
new_list=list(df['cancer_type'].values)
new_list.extend(list(df2['Label'].values))
len(new_list)
final_df=pd.DataFrame({"cancer_type":new_list})
final_df['cancer_type'].value_counts()

gdrive                        2239
dermatofibroma                1000
actinic keratosis             1000
basal cell carcinoma          1000
melanoma                       627
squamous cell carcinoma        500
pigmented benign keratosis     500
vascular lesion                500
nevus                          500
seborrheic keratosis           500
Name: cancer_type, dtype: int64

So, now we have added 500 images to all the classes to maintain some class balance. We can add more images as we want to improve training process.

#### Creating the Dataset after augmentation

In [36]:
batch_size = 32
img_height = 180
img_width = 180

#### Keeping 70/30 Train and Validation Dataset Ratio and using seed=123

In [37]:
## Writing Train dataset
train_ds_aug = pre.image_dataset_from_directory(
    path_train,
    seed=123,
    validation_split= 0.3,
    subset= 'training',
    image_size=(img_height,img_width),
    batch_size = batch_size
)

Found 8366 files belonging to 9 classes.
Using 5857 files for training.


In [38]:
## Writing Validation dataset
val_ds_aug = pre.image_dataset_from_directory(
    path_train,
    seed=123,
    validation_split= 0.3,
    subset= 'validation',
    image_size=(img_height,img_width),
    batch_size = batch_size
)

Found 8366 files belonging to 9 classes.
Using 2509 files for validation.


#### Model Building & training on Augmented Data and Class Imblance handling

In [39]:
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D\

#### Model Building on Augmented Data using Adam Optimizer

In [None]:
num_classes = 9
model_aug_adam = Sequential([
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
])
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adam.add(MaxPool2D(pool_size=(2,2)))
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adam.add(MaxPool2D(pool_size=(2,2)))
model_aug_adam.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adam.add(MaxPool2D(pool_size=(2,2)))
model_aug_adam.add(Dropout(0.25))


model_aug_adam.add(Flatten())
model_aug_adam.add(Dense(num_classes, activation = "softmax"))

## Model 2 Compilation
model_aug_adam.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Model 2 Training
epochs=20
history_aug_adam = model_aug_adam.fit(
  train_ds_aug,
  validation_data=val_ds_aug,
  epochs=epochs
)

Epoch 1/30

In [None]:
# Model 2 Visualizaiton
acc = history_aug_adam.history['accuracy']
val_acc = history_aug_adam.history['val_accuracy']

loss = history_aug_adam.history['loss']
val_loss = history_aug_adam.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

**Observation: Overfitting and loss both reduce in this model**

#### Model Building on Augmented Data using Stochastic gradient descent(SGD) Optimizer

In [None]:
 
num_classes = 9
model_aug_SGD = Sequential([
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
])
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_SGD.add(MaxPool2D(pool_size=(2,2)))
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_SGD.add(MaxPool2D(pool_size=(2,2)))
model_aug_SGD.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_SGD.add(MaxPool2D(pool_size=(2,2)))
model_aug_SGD.add(Dropout(0.25))


model_aug_SGD.add(Flatten())
model_aug_SGD.add(Dense(num_classes, activation = "softmax"))

## Model Compilation
model_aug_SGD.compile(optimizer='sgd',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Model 2 Training
epochs=20
history_aug_sgd = model_aug_SGD.fit(
  train_ds_aug,
  validation_data=val_ds_aug,
  epochs=epochs
)

In [None]:
# Model 2 Visualizaiton
acc = history_aug_sgd.history['accuracy']
val_acc = history_aug_sgd.history['val_accuracy']

loss = history_aug_sgd.history['loss']
val_loss = history_aug_sgd.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Model Building on Augmented Data using Adagrad Optimizer

In [None]:
 
num_classes = 9
model_aug_adagrad = Sequential([
                    layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width,3))
])
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (180, 180, 32)))
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adagrad.add(MaxPool2D(pool_size=(2,2)))
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adagrad.add(MaxPool2D(pool_size=(2,2)))
model_aug_adagrad.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model_aug_adagrad.add(MaxPool2D(pool_size=(2,2)))
model_aug_adagrad.add(Dropout(0.25))


model_aug_adagrad.add(Flatten())
model_aug_adagrad.add(Dense(num_classes, activation = "softmax"))

## Model Compilation
model_aug_adagrad.compile(optimizer='adagrad',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Model 2 Training
epochs=20
history_aug_adagrad = model_aug_adagrad.fit(
  train_ds_aug,
  validation_data=val_ds_aug,
  epochs=epochs
)

In [None]:
# Model 2 Visualizaiton
acc = history_aug_adagrad.history['accuracy']
val_acc = history_aug_adagrad.history['val_accuracy']

loss = history_aug_adagrad.history['loss']
val_loss = history_aug_adagrad.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Augmented Models Predication on Test Dataset

In [None]:
#checking the performance on the test set 
y_true=[]
y_pred=[]
for images, labels in test_ds.take(1):
  y_true=list(labels.numpy())
  y_pred=model_aug_adam.predict_classes(images)
  # break
  print("Adam optimizer")
  print(classification_report(y_true,y_pred,target_names=dataset_classes))
  print("Accuracy on test dataset : ",accuracy_score(y_true,y_pred))
  

  print("*"*20)
  y_pred=model_aug_adagrad.predict_classes(images)
  # break
  print("Adagrad optimizer")
  print(classification_report(y_true,y_pred,target_names=dataset_classes))
  print("Accuracy on test dataset : ",accuracy_score(y_true,y_pred))
  
    
  print("*"*20)
  y_pred=model_aug_SGD.predict_classes(images)
  # break
  print("SGD optimizer")
  print(classification_report(y_true,y_pred,target_names=dataset_classes))
  print("Accuracy on test dataset : ",accuracy_score(y_true,y_pred))
  print("*"*20)

#### Conculsion

1. Adding augmented images helped in handling class imblance.
2. Model trained accuracy increased to above 70% and validation above 65%
3. Till first 15 epocs learning rate is very high then validation accuracy decreases while train accuracy still increases for some epochs
