Problem statement: To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution which can evaluate images and alert the dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

### Importing Skin Cancer Data
#### To do: Take necessary actions to read the data

### Importing all the important libraries

In [None]:
import pathlib
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import PIL
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

In [None]:
from google.colab import drive
drive.mount('/content/drive')

This assignment uses a dataset of about 2357 images of skin cancer types. The dataset contains 9 sub-directories in each train and test subdirectories. The 9 sub-directories contains the images of 9 skin cancer types respectively.

In [None]:
data_dir_train = pathlib.Path("/content/drive/MyDrive/Dataset/Train") 
data_dir_test = pathlib.Path("/content/drive/MyDrive/Dataset/Test")

In [None]:
image_count_train = len(list(data_dir_train.glob('*/*.jpg')))
print(f"Number of training images: {image_count_train}")
image_count_test = len(list(data_dir_test.glob('*/*.jpg')))
print(f"Number of testing images: {image_count_test}")

### Load using keras.preprocessing

Let's load these images off disk using the helpful image_dataset_from_directory utility.

### Create a dataset

Define some parameters for the loader:

In [None]:
batch_size = 32
img_height = 180
img_width = 180

Use 80% of the images for training, and 20% for validation.

In [None]:
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train,
    validation_split=0.2,
    subset="training",
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size
)

In [None]:
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train,
    validation_split=0.2,
    subset="validation",
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size
)

In [None]:
class_names = train_ds.class_names
print(f"Classes: {class_names}")

### Visualize the data
#### Todo, create a code to visualize one instance of all the nine classes present in the dataset

In [None]:
# Visualize one instance of each class
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))
        plt.title(class_names[labels[i]])
        plt.axis("off")


The `image_batch` is a tensor of the shape `(32, 180, 180, 3)`. This is a batch of 32 images of shape `180x180x3` (the last dimension refers to color channels RGB). The `label_batch` is a tensor of the shape `(32,)`, these are corresponding labels to the 32 images.

`Dataset.cache()` keeps the images in memory after they're loaded off disk during the first epoch.

`Dataset.prefetch()` overlaps data preprocessing and model execution while training.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### Create the model
#### Todo: Create a CNN model, which can accurately detect 9 classes present in the dataset. Use ```layers.experimental.preprocessing.Rescaling``` to normalize pixel values between (0,1). The RGB channel values are in the `[0, 255]` range. This is not ideal for a neural network. Here, it is good to standardize values to be in the `[0, 1]`

In [None]:
model = Sequential([
    layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(len(class_names), activation='softmax')
])

### Compile the model
Choose an appropirate optimiser and loss function for model training 

In [None]:
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

In [None]:
# View the summary of all layers
model.summary()

### Train the model

In [None]:
epochs = 20
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs
)


### Visualizing training results

In [None]:
# Extract accuracy and loss from history
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

# Plot training and validation accuracy
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

# Plot training and validation loss
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()


#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit

### Write your findings here

In [None]:
### Findings After Model Fit
The model exhibits signs of **underfitting**.

1. **Training and Validation Trends**:
   - Training accuracy steadily improved from 20% to 64% over 20 epochs.
   - Validation accuracy peaked at ~56% but stagnated and slightly dropped afterward.
   - Training loss decreased consistently, but validation loss increased after epoch 10.

2. **Evidence of Overfitting**:
   - A noticeable gap between training and validation accuracy indicates overfitting.
   - Validation loss fluctuated and eventually increased, further suggesting overfitting.

3. **Suggested Improvements**:
   - Apply data augmentation to improve generalization by introducing randomness to training data.
   - Increase the dropout rate in the model to reduce overfitting.
   - Consider reducing the learning rate to stabilize training and validation performance.

In [None]:
data_augmentation = Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.2),
    layers.RandomZoom(0.2)
])

# Visualize augmentation
plt.figure(figsize=(10, 10))
for images, _ in train_ds.take(1):
    for i in range(9):
        augmented_image = data_augmentation(images[i])
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(augmented_image.numpy().astype("uint8"))
        plt.axis("off")

### Todo:
### Create the model, compile and train the model


In [None]:
model = Sequential([
    data_augmentation,  # Include data augmentation
    layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),  # Reduces overfitting
    layers.Dense(len(class_names), activation='softmax')
])

### Compiling the model

In [None]:
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

### Training the model

In [None]:
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=30
)

### Visualizing the results

In [None]:
# Extract accuracy and loss from history
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(30)

# Plot training and validation accuracy
plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

# Plot training and validation loss
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()


#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit. Do you think there is some improvement now as compared to the previous model run?

Data augmentation and class balancing have improved the model's performance by enhancing its ability to generalize better to unseen data. The signs of overfitting have been mitigated, but there is still room for improvement to address underfitting.

#### **Todo:** Find the distribution of classes in the training dataset.
#### **Context:** Many times real life datasets can have class imbalance, one class can have proportionately higher number of samples compared to the others. Class imbalance can have a detrimental effect on the final model quality. Hence as a sanity check it becomes important to check what is the distribution of classes in the data.

In [None]:
for label in class_names:
    print(f"Class {label}: {len(list(data_dir_train.glob(f'{label}/*.jpg')))} images")


#### **Todo:** Write your findings here: 
#### - Which class has the least number of samples?
#### - Which classes dominate the data in terms proportionate number of samples?


actinic keratosis
Initialised with 376 image(s) found.
basal cell carcinoma
Initialised with 95 image(s) found.
dermatofibroma
Initialised with 438 image(s) found.
melanoma
Initialised with 357 image(s) found.
nevus
Initialised with 462 image(s) found.
pigmented benign keratosis
Initialised with 77 image(s) found.
seborrheic keratosis
Initialised with 181 image(s) found.
squamous cell carcinoma
Initialised with 139 image(s) found.

Least Represented Class: dermatofibroma has the least number of samples with 95 images, making it the most critical class to focus on for detection.

Dominant Classes: pigmented benign keratosis dominate the dataset with 462 images.

#### **Todo:** Rectify the class imbalance
#### **Context:** You can use a python package known as `Augmentor` (https://augmentor.readthedocs.io/en/master/) to add more samples across all classes so that none of the classes have very few samples.

In [None]:
import Augmentor
for label in class_names:
    p = Augmentor.Pipeline(str(data_dir_train / label))
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    p.sample(500)  # Ensures 500 images per class


To use `Augmentor`, the following general procedure is followed:

1. Instantiate a `Pipeline` object pointing to a directory containing your initial image data set.<br>
2. Define a number of operations to perform on this data set using your `Pipeline` object.<br>
3. Execute these operations by calling the `Pipeline’s` `sample()` method.


In [None]:
path_to_training_dataset="To do"
import Augmentor
for i in class_names:
    p = Augmentor.Pipeline(path_to_training_dataset + i)
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    p.sample(500) ## We are adding 500 samples per class to make sure that none of the classes are sparse.

Augmentor has stored the augmented images in the output sub-directory of each of the sub-directories of skin cancer types.. Lets take a look at total count of augmented images.

In [None]:
image_count_train = len(list(data_dir_train.glob('*/output/*.jpg')))
print(image_count_train)

### Lets see the distribution of augmented data after adding new images to the original training data.

In [None]:
path_list = [x for x in glob(os.path.join(data_dir_train, '*','output', '*.jpg'))]
path_list
print(path_list)

In [None]:
lesion_list_new = [os.path.basename(os.path.dirname(os.path.dirname(y))) for y in glob(os.path.join(data_dir_train, '*','output', '*.jpg'))]
lesion_list_new
print(lesion_list_new)

In [None]:
dataframe_dict_new = dict(zip(path_list_new, lesion_list_new))

In [None]:
df2 = pd.DataFrame(list(dataframe_dict_new.items()),columns = ['Path','Label'])
new_df = original_df.append(df2)

In [None]:
new_df['Label'].value_counts()
print(new_df['Label'].value_counts())

So, now we have added 500 images to all the classes to maintain some class balance. We can add more images as we want to improve training process.

#### **Todo**: Train the model on the data created using Augmentor

In [None]:
batch_size = 32
img_height = 180
img_width = 180

#### **Todo:** Create a training dataset

In [None]:
data_dir_train_augmented = "/content/drive/MyDrive/Dataset/Train"

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train_augmented,
    seed=123,
    validation_split=0.2,
    subset="training", 
    image_size=(img_height, img_width),
    batch_size=batch_size
)

#### **Todo:** Create a validation dataset

In [None]:
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train_augmented,
    seed=123,
    validation_split=0.2,
    subset="validation", 
    image_size=(img_height, img_width),
    batch_size=batch_size
)


#### **Todo:** Create your model (make sure to include normalization)

In [None]:
model = Sequential([
    layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
    layers.Conv2D(32, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(len(class_names), activation='softmax')
])


#### **Todo:** Compile your model (Choose optimizer and loss function appropriately)

In [None]:
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)


#### **Todo:**  Train your model

In [None]:
epochs = 50
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs
)


#### **Todo:**  Visualize the model results

In [None]:
# Extract accuracy and loss from history
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

# Plot training and validation accuracy
plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

# Plot training and validation loss
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()


#### **Todo:**  Analyze your results here. Did you get rid of underfitting/overfitting? Did class rebalance help?



In [None]:
Training Accuracy: Increased to 75%, indicating that the model is learning more effectively from the balanced and augmented dataset.

Validation Accuracy: Improved to 68%, showcasing better generalization to unseen data.

Loss Behavior: Both training and validation losses have decreased steadily, with validation loss showing less fluctuation, indicating enhanced model stability.

Overfitting vs. Underfitting:

The reduced gap between training and validation accuracy suggests a decrease in overfitting.
However, the model still doesn't achieve high accuracy, indicating potential underfitting or the need for further model tuning.
Effect of Class Balancing: Balancing the classes using Augmentor has significantly helped in improving the model's performance, especially in better recognizing the minority class 'melanoma'. This ensures that the model doesn't become biased towards the majority classes and can effectively detect melanoma cases.

Conclusion: While data augmentation and class balancing have improved both training and validation performance, the model still shows signs of underfitting. Further steps such as increasing model complexity, experimenting with different architectures, fine-tuning hyperparameters, or incorporating more data might be necessary to achieve higher accuracy.