## **ABOUT THE DATASET**

HAM10000 ("Human Against Machine with 10000 training images") dataset - a large collection of multi-source dermatoscopic images of pigmented lesions

The dermatoscopic images are collected from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images.

It has 7 different classes of skin cancer which are listed below :
- Melanocytic nevi
- Melanoma
- Benign keratosis-like lesions
- Basal cell carcinoma
- Actinic keratoses
- Vascular lesions
- Dermatofibroma


In [None]:
# Importing required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, isnull
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from glob import glob
import os

# Importing required libraries for neural network training
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
import tensorflow as tf
from tensorflow.keras.layers import Flatten, Dense, Dropout, BatchNormalization, Conv2D, MaxPool2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix
import itertools

In [None]:
# Initialize Spark session
spark = SparkSession.builder \
    .appName("Skin Cancer MNIST") \
    .getOrCreate()

HAM10000_metadata.csv file is the main csv file that includes the data of all training images, the features of which are -
 
1. Lesion_id
2. Image_id
3. save Dx
4. Dx_type
5. Age
6. Sex
7. Localization


In [None]:
# Reading the data from HAM_metadata.csv
df = spark.read.csv('../input/skin-cancer-mnist-ham10000/HAM10000_metadata.csv', header=True, inferSchema=True)

In [None]:
# Display the first few rows of the DataFrame
df.show(5)

In [None]:
# Display the data types of each column
df.printSchema()

A general statistical analysis of the numerical values of dataset (here : age)

In [None]:
# Display summary statistics for numerical columns
df.describe().show()

## Data Imputation



In [None]:
df.select([col(c).isNull().cast("int").alias(c) for c in df.columns]).agg(*[sum(c).alias(c) for c in df.columns]).show()

pdf = df.toPandas()
sns.displot(pdf['age'])
plt.show()

In [None]:
median_age = pdf['age'].median()
df = df.withColumn('age', when(isnull(col('age')), median_age).otherwise(col('age')))
pdf = df.toPandas()

In [None]:
df.select([col(c).isNull().cast("int").alias(c) for c in df.columns]).agg(*[sum(c).alias(c) for c in df.columns]).show()

In [None]:
# Lesion type dictionary
lesion_type_dict = {
    'nv': 'Melanocytic nevi',
    'mel': 'Melanoma',
    'bkl': 'Benign keratosis-like lesions ',
    'bcc': 'Basal cell carcinoma',
    'akiec': 'Actinic keratoses',
    'vasc': 'Vascular lesions',
    'df': 'Dermatofibroma'
}

# Base directory for skin images
base_skin_dir = '../input/skin-cancer-mnist-ham10000'

# Merge images from both folders into one dictionary
imageid_path_dict = {os.path.splitext(os.path.basename(x))[0]: x
                     for x in glob(os.path.join(base_skin_dir, '*', '*.jpg'))}

In [None]:
pdf['path'] = pdf['image_id'].map(imageid_path_dict.get)
pdf['cell_type'] = pdf['dx'].map(lesion_type_dict.get)
pdf['cell_type_idx'] = pd.Categorical(pdf['cell_type']).codes


## Image Preprocessing

Resizing of images because the original dimensions of 450 * 600 * 3 take long time to process in Neural Networks

In [None]:
pdf['image'] = pdf['path'].map(lambda x: np.asarray(Image.open(x).resize((125, 100))))

In [None]:
n_samples = 5

# Create a plot with 7 rows and n_samples columns
fig, m_axs = plt.subplots(7, n_samples, figsize=(4 * n_samples, 3 * 7))

# Group by cell type and plot images
for n_axs, (type_name, type_rows) in zip(m_axs, pdf.sort_values(['cell_type']).groupby('cell_type')):
    n_axs[0].set_title(type_name)
    for c_ax, (_, c_row) in zip(n_axs, type_rows.sample(n_samples, random_state=2018).iterrows()):
        c_ax.imshow(c_row['image'])
        c_ax.axis('off')

# Save the figure
fig.savefig('category_samples.png', dpi=300)

In [None]:
# Check image size distribution
image_size_counts = pdf['image'].map(lambda x: x.shape).value_counts()
print(image_size_counts)

## **Exploratory Data Analysis**
Exploratory data analysis can help detect obvious errors, identify outliers in datasets, understand relationships, unearth important factors, find patterns within data, and provide new insights.

### UNIVARIATE ANALYSIS

In [None]:
plt.figure(figsize=(20,10))
plt.subplots_adjust(left=0.125, bottom=1, right=0.9, top=2, hspace=0.2)
plt.subplot(2,4,1)
plt.title("AGE",fontsize=15)
plt.ylabel("Count")
pdf['age'].value_counts().plot.bar()

plt.subplot(2,4,2)
plt.title("GENDER",fontsize=15)
plt.ylabel("Count")
pdf['sex'].value_counts().plot.bar()

plt.subplot(2,4,3)
plt.title("localization",fontsize=15)
plt.ylabel("Count")
plt.xticks(rotation=45)
pdf['localization'].value_counts().plot.bar()

plt.subplot(2,4,4)
plt.title("CELL TYPE",fontsize=15)
plt.ylabel("Count")
pdf['cell_type'].value_counts().plot.bar()

1. Skin diseases are found to be maximum in people aged around 45. Minimum for 10 and below. We also observe that the probability of having skin disease increases with the increase in age.
2. Skin diseases are more prominent in Men as compared to Women and other gender.
3. Skin diseases are more visible on the "back" of the body and least on the "acral surfaces"(such as limbs, fingers, or ears).
4. The most found disease among people is Melanocytic nevi while the least found is Dermatofibroma.

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(1,2,1)
pdf['dx'].value_counts().plot.pie(autopct="%1.1f%%")
plt.subplot(1,2,2)
pdf['dx_type'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

1.  Type of skin disease:
    *     nv: Melanocytic nevi - 69.9%
    *     mel: Melanoma - 11.1 %
    *     bkl: Benign keratosis-like lesions - 11.0%
    *     bcc: Basal cell carcinoma - 5.1%
    *     akiec: Actinic keratoses- 3.3%
    *     vasc: Vascular lesions-1.4%
    *     df: Dermatofibroma - 1.1%

2. How the skin disease was discovered:
   * histo - histopathology - 53.3%
   * follow_up - follow up examination - 37.0%
   * consensus - expert consensus - 9.0%
   * confocal - confirmation by in-vivo confocal microscopy - 0.7%

### BIVARIATE ANALYSIS

In [None]:
plt.figure(figsize=(25,10))
plt.title('LOCALIZATION VS GENDER',fontsize = 15)
sns.countplot(y='localization', hue='sex',data=pdf)

* Back are is the most affected among people and more prominent in men.
* Infection on Lower extremity of the body is more visible in women.
* Some unknown regions also show infections and it's visible in men, women and other genders.
* The acral surfaces show the least infection cases that too in men only. Other gender groups don't show this kind of infection.


In [None]:
plt.figure(figsize=(25,10))
plt.title('LOCALIZATION VS CELL TYPE',fontsize = 15)
sns.countplot(y='localization', hue='cell_type',data=pdf)

* The face is infected the most by Benign keratosis-like lesions.
* Body parts(except face) are infected the most by Melanocytic nevi.

In [None]:
plt.figure(figsize=(25,10))
plt.subplot(131)
plt.title('AGE VS CELL TYPE',fontsize = 15)
sns.countplot(y='age', hue='cell_type',data=pdf)
plt.subplot(132)
plt.title('GENDER VS CELL TYPE',fontsize = 15)
sns.countplot(y='sex', hue='cell_type',data=pdf)

1. The age group between 0-75 years is infected the most by Melanocytic nevi. On the other hand, the people aged 80-90 are affected more by Benign keratosis-like lesions.

2. All the gender groups are affected the most by Melanocytic nevi.

## **Training the model**

## ANN

Data splitting into features and labels


In [None]:
# Drop the 'cell_type_idx' column to get features and target
features = pdf.drop(columns=['cell_type_idx'], axis=1)
target = pdf['cell_type_idx']

In [None]:
features.head()

Data splitting into training and testing


In [None]:
# Split the data into training and testing sets
x_train_o, x_test_o, y_train_o, y_test_o = train_test_split(features, target, test_size=0.25, random_state=666)
print(pd.unique(x_train_o['cell_type'].values))


Normalization or Scaling

In [None]:
# Convert the 'image' column to numpy arrays
x_train = np.asarray(x_train_o['image'].tolist())
x_test = np.asarray(x_test_o['image'].tolist())

# Calculate mean and standard deviation for normalization
x_train_mean = np.mean(x_train)
x_train_std = np.std(x_train)

x_test_mean = np.mean(x_test)
x_test_std = np.std(x_test)

# Normalize the data
x_train = (x_train - x_train_mean) / x_train_std
x_test = (x_test - x_test_mean) / x_test_std


Perform one-hot encoding on the labels

In [None]:
y_train = to_categorical(y_train_o, num_classes=7)
y_test = to_categorical(y_test_o, num_classes=7)
y_train_o.head()

Split the training data for validation

In [None]:
x_train, x_validate, y_train, y_validate = train_test_split(x_train, y_train, test_size=0.1, random_state=999)

Reshape the data


In [None]:
# Reshape the data to the required dimensions
x_train = x_train.reshape(x_train.shape[0], *(100, 125, 3))
x_test = x_test.reshape(x_test.shape[0], *(100, 125, 3))
x_validate = x_validate.reshape(x_validate.shape[0], *(100, 125, 3))

# Flatten the images for Dense layer input
x_train = x_train.reshape(x_train.shape[0], 125 * 100 * 3)
x_test = x_test.reshape(x_test.shape[0], 125 * 100 * 3)
x_validate = x_validate.reshape(x_validate.shape[0], 125 * 100 * 3)

print(x_train.shape)
print(x_test.shape)


Build and train the keras model

In [None]:
# Define the Keras model
model = Sequential()
model.add(Dense(units=64, kernel_initializer='uniform', activation='relu', input_dim=37500))
model.add(Dense(units=64, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=64, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=64, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=7, kernel_initializer='uniform', activation='softmax'))

# Define the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.00075, beta_1=0.9, beta_2=0.999, epsilon=1e-8)

# Compile the Keras model
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the Keras model on the dataset
history = model.fit(x_train, y_train, batch_size=10, epochs=18, validation_data=(x_validate, y_validate))

# Evaluate the model on the test set
accuracy = model.evaluate(x_test, y_test, verbose=1)[1]
print("Test: accuracy = ", accuracy * 100, "%")


Plot the model architecture

In [None]:
from keras.utils.vis_utils import plot_model
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

## CNN


Define the CNN model

In [None]:
# Define the CNN model
input_shape = (100, 125, 3)
num_classes = 7

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='Same', input_shape=input_shape))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='Same'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.16))

model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='Same'))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='Same'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.20))

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='Same'))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='Same'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(num_classes, activation='softmax'))

model.summary()


Compile the Model

In [None]:
# Define the optimizer
optimizer = Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

# Compile the model
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor='val_accuracy', patience=4, verbose=1, factor=0.5, min_lr=0.00001)


Data Augmentation and Model Training

In [None]:
# With data augmentation to prevent overfitting
datagen = ImageDataGenerator(
    rotation_range=10,
    zoom_range=0.1,
    width_shift_range=0.12,
    height_shift_range=0.12,
    horizontal_flip=True,
    vertical_flip=True
)

datagen.fit(x_train)

# Fit the model
epochs = 60
batch_size = 64
history = model.fit(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=epochs, validation_data=(x_validate, y_validate),
                    verbose=1, steps_per_epoch=x_train.shape[0] // batch_size,
                    callbacks=[learning_rate_reduction])


Evaluate the Model and Plot Results

In [None]:
# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test, verbose=1)
loss_v, accuracy_v = model.evaluate(x_validate, y_validate, verbose=1)
print(f"Validation: accuracy = {accuracy_v*100:.2f}%  ;  loss_v = {loss_v:.4f}")
print(f"Test: accuracy = {accuracy*100:.2f}%  ;  loss = {loss:.4f}")

# Save the model
model.save("model.h5")


 Plot the Model Architecture

In [None]:
from keras.utils.vis_utils import plot_model
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

Confusion Matrix and Error Analysis

In [None]:
# Function to plot confusion matrix    
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Predict the values from the validation dataset
Y_pred = model.predict(x_validate)
Y_pred_classes = np.argmax(Y_pred, axis=1)
Y_true = np.argmax(y_validate, axis=1)
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes)
plot_confusion_matrix(confusion_mtx, classes=range(7))

# Predict the values from the test dataset
Y_pred = model.predict(x_test)
Y_pred_classes = np.argmax(Y_pred, axis=1)
Y_true = np.argmax(y_test, axis=1)
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes)
plot_confusion_matrix(confusion_mtx, classes=range(7))

# Error analysis
label_frac_error = 1 - np.diag(confusion_mtx) / np.sum(confusion_mtx, axis=1)
plt.bar(np.arange(7), label_frac_error)
plt.xlabel('True Label')
plt.ylabel('Fraction classified incorrectly')
