***Andrew Plum***<br/>
***CS 474***<br/>
***11/30/2024***

## Skin Cancer Detection¶
#### Project description
Skin cancer is the most common form of cancer, globally accounting for at least 40% of cancer cases. People with lighter skin are at higher risk. There are three main types of skin cancers: basal-cell skin cancer (BCC), squamous-cell skin cancer (SCC), and melanoma. Globally in 2012, melanoma occurred in 232,000 people and resulted in 55,000 deaths. Between 20% and 30% of melanomas develop from moles.

#### Task
In this project, we will develop deep learning-based solutions to classify images of skin moles into benign or malignant categories.

#### Data
The training and test set contains 2,637 and 660 images, respectively. The dataset is from a Kaggle Competition at https://www.kaggle.com/fanconic/skin-cancer-malignant-vs-benign. The 'data' folder structure is as below

- data
    - train
        - benign
        - malignant
    - test
        - benign
        - malignant

#### Code templates
- If you have a computer with a large-size RAM (>8GB) and a fast CPU, you can download the data(zip file) to work on your own computer. You need to manually unzip the data file.
 - If you would like to use google colab, you can download the 'colab' template, and upload the code file and data zip file to your Google Drive; then open and edit the code using the Google Colab. The 3rd code cell will unzip the data file automatically.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [3]:
#unzip the image set. You only need to run this cell once.
#you may need to change the folders bellow
!unzip '/content/drive/MyDrive/Colab Notebooks/data.zip' -d '/content/drive/MyDrive/Colab Notebooks/'
print('unziped the image set to /content/drive/MyDrive/Colab Notebooks/')

unziped the image set to /content/drive/MyDrive/Colab Notebooks/


'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [82]:
import matplotlib.pyplot as plt
import numpy as np
from tensorflow import keras
#from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### 1. Data preparation

In [3]:
import os
from PIL import Image

#1.1 get image lists
train_b = 'data/train/benign'
train_m = 'data/train/malignant'

test_b = 'data/test/benign'
test_m = 'data/test/malignant'

def getImList(folder):
    # get the list of file names in 'folder'
    imList = os.listdir(folder) # get all names
    fPath = [os.path.join(folder, fName) for fName in imList] # add path to names
    
    return fPath

trainList_b = getImList(train_b)
trainList_m = getImList(train_m)
trainList = trainList_b + trainList_m
print("# training images:", len(trainList))

testList_b = getImList(test_b)
testList_m = getImList(test_m)
testList = testList_b + testList_m
print("# test images:", len(testList))

#1.2 load all images
read = lambda imName: np.asarray(Image.open(imName).convert("RGB"))
print('loading images ...')
X_train = [read(name) for name in trainList]
X_train = np.array(X_train, dtype='uint8')/255
print('.   training set shape:', X_train.shape)

X_test = [read(name) for name in testList]
X_test = np.array(X_test, dtype='uint8')/255
print('.   test set shape:', X_test.shape)

print('loading ended.')

# 1.3 Create target labels
y_benign_train = np.zeros(len(trainList_b))
y_malignant_train = np.ones(len(trainList_m))
y_train = np.concatenate((y_benign_train, y_malignant_train), axis = 0)
print('.    training target shape: ', y_train.shape)
y_benign_test = np.zeros(len(testList_b))
y_malignant_test = np.ones(len(testList_m))
y_test = np.concatenate((y_benign_test, y_malignant_test), axis = 0)
print('.    test target shape', y_test.shape)

# 1.4 Shuffle data
print('shuffling data ...')
s = np.arange(X_train.shape[0])
np.random.shuffle(s)
X_train = X_train[s]
y_train = y_train[s]

s = np.arange(X_test.shape[0])
np.random.shuffle(s)
X_test = X_test[s]
y_test = y_test[s]

print('Dataset is ready for using.')

# training images: 2637
# test images: 660
loading images ...
.   training set shape: (2637, 224, 224, 3)
.   test set shape: (660, 224, 224, 3)
loading ended.
.    training target shape:  (2637,)
.    test target shape (660,)
shuffling data ...
Dataset is ready for using.


### Familiarizing Myself WIth the Data

In [12]:
X_train[0]

array([[[0.91764706, 0.65490196, 0.63137255],
        [0.9254902 , 0.65882353, 0.64705882],
        [0.92156863, 0.67058824, 0.6627451 ],
        ...,
        [0.9372549 , 0.72941176, 0.70588235],
        [0.92941176, 0.72156863, 0.69803922],
        [0.92941176, 0.72156863, 0.70588235]],

       [[0.90980392, 0.64705882, 0.62352941],
        [0.90980392, 0.65490196, 0.63921569],
        [0.9254902 , 0.65882353, 0.65490196],
        ...,
        [0.9254902 , 0.72941176, 0.70196078],
        [0.92156863, 0.7254902 , 0.69803922],
        [0.94117647, 0.72941176, 0.72156863]],

       [[0.91764706, 0.65490196, 0.63137255],
        [0.90588235, 0.65098039, 0.62745098],
        [0.91372549, 0.65098039, 0.62745098],
        ...,
        [0.92941176, 0.72941176, 0.71372549],
        [0.93333333, 0.7254902 , 0.70980392],
        [0.91764706, 0.71764706, 0.70196078]],

       ...,

       [[0.89803922, 0.69019608, 0.65882353],
        [0.88627451, 0.6745098 , 0.62745098],
        [0.89019608, 0

In [43]:
print("Third RGB value (the Blue) of the fifth row of the forth column of the first image")
X_train[0][4][3][2]

Third RGB value (the Blue) of the fifth row of the forth column of the first image


0.6235294117647059

### Create Validation Data From Training Data

In [None]:
###X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.5, random_state = 0)

### Build Model

In [76]:
CNN_model = keras.models.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

CNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

CNN_model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_9 (Conv2D)           (None, 222, 222, 32)      896       
                                                                 
 max_pooling2d_6 (MaxPooling  (None, 111, 111, 32)     0         
 2D)                                                             
                                                                 
 conv2d_10 (Conv2D)          (None, 109, 109, 64)      18496     
                                                                 
 max_pooling2d_7 (MaxPooling  (None, 54, 54, 64)       0         
 2D)                                                             
                                                                 
 conv2d_11 (Conv2D)          (None, 52, 52, 64)        36928     
                                                                 
 flatten_3 (Flatten)         (None, 173056)           

### Train the Model

In [78]:
#CNN_model_hist = CNN_model.fit(X_train, y_train, epochs = 3, verbose = 1)
CNN_model.fit(X_train, y_train, epochs = 3, verbose = 1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1f115dc6a00>

In [86]:
y_train_hat = CNN_model.predict(X_train)
y_train_hat = (0.5 < y_train_hat).astype(int)

y_test_hat = CNN_model.predict(X_test)
y_test_hat = (0.5 < y_test_hat).astype(int)

y_train_hat = y_train_hat.flatten()
y_test_hat = y_test_hat.flatten()

train_accuracy = accuracy_score(y_train, y_train_hat)
test_accuracy = accuracy_score(y_test, y_test_hat)

print("Train accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)
print()

train_accuracy = (sum(y_train == y_train_hat) / y_train_hat.shape[0])
test_accuracy = (sum(y_test == y_test_hat) / y_test_hat.shape[0])

print("Train accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)
print()

Train accuracy: 0.7986348122866894
Test accuracy: 0.7863636363636364

Train accuracy: 0.7986348122866894
Test accuracy: 0.7863636363636364

