# Introduction
The world’s most common cancer, skin cancer, is a disease that strikes one in five people by age 70. Unlike cancers that are developed inside your body, skin cancer forms on the outside of the skin. If spotted on time, most all cases are curable if they are diagnosed early enough. Therefore, detecting what a lesion represents is vital to patient outcomes, with a 5-year survival rate dropping from 99% to 14% depending on stage of detection[3]. Melanoma, the most serious type of skin cancer, occurs when the pigment-producing cells, also known as lesions, give color to the skin and become cancerous.

The International Skin Imaging Collaboration released the largest skin image analysis challenge to automatically diagnose pigmented lesions towards melanoma detection. We decided to use this dataset and test multiple classifiers to find the best method of skin cancer lesion detection.




![dataset-cover.png](attachment:dataset-cover.png)

# Data

### Data Collection
The dataset was originally retrieved from Harvard University’s datacenter and consists of 10,015 dermatoscopic images[5] of pigmented lesions. Dermatoscopic images are images that eliminate surface reflection allowing scientists to look further into the levels of the skin. Additionally this dataset comes with 7 main values: lesion, image, diagnosis, diagnosis method, age, gender, and location of the lesion. The dataset includes all the representative categories of skin lesions - such as benign, pre-cancer (actinic keratosis), low risk (basal cell carcinoma) and malignant forms (melanoma). 

### Data Preprocessing
For the CNN and MobileNet architectures, we used Keras’s built in ImageDataGenerator class. This allowed us to shear, zoom, and flip the image horizontally in a random fashion. This data augmentation approach will hopefully allow the model to not overfit and have a better training accuracy. Specifically we used a shear range of [0,0.2] and zoom from [0,0.2]. The data was split into 2 separate segments, a csv file and a set of images. Our preprocessing consisted of 5 steps:$$$$
    $1.$ Pull csv and image data into 2 separate variables$$$$
    $2.$ Separate labels from the rest of the features in the csv. This will make our ground truth and our 
    dataset1$$$$
    $3.$ We also wanted to try training on the pixels so we append pixel values to the rest of the features and save 
    it as our dataset 2$$$$
    $4.$ For both datasets we convert the string formatted data into unique integers$$$$
    $5.$ Drop the ids$$$$ 

We also decided to try cleaning up our data a bit because there were an unnaturally high amount of  “nv” labels and it seemed like our classifiers were only training on that. To try and fix that we created 2 frameworks:$$$$
    $1.$ Downsample the points with nv labels: We randomly selected about 5500 points which had nv as their label and 
    dropped it from the dataset. This saw a more even distribution in our data, but it also resulted on less data to 
    train and test on. $$$$  
    $2.$ Remove the labels with lower amount of data: We wanted to see how accurate the classifier would be with less 
    labels so it would have an easier time detecting differences from nv. This meant we couldn’t properly classify 
    approximately 4 other types of skin cancer, but it had a slightly better confusion matrix as seen below.



# Methods
We intend to apply unsupervised, supervised, and transfer learning methods to images from the “Skin Lesion Analysis Toward Melanoma Detection 2018”[4] dataset.
 
### Unsupervised Learning
We used K-Means as a naive and initial method of partitioning data into a fixed number of  clusters and compare these clusters to our desired classifications. We also used PCA to better understand the dimensionality of the data.

The images we use to train our classifier consists of dimensions (pixels) in the order of thousands. In order to be able to visualise the different clusters on a 2-D plane, we used PCA to reduce the dimensions of our data. We plotted the first two principal components of our data and identified clusters based on their true labels. 

### Supervised Learning
We compared outcomes between three different methods: Support Vector Machine, Convolutional Neural Network, and Random Forest Classifier. Support Vector Machines have been shown to generalize well to large image datasets[2], Convolutional Neural Networks allow for localized information sharing (taking advantage of images’ spatial layout), and decision trees are a widely used classification method[1].

### Transfer Learning
In an attempt to improve accuracy, we looked at other competitors within the kaggle competition to see what methods were being applied to improve accuracy. We found one of the competitors used a concept called transfer learning, where a portion of a different pre-trained model is used, and the bottom of the model changed and trained on for the new dataset. In this case, we used the MobileNet classifier and added a final 7 node layer for classification. The concept here is this network has learned some general image features whose representation can be “transferred” to the new classification problem.


# Results
### K-Means
<div>
<img src="attachment:Screen%20Shot%202019-11-15%20at%201.33.30%20AM.png" width="500"height="500"/>
<img src="attachment:Screen%20Shot%202019-11-15%20at%201.33.37%20AM.png" width="500"height="500"/>
</div>


### PCA
<div>
<img src="attachment:Screen%20Shot%202019-11-15%20at%201.41.48%20AM.png" width="500"height="500"/>
<img src="attachment:Screen%20Shot%202019-11-15%20at%201.41.58%20AM.png" width="500"height="500"/>
</div>


### LDA

### SVM

### Random Forest

### CNN