# Coursework 2: Clinical image classification

In this coursework, we will work on a clinical imaging dataset. The dataset comes from ISIC (International Skin Imaging Collaboration) 2016 Challenge. It consists of 900 skin lesion images, categorised into two classes: malignant (melanoma) and benign, as shown here.

![](melanoma_vs_benign.jpg)

To faciliate this coursework, we have pre-processed the images for you, including resizing the original images to a standard size of 64 by 64, splitting into a training set and a test set, so you simply need to focus on the visualisation and classification work. Your task here is to read the imaging data and train a model that classifies between benign and malignant images.

Hint: This coursework is similar to the hand digit classification problem you have just learnt at class. It is a transition from data management to data analysis. You will use a machine learning library, sklearn, and we have provided guidance in this coursework. We will assess the proper implementation of the whole pipeline, but not the classification accuracy. Just have fun!

In the next term, you will learn more about machine learning, what each of the model mentioned here does and how to improve the classification performance.

In [None]:
# Load the libraries (provided)
import os
import imageio
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn
from sklearn import neighbors, ensemble, svm, metrics

## 1. Load and visualise data.

#### 1.1 Load the spreadsheets `training.csv` and `test.csv`. Print the first few lines. Check how many benign and malignant cases there are in training and test sets. (15 points)

#### 1.2 From the training image set, load five malignant cases and display their images. (10 points)

The images are compressed in files with suffix ".tar.gz". You need to decompress them first.

Hint: You can use the library [imageio](https://imageio.github.io/) to read images.

#### 1.3 From the training image set, load five benign cases and display their images. (10 points)

#### 1.4 Pick one image and show the three channels (R-G-B) of the image separately. (10 points)

Hint: An RGB image is an array of size X x Y x C, where C = 3, standing for respectively red, green and blue channels. You can use `Reds_r`, `Greens_r`, `Blues_r` as the cmap (colormap) for each channel when you plot the images.

#### 1.5 Show the intensity histogram of the image. (5 points)

Hint: Flatten the image to a vector, then use the appropriate histogram function in Python.

## 2. Analyse data.

Here, you need to train a classification model on the training set, then apply it to the test set. Most [sklearn](https://scikit-learn.org) classifiers require two inputs for training a model, respectively features X and labels y.

X: N x M matrix, N denoting the number of samples, M denoting the number of features for each sample.

y: N x 1 vector, each element recording the label of a sample.

#### 2.1 Prepare the training data according to the above description, using variable names X_train and y_train. Print out the shapes, i.e. dimensions, of X_train and y_train. (10 points)

#### 2.2 Similarly, prepare the test data according, using variable names X_test and y_test. Print out the shapes of X_test and y_test. (10 points)

#### 2.3 Train a classification model on X_train and y_train. (10 points)

Hint: You only need to train one model. You can use any classification model supported by sklearn, including

* [K nearest neighbour classifier](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors)
* [Random forest classifier](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Support vector machine classifier](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm)
* or any other models

#### 2.4 Apply the model on the test set X_test to predict the labels. (5 points)

#### 2.5 Display the [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) of the prediction results. (5 points)

Hint: You can find an [implementation](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) in sklearn.

#### 2.6 Explain how many malignant cases there are in the test set and how many are correctly predicted. (10 points)

#### Survey: How long does it take you to complete this coursework?