# Assignment - shallow learning

Hi there! In this assignment, you will use shallow learning (including svm, random forests and gradient boosting if you feel up for the challenge) to solve an adapted Question 1 of the winter 2023 exam in applied machine learning:

## Introduction:

During the semester you have become very excited about the field of digital pathology which is an area that is developing rapidly due to advancements in microscopy imaging hardware. These advancements have allowed digitizing glass slides into whole-slide images. You have recently read the paper by [Veeling et al (2018)](https://arxiv.org/abs/1806.03962) and you are thrilled to see that the authors have derived a novel dataset, denoted PatchCamelyon (PCam), that will enable you to develop and benchmark your own machine learning models. As Veeling et al (2018) you are primarily interested in developing machine learning models that based on patches of whole-slide images of lymph node sections can assist pathologist in tumor detection. 

The primary objective of this exam is to perform image classification using the PCam dataset. The full dataset consists of 327,680 color images (96x96pxs) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. For this assignment, however, you are only going to use the subset of the data which have been made available on Kaggle.

### Question 1 (adapted from the exam):
Use non-deep learning to perform image classification (tumor detection). Consider among other things the following:
1. Support vector machines
2. Random forests
3. Boosting
4. A combination of two or all three of the methods
5. Assess the importance of image resolution for the methods you are using

The assignment is posted as a Kaggle competition and is available here: https://www.kaggle.com/t/1f880200648443a3a30878d318cc6e4b


# Hints to get you started (with a very simple model)

In [3]:
from sklearn import svm
from sklearn.metrics import accuracy_score
import numpy as np

import tensorflow as tf

from sklearn.preprocessing import StandardScaler

Defining a function that grayscale, resize and flattens the image. This function might also become handy (for deep learning later) if the original images are too large for your hardware configuration.

In [2]:
def convert_sample(image):
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image,[32,32]).numpy()
    image = image.reshape(1,-1)
    return image

# Hint: resize image

In [2]:
X = np.load('/mnt/c/Users/cmd/Dropbox/Teaching/amlFall2023/assignments/Xtrain.npy')
X = np.vstack(list(map(convert_sample,X)))
X = StandardScaler(with_mean=0, with_std=1).fit_transform(X)
print(f'Shape of training data features (observations,features): {X.shape}')

y = np.load('/mnt/c/Users/cmd/Dropbox/Teaching/amlFall2023/assignments/ytrain.npy')
y = y.reshape(-1,)    
print(f'Shape of training data labels (observations,): {y.shape}')

Xtest = np.load('/mnt/c/Users/cmd/Dropbox/Teaching/amlFall2023/assignments/Xtest.npy')
Xtest = np.vstack(list(map(convert_sample,Xtest)))
Xtest = StandardScaler(with_mean=0, with_std=1).fit_transform(Xtest)
print(f'Shape of training data features (observations,features): {Xtest.shape}')

NameError: name 'np' is not defined

The data is then ready to be applied for training and prediction in a shallow learning model such as the SVM classifier...below just a very very simple illustration on how to construct and train a support vector machine based on the data we have prepared. The predicted file can be submitted to Kaggle for evaluation.

In [None]:
import pandas as pd
clf = svm.SVC(kernel='rbf')
clf.fit(X, y)
y_test_hat = clf.predict(Xtest)


In [None]:
ytest_hat = pd.DataFrame({
    'Id': list(range(len(y_test_hat))),
    'Predicted': y_test_hat.reshape(-1,),
})
ytest_hat.to_csv('/mnt/c/Users/cmd/Dropbox/Teaching/amlFall2023/assignments/ytest_hat.csv', index=False)