# Assignment - shallow learning

Hi there! In this assignment, you will use shallow learning (including svm, random forests and gradient boosting if you feel up for the challenge) to solve an adapted Question 1 of the winter 2023 exam in applied machine learning:

## Introduction:

During the semester you have become very excited about the field of digital pathology which is an area that is developing rapidly due to advancements in microscopy imaging hardware. These advancements have allowed digitizing glass slides into whole-slide images. You have recently read the paper by [Veeling et al (2018)](https://arxiv.org/abs/1806.03962) and you are thrilled to see that the authors have derived a novel dataset, denoted PatchCamelyon (PCam), that will enable you to develop and benchmark your own machine learning models. As Veeling et al (2018) you are primarily interested in developing machine learning models that based on patches of whole-slide images of lymph node sections can assist pathologist in tumor detection. 

The primary objective of this exam is to perform image classification using the PCam dataset. The full dataset consists of 327,680 color images (96x96pxs) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. For this assignment, however, you are only going to use the subset of the data which have been made available on Kaggle.

### Question 1 (adapted from the exam):
Use non-deep learning to perform image classification (tumor detection). Consider among other things the following:
1. Support vector machines
2. Random forests
3. Boosting
4. A combination of two or all three of the methods
5. Assess the importance of image resolution for the methods you are using

The assignment is posted as a Kaggle competition and is available here: https://www.kaggle.com/t/1f880200648443a3a30878d318cc6e4b


In [2]:
from sklearn import svm
from sklearn.metrics import accuracy_score 
import numpy as np

import tensorflow as tf

from sklearn.preprocessing import StandardScaler
from sklearn import tree, ensemble

from pygame import mixer # type: ignore

import pandas as pd
import tqdm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

pygame 2.6.0 (SDL 2.28.4, Python 3.11.5)
Hello from the pygame community. https://www.pygame.org/contribute.html


In [3]:
def convert_sample(image):
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image,[32,32]).numpy()
    image = image.reshape(1,-1)
    return image

In [4]:
X = np.load('data/Xtrain.npy')
X = np.vstack(list(map(convert_sample, X)))
X = StandardScaler(with_mean=False, with_std=True).fit_transform(X)
print(f'Shape of training data features (observations,features): {X.shape}')

y = np.load('data/ytrain.npy')
y = y.reshape(-1,)
print(f'Shape of training data labels (observations,): {y.shape}')

Xtest = np.load('data/Xtest.npy')
Xtest = np.vstack(list(map(convert_sample, Xtest)))
Xtest = StandardScaler(with_mean=False, with_std=True).fit_transform(Xtest)
print(f'Shape of test data features (observations,features): {Xtest.shape}')

Shape of training data features (observations,features): (26214, 1024)
Shape of training data labels (observations,): (26214,)
Shape of test data features (observations,features): (1638, 1024)


In [46]:
models = {
    'gbc' : ensemble.GradientBoostingClassifier( 
                                 criterion="squared_error", 
                                 learning_rate=0.01, 
                                 max_depth=2,
                                 n_estimators= 500
                                 )
}

In [10]:
def explore_models(X_train, y_train, X_test):
    for model_name, model in models.items():
        print(f"Evaluating {model_name}...\n")
        
        # Fit the model
        model.fit(X_train, y_train)
        print(f"Fitting {model_name}...\n")

        # Make predictions
        y_test_pred = model.predict(X_test)
        print(f"Predicting {model_name}...\n")

        score = model.score(X_train, y_train)
        print(f"Score {score}...\n")

        ytest_hat = pd.DataFrame({
            'Id': list(range(len(y_test_pred))),
            'Predicted': y_test_pred.reshape(-1,),
        })
        ytest_hat.to_csv(f'ytest_hat_{model_name}.csv', index=False)

In [47]:
explore_models(X, y, Xtest)

Evaluating gbc...

Fitting gbc...

Predicting gbc...

Score 0.7607385366598001...



In [7]:
mixer.init()
mixer.music.load('done-for-you.mp3')
mixer.music.play()
#.76

In [9]:
model = ensemble.RandomForestClassifier(n_estimators=1000, 
                                 max_depth=10,
                                 min_samples_leaf=3,
                                 min_samples_split=4,
                                 )

print(f"Evaluating ...\n")
        
        # Fit the model
model.fit(X, y)
print(f"Fitting ...\n")

Evaluating ...

Fitting ...



In [11]:
y_test_pred = model.predict(Xtest)
print(f"Predicting ")

score = model.score(X, y)
print(f"Score {score}...\n")

ytest_hat = pd.DataFrame({
    'Id': list(range(len(y_test_pred))),
    'Predicted': y_test_pred.reshape(-1,),
})
ytest_hat.to_csv(f'ytest_hat_rf.csv', index=False)

Predicting 
Score 0.8841077286945906...



In [8]:
importances = model.feature_importances_
names = range(1024)

feature_importance = pd.DataFrame(zip(names, importances), columns=['Feature', 'Importance'])
feature_importance = feature_importance.sort_values('Importance', ascending=True).reset_index()
feature_importance_sorted = feature_importance.loc[feature_importance['Importance']>0.000]
feature_importance_sorted

Unnamed: 0,index,Feature,Importance
0,0,0,0.001012
1,1,1,0.001102
2,2,2,0.000936
3,3,3,0.000880
4,4,4,0.001039
...,...,...,...
1019,1019,1019,0.000806
1020,1020,1020,0.000792
1021,1021,1021,0.000691
1022,1022,1022,0.000874


In [12]:
from numpy import loadtxt
from xgboost import XGBClassifier

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [50]:
# fit model no training data
model = XGBClassifier(max_depth=50, min_child_weight=3)
model.fit(X, y)

In [49]:
y_pred = model.predict(Xtest)
train_score = model.score(X, y)
test_score = model.score(Xtest, y_pred)
print(train_score)
print(test_score)

0.7642099641412985
1.0


In [15]:
score = model.score(X, y)
print(f"Score {score}...\n")

Score 0.9855420767528802...

