# Assignment - shallow learning

Hi there! In this assignment, you will use shallow learning (including svm, random forests and gradient boosting if you feel up for the challenge) to solve an adapted Question 1 of the winter 2023 exam in applied machine learning:

## Introduction:

During the semester you have become very excited about the field of digital pathology which is an area that is developing rapidly due to advancements in microscopy imaging hardware. These advancements have allowed digitizing glass slides into whole-slide images. You have recently read the paper by [Veeling et al (2018)](https://arxiv.org/abs/1806.03962) and you are thrilled to see that the authors have derived a novel dataset, denoted PatchCamelyon (PCam), that will enable you to develop and benchmark your own machine learning models. As Veeling et al (2018) you are primarily interested in developing machine learning models that based on patches of whole-slide images of lymph node sections can assist pathologist in tumor detection. 

The primary objective of this exam is to perform image classification using the PCam dataset. The full dataset consists of 327,680 color images (96x96pxs) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. For this assignment, however, you are only going to use the subset of the data which have been made available on Kaggle.

### Question 1 (adapted from the exam):
Use non-deep learning to perform image classification (tumor detection). Consider among other things the following:
1. Support vector machines
2. Random forests
3. Boosting
4. A combination of two or all three of the methods
5. Assess the importance of image resolution for the methods you are using

The assignment is posted as a Kaggle competition and is available here: https://www.kaggle.com/t/1f880200648443a3a30878d318cc6e4b


# Hints to get you started (with a very simple model)

In [2]:
from sklearn import svm
from sklearn.metrics import accuracy_score
import numpy as np

import tensorflow as tf

from sklearn.preprocessing import StandardScaler

Defining a function that grayscale, resize and flattens the image. This function might also become handy (for deep learning later) if the original images are too large for your hardware configuration.

In [3]:
def convert_sample(image):
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image,[32,32]).numpy()
    image = image.reshape(1,-1)
    return image

In [4]:
X = np.load('C:/Users/thorp/OneDrive/Dokumenter/Uni/Kandidat/Anvendt maskinlæring/Xtrain.npy/Xtrain.npy')
X = np.vstack(list(map(convert_sample,X)))
X = StandardScaler(with_mean=0, with_std=1).fit_transform(X)
print(f'Shape of training data features (observations,features): {X.shape}')

y = np.load('C:/Users/thorp/OneDrive/Dokumenter/Uni/Kandidat/Anvendt maskinlæring/ytrain.npy')
y = y.reshape(-1,)    
print(f'Shape of training data labels (observations,): {y.shape}')

Xtest = np.load('C:/Users/thorp/OneDrive/Dokumenter/Uni/Kandidat/Anvendt maskinlæring/Xtest.npy/Xtest.npy')
Xtest = np.vstack(list(map(convert_sample,Xtest)))
Xtest = StandardScaler(with_mean=0, with_std=1).fit_transform(Xtest)
print(f'Shape of training data features (observations,features): {Xtest.shape}')



Shape of training data features (observations,features): (26214, 1024)
Shape of training data labels (observations,): (26214,)
Shape of training data features (observations,features): (1638, 1024)




The data is then ready to be applied for training and prediction in a shallow learning model such as the SVM classifier...below just a very very simple illustration on how to construct and train a support vector machine based on the data we have prepared. The predicted file can be submitted to Kaggle for evaluation.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
''' 
clf = svm.LinearSVC(max_iter=100000)
clf.fit(X_train, y_train)
y_test_hat = clf.predict(X_test)
'''


' \nclf = svm.LinearSVC(max_iter=100000)\nclf.fit(X_train, y_train)\ny_test_hat = clf.predict(X_test)\n'

In [11]:
accuracy_linear_int = accuracy_score(y_test_hat, y_test)
print(accuracy_linear_int)

0.6202555788670608


In [8]:
ytest_hat = pd.DataFrame({
    'Id': list(range(len(y_test_hat))),
    'Predicted': y_test_hat.reshape(-1,),
})
ytest_hat.to_csv('C:/Users/computer/Documents/AML folder/AML-Code/AML Code/Assignments/test_hat.csv', index=False)

In [5]:
#simple hyperparameter search found no better than base case
Kernel = ["linear", "rbf"] 
Cs = [0.1, 0.5, 1, 2, 3, 100, 1000]

results = []

for kernel in Kernel:
    for C in Cs:
        svm_poly = svm.SVC(kernel=kernel, C=C, decision_function_shape='ovr')
        svm_poly.fit(X_train, y_train)
        y_val_hat = svm_poly.predict(X_test)
        accuracy = accuracy_score(y_val_hat, y_test)

        results.append([accuracy, kernel, C])

results = pd.DataFrame(results)
results.columns = ['Accuracy', 'Polynomial degree', 'C']
print(results)

# Extract best parameters.
#results[results['Accuracy'] == results['Accuracy'].max()]

I will now move on to decision trees i saw no better case than the base case of rbf in the SVM also hyperparameter seach is difficult, maybe i should do some PCA given enough computational power

In [18]:
from sklearn import tree
from matplotlib import pyplot as plt
# Initialize a DT (default setting)
dt_default = tree.DecisionTreeClassifier()


# Fit your DT
dt_default.fit(X_train, y_train)

# Predict on your test data with your DT
y_test_hat_default = dt_default.predict(X_test) 

# Obtain accuracy by using the `accuracy_score` function
accuracy_default = accuracy_score(y_test_hat_default, y_test)

# Print results
print(f'DT with default settings achieved {round(accuracy_default * 100, 1)}% accuracy with Depth: {dt_default.get_depth()}.')
#Basic model gained 64.8% with basic settings

DT with default settings achieved 64.8% accuracy with Depth: 59.


Basic model gained 64.8% with basic settings, i will try to tune it.

In [19]:
max_depth = 5 # try more values than just 5 here! Also try fractions!

# Initialize DT
dt_low_max_depth = tree.DecisionTreeClassifier(max_depth=max_depth)

# Fit your DT
dt_default.fit(X_train, y_train)

# Predict on your test data with your DT
y_test_hat_default = dt_default.predict(X_test) 

# Obtain accuracy by using the `accuracy_score` function
accuracy_default = accuracy_score(y_test_hat_default, y_test)

# Print results
print(f'DT with default settings achieved {round(accuracy_default * 100, 1)}% accuracy with Depth: {dt_default.get_depth()}.')

DT with default settings achieved 65.4% accuracy with Depth: 63.


Max depth 5 gave 65.4% accuracy now we loop  it

In [33]:
n = 30
max_depth = [i for i in range(1, n+1)]

results = []

for max_depth in max_depth:
    dt_low_max_depth = tree.DecisionTreeClassifier(max_depth=max_depth)
    dt_default.fit(X_train, y_train)
    y_test_hat_default = dt_default.predict(X_test) 
    accuracy = accuracy_score(y_test_hat_default, y_test)
    
    results.append([max_depth,round(accuracy * 100, 5)])

results = pd.DataFrame(results)
results.columns = ['Max Depth',
                   'Accuracy']
print(results)


    Max Depth  Accuracy
0           1  64.88652
1           2  64.31432
2           3  65.09632
3           4  64.46691
4           5  64.56227
5           6  64.58135
6           7  65.09632
7           8  64.96281
8           9  64.73393
9          10  65.07725
10         11  64.81022
11         12  64.92466
12         13  65.34427
13         14  65.00095
14         15  65.00095
15         16  64.79115
16         17  65.05817
17         18  65.93553
18         19  65.05817
19         20  65.32520
20         21  64.96281
21         22  65.26798
22         23  65.26798
23         24  64.98188
24         25  64.84837
25         26  65.26798
26         27  64.77208
27         28  65.30612
28         29  64.69578
29         30  65.09632


In [34]:
results[results['Accuracy'] == results['Accuracy'].max()]

Unnamed: 0,Max Depth,Accuracy
17,18,65.93553


From max_depth 1-30 we get that the depth of 18 gives 65.9, i will try other params and then try PCA

In [57]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

max_depth_values = [4, 5, 15, 17]
min_samples_split_values = [3, 5, 7, 10]
min_samples_leaf_values = [3, 5, 7, 10]

results = []

for max_depth in max_depth_values:
    for min_samples_split in min_samples_split_values:
        for min_samples_leaf in min_samples_leaf_values:
            dt_model = DecisionTreeClassifier(max_depth=max_depth, 
                                             min_samples_split=min_samples_split, 
                                             min_samples_leaf=min_samples_leaf)
            dt_model.fit(X_train, y_train)
            y_test_hat = dt_model.predict(X_test)
            accuracy = accuracy_score(y_test_hat, y_test)
            
            results.append([max_depth, min_samples_split, min_samples_leaf, accuracy])

results = pd.DataFrame(results)


In [61]:

results.columns = ['Max Depth', 
                   'Min Sample Split',
                   'Min sample_leaf',
                   'Accuracy']
print(results)
results[results['Accuracy'] == results['Accuracy'].max()]
                   

    Max Depth  Min Sample Split  Min sample_leaf  Accuracy
0           4                 3                3  0.647149
1           4                 3                5  0.647149
2           4                 3                7  0.647149
3           4                 3               10  0.647149
4           4                 5                3  0.647149
..        ...               ...              ...       ...
59         17                 7               10  0.678810
60         17                10                3  0.670799
61         17                10                5  0.674423
62         17                10                7  0.679954
63         17                10               10  0.676140

[64 rows x 4 columns]


Unnamed: 0,Max Depth,Min Sample Split,Min sample_leaf,Accuracy
62,17,10,7,0.679954


We achived a accuracy of 0.6799 with max depth 17 plit of 10 and min_leaf of 7, i will now run a super long model

I will now do a grid seach for the hyper parameters

In [68]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.experimental import enable_halving_search_cv
enable_halving_search_cv=True
from sklearn.model_selection import HalvingGridSearchCV
param_grid = {
    'max_depth': [4, 5, 15, 17],
    'min_samples_split': [3, 5, 7, 10],
    'min_samples_leaf': [3, 5, 7, 10, 12, 15, 16]
}

dt_model = DecisionTreeClassifier()
grid_search = HalvingGridSearchCV(dt_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

In [None]:
results = grid_search.cv_results_
for mean_score, params in zip(results['mean_test_score'], results['params']):
    print("Accuracy: {:.5f}, Hyperparameters: {}".format(mean_score, params))

Nothing good came of this, we try the nuclear option, random forrest

In [12]:

from sklearn.metrics import accuracy_score
from sklearn import ensemble  # ensemble instead of tree
# Initialize
rf = ensemble.RandomForestClassifier(criterion="entropy")

# Fit
rf.fit(X, y)

# Predict
y_test_hat = rf.predict(Xtest)


In [13]:
# accuracy = accuracy_score(y_test_hat, y_test)
# print(accuracy)
print(y.shape)
print(y_test_hat.shape)

(26214,)
(1638,)


Basic random just did better with a score of 76.4%, minimizing for enthropy gives us a score of 77,9 lets do some parameter search.

In [7]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = [2, 4, 6] #Must be int we are not doing regression, we are doing image classification
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [None]:
from sklearn.model_selection import RandomizedSearchCV
#Initiate the random forest

rf = ensemble.RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores (for a general test i do 3x3 to see what happens)
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 3, cv = 3, 
                               verbose=2, 
                               random_state=42, 
                               n_jobs = -1)
y_test_hat_rs = rf_random.fit(X, y)


In [18]:
ytest_hat = pd.DataFrame({
    'Id': list(range(len(y_test_hat))),
    'Predicted': y_test_hat.reshape(-1,),
})
ytest_hat.to_csv('C:/Users/thorp/Downloads/test_hat.csv', index=False)
print(ytest_hat.shape)

(1638, 2)
