## Course: TM10007 - Machine learning
Editors: Lishia Vergeer, Amy Roos, Maaike Pruijt, Hilde Roording.

Description: The aim of this code is to predict the tumor grade of glioma’s(high or low) before surgery, 
based on features extracted from a combination of four MRI images: 
T2-weighted, T2-weighted FLAIR and T1-weighted before and after injection of contrast agent.

#### Import packages

In [12]:
# General packages
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets as ds

# Import code
from brats.load_data import load_data

# Performance 
from sklearn.model_selection import train_test_split
from sklearn import decomposition

# Pipeline and gridsearch
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV

#preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
# scaler
from sklearn.preprocessing import RobustScaler

#Machine learning classifiers
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn import feature_selection 
from sklearn import preprocessing
from sklearn import neighbors
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import svm
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsRegressor


#### Load data

In [13]:
# Data loading functions. Uncomment the one you want to use
from brats.load_data import load_data

data_brats = load_data()

# Convert to dataframe
X = pd.DataFrame(data_brats)

print(f'The number of samples in data_brats: {len(X.index)}')
print(f'The number of columns in data_brats: {len(X.columns)}')





The number of samples in data_brats: 167
The number of columns in data_brats: 725


In [14]:
# Check datatypes
print(X.dtypes)
print(X.dtypes['VOLUME_ET_OVER_ED'])
print(X.dtypes['VOLUME_NET_OVER_ED'])
# both categorical and numeric variables
# die twee kolommen gaven een error omdat het objecten zijn! dus daarom verderop eruit gehaald.

VOLUME_ET        int64
VOLUME_NET       int64
VOLUME_ED        int64
VOLUME_TC        int64
VOLUME_WT        int64
                ...   
TGM_Cog_X_6    float64
TGM_Cog_Y_6    float64
TGM_Cog_Z_6    float64
TGM_T_6        float64
label           object
Length: 725, dtype: object
object
object


#### Split data in X and y
Split in X (data) and y (label)

In [15]:
# split column label from dataset X
y = X.pop('label')
print(f'The number of samples in y: {len(y.index)}')

The number of samples in y: 167


#### Split data in train and test set
This function creates a panda dataframe and splits the data into test and train components.
This is done with test_size variable and the function train_test_split from the sklearn module.
Returns a train set with the data of 80% and a test set of 20% of the subjects.



In [16]:
# Split data in train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.1)  

#### Preprocessing : 
###### deze stap moet binnen de pipeline komen denk ik? en die NaN stappen van de oude code ook!
###### moet hier onder ook nog een verwerking van test data komen? daar kunnen ook #div/0 in zitten toch?

In [17]:
# infinity to NaN
X_train[X_train==np.inf]=np.nan

# non-numeric features to NaN
X_train = X_train.replace(['#DIV/0!'], np.nan)
X_train = X_train.apply(pd.to_numeric, errors='coerce')


# Pipeline
##### building a pipeline to define each transformer type

In [18]:
numeric_transformer = Pipeline(steps=[
    ('imputation', SimpleImputer(missing_values = np.NaN, strategy='most_frequent')),   # kan ook strategy = 'most frequent'
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

#### use the ColumnTransformer to apply the transformations to the correct columns in the dataframe

In [19]:
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
#categorical_features = X_train.select_dtypes(include=['object']).drop(['VOLUME_NET_OVER_ED', 'VOLUME_ET_OVER_ED'], axis=1).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)]) # ('cat', categorical_transformer,categorical_features)])

#### Cross validation

In [20]:
kf = KFold(n_splits=5, shuffle= True, random_state = 1)

knn = KNeighborsRegressor()
r_2s = cross_val_score(knn, X_train, y_train, scoring = 'r2', cv=kf)
avg_r2 = np.mean(r_2s)

print(r_2s)
print(avg_r2)


[nan nan nan nan nan]
nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\amymy\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\amymy\miniconda3\lib\site-packages\sklearn\neighbors\_regression.py", line 213, in fit
    return self._fit(X, y)
  File "C:\Users\amymy\miniconda3\lib\site-packages\sklearn\neighbors\_base.py", line 400, in _fit
    X, y = self._validate_data(X, y, accept_sparse="csr", multi_output=True)
  File "C:\Users\amymy\miniconda3\lib\site-packages\sklearn\base.py", line 581, in _validate_data
    X, y = ch

#### Loop some classifiers and check performance


In [21]:
# Beste classifier gebruiken! Dit gebruiken om keuze te onderbouwen
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()]
    
for classifier in classifiers:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', classifier)])
    pipe.fit(X_train, y_train)   
    print(classifier)
    print("model score: %.3f" % pipe.score(X_test, y_test))

KNeighborsClassifier(n_neighbors=3)
model score: 0.765
SVC(C=0.025, probability=True)
model score: 0.706
NuSVC(probability=True)
model score: 0.824
DecisionTreeClassifier()
model score: 0.765
RandomForestClassifier()
model score: 0.824
AdaBoostClassifier()
model score: 0.765
GradientBoostingClassifier()
model score: 0.824


#### Hyperparametersearch

In [22]:
# Hyperparameter suggestions
param_grid = { 
    'classifier__n_estimators': [200, 500],
    'classifier__max_features': ['auto', 'sqrt', 'log2'],
    'classifier__max_depth' : [4,5,6,7,8]}                        # kan ook nog, maar deed het niet: 'classifier__criterion' :['gini', 'entropy']
from sklearn.model_selection import GridSearchCV

# Gridsearch with 5-fold cross validation
Gridsearch_CV = GridSearchCV(pipe, param_grid, n_jobs= 1, cv=5)   # Hier kunnen we kiezen voor GridSearh of randomgridsearch
                  
Gridsearch_CV.fit(X_train, y_train)  

#Kijken welke parameters het best zijn en die uiteindelijk gebruiken!
print(Gridsearch_CV.best_params_)    
print(Gridsearch_CV.best_score_)

In [None]:
#create pipeline with best parameters and best classifier : voor nu gekozen voor RF 
# Error bij andere classifiers en niet bij xgb?  .XGBClassifier opzoeken bij RF enzo
hyperparams_after_gridsearch= Gridsearch_CV.best_params_
params_after_grid = { **static_params, **hyperparams_after_gridsearch}
pipe_after_grid = Pipeline([('classifier', xgb.XGBClassifier(**params_after_grid))])

#fit pipe with hyperparameters on complete train set
bst= pipe_after_grid.fit(X_train, y_train)

In [None]:
score_train = roc_auc_score(y_train, bst.predict_proba(X_train)[:, 1])
print(score_train)

#TEST WERKT NOG NIET
#score_test = roc_auc_score(y_test, bst.predict_proba(X_test)[:, 1])
# print(score_test)


1.0


# OUDE CODE

load data

In [None]:
# Data loading functions. Uncomment the one you want to use
from brats.load_data import load_data

data_brats = load_data()
print(f'The number of samples in data_brats: {len(X.index)}')
print(f'The number of columns in data_brats: {len(X.columns)}')

# Convert to dataframe
X = pd.DataFrame(data_brats)

The number of samples in data_brats: 167
The number of columns in data_brats: 724


  data = data.append(data2)


#### Split data in X and y
Split in X (data) and y (label)

In [None]:
# split column label from dataset X
y = X.pop('label')
print(f'The number of samples in y: {len(y.index)}')

The number of samples in y: 167


#### Split data in train and test set
This function creates a panda dataframe and splits the data into test and train components.
This is done with test_size variable and the function train_test_split from the sklearn module.
Returns a train set with the data of 80% and a test set of 20% of the subjects.

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.1)  

#### No None
Deleting columns with NaN or filling them.
- Bepalen waar threshold ligt

In [None]:
# Insight in the data
#print(f'OVERZICHT: {X_train.isnull().sum()}')

# infinity to NaN
X_train[X_train==np.inf]=np.nan

# non-numeric features to NaN
X_train = X_train.replace(['#DIV/0!'], np.nan)
X_train = X_train.apply(pd.to_numeric, errors='coerce')

# If the total number of NaN observations in a column are greater than 40%, delete the entire column.
perc = 40.0
min_count = int(((100-perc)/100)*X_train.shape[0] + 1)
data_dropcolumn = X_train.dropna(axis=1, thresh=min_count)

# fill the NaN observations.
data_fill = data_dropcolumn.fillna(data_dropcolumn.median()) #Bekijken mean of median

# Inzicht in data
#print(f'OVERZICHT NONONE: {data_fill.isnull().sum()}')


#### Scale features

In [None]:
# robustscaler
scaler = RobustScaler()
scaler.fit(data_fill)
X_scaled = scaler.transform(data_fill)

print(X_scaled)

[[-0.35729482  1.85343746  1.56015075 ... -0.93128157  0.21975955
  -0.12371608]
 [ 0.89552576 -0.2814291   0.62961102 ...  0.35519775  0.65292389
  -0.03776545]
 [ 0.4599639  -0.0106383   0.50867244 ... -0.1861754  -0.6246499
   0.53998392]
 ...
 [ 0.13535555  2.92167015 -0.35408497 ... -0.81228628  0.35661828
   1.50297563]
 [-0.4083876   0.30029517 -0.87507739 ... -0.41067211 -1.18149254
  -0.11757403]
 [-0.34859132  3.49335875  0.00411862 ... -0.60665591  0.09969611
   1.36893765]]


#### Transform features
- We denken alleen PCA te gebruiken. Klopt het dat je dan niet ook selectie gebruikt?
- PCA gaat uit van lineair model. Hoe kunnen we weten of ons onze data daar geschikt voor is?
- Is het de bedoeling dat we ons hier verder in verdiepen of valt dat buiten de scope van het vak?
- Uitzoeken hoe we de X_test en y_test correct gereed krijgen voor PCA.

In [None]:
# Perform a PCA
pca = decomposition.PCA(n_components=2)
pca.fit(X_scaled) 
X_train_pca = pca.transform(X_scaled)

#X_test_pca = pca.transform(X_test)


#### Classifier: kNN

In [None]:
# # Fit kNN
# knn = neighbors.KNeighborsClassifier(n_neighbors=15)
# knn.fit(X_train_pca, y_train)
# score_train = knn.score(X_train_pca, y_train)
# #score_test = knn.score(X_test_pca, y_test)

# # Print result
# print(f"Training result: {score_train}")
# #print(f"Test result: {score_test}")

#### Classifier: SVM

#### Classifier: Random Forest