### Welcome to the Datacation Bootcamp!

Today, we present a coding challenge to you.
In this challenge, you will be given two brain tumor dataset. A train and a test dataset.
The target variable has been removed from the test dataset. You will try different Machine Learning models and
use your best model to predict whether patients have a brain tumor or not.

Try to get as far as possible in the following exercises, increasing in difficulty:

1. Loading the training set train_brain.csv and split the training set in a training and validation set, then preprocess the data. Options include:
    - imputing missing values
    - one-hot-encoding categorical values
    - scaling the data
2. Implement the machine learning model called the Support Vector Machine (SVM) and optimize based on the validation accuracy.
3. Visualize the SVM accuracy results in a graph.
4. Use a grid search to find the optimal hyperparameters of the SVM, KNN and RandomForest models using 3-fold cross validation.
5. Visualize the SVM, KNN and RandomForest accuracy results in a heatmap.
6. Implement a Neural Network and visualize the loss and accuracy, both for the test and training dataset.
7. Apply any type of model and preprocessing steps necessary to achieve the highest possible validation accuracy.
8. Use the test_brain.csv dataset to predict whether the patients have a brain tumor or not. The target variable has been removed from the dataset.
   Save the prediction results in a list with the same order as the patients in the test dataset.
   Use the given code to store the list as .pkl file and save it using your group number.
   Finally, do a push request to the github repository. We will calculate your final accuracy score.


In [1]:
######################################   EXERCISE 1   ######################################
'''
Loading the training set train_brain.csv and split the training set in a training and validation set, then preprocess the data. Options include:
    - imputing missing values
    - one-hot-encoding categorical values
    - scaling the data
'''

# Read in the dataset: (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

import pandas as pd
import sklearn

data = pd.read_csv('train_brain.csv', na_values='na')
data = data._get_numeric_data()
print(data)

# Split the training set in a train and validation set, use random_state = 0: (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

# Impute missing values: (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

# Scale numerical columns: (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)


              1          2          4          5          6          7  \
0      1.250042  19.145029   9.584353  34.984717   7.027265   6.689411   
1      2.970214  27.764929  20.236078  38.657649  10.392102  11.340814   
2      3.135308  32.574846  10.915899  32.318764   6.779047   7.626425   
3      4.277058  47.738345  16.489040  39.429322   6.935775  10.205494   
4      4.473978  16.115129  10.403431  38.915410   7.852578   6.757196   
...         ...        ...        ...        ...        ...        ...   
48397  5.097584  16.073367  22.228708  34.273965  10.838109   6.613083   
48398  2.479135  33.375509  16.584398  27.859037   9.112503  15.142838   
48399       NaN        NaN        NaN        NaN        NaN        NaN   
48400  7.863778  41.672625  19.498322  37.716725  11.986951  11.115559   
48401  9.408801  22.779617  11.257911  37.299801  10.411007   8.616008   

               8         9         10         11  ...         92         93  \
0      36.111964  4.899921  23.9

In [2]:
data = pd.DataFrame.head(data,5)

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import xgboost as xgb
import numpy as np


X_train, X_test, y_train, y_test = \
        train_test_split(data.iloc[:,:-1], data.iloc[:,-1:], test_size=0.30, random_state=0)

In [4]:
imp = SimpleImputer(missing_values=np.nan, strategy="most_frequent")

X_train_new = imp.fit_transform(X_train)


# We dont want to learn about test data, only change it according to previously learnt information
X_test_new = imp.transform(X_test)

In [5]:
print(y_train)

   target
1       0
3       0
4       1


In [6]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier

params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

xgb = XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)
folds = 2
param_comb = 5

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(X_train_new,y_train), verbose=3, random_state=1001 )

random_search.fit(X_train_new, y_train)

Fitting 2 folds for each of 5 candidates, totalling 10 fits




Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


RandomizedSearchCV(cv=<generator object _BaseKFold.split at 0x000001C7895B3D48>,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           enable_categorical=False, gamma=None,
                                           gpu_id=None, importance_type=None,
                                           interaction_constraints=None,
                                           learning_rate=0.02,
                                           max_delta_step=None, max_depth=None,
                                           mi...
                                           reg_alpha=None, reg_lambda=None,
                                           scale_pos_weight=None, silent=True,
                                           subsample=None, tree_met

In [16]:



y_test_pred = random_search.predict_proba(X_test)
print(y_test_pred)
print(y_test)

[[0.5 0.5]
 [0.5 0.5]]
   target
2       1
0       0


In [8]:
######################################   EXERCISE 2   ######################################
'''
Implement the machine learning model called the Support Vector Machine (SVM) and optimize based on the validation accuracy.
'''

# Implement the Support Vector Machine: (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)



'\nImplement the machine learning model called the Support Vector Machine (SVM) and optimize based on the validation accuracy.\n'

In [9]:
######################################   EXERCISE 3   ######################################
'''
Visualize the SVM accuracy results in a graph.
'''

# Visualize the SVM results: (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html)



'\nVisualize the SVM accuracy results in a graph.\n'

In [10]:
######################################   EXERCISE 4   ######################################
'''
Use a grid search to find the optimal hyperparameters of the SVM, KNN and RandomForest models using 3-fold cross validation.
'''

# Make use of GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# KNN: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# RandomForest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html



'\nUse a grid search to find the optimal hyperparameters of the SVM, KNN and RandomForest models using 3-fold cross validation.\n'

In [11]:
######################################   EXERCISE 5   ######################################
'''
Visualize the SVM, KNN and RandomForest accuracy results in a heatmap.
'''

# Make us of a heatmap: https://seaborn.pydata.org/generated/seaborn.heatmap.html



'\nVisualize the SVM, KNN and RandomForest accuracy results in a heatmap.\n'

In [12]:
######################################   EXERCISE 6   ######################################
'''
Implement a basic Neural Network and visualize the loss and accuracy, both for the test and training dataset.
'''

# Make use of the MLPClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier



'\nImplement a basic Neural Network and visualize the loss and accuracy, both for the test and training dataset.\n'

In [13]:
######################################   EXERCISE 7   ######################################
'''
Apply any type of model and preprocessing steps necessary to achieve the highest possible validation accuracy.
'''



'\nApply any type of model and preprocessing steps necessary to achieve the highest possible validation accuracy.\n'

In [14]:
######################################   EXERCISE 8   ######################################
'''
Time to predict using your best ML model!
Use the test_brain.csv dataset to predict whether the patients have a brain tumor or not. The target variable has been removed from the dataset.
Save the prediction results in a list with the same order as the patients in the test dataset.
Use the below given code to store the list as .pkl file and save it using your group number.
Finally, do a push request to the github repository. We will calculate your final accuracy score.
'''

import pickle

group_number = 9
ypred = []

with open(f'test_predictions_group_{group_number}.pkl', 'wb') as f:
    pickle.dump(ypred, f)