### Welcome to the Datacation Bootcamp!

Today, we present a coding challenge to you.
In this challenge, you will be given two brain tumor dataset. A train and a test dataset.
The target variable has been removed from the test dataset. You will try different Machine Learning models and
use your best model to predict whether patients have a brain tumor or not.

Try to get as far as possible in the following exercises, increasing in difficulty:

1. Loading the training set train_brain.csv and split the training set in a training and validation set, then preprocess the data. Options include:
    - imputing missing values
    - one-hot-encoding categorical values
    - scaling the data
2. Implement the machine learning model called the Support Vector Machine (SVM) and optimize based on the validation accuracy.
3. Visualize the SVM accuracy results in a graph.
4. Use a grid search to find the optimal hyperparameters of the SVM, KNN and RandomForest models using 3-fold cross validation.
5. Visualize the SVM, KNN and RandomForest accuracy results in a heatmap.
6. Implement a Neural Network and visualize the loss and accuracy, both for the test and training dataset.
7. Apply any type of model and preprocessing steps necessary to achieve the highest possible validation accuracy.
8. Use the test_brain.csv dataset to predict whether the patients have a brain tumor or not. The target variable has been removed from the dataset.
   Save the prediction results in a list with the same order as the patients in the test dataset.
   Use the given code to store the list as .pkl file and save it using your group number.
   Finally, do a push request to the github repository. We will calculate your final accuracy score.


In [47]:
######################################   EXERCISE 1   ######################################
'''
Loading the training set train_brain.csv and split the training set in a training and validation set, then preprocess the data. Options include:
    - imputing missing values
    - one-hot-encoding categorical values
    - scaling the data
'''

# Read in the dataset: (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('train_brain.csv')
X = data.drop(columns=['target'])
Y = data['target']



# Split the training set in a train and validation set, use random_state = 0: (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=0)
# Impute missing values: (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
#Numerical data
X_train_numerical = X_train.select_dtypes(include=np.number)
num_columns = X_train_numerical.columns
imp = IterativeImputer(max_iter=5, random_state=0)
X_train_numerical = pd.DataFrame(imp.fit_transform(X_train_numerical))
X_train_numerical.columns = num_columns

#Categorical data
X_train_cat = X_train.select_dtypes(exclude=np.number)

#Combine again
X_train_numerical.set_index(X_train_cat.index, inplace=True)
X_train = pd.concat([X_train_numerical, X_train_cat], axis=1)

#Impute categorical data
from sklearn.impute import SimpleImputer
imp_most = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
X_train = pd.DataFrame(imp_most.fit_transform(X_train))

# Scale numerical columns: (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)




       0        1        2        3        4        5        6        7    \
0  14.1154  26.9888  18.0004  37.9845  9.57692  7.73521  31.5357  9.71818   
1  4.01769  19.3011  12.8761  34.1732  12.0313  7.44901  30.5093  3.91994   
2  6.03816  28.2465  15.4586  32.3316  9.08333  9.26768  33.6227  9.55485   
3  12.3054  23.0694  14.1018  25.7072  5.84181  11.6547  36.1815   5.8799   
4  6.03819  28.0759  15.6182  32.7191  9.05655  9.26738  33.6227  2.93995   

       8        9    ... 91  92  93  94  95  96  97  98  99  100  
0  25.4178  16.3847  ...   C   F   B   D   B   F   A   B   F   K  
1  23.1735  18.7209  ...   A   F   B   D   C   G   C   A   A   V  
2  24.0309   14.165  ...   A   B   B   B   C   G   C   A   A   A  
3  25.4109  12.2382  ...   A   B   B   B   C   F   A   A   O   R  
4  23.6222  14.1649  ...   A   F   B   D   B   C   D   B   E  AC  

[5 rows x 101 columns]


In [None]:
######################################   EXERCISE 2   ######################################
'''
Implement the machine learning model called the Support Vector Machine (SVM) and optimize based on the validation accuracy.
'''

# Implement the Support Vector Machine: (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)



In [None]:
######################################   EXERCISE 3   ######################################
'''
Visualize the SVM accuracy results in a graph.
'''

# Visualize the SVM results: (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html)



In [None]:
######################################   EXERCISE 4   ######################################
'''
Use a grid search to find the optimal hyperparameters of the SVM, KNN and RandomForest models using 3-fold cross validation.
'''

# Make use of GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# KNN: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# RandomForest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html



In [None]:
######################################   EXERCISE 5   ######################################
'''
Visualize the SVM, KNN and RandomForest accuracy results in a heatmap.
'''

# Make us of a heatmap: https://seaborn.pydata.org/generated/seaborn.heatmap.html



In [None]:
######################################   EXERCISE 6   ######################################
'''
Implement a basic Neural Network and visualize the loss and accuracy, both for the test and training dataset.
'''

# Make use of the MLPClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier



In [None]:
######################################   EXERCISE 7   ######################################
'''
Apply any type of model and preprocessing steps necessary to achieve the highest possible validation accuracy.
'''



In [None]:
######################################   EXERCISE 8   ######################################
'''
Time to predict using your best ML model!
Use the test_brain.csv dataset to predict whether the patients have a brain tumor or not. The target variable has been removed from the dataset.
Save the prediction results in a list with the same order as the patients in the test dataset.
Use the below given code to store the list as .pkl file and save it using your group number.
Finally, do a push request to the github repository. We will calculate your final accuracy score.
'''

import pickle

group_number = 0
ypred = []

with open(f'test_predictions_group_{group_number}.pkl', 'wb') as f:
    pickle.dump(ypred, f)