### Welcome to the Datacation Bootcamp!

Today, we present a coding challenge to you.
In this challenge, you will be given two brain tumor dataset. A train and a test dataset.
The target variable has been removed from the test dataset. You will try different Machine Learning models and
use your best model to predict whether patients have a brain tumor or not.

Try to get as far as possible in the following exercises, increasing in difficulty:

1. Loading the training set train_brain.csv and split the training set in a training and validation set, then preprocess the data. Options include:
    - imputing missing values
    - one-hot-encoding categorical values
    - scaling the data
2. Implement the machine learning model called the Support Vector Machine (SVM) and optimize based on the validation accuracy.
3. Visualize the SVM accuracy results in a graph.
4. Use a grid search to find the optimal hyperparameters of the SVM, KNN and RandomForest models using 3-fold cross validation.
5. Visualize the SVM, KNN and RandomForest accuracy results in a heatmap.
6. Implement a Neural Network and visualize the loss and accuracy, both for the test and training dataset.
7. Apply any type of model and preprocessing steps necessary to achieve the highest possible validation accuracy.
8. Use the test_brain.csv dataset to predict whether the patients have a brain tumor or not. The target variable has been removed from the dataset.
   Save the prediction results in a list with the same order as the patients in the test dataset.
   Use the given code to store the list as .pkl file and save it using your group number.
   Finally, do a push request to the github repository. We will calculate your final accuracy score.


In [2]:
######################################   EXERCISE 1   ######################################
'''
Loading the training set train_brain.csv and split the training set in a training and validation set, then preprocess the data. Options include:
    - imputing missing values
    - one-hot-encoding categorical values
    - scaling the data
'''
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler


# Read in the dataset: (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
train = pd.read_csv("train_brain.csv")
test = pd.read_csv("test_brain.csv")
# Split the training set in a train and validation set, use random_state = 0: (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

train_split,validation = train_test_split(train, test_size=0.3, random_state=0)

# Impute missing values: (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
imp_mean = SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imp_mean.fit(train_split)
train_imputed = imp_mean.transform(train_split)

imp_mean.fit(validation)
vali_imputed = imp_mean.transform(validation)

# Scale numerical columns: (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
scaler = StandardScaler()

transpose_1 = train_imputed.transpose()
transpose_2 = vali_imputed.transpose()
transpose = transpose_1

for i,column in enumerate(transpose):
    if not isinstance(column[0], str):
        array = column.reshape(-1, 1)
        scaler.fit(array)
        array= scaler.transform(array)
        transpose[i] = array.reshape(-1)
    else:
        continue
        
train_scaled = transpose.transpose()
#vali_scaled = transpose.transpose()

        


In [44]:
#test
imp_mean.fit(test)
test_imputed = imp_mean.transform(test)
transpose = test_imputed.transpose()

for i,column in enumerate(transpose):
    if not isinstance(column[0], str):
        array = column.reshape(-1, 1)
        scaler.fit(array)
        array= scaler.transform(array)
        transpose[i] = array.reshape(-1)
    else:
        continue
        
test_scaled = transpose.transpose()

# Impute missing values: (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
cols = test.columns
datatypes = test.dtypes

test_df = pd.DataFrame(test_scaled, columns=cols)

# First cast all columns that are not 'object's to numeric columns,
#   then normalize all numeric columns (manually implemented)
for i in range(len(cols)):
    if datatypes[i] != 'O':
        test_df[cols[i]] = test_df[cols[i]].astype(datatypes[i])

In [3]:
transpose_1 = train_imputed.transpose()
transpose_2 = vali_imputed.transpose()
transpose = transpose_2

for i,column in enumerate(transpose):
    if not isinstance(column[0], str):
        array = column.reshape(-1, 1)
        scaler.fit(array)
        array= scaler.transform(array)
        transpose[i] = array.reshape(-1)
    else:
        continue
        
#train_scaled = transpose.transpose()
vali_scaled = transpose.transpose()

In [19]:
# Impute missing values: (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
cols = train.columns
datatypes = train.dtypes

train_df = pd.DataFrame(train_scaled, columns=cols)
vali_df =pd.DataFrame(vali_scaled, columns=cols)

# First cast all columns that are not 'object's to numeric columns,
#   then normalize all numeric columns (manually implemented)
for i in range(len(cols)):
    if datatypes[i] != 'O':
        train_df[cols[i]] = train_df[cols[i]].astype(datatypes[i])
        vali_df[cols[i]] = vali_df[cols[i]].astype(datatypes[i])
        #train_df[cols[i]] = (train_df[cols[i]] - train_df[cols[i]].mean()) / train_df[cols[i]].std()

X_train = train_df.drop('19', axis=1)
y_train = train_df['19']
X_vali = vali_df.drop('19', axis=1)
y_vali = vali_df['19']
        
# Encode categorical columns to one-hot encoding style
train_onehot = pd.get_dummies(train_df)
vali_onehot = pd.get_dummies(vali_df)

In [None]:
# Impute missing values: (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
cols = train.columns
datatypes = train.dtypes

train_df = pd.DataFrame(train_scaled, columns=cols)
vali_df =pd.DataFrame(vali_scaled, columns=cols)

# First cast all columns that are not 'object's to numeric columns,
#   then normalize all numeric columns (manually implemented)
for i in range(len(cols)):
    if datatypes[i] != 'O':
        train_df[cols[i]] = train_df[cols[i]].astype(datatypes[i])
        vali_df[cols[i]] = vali_df[cols[i]].astype(datatypes[i])
    
        

In [26]:
train_num = train_df.select_dtypes(['number'])
valid_num = vali_df.select_dtypes(['number'])

In [45]:
test_num = test_df.select_dtypes(['number'])

In [17]:
vali_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,93,94,95,96,97,98,99,100,101,target
0,-0.455487,1.530271,C,0.533798,2.699543,-0.683785,-0.304898,-0.041805,0.182059,0.711944,...,-0.685880,-1.670945,-0.126435,V,0.787338,-1.334376,1,-1.156410,0.050476,0
1,-0.800250,-0.657228,C,0.328634,-0.935224,0.584068,-0.366351,-0.725802,2.904398,3.315278,...,0.000816,-0.008565,0.375755,A,-1.047877,0.444585,1,-0.078818,-0.796004,1
2,-0.800250,-0.657228,C,0.328634,-0.935224,0.584068,-0.366351,-0.725802,3.360744,3.988174,...,0.000816,-0.008565,0.375755,BY,-1.047877,0.444585,1,-0.078818,-0.796004,0
3,0.563330,1.118313,C,0.544665,0.212928,-1.873994,1.553644,0.133174,-0.400175,-0.115421,...,-1.781179,0.794111,0.692014,AT,0.388435,-0.940897,0,-0.442542,0.441158,0
4,0.602630,0.790340,C,2.352711,1.067671,1.078103,0.801526,-0.755667,-0.038245,-0.474978,...,1.515184,-1.212162,-0.824331,AG,0.465018,5.574691,3,-1.285495,2.570225,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14516,-0.800250,-0.657228,C,0.328634,-0.935224,0.584068,-0.366351,-0.725802,1.488153,1.284469,...,0.000816,-0.008565,0.375755,BJ,-1.047877,0.444585,0,-0.078818,-0.796004,0
14517,-0.800250,-0.657228,C,0.328634,-0.935224,0.584068,-0.366351,-0.725802,0.166324,0.204182,...,0.000816,-0.008565,0.375755,CG,-1.047877,0.444585,0,-0.078818,-0.796004,0
14518,-0.439200,-0.814075,C,-0.092275,0.441402,0.179724,-0.008464,-0.012484,-0.431646,-0.369674,...,1.387317,-0.002608,-0.353152,BC,2.340939,0.291083,0,-0.186038,-0.430478,0
14519,0.279904,1.330578,C,1.776671,0.254458,0.204819,1.324487,-0.112409,0.921655,0.740127,...,2.166647,0.193639,-1.021590,BL,0.859242,0.731156,0,-0.464892,0.234964,0


In [27]:
train_num

Unnamed: 0,1,2,4,5,6,7,8,9,10,11,...,92,93,94,95,97,98,99,100,101,target
0,-0.804027,-0.708418,-0.641359,0.713120,-0.656371,-0.779224,-0.046616,0.495877,-0.509877,-0.520637,...,0.930517,-0.344857,-0.314412,-0.058565,0.382354,-0.131699,0,-0.494324,-0.811185,0
1,-0.804027,-0.708418,-0.641359,0.713120,-0.656371,-0.779224,-0.046616,-0.416755,-0.883957,-0.520637,...,0.387587,-0.344857,-0.314412,-0.058565,0.382354,-0.131699,0,-0.494324,-0.811185,0
2,2.392942,-1.548836,-2.360166,-0.686210,-0.512330,-1.378592,-0.827800,1.021800,-0.491925,1.485551,...,0.426909,-1.970369,5.592551,-1.958071,-0.012836,-1.695431,0,7.379186,0.279437,0
3,-0.804027,-0.708418,-0.641359,0.713120,-0.656371,-0.779224,-0.046616,1.238357,0.842654,-0.520637,...,0.401070,-0.344857,-0.314412,-0.058565,0.382354,-0.131699,2,-0.494324,-0.811185,0
4,0.341709,-1.062458,-1.689021,-0.655218,1.054341,-1.728540,0.316132,3.295645,2.635402,4.165475,...,-1.852143,-2.046016,0.376581,1.136826,-0.820327,-2.166242,2,3.125428,0.091711,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33876,0.834978,0.354109,0.735307,0.587800,0.183637,-1.526542,0.373999,-0.602375,-0.633870,0.408800,...,-2.118199,-1.946360,-1.148921,1.002518,-2.312056,-1.273795,0,0.714005,1.646079,0
33877,0.382469,1.841819,0.108445,-0.198922,0.470824,-0.340837,0.444170,-0.416755,-0.201483,0.681515,...,-0.113212,-1.178840,-0.522095,0.338015,-0.824094,-0.572929,0,0.432639,0.469628,0
33878,-0.804027,-0.708418,-0.641359,0.713120,-0.656371,-0.779224,-0.046616,-1.252044,-0.918103,-0.520637,...,0.111776,-0.344857,-0.314412,-0.058565,0.382354,-0.131699,0,-0.494324,-0.811185,1
33879,0.586232,-0.912161,-0.507078,0.133900,0.701337,1.185910,-0.265716,1.423977,1.830150,-0.259416,...,-0.138019,0.157905,-0.019627,-0.086892,0.769676,0.879728,2,-0.297428,0.847469,0


In [29]:
valid_num

Unnamed: 0,1,2,4,5,6,7,8,9,10,11,...,92,93,94,95,97,98,99,100,101,target
0,-0.455487,1.530271,0.533798,2.699543,-0.683785,-0.304898,-0.041805,0.182059,0.711944,-0.108279,...,0.475998,-0.685880,-1.670945,-0.126435,0.787338,-1.334376,1,-1.156410,0.050476,0
1,-0.800250,-0.657228,0.328634,-0.935224,0.584068,-0.366351,-0.725802,2.904398,3.315278,0.002214,...,-1.209240,0.000816,-0.008565,0.375755,-1.047877,0.444585,1,-0.078818,-0.796004,1
2,-0.800250,-0.657228,0.328634,-0.935224,0.584068,-0.366351,-0.725802,3.360744,3.988174,0.002214,...,-0.349203,0.000816,-0.008565,0.375755,-1.047877,0.444585,1,-0.078818,-0.796004,0
3,0.563330,1.118313,0.544665,0.212928,-1.873994,1.553644,0.133174,-0.400175,-0.115421,0.150986,...,-1.628246,-1.781179,0.794111,0.692014,0.388435,-0.940897,0,-0.442542,0.441158,0
4,0.602630,0.790340,2.352711,1.067671,1.078103,0.801526,-0.755667,-0.038245,-0.474978,-0.982041,...,0.817777,1.515184,-1.212162,-0.824331,0.465018,5.574691,3,-1.285495,2.570225,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14516,-0.800250,-0.657228,0.328634,-0.935224,0.584068,-0.366351,-0.725802,1.488153,1.284469,0.002214,...,0.650074,0.000816,-0.008565,0.375755,-1.047877,0.444585,0,-0.078818,-0.796004,0
14517,-0.800250,-0.657228,0.328634,-0.935224,0.584068,-0.366351,-0.725802,0.166324,0.204182,0.002214,...,0.350312,0.000816,-0.008565,0.375755,-1.047877,0.444585,0,-0.078818,-0.796004,0
14518,-0.439200,-0.814075,-0.092275,0.441402,0.179724,-0.008464,-0.012484,-0.431646,-0.369674,3.795517,...,-0.372814,1.387317,-0.002608,-0.353152,2.340939,0.291083,0,-0.186038,-0.430478,0
14519,0.279904,1.330578,1.776671,0.254458,0.204819,1.324487,-0.112409,0.921655,0.740127,-0.476544,...,-0.542836,2.166647,0.193639,-1.021590,0.859242,0.731156,0,-0.464892,0.234964,0


In [31]:
######################################   EXERCISE 2   ######################################
'''
Implement the machine learning model called the Support Vector Machine (SVM) and optimize based on the validation accuracy.
'''

# Implement the Support Vector Machine: (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
X_train = train_num.drop('target', axis=1)
y_train = train_num['target']
X_vali = valid_num.drop('target', axis=1)
y_vali = valid_num['target']




In [32]:
X_train_1 = X_train[:100]
y_train_1 = y_train[:100]
X_vali_1 = X_vali[:100]
y_vali_1 = y_vali[:100]

In [38]:
from sklearn.svm import LinearSVC
svclassifier = LinearSVC()
svclassifier.fit(X_train, y_train)




LinearSVC()

In [40]:
y_pred = svclassifier.predict(X_vali)

In [48]:
y_pred = svclassifier.predict(test_num)


In [49]:
y_pred

array([0, 1, 0, ..., 0, 0, 0], dtype=int64)

In [41]:
from sklearn.metrics import accuracy_score
accuracy_score(y_vali, y_pred)

0.7544934921837338

In [45]:
train_df.dtypes

0      object
1      object
2      object
3      object
4      object
        ...  
97     object
98     object
99     object
100    object
101    object
Length: 102, dtype: object

In [42]:
train_df = pd.get_dummies(train_df)
vali_df = pd.get_dummies(vali_df)


KeyboardInterrupt



In [None]:
train_final = pd.get_dummies(train_scaled)

transpose_1 = train_imputed.transpose()
transpose_2 = vali_imputed.transpose()
transpose = transpose_1

for i,column in enumerate(transpose):
    if not isinstance(column[0], str):
        array = column.reshape(-1, 1)
        scaler.fit(array)
        array= scaler.transform(array)
        transpose[i] = array.reshape(-1)
    else:
        continue
        
train_scaled = transpose.transpose()

In [None]:
######################################   EXERCISE 3   ######################################
'''
Visualize the SVM accuracy results in a graph.
'''

# Visualize the SVM results: (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html)



In [None]:
######################################   EXERCISE 4   ######################################
'''
Use a grid search to find the optimal hyperparameters of the SVM, KNN and RandomForest models using 3-fold cross validation.
'''

# Make use of GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# KNN: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# RandomForest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html



In [None]:
######################################   EXERCISE 5   ######################################
'''
Visualize the SVM, KNN and RandomForest accuracy results in a heatmap.
'''

# Make us of a heatmap: https://seaborn.pydata.org/generated/seaborn.heatmap.html



In [None]:
######################################   EXERCISE 6   ######################################
'''
Implement a basic Neural Network and visualize the loss and accuracy, both for the test and training dataset.
'''

# Make use of the MLPClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier



In [None]:
######################################   EXERCISE 7   ######################################
'''
Apply any type of model and preprocessing steps necessary to achieve the highest possible validation accuracy.
'''



In [51]:
######################################   EXERCISE 8   ######################################
'''
Time to predict using your best ML model!
Use the test_brain.csv dataset to predict whether the patients have a brain tumor or not. The target variable has been removed from the dataset.
Save the prediction results in a list with the same order as the patients in the test dataset.
Use the below given code to store the list as .pkl file and save it using your group number.
Finally, do a push request to the github repository. We will calculate your final accuracy score.
'''

import pickle

group_number = 3
ypred = y_pred

with open(f'test_predictions_group_{group_number}.pkl', 'wb') as f:
    pickle.dump(ypred, f)