# Case 2

Adapt the code from the 'Supervised Learning using the SKLearn Library - Notebook 1' notebook to train a decision tree and a random forest classifier on the COVID-19 data set from.

As well as training additional models, we will invetigate the impact of changing model hyperparameters and how to perform ROC analysis using SKlearn

**Tasks:**

2.1 Using a 5-fold cross validation, train a decision tree and random forest classifier and report the model accuracy and sensitivity?

2.2 How does the performance of these models compare to the Perceptron and Logistic regression? 

2.3 For the random forest model, try changing the n_estimators and max_features, one at a time. By training the model multiple times (using a FOR loop), plot the accuracy as you change these parameters.

In [141]:
# data import for exercises
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data = pd.read_csv('data_cvd.csv', na_values='NA')

Import the data set - I've removed the relatively few rows in which there is no age (alternatively, could have set these to -1, or some other value)

In [145]:
# models that will be used in this exercise
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
data = data.dropna(subset=['age'])

We wish to predict death, so assign a new variable to the 'death' column, and remove the column from the prediction features, X

In [146]:
y = data['death']

In [147]:
X = data.drop(columns=['death'])

In [148]:
# Split data into 50% train and 50% test subsets
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

There is a lot of missing data here. The scheme we are going to use is to treat missingness (NaN) as another category - this may or may not be a good idea. Further exploratory data analysis would help.

As a shortcut example, we are just going to consider gender and age. We have already removed all age NaNs, ans we will replace unknown gender with its own category

In [150]:
Xx_train = X_train[{'gender','age'}]

In [151]:
Xx_train = Xx_train.to_numpy()

In [157]:
Xx_train

array([[66.0, 1],
       [56.0, 0],
       [46.0, 1],
       [60.0, 0],
       [58.0, 1],
       [44.0, 0],
       [34.0, 1],
       [37.0, 1],
       [39.0, 1],
       [56.0, 1],
       [18.0, 0],
       [32.0, 0],
       [37.0, 1],
       [51.0, 1],
       [57.0, 1],
       [56.0, 1],
       [50.0, 1],
       [52.0, 0],
       [33.0, 1],
       [40.0, 1],
       [70.0, 1],
       [51.0, 1],
       [28.0, 0],
       [37.0, 1],
       [19.0, 1],
       [29.0, 1],
       [66.0, 0],
       [46.0, 1],
       [32.0, 0],
       [28.0, 1],
       [55.0, 1],
       [68.0, 1],
       [38.0, 1],
       [72.0, 1],
       [45.0, 1],
       [42.0, 1],
       [33.0, 0],
       [33.0, 0],
       [37.0, 1],
       [69.0, 1],
       [63.0, 1],
       [62.0, 0],
       [49.0, 1],
       [50.0, 1],
       [48.0, 1],
       [36.0, 1],
       [36.0, 0],
       [61.0, 1],
       [69.0, 1],
       [89.0, 1],
       [89.0, 1],
       [66.0, 1],
       [75.0, 1],
       [48.0, 0],
       [82.0, 1],
       [66

Column 2 (gender) contains 'M', 'F' and Nans. We want to convert this column into categories. To do this we will use the LabelEncoder function, but this requires single type data. NaN is treated by Python as a number, so we need to explicitly tell it to consider everything as a string.

In [153]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [154]:
gender = le.fit_transform(Xx_train[:,1].astype(str))

Replace the old gender column with the new one with NaNs replaced by a category

In [155]:
Xx_train[:,1] = gender

In [156]:
y_train = y_train.to_numpy()

In [158]:
Xx_train[:,0]

array([66.0, 56.0, 46.0, 60.0, 58.0, 44.0, 34.0, 37.0, 39.0, 56.0, 18.0,
       32.0, 37.0, 51.0, 57.0, 56.0, 50.0, 52.0, 33.0, 40.0, 70.0, 51.0,
       28.0, 37.0, 19.0, 29.0, 66.0, 46.0, 32.0, 28.0, 55.0, 68.0, 38.0,
       72.0, 45.0, 42.0, 33.0, 33.0, 37.0, 69.0, 63.0, 62.0, 49.0, 50.0,
       48.0, 36.0, 36.0, 61.0, 69.0, 89.0, 89.0, 66.0, 75.0, 48.0, 82.0,
       66.0, 81.0, 82.0, 65.0, 80.0, 53.0, 86.0, 70.0, 84.0, 50.0, 40.0,
       45.0, 66.0, 59.0, 23.0, 50.0, 43.0, 49.0, 42.0, 32.0, 22.0, 47.0,
       52.0, 53.0, 46.0, 85.0, 69.0, 36.0, 73.0, 70.0, 81.0, 65.0, 42.0,
       30.0, 29.0, 49.0, 23.0, 56.0, 39.0, 39.0, 34.0, 49.0, 70.0, 76.0,
       72.0, 79.0, 55.0, 87.0, 66.0, 58.0, 66.0, 78.0, 67.0, 65.0, 58.0,
       67.0, 82.0, 49.0, 2.0, 59.0, 57.0, 68.0, 40.0, 46.0, 56.0, 29.0,
       29.0, 57.0, 30.0, 33.0, 20.0, 24.0, 28.0, 41.0, 46.0, 45.0, 9.0,
       40.0, 28.0, 27.0, 15.0, 42.0, 24.0, 32.0, 34.0, 63.0, 58.0, 49.0,
       33.0, 55.0, 79.0, 19.0, 58.0, 39.0, 21.0, 30.0

Now train the decision tree:

In [159]:
model = DecisionTreeClassifier()
model.fit(Xx_train,y_train)

DecisionTreeClassifier()

Note that this template can be used to deal with the other variables with NaNs. In real life, we would want to consider whether it is even worth converting into categories. For many of the variables, most of the rows contain NaNs, and the remaining non-NaN rows may not be correlated with the outcome of interest.