# Classifying hospitalisation cases:

The data for this exercise is information on 20000+ knee/hip operations, for which there are 32 input features, most of which are boolean or categorical. The challenge is to predict, which patients will have complications after the surgery.

The data is divided into **training** (18104 cases) and **test** (3915 cases) sets already, to preserve the cronological order, i.e. the training set lies before the test set in time. The features are simply labelled x0-x31, simply to ensure anonymity, and their true meaning is not essential anyway.

The real challenge in this dataset lies in the **lacking balance between the two classes**, as there are only (fortunately) about 5%, that experience complications. This requires you to think a lot about your loss function.


### Questions:
Try to see, if you can answer the following questions:<br>
1) What loss function do you use? Is this "reasonable" given that the dataset is unbalanced?<br>
2) Imagine that the doctors wanted the largest fraction of positives (i.e. with complications) among the 20% most likely to be positive,<br> simply because the ward can give special treatment to 20% of patients. Would that affect your choice of loss function?<br>
3) How do you deal with the missing information?<br>

***

* Author: Troels C. Petersen (NBI) & Christian Michelsen (NBI)
* Email:  petersen@nbi.dk
* Date:   6th of May 2022 (latest version)

In [3]:
from __future__ import print_function, division   # Ensures Python3 printing & division standard
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

import math
from array import array

import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# For better looking text on figures
from matplotlib import rc
rc('text', usetex=True)
rc('font', family='serif')

In [6]:
# Read in the data, and prepare the target y (the y-files also contains an ML and a Logistic Regression (LR) answer,
# which you can consider "competitor algorithms"):
X_train = pd.read_csv("X_train.csv")
df_y = pd.read_csv("y_train.csv")
y_train = df_y["y"]

X_test = pd.read_csv("X_test.csv")
df_y = pd.read_csv("y_test.csv")
y_test = df_y["y"]

print(X_train, y_train)

       x0  x1     x2    x3   x4  x5  x6  x7  x8  x9  ...  x22  x23  x24  \
0       0   0  162.0  66.0  7.9   1   0   0   0   0  ...    0    1   60   
1       0   0  168.0  67.0  9.0   0   0   0   0   0  ...    0    1   76   
2       1   0  167.0  76.0  9.1   0   0   0   0   0  ...    0    1   73   
3       0   1  175.0  65.0  9.0   0   0   0   0   0  ...    0    0   69   
4       1   0  184.0  88.0  9.6   0   0   0   0   0  ...    0    0   63   
...    ..  ..    ...   ...  ...  ..  ..  ..  ..  ..  ...  ...  ...  ...   
18098   1   0  180.0  80.0  8.9   0   1   0   0   1  ...    0    0   56   
18099   1   0  165.0  90.0  8.6   0   0   0   0   1  ...    0    0   65   
18100   0   0  175.0  75.0  8.2   0   0   0   0   1  ...    1    0   54   
18101   1   0  184.0  88.0  9.0   0   0   1   0   1  ...    0    0   79   
18102   0   0  170.0  56.0  8.8   0   0   0   0   0  ...    0    0   69   

             x25  x26  x27   x28  x29  x30  x31  
0      25.148605    0    0  2014    1    3    2  