# Classifying patients before hip/knee surgery

The data for this exercise is from about 20000 hip/knee surgeries performed over a period of a few years. The data from the first years constitute the training case, while the last year is used for testing. In this way, one also preserves the "causal" order, i.e. not getting information from the "future to predict in the past".

For training, there are two datasets:
* X_train.csv, i.e. the input features (x0-x31, i.e. 32 in total).
* y_train.csv, i.e. the target class (y: 0/1, along with y_ML and y_LR from ML and Logistic Regresion, respectively).

The names of the input features has been renamed X0, X1, etc. in order to keep the data as anonymous as possible, and it should be stressed, that it is data **owned** by a research group in Denmark, and not for general use. The outcome/target is whether there were (i.e. are expected) medical complications, which is very useful to know **ahead** of the surgery. This is contained in the separate files "y_train", and the purpose is to predict this. Here, also two "competitor

The data has got some missing values and other flaws to consider, before applying any further data analysis, but this should be a minor issue. A larger issue of concern is, that the data is **unbalanced**, in that medical complications are rare, and only occur for about 5% of the cases! That is of course good in the real world, but a challenge when trying to train on it.

### Questions:
Try to see, if you can answer the following questions:<br>
1) What loss function do you use?<br>
2) How to counter the problem with the unbalanced data set?<br>
3) How to deal with the missing values? Do you impute?<br>
4) How well can you classify the patients at risk of medical complications?<br>

The hospitals estimate, that they can keep about 20% of the patients, and the goal is really to keep the largest number of medical complications in hospital, while accepting that most patients will not have such complications, even if they are at risk. This information should influence your choice of loss function. 

***

* Author: Troels C. Petersen (NBI) & Christian Michelsen (NBI)
* Email:  petersen@nbi.dk
* Date:   5th of May 2021 (latest version)

In [None]:
import pandas as pd

# Input features (X):
X_train = pd.read_csv("X_train.csv")

# Target values (0/1), along with two "competitor" algorithm predictions (for comparison):
df_y = pd.read_csv("y_train.csv")
y_train = df_y["y"]

# In the end, you want to apply your best model to the test data set:
X_test = pd.read_csv("X_test.csv")