# Classifying patients before hip/knee surgery

The data for this exercise is from 20000+ hip/knee surgeries performed over a period of a few years. The data from the first years constitute the training case, while the last year is used for testing. In this way, one also preserves the "causal" order, i.e. not getting information from the "future to predict in the past".

The data is divided into **training** (18104 cases) and **test** (3915 cases) sets already. For training, there are two datasets:
* X_train.csv, i.e. the input features (x0-x31, i.e. 32 in total).
* y_train.csv, i.e. the target class (y: 0/1, along with y_ML and y_LR from ML and Logistic Regresion, respectively).

The names of the input features has been renamed X0, X1, etc. in order to keep the data as anonymous as possible, and it should be stressed, that it is data **owned** by a research group in Denmark, and not for general use. The outcome/target is whether there were (i.e. are expected) medical complications, which is very useful to know ahead of the surgery. This is contained in the separate files "y_train", and the purpose is to predict this. Here, also two "competitor" algorithm predictions have been included for comparison.

The data has got some missing values and other flaws to consider, before applying any further data analysis, but this should be a minor issue. A larger issue of concern is, that the data is **unbalanced**, in that medical complications are rare (fortunately!), and only occur for about 5% of the cases. That is of course good in the real world, but a challenge when trying to train on it.

### Questions:
Try to see, if you can answer the following questions:<br>
1) What loss function do you use?<br>
2) How to counter the problem with the unbalanced data set?<br>
3) How to deal with the missing values? Do you impute?<br>
4) How well can you classify the patients at risk of medical complications?<br>

The hospitals estimate, that they can keep about 20% of the patients, and the goal is really to keep the largest number of patients with medical complications in hospital, while accepting that most patients will not have such complications, even if they are at risk. This information should influence your choice of loss function. 

***

* Author: Troels C. Petersen (NBI) & Christian Michelsen (NBI)
* Email:  petersen@nbi.dk
* Date:   6th of May 2022 (latest version)

In [None]:
from __future__ import print_function, division   # Ensures Python3 printing & division standard
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

import math
from array import array

import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# For better looking text on figures
from matplotlib import rc
rc('text', usetex=True)
rc('font', family='serif')

In [None]:
# Read in the data, and prepare the target y (the y-files also contains an ML and a Logistic Regression (LR) answer,
# which you can consider "competitor algorithms"):
X_train = pd.read_csv("X_train.csv")
df_y = pd.read_csv("y_train.csv")
y_train = df_y["y"]

X_test = pd.read_csv("X_test.csv")
df_y = pd.read_csv("y_test.csv")
y_test = df_y["y"]

print(X_train, y_train)