# Mortality prediction with the PBCSeq dataset
## M2 Données Massives en Santé

This notebook serves as an application of the 3h crash-course ML for longitudinal data analysis. 

## 1. Introduction

The goal of this notebook is to predict the mortality of patients based on longitudinal measurements. Given a set of measurements, we wish to classify the patient as $1$ (deaceased) or $0$ (survived) by minimizing a convex surrogate of the binary classification error.

The PBCSeq dataset is a public dataset. A full description can be found [here](https://openml.org/d/516). Patients are identified using the variable `case_number`. For every patient there are at least two measurement dates.

## 2. Goals

In this practical session, you should: 
    
    1. Understand the dataset. Clean and pre-process the data in the PBCSeq dataset.
    2. Compute descriptive statistics of the PBCSeq dataset and vizualise the obtained statistics.
    3. Perform feature extraction on the dataset. 
    4. Use these features to predict the mortality of patients using different penalized prediction algorithms (logistic regression, random forests, SVM, basic deep learning algorithms).
    
To put task $4$ more precisely, we wish to answer the following question: given a timespan $\varepsilon$ starting at the first observation date, what is the probability of a patient dying within this timespan knowing all data collected on here during this period ? This task can be casted as a binary classification task: your classifier should output a probability of dying, and your classification rule should then be to classify this patient as $1$ if the predicted death probability is greater than $1/2$.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

## Load data

In [10]:
# reading csv file  
df = pd.read_csv("./pbcseq.csv")
print("There are %s lines and %s columns in this dataset." % df.shape)
print("There are %s patients in this dataset." % df['case_number'].max())

# echo 5 first lines
df.head(40)

There are 1945 lines and 19 columns in this dataset.
There are 312 patients in this dataset.


Unnamed: 0,case_number,number_of_days,status,drug,age,sex,day,presence_of_asictes,presence_of_hepatomegaly,presence_of_spiders,presence_of_edema,serum_bilirubin,serum_cholesterol,albumin,alkaline_phosphatase,SGOT,platelets,prothrombin_time,histologic_stage_of_disease
0,1,400,2,D-penicillamine,21464,female,no,yes,yes,1,1.0,14.5,261,2.6,1718,138.0,190,12.2,4
1,1,400,2,D-penicillamine,21464,female,192,yes,yes,1,1.0,21.3,?,2.94,1612,6.2,183,11.2,4
2,2,5169,0,D-penicillamine,20617,female,no,no,yes,1,0.0,1.1,302,4.14,7395,113.5,221,10.6,3
3,2,5169,0,D-penicillamine,20617,female,182,no,yes,1,0.0,0.8,?,3.6,2107,139.5,188,11.0,3
4,2,5169,0,D-penicillamine,20617,female,365,no,yes,1,0.0,1.0,?,3.55,1711,144.2,161,11.6,3
5,2,5169,0,D-penicillamine,20617,female,768,no,yes,1,0.0,1.9,?,3.92,1365,144.2,122,10.6,3
6,2,5169,0,D-penicillamine,20617,female,1790,yes,yes,1,0.5,2.6,230,3.32,1110,131.8,135,11.3,3
7,2,5169,0,D-penicillamine,20617,female,2151,yes,yes,1,1.0,3.6,?,2.92,996,131.8,100,11.5,3
8,2,5169,0,D-penicillamine,20617,female,2515,yes,yes,1,1.0,4.2,?,2.73,860,145.7,103,11.5,3
9,2,5169,0,D-penicillamine,20617,female,2882,yes,yes,1,1.0,3.6,244,2.8,779,119.0,113,11.5,3
