# Predicting Time-To-Event for a Breast Cancer Data Set

This notebook provides the tasks of predicting the time-to-event for a breast cancer dataset. The dataset is originally from [here](https://archive.ics.uci.edu/dataset/16/breast+cancer+wisconsin+prognostic), but a formatted version is available in the Github repository.

The dataset contains numerous patient characteristics and an indicator of whether or not the cancer has recurred. The task is to predict the time-to-event for the patients.

The main tasks are: 

- Load the dataset & explore the data
- Plot Kaplan-Meier survival curves
- Fit a Cox proportional hazards model
- Implement a DeepSurv model
- Compare model performance

## Task 1: Load the dataset & explore the data

The data set is available in the Github repository. Load the dataset and look at the features. You can look at some features and see if they are associated with recurrence or not.

In [None]:
available_columns = [
    "id",
    "outcome",
    "time",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "concave_points_mean",
    "symmetry_mean",
    "fractal_dimension_mean",
    "radius_se",
    "texture_se",
    "perimeter_se",
    "area_se",
    "smoothness_se",
    "compactness_se",
    "concavity_se",
    "concave_points_se",
    "symmetry_se",
    "fractal_dimension_se",
    "radius_worst",
    "texture_worst",
    "perimeter_worst",
    "area_worst",
    "smoothness_worst",
    "compactness_worst",
    "concavity_worst",
    "concave_points_worst",
    "symmetry_worst",
    "fractal_dimension_worst",
    "tumor_size",
    "lymph_node_status",
]

## Task 2: Generate Kaplan-Meier Survival Curves

Kaplan-Meier survival curves are a good way to visualize survival data. Generate the Kaplan-Meier survival curves for the data set. You can use the `lifelines' library for this task.

You should generate the survival curves for the following characteristics

- Tumor size (greater or less than the median)
- Number of lymph nodes (0, <5, >=5)
- Recurrence-free survival (RFS)

You can also generate the survival curves for the features you consider important.

## Task 3: Fit a Cox Proportional Hazards model

The Cox Proportional Hazards model is a popular model for predicting time-to-event. Fit a Cox Proportional Hazards model to the data set. You can use the `lifelines' library for this task.

You should perform cross-validation to evaluate the performance of the model. You can use the C-index as an evaluation metric. See the documentation of the `lifelines' library to see how to perform a cross-validation. Finally, generate a boxplot of the C-index values.

## Task 4: Implement a DeepSurv model

DeepSurv is a deep learning model for time-to-event prediction. Implement this model, evaluate its performance using cross-validation, and generate a box plot of the C-index values.

1. First, you need to create an appropriate **Dataset** class that returns the features, the time-to-event, and the event indicator.
2. Then you need a neural network model that takes the features as input and returns the predicted hazard ratio.
3. You need to implement a training and evaluation loop for the model. As a loss function, you can use the `neg_partial_log_likelihood` from the `torchsurv` library.
4. You need to put everything together and run a cross-validation to evaluate the performance of the model. Evaluate the performance in each fold using the C-index.

Notes: 
- Remember to scale the features before feeding them into the model.
- Do not optimize the hyperparameters of the model. In this case, you would need to implement nested cross-validation, which is beyond the scope of this notebook. You can use a learning rate of 0.01 and 50 epochs for training. If your model does not perform well, try using two linear layers of 32 units each and a ReLU activation function. The final output dimension should be 1.

## Task 5: Describe your results



## Optional

- Add a feature selection method to the Cox Proportional Hazards model.
- Attempt to implement nested cross-validation to optimize the hyperparameters of the DeepSurv model.