# <center>Breast cancer survival prediction</center>

### Table of contents

1. [Introduction](#introduction)
    1. [Survival analysis](#survival_analysis)
2. [Data exploration](#data_exploration)
    1. [Python requirements](#python_requirements)
    2. [Getting the data](#getting_data)
    3. [Baseline model](#baseline_model)
3. [Submission](#submission)


## Introduction <a name="introduction"></a>

Breast cancer is one of the most common cancers and the second leading cause of cancer death among women in the United States. One in nine women will be diagnosed with breast cancer in her lifetime ([INCa avril 2016](https://www.ligue-cancer.net/article/26094_cancer-du-sein)). Approximately 70% of breast cancer patients are inoperable because of advanced tumor growth or bone metastasis [Min Tao et al.](https://pubmed.ncbi.nlm.nih.gov/21512769/). 

It is therefore crucial to be able to accurately diagnose the disease, and to better understand the aggravating factors. Here we propose to predict the survival based on the genetic factors of the tumour.

### Survival analysis<a name="survival_analysis"></a>

*[Survival analysis](https://en.wikipedia.org/wiki/Survival_analysis)* is a branch of statistics designed to model the life or activity time of a living being or device. Widely used in medicine and particularly in oncology, it allows to process data often *censored* over time. As far as oncology is concerned, for instance, it is inevitable that patients become out of reach or leave the study before the measured event (death, remission, metastasis, etc.): classic statistical tools such as regression are then poorly adapted and may underestimate the lifetimes studied. Here we intend to provide a brief and non-exhaustive overview of basic relevant statistical techniques that might be useful for the rest of the challenge.

#### A few definitions

To make our introduction more readable, we can assume that time is a discrete variable i.e $t \in \{1,...,n\}$ with $n \in \mathbb{N}$.


Let $ \tau \geq 0$ be the random variable giving the time before a relevant event happen. We then wish to estimate the *survival function* $S$ which define  $\tau$, where $S$ is defined by :
$ {\displaystyle S(t)=\mathrm {Prob} (\tau >t)}$


We also make the assumption that the censors (the patients leaving the study) are non-informative (e.g. they don't have anything to do with the subject studied, e.g. can be considered random).

# TODO

Parler du [survival analysis](https://en.wikipedia.org/wiki/Survival_analysis) , en particulier avec des données censurées à droite (ce qui est notre cas : cela correspond à toutes les données pour lesquelles on a ```event = 0 ```, c'est à dire qu'on a perdu la trace du patient). On peut se baser sur toute la doc de ces deux libraries Python, qui englobe tous les points de notre projet, avec notamment des exemples d'analyse de A à Z, depuis la data exploration jusqu'au choix du modèle et de la métrique : 

+ scikit-survival
+ pysurvival

## Data exploration <a name="data_exploration"></a>

### Python requirements <a name="python_requirements"></a>

In order to collect and analyse the data, the following Python libraries are required :



In [1]:
with open('requirements.txt', 'r') as requirements:
    print(requirements.read())

# Generic requirements
numpy
pandas

# Gather the data
mygene
xenaPython



Which can be installed (under Linux) with :

In [2]:
%%capture
!pip install -r requirements.txt

### Gather the data <a name="getting_data"></a>
We 

In [3]:
from problem import load_or_download
database_filename = "database_TCGA.csv" 
df = load_or_download(database_filename) # Download the dataset the first time, and then just load it

  re.sub(r"^[^[]+[[]([^]]*)[]].*$", r"\1", query, flags=re.DOTALL))


querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

In [4]:
df.head

<bound method NDFrame.head of       expressionNUCB1-AS1  expressionHNRNPA1P76  expressionIGKV2OR2-1  \
0                  10.270                 0.000                 0.000   
1                   0.000                 8.803                11.280   
2                  11.350                 9.381                 9.861   
3                   0.000                 8.658                 0.000   
4                  10.960                 0.000                10.470   
...                   ...                   ...                   ...   
1189                9.987                 0.000                12.080   
1190                0.000                 0.000                 0.000   
1191                0.000                 0.000                 0.000   
1192                0.000                 0.000                12.150   
1193                0.000                 8.754                11.230   

      expressionRP1L1  expressionRNF152P1  expressionKIR3DP1  \
0               7.921        

Note : Although 55511 genes were listed, the data for "only" 37498 genes are available.

Some ideas :
+ Study the gene expression distributions, those who are very expressed, those who aren't...
+ Study the correlations among the gene expression
+ Study the difference of gene expression (those with the biggest difference for example) between alive patients and dead patients
 
 ...

### Baseline model <a name="baseline_model"></a>



## Metric

We could use Integrated Brier Score (https://square.github.io/pysurvival/metrics/brier_score.html)



## Submission <a name="submission"></a>