# <center>Breast cancer survival prediction</center>

### Table of contents

1. [Introduction](#introduction)
2. [Data exploration](#data_exploration)
    1. [Python requirements](#python_requirements)
    2. [Getting the data](#getting_data)
    3. [Baseline model](#baseline_model)
3. [Submission](#submission)


## Introduction <a name="introduction"></a>

Breast cancer is one of the most common cancers and the second leading cause of cancer death among women in the United States. One in nine women will be diagnosed with breast cancer in her lifetime ([INCa avril 2016](https://www.ligue-cancer.net/article/26094_cancer-du-sein)). Approximately 70\% of breast cancer patients are inoperable because of advanced tumor growth or bone metastasis [Min Tao et al.](https://pubmed.ncbi.nlm.nih.gov/21512769/). 

It is therefore crucial to be able to accurately diagnose the disease, and to better understand the aggravating factors. Here we propose to predict the survival based on the genetic factors of the tumour.

## Data exploration <a name="data_exploration"></a>

### Python requirements <a name="python_requirements"></a>

In order to collect and analyse the data, the following Python libraries are required :



In [1]:
with open('requirements.txt', 'r') as requirements:
    print(requirements.read())

# Generic requirements
numpy
pandas

# Gather the data
mygene
xenaPython



Which can be installed (under Linux) with :

In [2]:
%%capture
!pip install -r requirements.txt

### Gather the data <a name="getting_data"></a>

In [3]:
from gather_data import load_or_download
database_filename = "database_TCGA.csv" 
df = load_or_download(database_filename) # Download the dataset the first time, and then just load it

  re.sub(r"^[^[]+[[]([^]]*)[]].*$", r"\1", query, flags=re.DOTALL))


querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

In [4]:
df.head

<bound method NDFrame.head of       expressionPP2672  expressionRNU6-319P  expressionFTH1P24  \
0                0.000                 0.00              0.000   
1                7.287                 0.00              0.000   
2                0.000                 0.00              0.000   
3                7.143                 0.00             10.530   
4                0.000                 0.00              0.000   
...                ...                  ...                ...   
1189             0.000                 0.00             10.890   
1190             0.000                 0.00              0.000   
1191             0.000                 0.00              9.491   
1192             0.000                 0.00              0.000   
1193             7.238                11.72              9.629   

      expressionBTF3P15  expressionADIPOR1P1  expressionNUTF2P7  \
0                  0.00                0.000              0.000   
1                  0.00                0.00

Note : Although 55511 genes were listed, the data for "only" 37498 genes are available.

Some ideas :
+ Study the gene expression distributions, those who are very expressed, those who aren't...
+ Study the correlations among the gene expression
+ Study the difference of gene expression (those with the biggest difference for example) between alive patients and dead patients
 
 ...

### Baseline model <a name="baseline_model"></a>

We could use a linear model, and select appropriately the genes of interest, for example with a Lasso-Ridge. The main advantage is that our model would be interpretable, and one could recover the genes of interest.

## Submission <a name="submission"></a>