# Fine-tuning a pre-trained model

This tutorial describes how to fine-tune a pre-trained model from the [DeepCpG model zoo](https://github.com/cangermueller/deepcpg/blob/master/docs/models.md). Fine-tuning a model that has been pre-trained on a cells which are similar to the cells of interest can considerably decrease training time. 

## Table of Contents
* [Initialization](#Initialization)
* [Creating DeepCpG data files](#Creating-DeepCpG-data-files)
* [Downloading a pre-trained model](#Downloading-a-pre-trained-model)
* [Fine-tuning the model](#Fine-tuning-the-model)
* [Imputing methylation profiles](#Imputing-methylation-profiles)

## Initialization

We first initialize some variables that will be used throughout the tutorial. `test_mode=1` should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, `test_mode=0` should be used.

In [1]:
function run {
  local cmd=$@
  echo
  echo "#################################"
  echo $cmd
  echo "#################################"
  eval $cmd
}

test_mode=1 # Set to 1 for testing and 0 otherwise
example_dir="../../data/" # Directory with example data.
cpg_dir="$example_dir/cpg" # Directory with CpG profiles.
dna_dir="$example_dir/dna/mm10" # Directory with DNA sequences.



## Creating DeepCpG data files

First, we create DeepCpG data files using `dcpg_data.py`. Since we will fine-tune a CpG model, we do not extract sequence windows. Otherwise, `--dna_files` and `--dna_wlen` must to be specified.

In [4]:
data_dir="./data"
cmd="dcpg_data.py
    --cpg_profiles $cpg_dir/*.tsv
    --out_dir $data_dir
    --cpg_wlen 50
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_sample 10000
        "
fi
run $cmd


#################################
dcpg_data.py --cpg_profiles ../../data//cpg/BS27_1_SER.tsv ../../data//cpg/BS27_3_SER.tsv ../../data//cpg/BS27_5_SER.tsv ../../data//cpg/BS27_6_SER.tsv ../../data//cpg/BS27_8_SER.tsv --out_dir ./data --cpg_wlen 50 --nb_sample 10000
#################################
INFO (2017-03-05 19:19:42,901): Reading single-cell profiles ...
INFO (2017-03-05 19:19:43,339): 10000 samples
INFO (2017-03-05 19:19:43,340): --------------------------------------------------------------------------------
INFO (2017-03-05 19:19:43,340): Chromosome 1 ...
INFO (2017-03-05 19:19:43,368): 10000 / 10000 (100.0%) sites matched minimum coverage filter
INFO (2017-03-05 19:19:43,369): Chunk 	1 / 1
INFO (2017-03-05 19:19:43,379): Extracting CpG neighbors ...
INFO (2017-03-05 19:19:44,498): Done!


## Downloading a pre-trained model

`dcpg_download.py` downloads a pre-trained model from the DeepCpG model zoo. Available models and their corresponding description can be found on the [model zoo website](https://github.com/cangermueller/deepcpg/blob/master/docs/source/zoo.md), or retrieved with `dcpg_download.py --show`:

In [5]:
dcpg_download.py --show

Available models: https://github.com/cangermueller/deepcpg/blob/master/docs/models.md
Hou2016_HCC_cpg
Hou2016_HCC_dna
Hou2016_HCC_joint
Hou2016_HepG2_cpg
Hou2016_HepG2_dna
Hou2016_HepG2_joint
Hou2016_mESC_cpg
Hou2016_mESC_dna
Hou2016_mESC_joint
Smallwood2014_2i_cpg
Smallwood2014_2i_dna
Smallwood2014_2i_joint
Smallwood2014_serum_cpg
Smallwood2014_serum_dna
Smallwood2014_serum_joint


A model name consist of three parts, which are separated by '_'. The first part corresponds to the publication, the second to the cell type, and the third to the modle type(CpG, DNA, or Joint model). Cells from 'Hou2016' were profiled using scRRBS-seq, cells from 'Smallwood2014' using scBS-seq. 'HCC' and 'HepG2' are human cancer cells, and the rest mouse cells. You should use the cell-type that is most similar to the cell-type you are working with. More information  about the available models can be found [here](https://github.com/cangermueller/deepcpg/blob/master/docs/models.md). 

Since we are dealing with 2i cells and want to train a CpG model, we will fine-tune 'Smallwood2014_2i_cpg':

In [6]:
pretrained_model="./models/Smallwood2014_2i_cpg"
cmd="dcpg_download.py
  $(basename $pretrained_model)
  -o $pretrained_model
  "
run $cmd


#################################
dcpg_download.py Smallwood2014_2i_cpg -o ./models/Smallwood2014_2i_cpg
#################################
INFO (2017-03-05 19:19:51,601): Downloading model ...
INFO (2017-03-05 19:19:51,601): Model URL: http://www.ebi.ac.uk/~angermue/deepcpg/alias/f89b2e8344012d73e95504da06bcf378
--2017-03-05 19:19:51--  http://www.ebi.ac.uk/~angermue/deepcpg/alias/f89b2e8344012d73e95504da06bcf378
Resolving www.ebi.ac.uk (www.ebi.ac.uk)... 193.62.192.80
Connecting to www.ebi.ac.uk (www.ebi.ac.uk)|193.62.192.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31068468 (30M) [text/plain]
Saving to: ‘./models/Smallwood2014_2i_cpg/model.zip’


2017-03-05 19:19:54 (10.1 MB/s) - ‘./models/Smallwood2014_2i_cpg/model.zip’ saved [31068468/31068468]

Archive:  ./models/Smallwood2014_2i_cpg/model.zip
  inflating: ./models/Smallwood2014_2i_cpg/model.h5  
  inflating: ./models/Smallwood2014_2i_cpg/model.json  
  inflating: ./models/Smallwo

The command downloads and stores model files in the output directory, including the weights and JSON file with the model specification:

In [7]:
ls $pretrained_model

model.h5               model_weights.h5       model_weights_val.h5
model.json             model_weights_train.h5


`model.json` stores the model specification, and `model_weights_train.h5` and `model_weights_val.h5` the weights that yielded the highest performance on the training and validation set, respectively. `model.h5` combines `model.json` and `model_weights_val.h5`.

## Fine-tuning the model

To fine-tune the downloaded model, we use `--cpg_model` followed by the model directory, and `--fine_tune` to only train the output layers.

`--cpg_model $pretrained_model` is equivalent to `--cpg_model $pretrained_model/model.json $pretrained_model/model_weights_val.h5`. To fine-tune the weights with the highest performance on the training set, you have to use `model_weights_train.h5` as input instead of `model_weights_val.h5`.

Without `--fine_tune`, `dcpg_train.py` will train all weights, not only the output layers. This is recommended if the cells that were used for the pre-trained model are only distantly related to the cells of interests, e.g. if cell-types do not match. Training all weights can lead to a higher prediction performance, but also increase training time.

In [8]:
cmd="dcpg_train.py
    $data_dir/*.h5
    --cpg_model $pretrained_model
    --out_dir ./models/cpg
    --fine_tune
  "
if [[ $test_mode -eq 1 ]]; then
  cmd="$cmd
    --nb_epoch 2
    --nb_train_sample 1000
    --nb_val_sample 1000
    "
else
  cmd="$cmd
    --nb_epoch 25
    --early_stopping 5
    "
fi
run $cmd



#################################
dcpg_train.py ./data/c1_000000-010000.h5 --cpg_model ./models/Smallwood2014_2i_cpg --out_dir ./models/cpg --fine_tune --nb_epoch 2 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-03-05 19:20:27,727): Building model ...
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-03-05 19:20:27,735): Loading existing CpG model ...
INFO (2017-03-05 19:20:27,736): Using model files ./models/Smallwood2014_2i_cpg/model.json ./models/Smallwood2014_2i_cpg/model_weights.h5
INFO (2017-03-05 19:20:28,772): Replicate names differ: Copying weights to new model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
cpg/state (InputLayer)           (None, 5, 50)         0                                        

## Imputing methylation profiles

Finally, we impute methylation profiles and evaluate our fine-tuned model using `dcpg_eval.py`:

In [9]:
eval_dir="./eval"
mkdir -p $eval_dir

cmd="dcpg_eval.py
    $data_dir/*.h5
    --model_files ./models/cpg
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.tsv
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_sample 1000
        "
fi
run $cmd


#################################
dcpg_eval.py ./data/c1_000000-010000.h5 --model_files ./models/cpg --out_data ./eval/data.h5 --out_report ./eval/report.tsv --nb_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-03-05 19:20:46,772): Loading model ...
INFO (2017-03-05 19:20:47,542): Loading data ...
INFO (2017-03-05 19:20:47,571): Predicting ...
INFO (2017-03-05 19:20:47,587):  128/1000 (12.8%)
INFO (2017-03-05 19:20:47,686):  256/1000 (25.6%)
INFO (2017-03-05 19:20:47,759):  384/1000 (38.4%)
INFO (2017-03-05 19:20:47,833):  512/1000 (51.2%)
INFO (2017-03-05 19:20:47,914):  640/1000 (64.0%)
INFO (2017-03-05 19:20:47,991):  768/1000 (76.8%)
INFO (2017-03-05 19:20:48,078):  896/1000 (89.6%)
INFO (2017-03-05 19:20:48,158): 1000/1000 (100.0%)
  'precision', 'predicted', average, warn_for)
           output       auc       acc       tpr       tnr        f1       mcc      n
2  cpg/BS27_5_SER  0.614279  0.850000  0.989362  0.031250  0.918519