# Kipoi python API

## Quick start

There are three basic building blocks in kipoi:

- **Source** - provides Models and DataLoaders.
- **Model** - makes the prediction given the numpy arrays. 
- **Dataloader** - loads the data from raw files and transforms them into a form that is directly consumable by the Model

![img](../docs/theme_dir/img/kipoi-workflow.png)

## List of main commands


- `kipoi.list_sources()`
- `kipoi.get_source()`


- `kipoi.list_models()`
- `kipoi.list_dataloaders()`


- `kipoi.get_model()`
- `kipoi.get_dataloader_factory()`



### Source

Available sources are specified in the config file located at: `~/.kipoi/config.yaml`. Here is an example config file:

```yaml
model_sources:
    kipoi: # default
        type: git-lfs # git repository with large file storage (git-lfs)
        remote_url: git@github.com:kipoi/models.git # git remote
        local_path: ~/.kipoi/models/ # local storage path
    gl:
        type: git-lfs  # custom model
        remote_url: https://i12g-gagneurweb.informatik.tu-muenchen.de/gitlab/gagneurlab/model-zoo.git
        local_path: /s/project/model-zoo
```

There are three different model sources possible: 

- **`git-lfs`** - git repository with source files tracked normally by git and all the binary files like model weights (located in `files*` directories) are tracked by [git-lfs](https://git-lfs.github.com). 
  - Requires `git-lfs` to be installed.
- **`git`** - all the files including weights (not recommended)
- **`local`** - local directory containing models defined in subdirectories

For **`git-lfs`** source type, larger files tracked by `git-lfs` will be downloaded into the specified directory `local_path` only after the model has been requested (when invoking `kipoi.get_model()`).

#### Note

A particular model/dataloader is defined by its source (say `kipoi` or `my_git_models`) and the relative path of the desired model directory from the model source root (say `rbp/`).

A directory is considered a model if it contains a `model.yaml` file.

In [1]:
import kipoi

In [2]:
import warnings
warnings.filterwarnings('ignore')

import logging
logging.disable(1000)

In [3]:
kipoi.list_sources()

Unnamed: 0,source,type,location,local_size,n_models,n_dataloaders
0,kipoi,git-lfs,/homes/rkreuzhu/.kipoi/models/,5.0G,780,780


In [4]:
s = kipoi.get_source("kipoi")

In [5]:
s

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/homes/rkreuzhu/.kipoi/models/')

In [6]:
kipoi.list_models()

Unnamed: 0,source,model,version,authors,contributors,doc,type,inputs,targets,postproc_score_variants,license,cite_as,trained_on,training_procedure,tags
0,kipoi,CpGenie/A549_ENCSR000DDI,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
1,kipoi,CpGenie/BE2C_ENCSR000DEB,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
2,kipoi,CpGenie/BJ_ENCSR000DEA,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
3,kipoi,CpGenie/CMK_ENCSR000DGJ,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
4,kipoi,CpGenie/Caco_2_ENCSR000DDO,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
5,kipoi,CpGenie/GM06990_ENCSR000DDN,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
6,kipoi,CpGenie/GM12878_ENCSR000DEY,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
7,kipoi,CpGenie/GM12878_ENCSR000DFT,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
8,kipoi,CpGenie/GM12891_ENCSR000DFO,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]
9,kipoi,CpGenie/GM12892_ENCSR000DFN,0.1,"[Author(name='Haoyang Zeng', github='haoyangz'...","[Author(name='Roman Kreuzhuber', github='krrom...",Abstract: DNA methylation plays a crucial role...,keras,seq,methylation_prob,True,Apache License v2,https://doi.org/10.1093/nar/gkx177,RRBS (restricted representation bisulfite sequ...,RMSprop,[DNA methylation]


## Model

Let's choose to use the `rbp_eclip/UPF1` model from kipoi

In [7]:
# Note. Install all the dependencies for that model using
# kipoi env install 
model = kipoi.get_model("rbp_eclip/UPF1")

Using TensorFlow backend.


### Available fields:

#### Model

- type
- args
- info
  - authors
  - name
  - version
  - tags
  - doc
- schema
  - inputs
  - targets
- default_dataloader - loaded dataloader class


- predict_on_batch()
- source
- source_dir
- pipeline
  - predict()
  - predict_example()
  - predict_generator()
  
#### Dataloader

- type
- defined_as
- args
- info (same as for the model)
- output_schema
  - inputs
  - targets
  - metadata


- source
- source_dir
- example_kwargs
- init_example()
- batch_iter()
- batch_train_iter()
- batch_predict_iter()
- load_all()

In [8]:
model

<kipoi.model.KerasModel at 0x2ada1ede6cc0>

In [9]:
model.type

'keras'

### Info

In [10]:
model.info

ModelInfo(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='\'RBP binding model from Avsec et al: "Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks". \'\n', name=None, version='0.1', license='MIT', tags=['RNA binding'], contributors=[Author(name='Ziga Avsec', github='avsecz', email=None)], cite_as='https://doi.org/10.1093/bioinformatics/btx727', trained_on='RBP occupancy peaks measured by eCLIP-seq (Van Nostrand et al., 2016 - https://doi.org/10.1038/nmeth.3810), https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017\n', training_procedure='Single task training with ADAM')

In [11]:
model.info.version

'0.1'

### Schema

In [12]:
dict(model.schema.inputs)

{'dist_exon_intron': ArraySchema(shape=(1, 10), doc='Distance the nearest exon_intron (splice donor) site transformed with B-splines', name='dist_exon_intron', special_type=None, associated_metadata=[], column_labels=None),
 'dist_gene_end': ArraySchema(shape=(1, 10), doc='Distance the nearest gene end transformed with B-splines', name='dist_gene_end', special_type=None, associated_metadata=[], column_labels=None),
 'dist_gene_start': ArraySchema(shape=(1, 10), doc='Distance the nearest gene start transformed with B-splines', name='dist_gene_start', special_type=None, associated_metadata=[], column_labels=None),
 'dist_intron_exon': ArraySchema(shape=(1, 10), doc='Distance the nearest intron_exon (splice acceptor) site transformed with B-splines', name='dist_intron_exon', special_type=None, associated_metadata=[], column_labels=None),
 'dist_polya': ArraySchema(shape=(1, 10), doc='Distance the nearest Poly-A site transformed with B-splines', name='dist_polya', special_type=None, associ

In [13]:
model.schema.targets

ArraySchema(shape=(1,), doc='Predicted binding strength', name=None, special_type=None, associated_metadata=[], column_labels=None)

### Default dataloader

Model already has the default dataloder present. To use it, specify

In [14]:
model.source_dir

'/homes/rkreuzhu/.kipoi/models/rbp_eclip/UPF1'

In [15]:
model.default_dataloader

dataloader.SeqDistDataset

In [16]:
model.default_dataloader.info

Info(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='RBP binding model taking as input 101nt long sequence as well as 8 distances to nearest genomic landmarks -  tss, poly-A, exon-intron boundary, intron-exon boundary, start codon, stop codon, gene start, gene end\n', name=None, version='0.1', license='MIT', tags=[])

### Predict_on_batch

In [17]:
model.predict_on_batch

<bound method KerasModel.predict_on_batch of <kipoi.model.KerasModel object at 0x2ada1ede6cc0>>

### Pipeline

Pipeline object will take the dataloader arguments and run the whole pipeline:

```
dataloader arguments --Dataloader-->  numpy arrays --Model--> prediction
```

In [18]:
#model.pipeline.predict

In [19]:
#model.pipeline.predict_generator

### Others

In [20]:
# Model source
model.source

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/homes/rkreuzhu/.kipoi/models/')

In [21]:
# model location directory
model.source_dir

'/homes/rkreuzhu/.kipoi/models/rbp_eclip/UPF1'

## DataLoader

In [22]:
DataLoader = kipoi.get_dataloader_factory("rbp_eclip/UPF1")

A dataloader will most likely require input arguments in which the input files are defined, for example input fasta files or bed files, based on which the model input is generated. There are several options where the dataloader input keyword arguments are displayed:

In [23]:
# Display information about the dataloader
print(DataLoader.__doc__)


    Args:
        intervals_file: file path; tsv file
            Assumes bed-like `chrom start end id score strand` format.
        fasta_file: file path; Genome sequence
        gtf_file: file path; Genome annotation GTF file.
        filter_protein_coding: Considering genomic landmarks only for protein coding genes
        preproc_transformer: file path; tranformer used for pre-processing.
        target_file: file path; path to the targets
        batch_size: int
    


In [25]:
# Alternatively the dataloader keyword arguments can be displayed using the function:
kipoi.print_dl_kwargs(DataLoader)

Keyword argument: `intervals_file`
    doc: bed6 file with `chrom start end id score strand` columns
    type: str
    optional: False
    example: example_files/intervals.bed
Keyword argument: `fasta_file`
    doc: Reference genome sequence
    type: str
    optional: False
    example: example_files/hg38_chr22.fa
Keyword argument: `gtf_file`
    doc: file path; Genome annotation GTF file
    type: str
    optional: False
    example: example_files/gencode.v24.annotation_chr22.gtf
Keyword argument: `filter_protein_coding`
    doc: Considering genomic landmarks only for protein coding genes when computing the distances to the nearest genomic landmark.
    type: str
    optional: True
    example: True
Keyword argument: `target_file`
    doc: path to the targets (txt) file
    type: str
    optional: True
    example: example_files/targets.tsv
Keyword argument: `use_linecache`
    doc: if True, use linecache https://docs.python.org/3/library/linecache.html to access bed file rows
    ty

## Run dataloader on some examples

In [26]:
# each dataloader already provides example files which can be used to illustrate its use:
DataLoader.example_kwargs

{'fasta_file': 'example_files/hg38_chr22.fa',
 'filter_protein_coding': True,
 'gtf_file': 'example_files/gencode.v24.annotation_chr22.gtf',
 'intervals_file': 'example_files/intervals.bed',
 'target_file': 'example_files/targets.tsv'}

In [27]:
import os

In [28]:
# cd into the source directory 
os.chdir(DataLoader.source_dir)

In [29]:
!tree

.
├── custom_keras_objects.py -> ../template/custom_keras_objects.py
├── dataloader_files
│   └── position_transformer.pkl
├── dataloader.py -> ../template/dataloader.py
├── dataloader.yaml -> ../template/dataloader.yaml
├── example_files -> ../template/example_files
├── model_files
│   └── model.h5
├── model.yaml -> ../template/model.yaml
└── __pycache__
    ├── custom_keras_objects.cpython-35.pyc
    └── dataloader.cpython-35.pyc

4 directories, 8 files


In [30]:
dl = DataLoader(**DataLoader.example_kwargs)
# could be also done with DataLoader.init_example()

In [31]:
# This particular dataloader is of type Dataset
# i.e. it implements the __getitem__ method:
dl[0].keys()

dict_keys(['metadata', 'inputs', 'targets'])

In [32]:
dl[0]["inputs"].keys()

dict_keys(['dist_gene_start', 'dist_gene_end', 'dist_stop_codon', 'dist_exon_intron', 'dist_start_codon', 'dist_tss', 'seq', 'dist_polya', 'dist_intron_exon'])

In [33]:
dl[0]["inputs"]["seq"][:5]

array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]], dtype=float32)

In [34]:
len(dl)

14

### Get the whole dataset

In [35]:
whole_data = dl.load_all()

100%|██████████| 1/1 [00:00<00:00,  3.29it/s]


In [36]:
whole_data.keys()

dict_keys(['metadata', 'inputs', 'targets'])

In [37]:
whole_data["inputs"]["seq"].shape

(14, 101, 4)

### Get the iterator to run predictions

In [38]:
it = dl.batch_iter(batch_size=1, shuffle=False, num_workers=0, drop_last=False)

In [39]:
next(it)["inputs"]["seq"].shape

(1, 101, 4)

In [40]:
model.predict_on_batch(next(it)["inputs"])

array([[0.00050414]], dtype=float32)

### Train the Keras model

Keras model is stored under the `.model` attribute.

In [41]:
model.model.compile("adam", "binary_crossentropy")

In [42]:
train_it = dl.batch_train_iter(batch_size=2)

In [43]:
# model.model.summary()

In [44]:
model.model.fit_generator(train_it, steps_per_epoch=3, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x2ada28e3cba8>

## Pipeline: `raw files -[dataloader]-> numpy arrays -[model]-> prediction`

In [45]:
example_kwargs = model.default_dataloader.example_kwargs

In [46]:
model.pipeline.predict(example_kwargs)

1it [00:02,  2.67s/it]


array([1.        , 0.52299094, 0.52299094, 1.        , 1.        ,
       1.        , 0.52299094, 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        ], dtype=float32)

In [47]:
next(model.pipeline.predict_generator(example_kwargs, batch_size=2))

array([[1.        ],
       [0.52299076]], dtype=float32)