# Introductin to DeepBlue

### The access to DeepBlue is made using the XML-RPC protocol

In [47]:
import xmlrpclib

url = "http://deepblue.mpi-inf.mpg.de/xmlrpc"
server = xmlrpclib.Server(url, allow_none=True)

#### DeepBlue use the anonymous keys for accessing DeepBlue

In [48]:
user_key = "anonymous_key"

#### We can verify the connection to DeepBlue using the command [echo](http://deepblue.mpi-inf.mpg.de/api.php#api-echo)

In [49]:
server.echo(user_key)

['okay', 'DeepBlue (1.17.7) says hi to anonymous']

## Accessing the data and metadata

In [50]:
(status, genomes) = server.list_genomes(user_key)

In [51]:
genomes

[['g2', 'mm9'],
 ['g3', 'mm10'],
 ['g1', 'hg19'],
 ['g4', 'hs37d5'],
 ['g7', 'GRCh38'],
 ['g8', 'GRCm38']]

## All DeepBlue data as an identifier.
### The identifier is one of mre leters + a number.
For example:
 - genomes: 'g' + number
 - experiments: 'e' + number
 - samples: 's' + number
 - biosources: 'bs' + number
 - users: 'u' + number
 - genes: 'gn' + number
 
The command [info](http://deepblue.mpi-inf.mpg.de/api.php#api-info) retrieve the identifier information:

In [52]:
server.info("g7", user_key)

['okay',
 [{'_id': 'g7',
   'chromosomes': [{'name': 'chr1', 'size': 248956422},
    {'name': 'chr2', 'size': 242193529},
    {'name': 'chr3', 'size': 198295559},
    {'name': 'chr4', 'size': 190214555},
    {'name': 'chr5', 'size': 181538259},
    {'name': 'chr6', 'size': 170805979},
    {'name': 'chr7', 'size': 159345973},
    {'name': 'chr8', 'size': 145138636},
    {'name': 'chr9', 'size': 138394717},
    {'name': 'chr10', 'size': 133797422},
    {'name': 'chr11', 'size': 135086622},
    {'name': 'chr12', 'size': 133275309},
    {'name': 'chr13', 'size': 114364328},
    {'name': 'chr14', 'size': 107043718},
    {'name': 'chr15', 'size': 101991189},
    {'name': 'chr16', 'size': 90338345},
    {'name': 'chr17', 'size': 83257441},
    {'name': 'chr18', 'size': 80373285},
    {'name': 'chr19', 'size': 58617616},
    {'name': 'chr20', 'size': 64444167},
    {'name': 'chr21', 'size': 46709983},
    {'name': 'chr22', 'size': 50818468},
    {'name': 'chrX', 'size': 156040895},
    {'name'

## Data Organization

DeepBlue stores the files in *experiments*. 

An experiment metadata has:
 - Name: Usually the file name
 - Description
 - Genome Assembly (hg19, GRCh38, mm10, ..)
 - Sample: A sample is a BioSource (from an ontology, with a metadata describing it)        
 - Epigenetic mark (H3K27ac, DNA Methylation, CTFC, ..)
 - Technique (ChIP-seq, WGBS, RRBS, ..)
 - Project (BLUEPRINT, DEEP, CEEHRC, ..)
 
An experiment also have a set of columns, where each column has a name and type. For example, the CHROMOSOME column is a string, the START is an integer, Q_VALUE is an double.

Experiments also contain key-values that are used to stored the metadata that does not map directly to the previous metadata field.

The information related to an experiment can be retrieved with the [info](http://deepblue.mpi-inf.mpg.de/api.php#api-info) command.


In [53]:
(status, experiment_info) = server.info("e123456", user_key)
experiment_info[0]

{'_id': 'e123456',
 'columns': [{'column_type': 'string', 'name': 'CHROMOSOME'},
  {'column_type': 'integer', 'name': 'START'},
  {'column_type': 'integer', 'name': 'END'},
  {'column_type': 'string', 'name': 'NAME'},
  {'column_type': 'double', 'name': 'SCORE'},
  {'column_type': 'category', 'items': '+,-,.', 'name': 'STRAND'},
  {'column_type': 'double', 'name': 'SIGNAL_VALUE'},
  {'column_type': 'double', 'name': 'P_VALUE'},
  {'column_type': 'double', 'name': 'Q_VALUE'},
  {'column_type': 'integer', 'name': 'PEAK'}],
 'data_type': 'peaks',
 'description': 'GSM1523556: FL10 H3K27ac; Homo sapiens; ChIP-Seq',
 'epigenetic_mark': 'H3K27ac',
 'extra_metadata': {'ID': 'SRX730794',
  'antigen': 'H3K27ac',
  'antigen_class': 'Histone',
  'cell_type': 'B-Lymphocytes',
  'cell_type_class': 'Blood',
  'cell_type_description': 'MeSH Description=Lymphoid cells concerned with humoral immunity. They are short-lived cells resembling bursa-derived lymphocytes of birds in their production of immunog

## Operating epingenomic data

DeepBlue has a set of commands for [selecting and operating](http://deepblue.mpi-inf.mpg.de/api.php#api-genomic-regions-operations) the epigenomic data.

A very useful command is the [select_regions](http://deepblue.mpi-inf.mpg.de/api.php#api-select_regions). The command receives the following _optional_ paramenters and select all 
 - experiment_name
 - genome
 - epigenetic_mark
 - sample_id
 - technique
 - project
 - chromosomes
 - start
 - end
 
The following command select all regions from the BLUEPRINT Epigenome project, that are from H3K27ac, GRCh38, and between position 3.000.000 to 4.000.000 of the Chromosome 1:

In [54]:
status, regions_id = server.select_regions (None, "grch38", "h3k27ac", None, None, "blueprint epigenome", "chr1", 3000000, 4000000, user_key )

Like all data selection and operation commands, the *select_regions* return *query_id*.
A *query_id* is an identifier, or a handler, to a data selection.

It is possible to obtain information about a *query_id* using the *info* command:

In [55]:
server.info(regions_id, user_key)

['okay',
 [{'_id': 'q994618',
   'args': {'chromosomes': ['chr1'],
    'end': 4000000,
    'epigenetic_mark': ['h3k27ac'],
    'genomes': ['grch38'],
    'project': ['BLUEPRINT Epigenome'],
    'start': 3000000},
   'type': 'experiment_select',
   'user': 'anonymous'}]]

*query_ids* are used as input of other commands. For example, we want to select only *peaks* (not signal) data, we use the previous *regions_id* in the *query_experiment_type* command:

In [56]:
status, peaks_id = server.query_experiment_type(regions_id, "peaks", user_key)

The command [count_regions](http://deepblue.mpi-inf.mpg.de/api.php#api-count_regions) is used to count how many regions are referenced by *query_id*.

In [80]:
status, request_id = server.count_regions(peaks_id, user_key)

The *count_regions* command return a *request_id*, that is used to retrieve the data. As the processing is asynchronous, we have to verify the if the processing is ready. The following code waits until the processing is finished and return the data using the [get_request_data](http://deepblue.mpi-inf.mpg.de/api.php#api-get_request_data) command.

In [81]:
import time
def get_data(request_id, user_key):
    (status, info) = server.info(request_id, user_key)
    request_status = info[0]["state"]
    while request_status != "done" and request_status != "failed":    
      time.sleep(1)
      (status, info) = server.info(request_id, user_key)
      request_status = info[0]["state"]
      print request_status

    return server.get_request_data(request_id, user_key)[1]

In [82]:
count = get_data(request_id, user_key)
print count

{'count': 11953}


It is possible to retrieve the regions using the command [get_regions](http://deepblue.mpi-inf.mpg.de/api.php#api-get_regions).

The *get_regions* command requires a string containing the columns that must be included in the output. In the following example, the columns that start with a *@* are called "meta-columns" and they are used to annotate the file rows with extra information. For example, the extra field *@NAME* includes de source file of the region and *@BIOSOURCE* include the biosource of the file.

In [83]:
status, request_id = server.get_regions(peaks_id, "CHROMOSOME,START,END,SIGNAL_VALUE,@NAME,@BIOSOURCE", user_key)

In [84]:
regions = get_data(request_id, user_key)
print regions

chr1	3003120	3004017	5.2902	S016BDH1.ERX1122502.H3K27ac.bwa.GRCh38.20151026.bed	lymphatic system
chr1	3003650	3003810	4.1740	S01KXOH1.ERX1384830.H3K27ac.bwa.GRCh38.20160701.bed	venous blood
chr1	3003710	3004009	4.5758	S01FKXH1.ERX1122578.H3K27ac.bwa.GRCh38.20160126.bed	venous blood
chr1	3004017	3004281	4.3370	S01KYMH1.ERX1305354.H3K27ac.bwa.GRCh38.20160701.bed	venous blood
chr1	3004147	3004503	3.8473	S016BDH1.ERX1122502.H3K27ac.bwa.GRCh38.20151026.bed	lymphatic system
chr1	3004858	3005008	4.1496	S0138VH1.ERX1007380.H3K27ac.bwa.GRCh38.20151015.bed	naive B cell
chr1	3005464	3005670	5.4249	S00JFZH1.ERX651406.H3K27ac.bwa.GRCh38.20150527.bed	neutrophilic metamyelocyte
chr1	3012903	3013177	3.5777	S01HHVH1.ERX1384823.H3K27ac.bwa.GRCh38.20160701.bed	venous blood
chr1	3013234	3013578	3.5777	S01HHVH1.ERX1384823.H3K27ac.bwa.GRCh38.20160701.bed	venous blood
chr1	3013754	3013928	5.4357	S00Y05H1.ERX1007422.H3K27ac.bwa.GRCh38.20151015.bed	myeloid cell
chr1	3021290	3021470	5.8729	S005GFH1.ERX943211.H3