# PyGML: a python API to the GMQL system

The user imports the library in his python environment

In [1]:
import gmql as gl

## Loading datasets

Sets the input path from which to take the dataset and the output path where to materialize the results

In [2]:
# loading of a dataset
data_path = "/home/luca/Documenti/resources/hg_narrowPeaks"
output_path = "/home/luca/Documenti/resources/result"

He chooses a parser to use with the data. The available parsers are the one already provided by the GMQL system.
For now there is no possibility for the user to define its own parser due to the fact that the loading of data is done by the GMQL system.

In [3]:
parser = gl.parsers.NarrowPeakParser()

data = gl.GMQLDataset(parser=parser)\
                .load_from_path(path=data_path)
    
data.show_info()

2017-04-20 15:46:01,501 - PyGML logger - INFO - Loading meta data from path /home/luca/Documenti/resources/hg_narrowPeaks


100%|██████████| 2000/2000 [00:23<00:00, 83.57it/s] 


2017-04-20 15:46:26,964 - PyGML logger - INFO - dataframe construction


100%|██████████| 115/115 [00:49<00:00,  2.58it/s]

GMQLDataset
	Parser:	NarrowPeakParser
	Index:	0





When the user loads the file, in reality nothing really happens. All the action but the `materialize` are to be intended as an *intention* of action. All the computation is started when a `materialize` is performed.
The only thing that it is really done by the `load_from_path` function is to load all the metadata in memory (in a Pandas dataframe) in order for the user to explore them.

In [4]:
# visualize the metadata
# for visualization purposes we only show the first rows and only 4 columns
data.meta.head()[['ID','antibody','antibody_lab','cell']]

Unnamed: 0_level_0,ID,antibody,antibody_lab,cell
id_sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-9217770635305287634,[166],[EZH2_(39875)],[Bernstein],[NHEK]
-9207218392883319159,[831],[CTCF],"[Myers,]",[GM19240]
-9205453250035453609,[1246],[MAZ_(ab85725)],[Snyder],[IMR90]
-9174810527625421920,[1194],[Znf143_(16618-1-AP)],[Snyder],[HeLa-S3]
-9158105103324649074,[698],[TCF7L2],[Farnham],[MCF-7]


## Selection on metadata

This is a demonstration of how the user can create a selection predicate. 
It was the result of a lot of reasoning and hacking of python function overloading and I am pretty sure that is the best trade-off between usability, easy of writing and expressive power.
The user can express arbitrarly complex condition using the usual logical symbols.

In [5]:
# selects all the data coming from cell K562 and having antibody H3K4me3
condition = (data.MetaField("cell") == 'K562') & (data.MetaField("antibody") == 'H3K4me3')

filtered_data = data.meta_select(predicate=condition) # test

Here we materialize our results to the `output_path`, and automatically we load in memory the result into two Pandas dataframe for local processing, one for regions and one for metadata. The function directly call the GMQL system and makes it perform all operations and optimizations.

In [6]:
filtered_data = filtered_data.materialize(output_path=output_path)

2017-04-20 15:48:18,498 - PyGML logger - INFO - Loading meta data from path /home/luca/Documenti/resources/result/exp/


100%|██████████| 12/12 [00:00<00:00, 570.94it/s]

2017-04-20 15:48:18,607 - PyGML logger - INFO - dataframe construction



100%|██████████| 94/94 [00:00<00:00, 140.00it/s]

2017-04-20 15:48:19,298 - PyGML logger - INFO - Loading region data from path /home/luca/Documenti/resources/result/exp/



100%|██████████| 14/14 [00:13<00:00,  1.12it/s]


Let's check that the result is coherent with the performed query

In [7]:
filtered_data.meta.head()[['cell', 'antibody']]

Unnamed: 0_level_0,cell,antibody
id_sample,Unnamed: 1_level_1,Unnamed: 2_level_1
-5359091651622680202,[K562],[H3K4me3]
-5125657087399478862,[K562],[H3K4me3]
-3488503578342308248,[K562],[H3K4me3]
-3421537147502246704,[K562],[H3K4me3]
-2741070829214587910,[K562],[H3K4me3]


In [8]:
filtered_data.regs.head()

Unnamed: 0_level_0,chr,name,pValue,peak,qValue,score,signalValue,start,stop,strand
id_sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7671500149727781280,chr1,.,113.109,-1.0,-1.0,0.0,110.0,137700,137850,*
7671500149727781280,chr1,.,73.0561,-1.0,-1.0,0.0,81.0,138420,138570,*
7671500149727781280,chr1,.,73.0561,-1.0,-1.0,0.0,56.0,138680,138830,*
7671500149727781280,chr1,.,44.4925,-1.0,-1.0,0.0,78.0,138960,139110,*
7671500149727781280,chr1,.,80.5307,-1.0,-1.0,0.0,62.0,139320,139470,*


## Selection on region data

In [9]:
output_path = "/home/luca/Documenti/resources/result1"
condition = (data.RegField("chr") == 'chr9') & (data.RegField("start") >= 138680) & (data.RegField("stop") <= 142000)

filtered_data_regs = data.reg_select(predicate=condition)

In [10]:
filtered_data_regs = filtered_data_regs.materialize("/home/luca/Documenti/resources/result1")

2017-04-20 15:57:51,237 - PyGML logger - INFO - Loading meta data from path /home/luca/Documenti/resources/result1/exp/


100%|██████████| 1999/1999 [00:00<00:00, 3095.86it/s]


2017-04-20 15:57:54,345 - PyGML logger - INFO - dataframe construction


100%|██████████| 115/115 [00:45<00:00,  2.17it/s]


2017-04-20 15:58:40,267 - PyGML logger - INFO - Loading region data from path /home/luca/Documenti/resources/result1/exp/


100%|██████████| 14/14 [00:00<00:00, 435.43it/s]


In [11]:
filtered_data_regs.show_info()

GMQLDataset
	Parser:	NarrowPeakParser
	Index:	2


In [12]:
filtered_data_regs.regs.head()

Unnamed: 0_level_0,chr,name,pValue,peak,qValue,score,signalValue,start,stop,strand
id_sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1908981323311506945,chr9,.,-1.0,-1.0,-1.0,0.0,11.0,140000,140150,*
-3837274876828151293,chr9,.,19.7974,-1.0,-1.0,0.0,43.0,140000,140150,*
-6369496270910918077,chr9,.,9.18639,-1.0,-1.0,0.0,14.0,140000,140150,*
-4672557036496796443,chr9,.,5.6505,-1.0,-1.0,0.0,7.0,140000,140150,*
2402220549503469350,chr9,.,38.6793,-1.0,-1.0,0.0,67.0,139960,140110,*


In [15]:
filtered_data_regs.meta.head()[['ID','antibody','antibody_lab','cell']]

Unnamed: 0_level_0,ID,antibody,antibody_lab,cell
id_sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-9214550995092909750,[731],[CTCF],"[Myers,]",[GM19239]
-9187999009443647067,[1614],[],[],[LHCN-M2]
-9173927680047723274,[151],[EZH2_(39875)],[Bernstein],[K562]
-9161798996683708116,[1258],[BRF2],[Struhl],[K562]
-9159107200289324465,[1823],[H3K4me3],"[Bernstein,]",[HUVEC]
