# Proof of concept: *PyGMQL*

In [1]:
import gmql as gl

In [2]:
gl.sc.getConf().getAll()

[('spark.driver.host', '192.168.1.132'),
 ('spark.driver.extraClassPath',
  '/home/luca/Scrivania/GMQL/GMQL-Spark/target/GMQL-Spark-4.0.jar'),
 ('spark.rdd.compress', 'True'),
 ('spark.app.id', 'local-1491511308797'),
 ('spark.app.name', 'gmql_spark'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.executor.extraClassPath',
  '/home/luca/Scrivania/GMQL/GMQL-Spark/target/GMQL-Spark-4.0.jar'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.port', '56531')]

## Loading a dataset 

In [7]:
# path of the dataset
input_path = "/home/luca/Documenti/resources/hg_narrowPeaks/"

The library provides the user with a set of different parsers for datasets. In particular, for this demonstration we use the `NarrowPeakParser`.

In [8]:
np_parser = gl.parsers.NarrowPeakParser()

One of the main abstractions of the library is the `GMQLDataset`. It is the main access point of the user to the data. Each data manipulation operation is operated on a `GMQLDataset` and returns a *new* object of the same type.

In [9]:
dataset = gl.GMQLDataset(parser=np_parser)

In [10]:
dataset = dataset.load_from_path(path=input_path)

2017-04-06 22:42:23,340 - gmql_logger - INFO - loading metadata
2017-04-06 22:42:25,032 - gmql_logger - INFO - parsing metadata
2017-04-06 22:42:46,878 - gmql_logger - INFO - dataframe construction


100%|██████████| 115/115 [00:37<00:00,  3.04it/s]


2017-04-06 22:43:25,499 - gmql_logger - INFO - loading region data
2017-04-06 22:43:26,067 - gmql_logger - INFO - parsing region data


In [None]:
sample_dataset = dataset.sample(fraction=0.01)

In [None]:
sample_dataset._meta_dataset.count()

In [None]:
sample_dataset._reg_dataset.count()

## Metadata management

Differently from the GMQL query engine, PyGMQL stores metadata directly in local memory as a **pandas dataframe** whose index (`id_sample`) is the sample id generated by the GMQL engine (based on the hash of the file name from which the sample comes from).
Each column of the dataframe represents one of the found attributes in the dataset, therefore it is possible that a sample (a row) has zero values for one column.

Other important fact is that each cell of the metadata dataframe is a **list** due to the fact that an attribute can have multiple values.

In [None]:
# for visualization purposes we only show 3 columns of 
# the dataframe (for a total of 115) and we use the 'head' 
# function to show only the first rows of the dataframe
dataset.meta_dataset.head()[['antibody','cell','antibody_lab']]

### Select based on metadata with a logical predicate

We select the samples of the dataset in which 'antibody' has 'CTCF' value. 
The function to be applied is the `meta_select` which accepts a generic predicate, which is basically an arbitrary complex function; in this example we use a lambda expression.
The selection affects both metadata and region samples.

In [None]:
filtered_dataset = dataset.meta_select(lambda row: 'CTCF' in row['antibody'])

In [None]:
# visualize only the first rows of the metadata pandas dataframe
filtered_dataset.meta_dataset.head()[['antibody','cell','antibody_lab']]

We can use the function `get_reg_sample(n)` to materialize a little sample of the regions in memory to a pandas dataframe

In [None]:
filtered_dataset.get_reg_sample(1)

### Project metadata based on an attribute list

An other possible metadata operation is the `meta_project` which, in its simplest form, takes only the specified columns of the dataframe and (as always) returns a new `GMQLDataset`

In [None]:
filtered_proj_data = filtered_dataset.meta_project(['antibody', 'cell'])
filtered_proj_data.meta_dataset.head()

### Add a new column

If the user wants to add a new attribute to the metadata (basically a new column of the dataframe) he needs to call the `add_meta` function that takes the name of the new attribute and the default value to assign to each sample of the dataset

In [None]:
filtered_proj_data = filtered_proj_data.add_meta('creator', 'luca')
filtered_proj_data.meta_dataset.head()

In [None]:
# we can visualize all the attribute names
all_attributes = filtered_proj_data.get_meta_attributes()
all_attributes

### Project and also compute new columns based on complex functions

The `meta_project` function can take an other argument which is a dictionary of the following type:
```
    new_attributes = {
        'new_attribute_name_1' : complex_function_1,
        'new_attribute_name_2' : complex_function_2,
        ...
        
        'new_attribute_name_N' : complex_function_N,
    }
```
This argument enables the user to build new columns/attributes of the metadata dataframe based on the values of the other attributes.

In [None]:
# define a function that operates on rows of the metadata dataset and 
# gives us the resulting new column value

# in particular this function simply concatenates the lists of 
# antibody and cell values
def complex_function(row):
    x = list(row['antibody'])
    y = list(row['cell'])
    #print("antibody: {}\t cell: {}".format(x, y))
    return x + y

In [None]:
new_attr_dict = {
    'extended' : complex_function
}

extended_dataset = filtered_proj_data.meta_project(attr_list=all_attributes,
                                                   new_attr_dict=new_attr_dict)

In [None]:
extended_dataset.meta_dataset.head()

## _Example: working with metadata_

We demonstrate the usage of some of the function described above with a very simple (and stupid) example.
The user adds two new attributes (the same for every sample) describing birth date and death date of the patient.
Then he generates a third new attribute given by the other two that represents the age of the patient.

In [None]:
from datetime import datetime

born_date = datetime.strptime("30 Nov 1935","%d %b %Y")
death_date = datetime.strptime("30 Nov 1999","%d %b %Y")

In [None]:
example_dataset = filtered_proj_data.add_meta('born_date', born_date)
example_dataset = example_dataset.add_meta('death_date', death_date)
all_attributes = example_dataset.get_meta_attributes()
all_attributes

In [None]:
def calculate_age(row):
    #print(row)
    born_date = row['born_date'][0]
    death_date = row['death_date'][0]
    return (death_date - born_date).days / 365

In [None]:
new_attr_dict = {
    'age' : calculate_age
}
example_dataset = example_dataset.meta_project(attr_list=all_attributes,
                                               new_attr_dict=new_attr_dict)

In [None]:
example_dataset.meta_dataset.head()

## Region management

### Region selection

In [None]:
ctcf_mcf7_dataset = dataset.meta_select(lambda attr: 'CTCF' in attr['antibody']
                                        and 'MCF-7' in attr['cell'])\
                           .reg_select(lambda reg: reg['pValue'] < 2)

In [None]:
ctcf_mcf7_dataset.get_reg_sample(20)