fetchCLI utility fails to download molecular sigantures database- Work around: Manually download hallmark geneset from msigDB website, and move it to
app/data/raw/with the filenamegene_set.entrez.gmt
- Work around: Manually download hallmark geneset from msigDB website, and move it to
- RAM likely too small on client machine for some processing steps. This mainly affects ETL step where biased walks are generated. The networkx graph is huge! It takes like ~40GB of RAM. This is a point for future work.
- Work around: Ping me an email and I will make and send you whatever data assets you need.
etl process allhas a bug which I haven't identified yet- Work around: Process each data asset individually
- Project description
- Features & How it works
- Requirements
- Installation
- Usage
- Project Structure
- Documentation
- Theory
- How to approach model development
- Examples
- Contact
- Credits
Graph studio is an experiment/reproducibility framework for use in creating node embeddings. It's main contribution is the automatic management of data assets used to train node embedding algorithms using SGD. Particular values used to generate datasets are defined in an experiment object, which is recorded in a config file and managed using a CLI utility. See the examples to find out more!
This project still has a massive ammount of headroom. There are a heap more features that I would like to include.
This project was initiated with my initial curiosity with AI x Biology and learning about bioinformatics. I started out with an open course from Stanford which introduced me to some of the current methods used in bioinformatics. From here I began implementing the foundational algorithms and testing them on biological datasets. This was initially in a style not disimillar to a notebook, as you so often see in ML.
The problem with this is that often you find yourself relying on the sequential execution of code. It's also not as regular to persist intermediate data assets during processing, something which quickly became a problem during development. Additionally, I wasnt making full use of the computational resources on my machine. So I had a refactor, and in doing so I have created Graph Studio, a nascent reproducibility/experimentation framework, which aims to catalogue and cache data assets for later training. The fundamental insight here is that
You can do experiments, both in the model and the data.
So far, this project helps the user manage the parameterisation of intermediate data assets by abstracting away their creation, and providing archive utilities.
This consists of a CLI utility which fetches the Biorgrid dataset of protein-protein interations (PPI's) and the MSIGDB Gene Annotation dataset and stores these as raw inputs to the ETL pipeline.
Note: This is currently semi-functional, see bugs and issues for more details.
The parameter values which are used to describe both models and data are encapsulated in objects called experiments. These are given whatever names you like and consist of key:value pairs which are used in creating derived data products and in training models. This is performed by a config management tool and users interact with this via a CLI utility.
As eluded to a moment ago, some data products are derived from other raw or intermediate data products. This naturally results in dependency hierarchies, which have non-trivial consequences for the final node embeddings. To manage dependencies and prevent duplication of data assets, an archive manager keeps track of what data assets exist and what parameters were used to create them. This is achieved by assigning each data asset a UUID which is created from its parameter values. This is a cross cutting utility and is not directly accessible by the user.
There are various steps along the journey from raw data to training dataset. Each step along the way is handled by a class designed expecially for that step. Each class shares a common set of methods:
- process
- describe
- head
- validate
which are presented to the user via a CLI utility. These methods are designed to perform the processing step, provide reassurance that the result is valid and give the user a peek at the result and any statistics about the data.
We combine the data definitions and model definitions by training the model on the data. This is performed by another CLI utility and is described by experiments for a given model. This is organised as three layers. First, there is the CLI layer. Second, there is the training layer which aligns the model with the appropriate data asset. Finally, the third layer consists of the model definitions which are defined by the user. See the docs and examples to see how this works in practice.
- ~20GB Disk
- ~40GB RAM
- Docker v20.10.0 and up
- Docker-compose v2.5.0 and up (probably can get away with v1.28.0+)
- An MsigDB account email address
Although this project is written in python, you don't actually need to worry about installing it or managing any packages because this is all taken care of in the Docker image.
You will need to ensure that you have docker installed. Some useful guides for this can be found here:
- Official docker website
You will also need to ensure that you have docker compose installed:
- Official docker compose website
- How to upgrade docker-compose to latest version
If you want to use GPU accelerated training (and I haven't checked if not doing so works yet), then you will need to install the nvidia container toolkit. This let's the container access any GPU's available on the host machine. :
In a working directory of your choice, clone the repo and enter that directory. For example
cd ~/Desktop
git clone https://github.com/BMDuke/GraphStudio.git
cd GraphStudio
Once you've cloned the project, you need to build the Docker image. This requires a fair few heavy libraries and may take some time.
sudo docker build . -t graphstudio
First you will want to test whether the nvidia-docker installation has been successful. To do this, run
sudo docker-compose run test-nvidia
and you should get the nvidia-smi print out. After that, take the container down
sudo docker-compose down test-nvidia
After that you are ready to spin up the container with
sudo docker-compose run graphstudio bash
and you should now be inside the container with a command prompt something like
root@h93d9dkw:/graphstudio#
Now you're ready to get started embedding some nodes 🤘.
To get started you are going to need to fetch the raw data. This uses data from two public data repositories for bioinformatics data. These are BioGrid and MsigDB. In order to do this, you need to have created an account with MsigDB, becuase you will need to use the email address registered with the account to download the data. To download the data, call the fetch command, passing your email as an argument to the -e flag.
fetch -e=your@email.goes.here
Note: This is currently semi-functional, see bugs and issues for more details.
Before you run the ETL toolkit over the raw data, you will want to have a think about how you want to parameterise your data. To understand how the choice of parameters affects the data, have a look at the explanation in the theory section. But as an example, suppose we want to generate biased walks with p equal to .3 and q equal to .9.
The first thing to do is to create a new config file. This will hold all the experiments you are going to perform in your research. Typically a research project wont need more than one config file.
conf new humanPPI
Next, you want to create a new experiment for the data. In a real life scenario, you will want to design a schema of parameter combinations so that you can produce a adequate selection of combinations on which you will base your experiments. However, for the sake of this example, suppose we want to run an experiment called more_out_than_in with p=.3 and q=.9 like we previously said. So, let's do it
conf new experiment data more_out_than_in p=.3 q=.9
Here, we are saying we want the config utility to create a new experiment in the data called more_out_than_in with parameter values p=.3 and q=.9.
Note: Dont use hyphens in experiment names, only use underscores. This will cause the type checking functionality to fail downstream.
To check that worked lets have a peek at the config to make sure.
conf show
If we look at the data key and then have a look under the experiments header, we can see that, indeed, we have created a new experiment called more_out_than_in with the correct values. You will notice that there are heaps of other experiment configurations in the file. They are all default configurations. These are the default values that would be used to process the data and the default training combinations. Also, note that there is the current field under the data key. This is the experiment that will be used by the ETL toolkit, unless specified otherwise.
We'd better change it incase we forget to tell the ETL tools what experiment we want to use
conf use experiment data more_out_than_in
This is saying to the config utility, use the experiment called more_out_than_in as a description for the data. Let's check that worked
conf show
Now, the current experiment specified for the data is more_out_than_in. Super.
Now we are ready to process the raw data and produce our training data. For a full breakdown of the processing steps and the order they must occur in, see the docs.
The first thing we need to do is select PPI's from the Biogrid dataset that are between humans only. Then select the format of the gene identifier that we want to be working with for this project (this is Entrez gene ID's btw). Luckily, all this is taken care of by the BioGrid ETL utility. So all we have to do is tell the ETL CLI tool to process the biogrid data and were sweet.
etl process biogrid
Now, this should have loaded in the raw BioGrid data, processed it and saved the result for us to use later. Not only that, but it should be archived and given a unique identifier based on the values of the parameters we have used in the experiment definition more_out_than_in. Lets have a look
etl ls
According to me, this has the id 0df726c9cc7bd8. Additionally, we can see from the table that the values of .3 and .9 were used for p and q respectively. Good stuff.
Ok, but what does the processed data look like? Lets have a peek
etl head biogrid
Looking good. Now, I wonder how many connections there are in this network. And how many objects there are interacting. Let's see what we can find out.
etl describe biogrid
Ok, that's heaps of info. Let's chew the fat on that another time. Better check there are no missing values or any other weird things going on in this dataset.
etl validate biogrid
Ok, sweet as, we're good to move onto the next step.
The next data asset we needs to make is the transition probability graph for the PPI network. But that might take a while. And I want to finish writing this documentation before I go to bed tonight, so we might take a quick shortcut. BUT its important that you know that we can repeat the same steps illustrated above for any of the data assets we need to make in this project. Each one can be processed, inspected (head), described and validated. But for the sake of brevity, we are moving right along.
etl process all
Now that we have processed all the data, we are ready to train some models on it. If you have a peek in the models directory, you will see that there are subdirectories that correspond to various different classes of models. At present, these are binaryClasifier, multiLabelClassifier and node2vec. The directory node2vec/ is the one where we define the various embedding architectures. A look in there will reveal that there is a default.py script which contains the definition for the default embedding architecture. We will use the default architecture to start with.
A key point here, is that trained embedding models are dependent on both the architecture and the data. And trained multi-label classifiers and binary classifiers are dependent on their architecture, a particular encoder (trained embedder) and the data. Again, here we have a hierarchy of dependencies which we must manage, becuase it has non-trivial consequences for the final model.
So to do this we again turn to our configuration file to manage these for us by defining experiments. Lets start by defining an experiment to train the encoder (node2vec) and telling the config manager to use it
conf new experiment node2vec my_first_encoder data=more_out_than_in epochs=1
conf use experiment node2vec my_first_encoder
conf show
Sweet, that looks about right. Here, we are telling the config utility to create a new experiment for node2vec called my_first_encoder, and that it should be trained on the data described by the experiment more_out_than_in for 1 epoch.
Important: the value you pass to the data argument, is the name of the experiment that was used to generate the data.
Then we are telling config to use that node2vec experiment unless otherwise specified.
While were here, we could also define an expriment for a model which uses the trained encoder to embed its inputs. Let's create a new multi-label classifier experiment
conf new experiment multiLabelClassifier mlc_from_vec encoder=my_first_encoder
conf use experiment multiLabelClassifier mlc_from_vec
Here, we are telling the config to create a new experiment which is for a multiLabelClassifier called mlc_from_vec which uses a trained encoder, described by the experiment my_first_encoder.
Important: the value you pass to the encoder argument, is the name of the experiment that was used to train the encoder.
Looking in the projects root directory, there are two main subdirrectories called app/ and models/. The app/ subdirectory contains all the source code for the project and any data assets and model weights which are generated. User's aren't expected to wade through here if they dont want to. Rather, users can put their model definitions in the models/ directory under the appropriate subdir. These are then made available to the app when the docker container runs. This allows the user to include various architectures using a plugin-style pattern.
We have previously seen the default architectures. The default architectures are alright for illustrative purposes, but it's likely that you will want to define your own custom architecture to experiment with creating information dense embeddings and other models. Supposing you have designed a new encoder architecture, called deep_encoder. This inherits from tensorflow.keras.Model and describes the computation in the call() method. For illustration lets pretend we have a deep_encoder (we will really just copy the default, but for illustration...). Outside of the docker container, in the host environment
cp models/node2vec/default.py models/node2vec/deep_encoder.py
Now fire the container up and tell the config manager about the new architecture by identifying it in an experiment
sudo docker-compose run graphstudio bash
conf new experiment node2vec custom_encoder architecture=deep_encoder data=more_out_than_in
conf use experiment node2vec custom_encoder
conf show
Here, we have told the config manager to create a new experiment for the node2vec embedding model, called custom_encoder. We have told it to use the architecture defined in the deep_encoder file and to train it on the dataset described by the more_out_than_in experiment.
Note: The value of architecture must be the same as the filename given to the model definition. This is how the plugin architecture works.
Now we are ready to train the model.
All the hard work has now been done, we are now ready to train the model. This is simply done with the command
train train node2vec
After training, the model is saved along with its weights and checkpoints. Like the data assets, it is saved under a UUID which is generated based on the parameters of the data it is trained on, the architecture used and training conditions used. To have a look at what models have been trained and the parameters that were used to train them
train ls
The project is structured so that it has two main sections. In the GraphStudio/ root dir, there are two subdirectories.The app/ subdirectory contains all the source code for the project and any data assets and model weights which are generated. User's aren't expected to wade through here if they dont want to. Rather, users can put their model definitions in the models/ directory under the appropriate subdir. This is the project structure in the host context
When you spin up the container, the project has a slightly different structure. By looking through the Dockerfile, you can see that the project stucture in the container context is as if app/ was the working directory in the host context. It just copies the contents of app/ to the working directory in the container.
By looking through the docker-compose file, we see that the the cache/, conf/ and data/ directories on the host machine are mounted into the container. This means that all the all data generated within the container are persisted on the host machine.
We can also see that each of the subdirectories in the models/ directory on the host machine are mounted to the corresponding subdirectory in the app/source/models/ directory in the container. This provides users with an easy and less-error prone way to include custom architectures.
Note: This means that you can have your custom architecture open in a text editor on your host machine and make changes which will instantly be reflected in the container environment. This may be useful for model prototyping and development.
Important: Make sure you create custom models in the host context, because if you create the file within the container context it will be read-only on the host machine.
This is a CLI utility which fetches data from two public bioinformatics repositories, BioGrid and MsigDB. It fetches both datasets at once, so they can't be retrieved individually.
Important: The email address that is assocuated with your MsigDB account is required to download the data.
From within the container
fetch -e=<your email> [-v=<biogrid version>]
---------------------------------------------
Args:
-e Email address associated with MsigDB
--email account. Required.
-v Biogrid version to download. Default is
--version 4.4.207. Optional.
This is a CLI utility that allows users to define experiments that desribe how to process data assets and train models.
From within the container
conf new [experiment] [asset] name [**params]
---------------------------------------------
Creates a new config file or experiment
Args:
experiment Directive which indicates to the config
manager whether the user wants to create
a new config file or an experiment. This is
passed as a literal 'experiment'. If this
is provided then an asset must also be given
to indicate what data/model the experiment
is for. If this is omitted, a new config
file will be created.
asset Name of the project resource that the
experiment is describing. This must match
top level keys in the config file eg.
node2vec, multiLabelClassifier etc...
name The name of the config file / experiment.
This shouldn't contain hyphens. Required.
params This then accepts a variable number of
kwargs. These should be the keys associated
with the resource being created. They need the
format 'key=value' for example 'p=.5'.
conf use [experiment] [asset] name
---------------------------------------------
Switches between active config file / experiment.
Args:
experiment Directive which indicates to the config
manager whether the user wants to switch to
a particular config file or a particular
experiment within the current config file.
This is passed as a literal 'experiment'.
If this is passed then it must be accompanied
with a value for asset also.
asset The name of the project resource whose active
experiment is being changed. This must match
top level keys in the config file eg.
node2vec, multiLabelClassifier etc...
name The name of config file / experiment that should
be activated.
conf show [name]
---------------------------------------------
Prints the current config file or a config file given by 'name'
Args:
name The name of the config file to display. Optional.
If no name is provided then this defaults to current.
conf current
---------------------------------------------
Prints the name of the current config file
conf edit [experiment] [asset] name [**params]
---------------------------------------------
Edit the current values of a config file, or an experiment within a config
file.
Args:
experiment Directive which indicates to the config
manager whether the user wants to create
a new config file or an experiment. This is
passed as a literal 'experiment'. If this
is provided then an asset must also be given
to indicate what data/model the experiment
is for. If this is omitted modifications will
be made to the top level of the config file.
asset Name of the project resource that is being
modified. This must match top level keys in
the config file eg. node2vec,
multiLabelClassifier etc...
name The name of the config file / experiment.
params These kwargs of the parameters you would like
to edit, along with their new values.They need
the format 'key=value' for example 'p=.5'.
conf delete [experiment] [asset] name
---------------------------------------------
Delete a config file, or an experiment within a config file.
Args:
experiment Directive which indicates to the config
manager whether the user wants to delete
a particular config file or a particular
experiment within the current config file.
This is passed as a literal 'experiment'.
If this is passed then it must be accompanied
with a value for asset also.
asset The name of the project resource to be deeted.
This must match top level keys in the config
file eg. node2vec, multiLabelClassifier etc...
name The name of config file / experiment that should
be deleted.
This is a CLI utility that allows users to process the raw data into derrived data assets and finally training data sets. There is an order in which data assets must be produced as dependencies exist among them. Here, you can see the order in which they should be created, along with the parameter values that have been used to create them.
Raw Data:
| Name | Description | Dependencies |
|---|---|---|
| MsigDB | Gene annotations. Used to created a multi-labelled dataset | None |
| Biogrid | Protein-protein dataset. Used to build interaction graph | None |
Processed Data:
| Name | Description | Dependencies |
|---|---|---|
| Gene IDs | This is a processed version of the MSigDB dataset. | None |
| Biogrid | Processes the raw biogrid data so that it only contains human-human interactions and gene id format is entrez | version |
| Transition Probabilties | Precalculated transition probabilities required to sample biased walks from the PPI graph. This is stored as a pickled nx graph and has a massive memory footprint. | version, p, q |
| walks | This is a dataset consisting of the biased walks sampled from the PPI graph. This indicates the topology of the graph. The nodes that are visited at each step of the walk are selected based on the transition probabilities computed at the previous step. | version, p, q, num_walks, walk_length |
| Skipgrams | This is a training dataset used to train the encoder for node embeddings. It is a TFRecords dataset that contains positive and negative skipgrams generated from the random walks | version, p, q, num_walks, walk_length, negative_samples, window_size |
| Vertices | This is a training dataset used to train a multi-label classifier which takes entres gene IDs as inputs. It is a TFRecords dataset. | None |
| Edges | Coming soon... | version |
A more complete description of what each of the parameters to can be found in the theory section, but a brief definition of the responsibilities of the parameters is given here.
Parameters:
| Name | Description |
|---|---|
| version | The version of the biogrid dataset that was used to build the PPI graph |
| p | Return parameter. Used to generate biased walks. Represents the liklihood of immediately revisiting an earlier node in the walk |
| q | In-out parameter. Used to generate biased walks. Controls whether the agent explores in a more BFS-like (q > 1) or DFS-like (q < 1) manner |
| num_walks | The number of times a biased walk should be initiated from each node |
| walk_length | The number of steps taken for each walk |
| window_size | The size of the window from which skipgrams are generated |
| negative_samples | The number of negative samples that should be generated for each the positive sample. This is used to optimise training. |
From within the container
etl process dataset[=experiment_name]
---------------------------------------------
Process a given data resource.
Args:
dataset The data resource to process. This will be one of the ones
given in the names column of the processed data table. This
also takes a special value 'all' which will attempt to process
all data resources for a given experiment.
experiment_name If provided, it specifies which experiment definition
should be used to process the data. If this is not provided
then whatever experiment is provided by the 'current' key in
the config file will be used.
etl describe dataset[=experiment_name]
---------------------------------------------
Describe a given data resource.
Args:
dataset The data resource to describe. This will be one of the ones
given in the names column of the processed data table.
experiment_name If provided, it specifies which experiment definition
should be used to describe the data. If this is not provided
then whatever experiment is provided by the 'current' key in
the config file will be used.
etl head dataset[=experiment_name] [n=value]
---------------------------------------------
Print the top n rows of a given data resource.
Args:
dataset The data resource to display. This will be one of the ones
given in the names column of the processed data table.
experiment_name If provided, it specifies which experiment definition
for the data to be displayed. If this is not provided
then whatever experiment is provided by the 'current' key in
the config file will be used.
n The number of records to be printed. Default is 5. Optional.
etl validate dataset[=experiment_name]
---------------------------------------------
Validate a given data resource.
Args:
dataset The data resource to validate. This will be one of the ones
given in the names column of the processed data table.
experiment_name If provided, it specifies which experiment definition
for the data to be validated. If this is not provided
then whatever experiment is provided by the 'current' key in
the config file will be used.
etl ls
---------------------------------------------
Display a table of all the data resources that have been created, their UUIDs, and the parameter values that were used to create them.
This is a CLI utility that is used to train the models. Because all the details are handled by the configuration utility, this has a very simple interface.
train train dataset[=experiment_name]
---------------------------------------------
train a given model. The syntax here could do with a tweak.
Args:
dataset The model to be trained. This will be one of:
node2vec: for node2vec
mlc: for multi-label classifier
bc: for binary classifier
experiment_name If provided, it specifies which experiment definition
should be used to define the model. If this is not provided
then whatever experiment is provided by the 'current' key in
the config file will be used.
train ls
---------------------------------------------
Display a table of all the models that have been trained, their UUIDs, and the parameter values that were used to create them.
The exact definition of the data for training the node2vec encoder will vary depending on the number of negative samples you decide to use. However, the basic structure of a training example will be something like this.
(
x = {
'target': tf.Tensor(shape=(1,)),
'context': tf.Tensor(shape=(negative_samples + 1,))
},
y = tf.Tensor(shape=(negative_samples + 1,))
)
So what this means is that in your call() method, you should arrange your computation such that target (a scalar) is applied to every element of context (a vector) returning y_pred which is a vector. Einsum is a good way to do this!
A training example for MLC is something like this
(
x = tf.tensor(shape=(1,)),
y = tf.tensor(shape=(num_labels,))
)
where num_labels is the number of hallmarks in the MsigDB database. At the time of development, this was 50.
y is a multi-hot encoding of labels.
I will try and write a comprehensive digest of the theory behind the models used in this project. However untill I do, you can find out more information about the methods used from the following resources.
- Snap Stanford: This is a resource produced by the creators of node2vec. This can be found here.
- Towards data science: This is is a nice break down of node2vec in an easy to understand way. This can be found here
Coming soon...
Coming soon...
If you want to get in touch to talk about anything you have seen in this project please get in touch! Also I am more than happy to send trained weights and any intermediate data resources if processing them is not possible on your machine. ✌️
This work was inspired by the work done by A. Grover and J. Leskovec at Stanford.
I also based by ppi graph implementation and sampling strategy on the work of E. Cohen