How do we best view a collection of molecular structures and their biological activities?
This is a central challenge we have in OSM. It would be good if the casual medicinal chemist could simply browse the project's structures.
Contributors are making molecules every day and regularly but less frequently receiving potencies or other bio/chemdata. We need to be able to share the structures and the activities most effectively. We want the data to be easily shared but also easily browsed. So we need a sheet/something with:
1) Structures, i.e. 2D pictures of the molecules that are human-friendly
2) Associated informatics data (e.g. InChI) that are machine-friendly
3) Potencies or other data
4) Any associated ID numbers
5) A weblink or two to where the molecule is featured/made
It's useful I think for the project to have a discrete place where the data are kept, just to maintain identity - i.e. not just to be subsumed by a larger database. Or at least for the project's structures to be group-able if they are part of a larger database. But really this is a problem about human visualization.
The initial solution was a Google sheet, but we found that beyond about 50 structures the sheet didn't handle the images well.
An alternative is a shared Excel sheet, but as I understand it we would need a plugin to handle the chemical structures. That's do-able, but not if we expect all the readers to have the same plugin.
The current solution is an sd file - a succinct and easy-to-update text file that contains all the information:
HOWEVER, reading the data (i.e. browsing the structures) is not easy to do for the casual observer.
So what is needed? Well, we're batch-uploading data to Chembl (https://www.ebi.ac.uk/chembl/malaria/doc/inspect/CHEMBL2113921). If we could do an auto-upload to Chembl (daily) then this would be problem solved, since Chembl is very cool and are doing cool things with visualization:
But another possible solution is for us to be able to set up a system where: the sd file is displayed on a webpage with a static address (can be bookmarked). When the sd file is updated, that would lead to a new rendering of the webpage when it is loaded. The page would need to have the structures, and be displayed in an active way such that the data can be re-ordered on demand, like in a spreadsheet.
The main sd file has now been joined by an Excel sheet and sd file of lots of exciting new molecules for the latest series the project's looking at:
We coincidentally need to combine these two sd files, and we need to browse the new structures because we need to think about which molecules to make next in that new series.
I know that there are solutions that are appropriate for cheminformaticians. We need solutions for people who are happy with email and web browsers only.
Any ideas, anyone?
Egon Willighagen spoke about possible solutions using the sd file during the previous OSM project meeting (http://youtu.be/GxJnWg_eR2I)
This post at:
I haven't had time to write it up yet, but please check out this page, which visualizes the SD file of Oct 10:
You can select the "Properties" checkbox to visualize all the SD properties, as shown in the second screenshot:
I have spoken with @vedina about options to automate updating the data set daily...
Another possible option: JSDraw by Scilligence has a java script solution to sdf display. It can be embedded in html and might look like this:
This is all on good lines but I can add 1) make sure the InChIKeys are open to the Googlebot 2) I would have thought monthly ChEMBL uploads were fine or even just before a new release so they flow to PubChem a couple of weeks later. 3) At least for the SD masterfiles keep the result-linked and designed virtuals plus made-but-not-yet-tested cleanly seperate (but can merge as well of course for clustering display ect). Since I have an academic ChemAxon licence I'd probably try JChem for Excel or the new MSWord integration but I'm generally agnostic about communal web-centric choices on the display side. As Egon can give us expert support directly with Ambit thats a bonus for a start.
There is a command line application, which can convert any SDF file into JSON and a folder with images. The page is a static one with some jQuery scripts and a bit of styling. The entire directory structure can be hosted at any HTTP server.
Here's a screenshot of that page:
As some of the posts above point out: it's pretty easy to create and use cheminformatics tools to pull the data from github, add whatever info is required, and build the kind of web page you want. I'm sure a lot of us cheminfogeeks would be happy to help out with that.
It's probably also no problem to provide a relatively easy to use desktop tool for grabbing, manipulating, and displaying the data. If I were doing this for a chemist, I'd probably use Knime. It's not that tough to use and has a great selection of tools for munging and manipulating data. For a more expert user (or someone willing to invest some more time in learning), I'd use an IPython notebook.
The trickier part is having the tool hosted on a reliable server somewhere that the community can access and keeping the contents up-to-date. Here, unfortunately, I don't have much in the way of ideas to offer.
Coming back to Chris's question above: how often does the web page need to be updated?
"reliable server and keeping the contents up-to-date" - with only a static page with scripts to host, the choice of HTTP servers is very broad (no application server needed nor database backend). JSON and images the page in my post relies on can be updated with a simple cron script any time the source SDF file is changed.
And for a very small dataset like this one, I would say github is the perfect place. Version control as an added value to hosting ;)
Oh yeah, that's definitely true for static files, but someone either has to run the scripts to generate those files and upload them to the server or you need another server somewhere that's running the cron job.
@greglandrum Not necessarily a server - run the scripts anywhere (laptop, desktop) and check the files into github (or server of choice). More or less similar to compiling source files.
That tool looks pretty cool. If the files are purely static and we don't need anything too special we could simply host the site from an S3 bucket (however the cron job would still, as mentioned above have to run on something).
Another option is to host the web server, and the cron job on the same server, this could either be as an AWS instance (a t1.micro would be sufficient), or alternately we have free access to use Nectar ( http://nectar.org.au/ ) an Australian cloud for researchers that would provide a box with better specs than a low cost AWS instance.
@mike - yes - all the files behind the molbrowser page are static and currently on github gh-pages. Great idea to host on a cloud service!
Ping @vedina could you please add a LICENSE to the repo folder that outlines usage/license? Cheers.
Done - added LGPL3 license in the pom.xml
Thanks, much appreciated.
Nice examples everyone, thank you. Just to clarify, is someone actively constructing something, or is the ball back in my court?
@greglandrum and @cdsouthan I'm not sure how often we'd update the sd file. We add only those compounds that have been biologically evaluated, so it's periodic. At a best guess, weekly?
And yes, Knime is a very good advanced tool for playing with data, but we just need something for the interested med chemist. Structures, some info like an ID, potencies. I think it's important people can follow the trail from the data to more info about the compounds, but if that's not already in the sd file we can handle it separately later.
@mike - true, the JSON data as it is now will not be indexed properly by Google. One option is to modify the application to directly generate HTML tables instead of JSON, this is quite straightforward. The same jQuery datatable can be used for rendering, but the HTML will contain all the values.
In fact Google might index the JSON pages with a bit of more efforts. Google seem to have a patent for indexing and searching JSON objects ...
@mattodd I can't help on the IT construction side but it looks like you have very able folk engaging. Re updating, I know OSM keeps up a cracking pace but longer result consolidation intervals have advantages. Besides, if Google gets to the ELNs and folk add the InChIKeys for what they are designing, and the more novel reagents you have de facto real-time surfacing anway. My suggestion would be that @georgeisyourman tips you off for the last two week window before ChEMBLMalaria > ChEMBL18 loads. You then put the call out for any new collaborator results, shut the fume cuboards, order the pizzas, crack a few tubes, and have a data catch-up session, filling in the tables the way GP needs. As mentioned before, you should then not only hit the ChEMBL release but also flow into PubChemBioassay a couple of weeks later. Its important that the MMV and OSM codes come through (via the SIDs) as searchable synonyms in the CID records. Note also you should also update any data and code name additions in the pre-existing ChEMBL records e.g. newer assay results. These will automatically refresh in the SIDs (there may be issues with multiple results needing new assay rows but maybe G&I can take a look)
@vedina I wouldn't be too surprised if they can index JSON, I'm more concerned about them indexing the entry page that generates the data dynamically based on loading the contents of the JSON file.
@mattodd - well, try uploading to an Ambit instance - nothing more than SDF upload through a web form or a REST API and all substructure and similarity searches are enabled once the file is uploaded.
@mattodd I refer exactly to the same solution as suggested by @egonw at the second comment.
Sorry for the self advertising.
Ambit is a chemical database (MySQL) with REST web service API doi:10.1186/1758-2946-3-18 . LGPL2 licence. Substructure and similarity search
It runs under Apache Tomcat container and is distributed as a web application archive, which means anyone can host it. Unlike ChEMBL the database is distributed without default content - that's it you could upload any chemical file through a web form or a HTTP POST call. Bioclipse can do this (among other things as running calculations and variety of predictions through the same API).
What I had in mind is 1) installing a Tomcat server 2) deploy ambit.war 3) upload SDF files (or CML, SMILES, InChI, etc). This could be an existing installation or one dedicated to OSM, at a location of choice.
I've created a new repo for this project/discussion here
I've started work on this, I just haven't made enough progress to warrant pushing anything to that repo yet. For those interested it will be a mongodb backed node.js app. Creating a quick and dirty visualisation doesn't take too long (and can actually be done in basically one line using openbabel) but I'd like to take the time to develop an awesome, user friendly system that has some cool features (like pulling in external data from ChEMBL).
just to tell @mike there is external data pulling in ambit with a REST API ...
@greglandrum Have you guys got a repo for that? Just want to make sure we aren't doubling up on work.
We are in the process of deploying a KNIME Server for use on projects such as this. We should be able to write out to any backend, and can do cheminfo. reports both on scheduled tasks and as part of ad hoc requests if this is useful. This is deployed to the EC2 but we will point some dns at it once we get to that stage (hopefully soon).
For now, you can run workflows from the link below, and make requests for reports/visualizations directly to me if you don't want to do them yourselves in KNIME.
I'm not sure how this fits into your ongoing efforts, but we should be able to stitch things together.
I would upload an example report, but don't see a way to attach a PDF. Is there somewhere decent to upload it (sorry, I'm new here)?
Aaron Hart, KNIME.com
ps. Nice to meet you all!
Hi @Aaron-KNIME, looks cool. Is there a repo somewhere with the source?
Closing this for now due to more recent Issues linked above. Meeting held before Christmas. Minutes and actions here: http://malaria.ourexperiment.org/osddmalaria_meeting_/8407