Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest ~2,000 high-quality models into Terarium #363

Closed
liunelson opened this issue Jan 4, 2024 · 7 comments
Closed

Ingest ~2,000 high-quality models into Terarium #363

liunelson opened this issue Jan 4, 2024 · 7 comments
Assignees

Comments

@liunelson
Copy link
Member

liunelson commented Jan 4, 2024

The goal is to pre-populate Terarium with a significant number of "high quality" models from an existing repository such as BioModels.

The models that we want to ingest are the ~2432 models returned by the BioModels search interface with only the filter "model format = SBML". We should use the REST API to download the SBML file of each model.

Each SBML model file (extension = xml or sbml) should go through the following script (requires MIRA package) to convert from SBML format to PetriNet AMR JSON format:

import os
import glob
import json
import tqdm

from mira.metamodel.ops import simplify_rate_laws
from mira.modeling import Model
from mira.modeling.amr.petrinet import AMRPetriNetModel
from mira.sources.sbml import template_model_from_sbml_file

PATH = "data/biomodels_sbml"
fnames = glob.glob(os.path.join(PATH, "*.*ml"))

fnames_succ = []
fnames_fail = []
for fname in tqdm.tqdm(fnames):
    try:
        model_tm = template_model_from_sbml_file(fname)
        model_tm_ = simplify_rate_laws(model_tm)
        model_pn = AMRPetriNetModel(Model(model_tm_))
        model_pn_json = model_pn.to_json()

        with open(".".join([fname.split(".")[0], "json"]), "w") as f:
            json.dump(model_pn_json, f, indent = 4)

        fnames_succ.append(fname)

    except:
        fnames_fail.append(fname)

print(f"{len(fnames_succ)} successes and {len(fnames_fail)} fails")

I've tested ~200 models and ~60% can be successfully converted into a PetriNet AMR JSON.

We'll need @j2whiting 's help to subsequently populate the "Model Card" associated with each model.

@liunelson
Copy link
Member Author

@bigglesandginger @YohannParis
Does the above make sense to you?

@bigglesandginger
Copy link

@liunelson Have you used the API? When I try curl -XGET https://www.ebi.ac.uk/biomodels/search\?query\=homo+sapiens\&format\=SBML I get a web page, not xml or whatever else one might expect from an API .

@liunelson
Copy link
Member Author

You are right about the API. It seems to return the search page itself, as opposed to a nice JSON listing the model IDs.

https://www.ebi.ac.uk/biomodels/search?query=%3A%20AND%20modelformat%3A%22SBML%22&domain=biomodels&offset=0&numResults=10

With the Model IDs, then you can use this endpoint to get the model SBML file
https://www.ebi.ac.uk/biomodels/search/download?models=MODEL0913095435

Does this make sense?

@j2whiting
Copy link

j2whiting commented Jan 20, 2024

I can write a script to crawl these and pull models & metadata if you'd like. I should have time on Monday/Tues and I think it should only take a couple hours tops.

Just let me know the schema you need for the output.

@YohannParis YohannParis assigned j2whiting and unassigned liunelson Jan 22, 2024
@j2whiting
Copy link

j2whiting commented Jan 23, 2024

This was a little bit annoying since I had to render the javascript instead of just building a crawler using simple requests and html parsing.. but it is done.

I managed to pull all 2435 model href tags, the URL for the source publication and the download link for the model file.

>>> import json
>>> with open('model_data.json', 'r') as f:
...     data = json.load(f)
...
>>> next(iter(data.items()))
('/biomodels/BIOMD0000000573', {'publication_link': 'http://identifiers.org/pubmed/24997239', 'model_files': ['https://www.ebi.ac.uk/biomodels/services/download/get-files/MODEL1503180001/3/BIOMD0000000573_url.xml', 'https://www.ebi.ac.uk/biomodels/services/download/get-files/MODEL1503180001/3/BIOMD0000000573_urn.xml']})

How do we add these to Terrarium?

Update:

URLs, models and reference links can be found in the JSON here: https://drive.google.com/file/d/1Upv84-fWmSqBvTxSzRpJqSEQ3OQ61GTc/view?usp=share_link

  • Test adding new models to Terrarium on Jan 24th

@YohannParis YohannParis changed the title Ingest a number (10-1000s) of high-quality models into Terarium Ingest a number 245 of high-quality models into Terarium Jan 23, 2024
@j2whiting
Copy link

@liunelson to convert these to AMR and upload to Terarium

@liunelson
Copy link
Member Author

I've ad-hoc converted 2k models from SBML to AMR JSON that Julian has scrapped from the BioModels repository: see here.

I've also used the Open Access Button API to find the download link of the associated paper PDF:
model_data_oa.json

I was only able to download 10.8% of the open-access URLs but I didn't spend any time trying to figure out why the other ~60% of open-access URL downloads didn't work. Some, the OA URL GET just 404. However, for example, model BIOMD0000000598 has this link which allowed me to download a PDF but I don't know why it'd 403 with GET.

Charles, can you have Terarium ingest all these models with their paper (if available)?

Number of models:			2435
	with SBML:			99.8%
	converted to AMR:		55.6%
	with PDF link:			99.4%
	with OA PDF link:		72.3%
	with downloaded PDF:		10.8%

@YohannParis YohannParis changed the title Ingest a number 245 of high-quality models into Terarium Ingest a number 2245 of high-quality models into Terarium Jan 29, 2024
@liunelson liunelson changed the title Ingest a number 2245 of high-quality models into Terarium Ingest ~2,000 high-quality models into Terarium Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants