A package to profile SPARQL endpoints to extract the nodes and relations represented in the knowledge graph.
This package follows the recommendations defined by the HCLS Community Profile (Health Care and Life Sciences) to generate the metadata about the content of a SPARQL endpoint.
This package requires Python >=3.7, simply install it with:
pip install sparql-profiler
You can easily use the sparql-profiler
from your terminal after installing with pip.
Quickly profile a small SPARQL endpoint to generate HCLS descriptive metadata for each graph:
sparql-profiler profile https://graphdb.dumontierlab.com/repositories/umids-kg
Profiling a bigger SPARQL endpoint will take more times:
sparql-profiler profile https://bio2rdf.org/sparql
Display more debugging logs with -l debug
:
sparql-profiler profile https://bio2rdf.org/sparql -l debug
Profile a SPARQL endpoint to run a profiling method specific to Bio2RDF:
sparql-profiler profile https://bio2rdf.org/sparql --profiler bio2rdf
You can also add additional metadata for the dataset distribution after answering questions about it (description, license, etc) by running this command:
sparql-profiler profile https://graphdb.dumontierlab.com/repositories/umids-kg -q
See all options for the profile
command with:
sparql-profiler profile --help
Get a full rundown of all available commands with:
sparql-profiler --help
Use the sparql-profiler
in python scripts:
from sparql_profiler import SparqlProfiler
sp = SparqlProfiler("https://graphdb.dumontierlab.com/repositories/umids-kg")
print(sp.metadata.serialize(format="turtle"))
The final section of the README is for if you want to run the package in development, and get involved by making a code contribution.
Clone the repository:
git clone https://github.com/MaastrichtU-IDS/sparql-profiler
cd sparql-profiler
Install Hatch, this will automatically handle virtual environments and make sure all dependencies are installed when you run a script in the project:
pip install --upgrade hatch
Install the dependencies in a local virtual environment:
hatch -v env create
Alternatively, if you are already handling the virtual environment yourself or installing in a docker container you can use:
pip install -e ".[test,dev]"
You can easily run the sparql-profiler
in your terminal with hatch while in development to profile a specific SPARQL endpoint:
hatch run sparql-profiler profile https://graphdb.dumontierlab.com/repositories/umids-kg
Make sure the existing tests still work by running pytest
. Note that any pull requests to the fairworkflows repository on github will automatically trigger running of the test suite;
hatch run test
To display all print()
:
hatch run test -s
The code will be automatically formatted when you commit your changes using pre-commit
. But you can also run the script to format the code yourself:
hatch run fmt
Check the code for errors, and if it is in accordance with the PEP8 style guide, by running flake8
and mypy
:
hatch run check
In case you are facing issues with dependencies not updating properly you can easily reset the virtual environment with:
hatch env prune
The deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:
- Make sure the
PYPI_TOKEN
secret has been defined in the GitHub repository (in Settings > Secrets > Actions). You can get an API token from PyPI at pypi.org/manage/account. - Increment the
version
number in thepyproject.toml
file in the root folder of the repository. - Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.
You can also manually trigger the workflow from the Actions tab in your GitHub repository webpage.
SPARQL profiler of the Yumaka viewer (https://umaka-viewer.dbcls.jp/, code in https://github.com/dbcls/umakaparser) written in Java which should be able to work with large graphs: https://bitbucket.org/yayamamo/tripledataprofiler/src/master/src/jp/ac/rois/dbcls/TripleDataProfiler.java
Run:
java -jar TripleDataProfiler.jar -ep https://bio2rdf.org/sparql
Build:
javac -cp commons-cli-1.2.jar:commons-lang3-3.3.2.jar:apache-jena-2.11.1/lib/*:./src ./src/jp/ac/rois/dbcls/TripleDataProfiler.java
The guy is expecting we figure out by ourselves where to get his shit** deprecated versions of jena. Can't use basic maven in 2021, I am so tired about this kind of work done in research. There is no respect whatsoever by anyone for their own work.