Various stuff related to uplifting the French open data portal to the Semantic Web
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
OnToology/ontology/dgfr.ttl
api
ontology
sparql
turtle
.gitignore
README.md
UNLICENSE
build.xml
config_template.properties

README.md

Semantic data.gouv.fr (0.9.0)

Various stuff around uplifting the French open data portal, data.gouv.fr, to the Semantic Web (Web 3.0).

This is the foundation work that fuels CasanovaLD.

I have forked this projet to deal with data.gov.uk metadata. It became datagovuk-rdf.

Update script

build.xml is an Apache Ant script that runs the following tasks:

  1. Downloading the latest metadata dumps from data.gouv.fr (CSV)
  2. Cleaning the data dumps (empty lines, wrongly escaped quotes)
  3. Converting the CSV into RDF (using TARQL)
  4. Uploading to an RDF repository
  5. Converting text identifiers in URIs for better linking across the data
  6. Doing some smart postprocessing to make the objects more interlinked in the graph, integrates the output of beheader into the graph
  7. Adding some metadata about the resulting data set (DCAT, VoID, PROV)

This script is run every 2 hours to update the RDF metadata (see here in French, in English).

This wouldn't be possible and so easy without the publication of live CSVs by @noirbizarre for Etalab.

The data model can be seen here.

Requirements

  • Apache Ant, with [ANT INSTALL]/bin directory added to your PATH environment variable
  • cURL, with [CURL INSTALL] directory added to your PATH environment variable
  • TARQL by Richard Cyganiak (@cygri), with [TARQL INSTALL] directory added to your PATH environment variable
  • An RDF repository. Apache Fuseki is a good choice, but there are plenty.

Configuration

  • Copy upload_template.properties and rename it upload.properties
  • Open it and fill it. As-is, your repository requires a user:password combination

Run it

  • If Requirements are fulfilled, just run ant in datagouvfr-rdf root folder.
  • If you have already run the process and just want to reload the data in the triple store, run ant quick.

Next steps

  • Tell me!

Contact

I would love to read your feedback/comments/suggestions!

If you have a Github account, you can create an issue.

Otherwise, you can reach me:

Change log

0.9

  • Externalized more configuration to config.properties
  • Restructured tasks
0.8.4
  • Trims trailing spaces off resource URLs and replace other spaces with %20 (#52)
0.8.3
  • Added KML formats, JSON-LD and N3 to the list of machine readable formats (ColinMaudry/datagouvfr-rdf#37)
0.8.2
  • Added application/shp+zip in the list of machine readable formats
0.8.1
  • Fixed CasanovaLD address for documentation (/.doc)

0.8

  • Added backup-repository and load-backup targets to enable the management of the repository as a service
  • Added dgfr:machineReadable property to distinguish machine readable resources from the others (CSV, XML, JSON, RDF, plain text, etc.)
  • Added a clean version of dcat:mediaType values, without charset=

0.7

  • Added properties dgfr:responseStatusCode, dgfr:responseTime and dgfr:availabilityCheckedOn to the ontology and API configuration
  • Added direct link between organizations and published distributions (see the result in the data model
  • Added a view for anavailable resources in the API (https://www.data.maudry.com/fr/resources/unavailable)
  • Icons for boolean values (true/false) are clearer now

0.6

0.5

  • Availability and unavailability count at dataset and organization levels
0.4.3
  • Made SPARQL endpoint configuration more flexible
0.4.2
  • Fixed errors in ontology
0.4.1
  • Disabled archiving of RDF due to disk space. Will enable again when I have a clearer archiving strategy.

0.4

  • Calculation of popularity points for all objects, and aggregate sums on organisations and datasets
  • Integration of the data collected by beheader (availability of the distributions, content type, content length)
0.3.3
  • Enabled ETL with previously downloaded data to have CasanovaLD up quicker
0.3.2
  • Not much...
0.3.1

0.3

  • The RDF data is now loaded in a single atomic transaction in the repository
  • Switch from Dydra (http://dydra.com) to a local Apache Fuseki instance
  • Added organizations and reuses data, with all identifiers turned into URIs for full linking
0.2.1
  • That was a lame name. Say hi to CasanovaLD!
  • Improved documentation

0.2

  • The data.gouv.fr explorer app, with somewhat documented APIs, is live!
  • URIs have changed to match the domain of the app
  • Added dgfr:visits and dcterms:keywords (as comma-separated list, meh) in the data
0.1.5
0.1.4
  • Fixed missing properties (mismatch at conversion stage). Still no tags
0.1.3
  • Fixed RDF dataset modification date
0.1.2
  • Fixed resources that have spaces in their URLs (url-encode)
  • Added dgfr:slug for datasets
0.1.1
  • Configured upload and update of VoID and PROV metadata (in default graph)
  • Enabled scheduled task to update data every day

0.1

  • Script to download/clean/convert/publish data.gouv.fr dataset metadata
  • Basic documentation