Skip to content
This repository has been archived by the owner on Mar 26, 2019. It is now read-only.

Epic: Track changes to a data series and push the date of the latest change to CKAN #298

Open
cjhendrix opened this issue Oct 17, 2014 · 26 comments

Comments

@cjhendrix
Copy link
Contributor

The goal is to allow all the information about a ckan indicator (which is simply a ckan dataset that is coming from CPS) to be maintained in one place: CPS. CPS would then have the ability to push changes to this information (let's call it Ancillary Indicator Information, AI2) to CKAN via CKAN's action API.

The goal of this epic is to set up the framework for this using a high value test case, described below.

Consider all the indicators returned from this search: https://data.hdx.rwlabs.org/dataset?q=fts+cross-appeal Note that the "Updated By" date for all of them is July 7, which is the date when the ckan datasets were created. However the data on CPS has been updated at least weekly since then, but CKAN has no way of knowing this. This epic will result in these dates being updated by CPS whenever a change is made to the data series. Later we will expand this approach to allow all of the AI2 to be managed in CPS.

The list of AI2 to be managed by CPS will ultimately include:

  • dataset AI2 (0..1 per data series)
    • most recent date of changed data or metadata (for "updated by")
    • description
    • source
    • source link
    • date range (calculated)
    • locations (1..*)
    • public/private
    • license (including other + text)
    • methodology (including other + text)
    • caveats
    • tags (0..*)
    • topics (fixed list)
    • Org
    • data display precision
    • Resource AI2 (0..* per data series)
      • CPS API URL for Resource
      • Resource Name
      • Resource Note
      • Resource Format
@rufuspollock
Copy link

I'm a bit unclear why wouldn't you manage all of this info directly in CKAN? It already has the capability to store all this kind of info and it saves you having to reinvent the wheel by adding support for this in CPS (and then pushing it back across into CKAN).

@cjhendrix
Copy link
Contributor Author

We use CPS to import and normalize data and maintain referential integrity. Since at least some of the info has to be maintained on the CPS side, our data managers feel it would be easier to manage all of it there. This is just for those datasets that we curate, not user contributed datasets which live solely on CKAN.

@rufuspollock
Copy link

@cjhendrix I guess the question is why you couldn't maintain all the info on the CKAN side here based on DRY principles? Generally, I think it would be really useful (for me) to understand a bit more about the overall architecture especially of CPS to understand what is being done where and how as I can then offer more useful input :-)

@cjhendrix cjhendrix added this to the Sprint 37 milestone Oct 23, 2014
@cjhendrix
Copy link
Contributor Author

Note to Sam. Understood that this one will likely carry over multiple sprints given your availability.

@seustachi
Copy link
Contributor

The biggest difficulty I see here is that CPS does not know about the curated datasets.

Instead, the curated datasets know about CPS.

If we add some kind of mapping, allowing CPS to know which curated datasets to update when some data (or metadata) changes are detected, we still have 2 places to maintain. If we add a new indicator, we have to create it in CPS, create the curated dataset, and they both must know about each other.

So we don't follow the DRY principle, and I am not sure this will be simpler for the data team.

The gain here would be that once this is set up, the updates should be replicated.

I think we should have a call dedicated to this topic.

@seustachi
Copy link
Contributor

So after discussion, here is the plan :

There is a 1 to 1 relationship between dataseries and ckan datasets. So if we detect a change in the data or metadata for a dataserie, we can push it to the dataset.

  • The metadata would be pushed to the dataset metadata.
  • For the data, this is still to be discussed. We might want to push a new file, and / or invalidate the cache... Many things might have to be done, to be discussed with Alex and Serban.

What I can do already is the following :

  1. Add some fields in the dataserie table :
  • Name of the dataset (to be able to push to ckan)
  • Last metadata update
  • Last metadata push
  • Last data update
  • Last data push
  1. Setup a job that will search the dataseries where Last metadata update > Last metadata push, and push the metadata to ckan (and update the last metadata push value)

seustachi added a commit that referenced this issue Nov 9, 2014
@rufuspollock
Copy link

@seustachi it would be super useful to get a bit of a diagram here to understand what is going on - as mentioned you'll want to be careful about not ending up with your authoratative metadata in 2 places (and getting stuff out of sync).

@cjhendrix
Copy link
Contributor Author

@seustachi The key thing we need to urgently solve is the high value test case listed in the original issue above. If I understand your last comment above, it sounds like you are putting that one as secondary. Happy to discuss, but I think you need to focus your effort on that one.

@seustachi
Copy link
Contributor

@cjhendrix I don't put it as secondary priority.

To detect a change related to a dataserie is a prerequisite.
To know how dataseries and datasets are related is also a prerequisite.

Then we will be able to push information to CKAN.

@cjhendrix
Copy link
Contributor Author

Ok, thanks for the clarification.

@seustachi
Copy link
Contributor

So, we agreed that :

  • a dataset is related to a dataserie
  • we need to change the names of the dataset to be able to have several datasets for an indicatorName (title_with_underscore___sourceCode)
  • we want in the extras : sourceCode, sourceName, IndicatorTypeCode, IndicatorTypeName, lastUpdateDate
  • we will propose to Serban to use a static FS for reports that will be updated by the CkanSynchronizerJob (This job is in CPS) instead.

LastUpdateDate changes only if at least one vale was added or updated

seustachi added a commit that referenced this issue Nov 14, 2014
@seustachi
Copy link
Contributor

List of the extras keys we wat to use :

"dataset_source" for the sourceName
"dataset_source_code" for the source code

"indicator_type" for the IT Name
"indicator_type_code" for the IT code

"dataset_date": "11/02/2014-11/20/2014", for the date range of the data

"dataset_summary"
"methodology"
"more_info"
"terms_of_use"
"validation_notes_and_comments"

@seustachi
Copy link
Contributor

Format of the action we want to use is documented here :
https://gist.github.com/alexandru-m-g/09155dff01e8302acf47

@seustachi
Copy link
Contributor

seustachi added a commit that referenced this issue Nov 27, 2014
@seustachi
Copy link
Contributor

@cjhendrix @alexandru-m-g
I don't remember what we decided about the change to the dataset names.

Do we keep a human readable title (title_with_underscore___sourceCode) or do we want (indTypeCode_SourceCode)

I think I remember the CJ prefered the human readable. If we do that, we have to manage the title in CPS (to be able to push updates). Is it what we want ?

seustachi added a commit that referenced this issue Dec 27, 2014
seustachi added a commit that referenced this issue Dec 27, 2014
seustachi added a commit that referenced this issue Dec 27, 2014
@teodorescuserban
Copy link
Member

Please, when in doubt about any names, favor human readable over anything else and url slug over human readable.

@cjhendrix
Copy link
Contributor Author

@seustachi It's the former, for example: https://data.hdx.rwlabs.org/dataset/proportion_of_the_population_using_improved_sanitation_facilities___mdgs

Alex is making the change in sprint 46 (2 week sprint starting 5 Jan): OCHA-DAP/hdx-ckan#1771

As for managing the title in CPS, that should be fine. The only thing we shouldn't manage is the "name", which is used for the URL.

seustachi added a commit that referenced this issue Dec 31, 2014
seustachi added a commit that referenced this issue Jan 2, 2015
Upon completion of metadata update, the ts in db is updated so the job
is considered as done
seustachi added a commit that referenced this issue Jan 2, 2015
@seustachi
Copy link
Contributor

What we want now is to trigger the metadata update is a new indicator value is added or an existing one changed, because we need to change the range of values dates

@seustachi
Copy link
Contributor

And we also want to update the date of the last "update" of the dataset. See with @alexandru-m-g if we store it in dataset or resource. This is a new metadata, update triggered when an update to the data is done

seustachi added a commit that referenced this issue Jan 12, 2015
which there is some data for the dataserie
seustachi added a commit that referenced this issue Jan 12, 2015
seustachi added a commit that referenced this issue Jan 12, 2015
seustachi added a commit that referenced this issue Jan 14, 2015
alexandru-m-g added a commit that referenced this issue Jan 20, 2015
to appear under "name" instead of "id"
@seustachi seustachi modified the milestones: Sprint 47, Sprint 37 Jan 23, 2015
@seustachi seustachi modified the milestones: Sprint 48, Sprint 47 Jan 29, 2015
@seustachi
Copy link
Contributor

@cjhendrix Moved to sprint 48.

Even if we started to implement this epic in sprint 46, and some work was also done on sprint 47, some sub-tasks are still pending and planned for sprint 48 or later

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants