Skip to content
This repository has been archived by the owner on Mar 26, 2019. It is now read-only.

On import update modified values #309

Open
alexandru-m-g opened this issue Nov 14, 2014 · 9 comments
Open

On import update modified values #309

alexandru-m-g opened this issue Nov 14, 2014 · 9 comments
Assignees
Labels
Milestone

Comments

@alexandru-m-g
Copy link
Member

Right now we are only importing new indicator values, we are not updating values which were modified since the last import.

CHANGES:

  1. At import time, if an indicator value already exists with the same:source, indicator type, entity (and entity type), periodicity and start time BUT it has a different value THEN we need to update it ( and specify that it was modified by the current import )
  2. Add a new feature that allows a user to REMOVE all the values for a specific dataseries ( for a specific indicator type + source)
  3. DISABLE the feature that allows a user to delete all values from a certain import.

This is related to #278

@alexandru-m-g
Copy link
Member Author

I have committed the change for allowing the update of modified values. Point (1) in the list of changes above.

@alexandru-m-g
Copy link
Member Author

I'm adding more detailed requirements about (2) - deleting dataseries values, per a discussion with @cjhendrix:

  1. From the main menu:
    • In the menu Curated data, in the first section, there should be a new entry like Manage dataseries values
    • This will show to the user a table ( similar to the one we have for indicator types ) with : indicator type, source, number of values and a delete/empty action for each of the dataseries
  2. During the import process:
  • A similar action should be part of the import process. Basically, during the validation step ( after going to Detected CKAN resources and clicking: download and validate ) an option should be shown called Manage dataseries values
  • This will open in a new window, a table that is similar to the one from (1): so it will show information from the CPS database about the dataseries that are to be imported
  • This will allow the data team to completly remove dataseries values before they are reimported

Caveat:
If we go this way, there is a moment when there is no data in the database for a dataseries. It's the moment between the deletion of the dataseries' values and the reimporting of these from the file. This could lead to some charts showing up empty OR users calling the API directly to get incomplete data.

One solution could be to:

  1. at first, to not really delete the indicator values for a dataseries, but instead mark them as to be deleted
  2. then, when importing new values, mark them as not yet activated. Basically, these values should not be used in API or reports yet.
  3. When the import finishes, we can run -- inside a single transaction -- an action that deletes the to be deleted values and activates the not yet activated values

@cjhendrix @seustachi we should discuss how we proceed about this issue

@cjhendrix
Copy link
Contributor

Thanks for documenting this, Alex. I'll add it to the CPS planning doc.

As for the caveat, I have a couple of questions:

  1. I know on the reports side, we would avoid most data gaps because of the caching that is in place. However, I'm not sure that's true on the API side. Is there any caching on CPS or CKAN or nginx that would make a gap in data availability unlikely?
  2. How much more effort is it to implement the "staged delete" solution you describe above?

@seustachi
Copy link
Contributor

It doesn't sound like an enormous effort.

The model should be change to have this new status in the unique key as well.
I can work on this issue as well, @alexandru-m-g @cjhendrix up to you.

@cjhendrix
Copy link
Contributor

Then I suggest we build it to avoid gaps in data availability, either as Alex suggested or some other approach.

@alexandru-m-g
Copy link
Member Author

I don't think it's a huge effort, I'd estimate like 1 day.

On the plus side, the solution suggested above, would solve another problem that we have: during import (which takes some time in our case) someone accessing the API might get incomplete data, and the result could also get cached.
One thing that needs to be taken care of during implementation though, is about updating/modifying indicator values in the import process. In order to keep data consistency:

  1. If a value is found that needs to be updated it should be marked as to be deleted
  2. Instead of updating the value that was found, a new value needs to be inserted and marked with not yet activated

@cjhendrix
Copy link
Contributor

Sounds like the plan is clear then. Lets implement a solution like the one you have described that avoid any gaps in data availability.

@seustachi
Copy link
Contributor

@alexandru-m-g @cjhendrix is this done ? or to roll to next sprint ?

@alexandru-m-g
Copy link
Member Author

There are still some tasks that were not done in this ticket:

  1. Add a new feature that allows a user to REMOVE all the values for a specific dataseries ( for a specific indicator type + source)
  2. implement a caveat solution, maybe the one above - this is related to the point above
  3. DISABLE the feature that allows a user to delete all values from a certain import.

@cjhendrix cjhendrix modified the milestones: Sprint 47, Sprint 41, Sprint 48 Jan 26, 2015
@danmihaila danmihaila added the CPS label Jun 22, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants