Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect data order for datasets 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196) #6

Closed
stefanpauliuk opened this issue May 28, 2020 · 2 comments
Assignees

Comments

@stefanpauliuk
Copy link
Member

  • IEDC github issue: for 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196):
    Data entries for these datasets are not correct (discoverd by Indecol Freiburg PhD student Stefanie Klose):
    Tables are 2D, region x year. When downloading these datasets from the db, the values for a given region don't make sense. When checking, it turns out that for each more year, the original table data from one row (region below are taken):

When parsing the table for a fixed row (here: region), the parser does not enter the values on that row but goes down one row for each column to the right (year).
For leftmost year, data are sometimes correct sometimes one row below but for the next year, data from one row below are taken etc. Then, data start at top of table.

I checked ca. 20 other TABLE datasets, no problem at all. So far, this problem as only seen with datasets 195 and 196.

My only explanation at the moment is that these datasets were uploaded with an earlier version of the parser, and a simple re-upload will fix it.

Also, what makes these two datasets a bit special is that the region labels are 3 digit country codes (attribute4 of the classification), whereas in most cases, we use attribute1. But I checked some TABLE datasets that use custom classifications or attibute 3 or so, no problem here.

@stefanpauliuk stefanpauliuk self-assigned this May 28, 2020
@nheeren
Copy link
Member

nheeren commented May 28, 2020

Let me know if I should investigate. I would be surprised if this is an issue with this version as it is the one I used to upload the last 10–20 datasets.

@stefanpauliuk
Copy link
Member Author

Problem solved!
First, there was a duplicate entry for region code 780 (see report here: IndEcol/IE_data_commons#25, which led the following command to produce a table/df longer than the original one, with a duplicate match for each year:
tmp = data.merge(db_classitems2, left_on=class_name, right_on='attribute%s_oto' % str(int(attribute_no)), how='left')
Second, that longer merged df, in turn, when matched with the original one, led to a period mismatch of 1 (one df: data for new year start after 273 entries, other df: after 274 entries), which leads to a 1 region offset for each year in
'data.loc[:, class_name] = tmp['i']'

Solution:

  1. remove duplicate entry and 2) add a check in the upload routine (very helpful!) to report and avoid such double instances:
    if len(tmp.index) != len(data.index): raise AssertionError("The database classification table contains at least one conflicting duplicate entry for the unique attribute attribute%s_oto of classification %s. Data upload halted. Check classification for duplicate entries!" % (str(int(attribute_no)), class_name))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants