Incorrect data order for datasets 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196) #6

stefanpauliuk · 2020-05-28T20:13:13Z

IEDC github issue: for 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196):
Data entries for these datasets are not correct (discoverd by Indecol Freiburg PhD student Stefanie Klose):
Tables are 2D, region x year. When downloading these datasets from the db, the values for a given region don't make sense. When checking, it turns out that for each more year, the original table data from one row (region below are taken):

When parsing the table for a fixed row (here: region), the parser does not enter the values on that row but goes down one row for each column to the right (year).
For leftmost year, data are sometimes correct sometimes one row below but for the next year, data from one row below are taken etc. Then, data start at top of table.

I checked ca. 20 other TABLE datasets, no problem at all. So far, this problem as only seen with datasets 195 and 196.

My only explanation at the moment is that these datasets were uploaded with an earlier version of the parser, and a simple re-upload will fix it.

Also, what makes these two datasets a bit special is that the region labels are 3 digit country codes (attribute4 of the classification), whereas in most cases, we use attribute1. But I checked some TABLE datasets that use custom classifications or attibute 3 or so, no problem here.

nheeren · 2020-05-28T20:25:46Z

Let me know if I should investigate. I would be surprised if this is an issue with this version as it is the one I used to upload the last 10–20 datasets.

stefanpauliuk · 2021-08-03T15:05:35Z

Problem solved!
First, there was a duplicate entry for region code 780 (see report here: IndEcol/IE_data_commons#25, which led the following command to produce a table/df longer than the original one, with a duplicate match for each year:
tmp = data.merge(db_classitems2, left_on=class_name, right_on='attribute%s_oto' % str(int(attribute_no)), how='left')
Second, that longer merged df, in turn, when matched with the original one, led to a period mismatch of 1 (one df: data for new year start after 273 entries, other df: after 274 entries), which leads to a 1 region offset for each year in
'data.loc[:, class_name] = tmp['i']'

Solution:

remove duplicate entry and 2) add a check in the upload routine (very helpful!) to report and avoid such double instances:
if len(tmp.index) != len(data.index): raise AssertionError("The database classification table contains at least one conflicting duplicate entry for the unique attribute attribute%s_oto of classification %s. Data upload halted. Check classification for duplicate entries!" % (str(int(attribute_no)), class_name))

stefanpauliuk self-assigned this May 28, 2020

stefanpauliuk closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect data order for datasets 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196) #6

Incorrect data order for datasets 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196) #6

stefanpauliuk commented May 28, 2020

nheeren commented May 28, 2020

stefanpauliuk commented Aug 3, 2021

Incorrect data order for datasets 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196) #6

Incorrect data order for datasets 2_P_UN_WPP_Future_2100 (dataset_id 195) und 2_P_UN_WPP_Historic_2015 (dataset_id 196) #6

Comments

stefanpauliuk commented May 28, 2020

nheeren commented May 28, 2020

stefanpauliuk commented Aug 3, 2021