Concerns v0.1.4 (figshare) / v0.1.40 (pypi)
Some proteomics tables contain multiple entries per entrez_id,improve_sample_id "key-pair". Duplicate appearances are not consistent.
Proteomics tables checked (only "surface level" so there might be more inconsitencies):
- beataml_proteomics: contains multiples of completely identical rows. Most likely this can be fixed with a simple
pandas.DataFrame.drop_duplicates() call during build. 10,468,364 total entries, 1,570,668 unique rows, 1,526,225 unique "key-pairs"
- cptac_proteomics: contains some duplicate
entrez_id, improve_sample_id "key-pairs". 11,458,410 total entries, 10,614,710 unique "key-pairs". proteomics measurements for duplicate "key pairs" are different
- mpnst_proteomics: seems to contain two entries with differing
proteomics measure for each pair of entrez_id & improve_sample_id "key-pair". 113,800 total entries, 55,231 unique "key-pairs" - i.e. for some "key-pairs" there must be more than 2 entries.
- broad_sanger_proteomics: contains some duplicate
entrez_id & improve_sample_id "key-pairs" (to be expected from what I understand since proteomics measures can come from either broad, sanger or both). 8,722,803 entries in total, 6,903,778 unique "key-pairs". proteomics measurements for duplicate "key pairs" are different (also to be expected)
Concerns v0.1.4 (figshare) / v0.1.40 (pypi)
Some proteomics tables contain multiple entries per
entrez_id,improve_sample_id"key-pair". Duplicate appearances are not consistent.Proteomics tables checked (only "surface level" so there might be more inconsitencies):
pandas.DataFrame.drop_duplicates()call during build. 10,468,364 total entries, 1,570,668 unique rows, 1,526,225 unique "key-pairs"entrez_id,improve_sample_id"key-pairs". 11,458,410 total entries, 10,614,710 unique "key-pairs".proteomicsmeasurements for duplicate "key pairs" are differentproteomicsmeasure for each pair ofentrez_id&improve_sample_id"key-pair". 113,800 total entries, 55,231 unique "key-pairs" - i.e. for some "key-pairs" there must be more than 2 entries.entrez_id&improve_sample_id"key-pairs" (to be expected from what I understand since proteomics measures can come from either broad, sanger or both). 8,722,803 entries in total, 6,903,778 unique "key-pairs".proteomicsmeasurements for duplicate "key pairs" are different (also to be expected)