duplicate rows in proteomics table

Concerns v0.1.4 (figshare) / v0.1.40 (pypi)

Some proteomics tables contain multiple entries per `entrez_id`,`improve_sample_id` "key-pair". Duplicate appearances are not consistent.

Proteomics tables checked (only "surface level" so there might be more inconsitencies):

- beataml_proteomics: contains multiples of completely identical rows. Most likely this can be fixed with a simple `pandas.DataFrame.drop_duplicates()` call during build. 10,468,364 total entries, 1,570,668 unique rows, 1,526,225 unique "key-pairs"
- cptac_proteomics: contains some duplicate `entrez_id`, `improve_sample_id` "key-pairs". 11,458,410 total entries, 10,614,710 unique "key-pairs".  `proteomics` measurements for duplicate "key pairs" are different
- mpnst_proteomics: seems to contain two entries with differing `proteomics` measure for each pair of `entrez_id` & `improve_sample_id` "key-pair". 113,800 total entries, 55,231 unique "key-pairs" - i.e. for some "key-pairs" there must be more than 2 entries.
- broad_sanger_proteomics: contains some duplicate `entrez_id` & `improve_sample_id` "key-pairs" (to be expected from what I understand since proteomics measures can come from either broad, sanger or both). 8,722,803 entries in total, 6,903,778 unique "key-pairs". `proteomics` measurements for duplicate "key pairs" are different (also to be expected)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate rows in proteomics table #256

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

duplicate rows in proteomics table #256

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions