Skip to content

duplicate rows in proteomics table #256

@ymahlich

Description

@ymahlich

Concerns v0.1.4 (figshare) / v0.1.40 (pypi)

Some proteomics tables contain multiple entries per entrez_id,improve_sample_id "key-pair". Duplicate appearances are not consistent.

Proteomics tables checked (only "surface level" so there might be more inconsitencies):

  • beataml_proteomics: contains multiples of completely identical rows. Most likely this can be fixed with a simple pandas.DataFrame.drop_duplicates() call during build. 10,468,364 total entries, 1,570,668 unique rows, 1,526,225 unique "key-pairs"
  • cptac_proteomics: contains some duplicate entrez_id, improve_sample_id "key-pairs". 11,458,410 total entries, 10,614,710 unique "key-pairs". proteomics measurements for duplicate "key pairs" are different
  • mpnst_proteomics: seems to contain two entries with differing proteomics measure for each pair of entrez_id & improve_sample_id "key-pair". 113,800 total entries, 55,231 unique "key-pairs" - i.e. for some "key-pairs" there must be more than 2 entries.
  • broad_sanger_proteomics: contains some duplicate entrez_id & improve_sample_id "key-pairs" (to be expected from what I understand since proteomics measures can come from either broad, sanger or both). 8,722,803 entries in total, 6,903,778 unique "key-pairs". proteomics measurements for duplicate "key pairs" are different (also to be expected)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions