dereplicate proteins on genus level to reduce database size #21

jvollme · 2022-02-21T17:22:13Z

maybe use 98% aminoacid identity cut-off?
proteins that are unique for one species in a genus would still be attributed to that individual species (but only one copy would be kept, in case of multi-copy entries)
"redundant" proteins, that occur identically in multiple species of a genus would be attributed to the genus instead of the species (again only represented by one copy)
--> reduces dataset size
--> increases diamond/blast search speeds
--> increases speed of LCA-classifications (a little bit)?

jvollme added this to the improvementsB milestone Feb 21, 2022

jvollme self-assigned this Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dereplicate proteins on genus level to reduce database size #21

dereplicate proteins on genus level to reduce database size #21

jvollme commented Feb 21, 2022

dereplicate proteins on genus level to reduce database size #21

dereplicate proteins on genus level to reduce database size #21

Comments

jvollme commented Feb 21, 2022