[RFC] Enabling Cython-based PDB parser backend for speed improvements #139

a-r-j · 2023-08-29T11:27:29Z

Describe the workflow you want to enable

Currently, the pure-python of PDB parsing in BioPandas is quite slow - certainly too slow for highthroughput structural bioinformatics or ML.

Describe your proposed solution

I have written a Cython-based implementation (CPDB) which is considerably faster and would like to set this as the default parsing backend. As it stands, I believe this to be one of the fastest (if not the fastest) available PDB parser for Python.

Performance comparison

However, given BioPandas' widespread usage, I am unclear if distributing this with a Cython component will lead to dependency problems for users.

Describe alternatives you've considered, if relevant

Speeding up the passage of time

Additional context

rasbt · 2023-08-29T11:54:04Z

@a-r-j This is super cool.

Btw. perhaps we don't need to worry about it extra dependencies here because NumPy already uses Cython (https://github.com/numpy/numpy/blob/main/build_requirements.txt), and pandas is build on NumPy, and BioPandas is build on pandas :P

a-r-j · 2023-08-29T13:27:47Z

That's a good point! I was mostly concerned about the potential for build problems (mostly as cpdb is my first time working with Cython). I'll make a PR tonight and push a dev release so we can collect some feedback.

Ruibin-Liu · 2023-08-30T18:45:52Z

One difference in the comparison is that your Cython implementation only reads ATOM, HETATM, and ENDMDL lines while biopandas reads all. Would be interesting to compare the performance if all lines are read (no need to parse like biopandas?).

a-r-j · 2023-08-30T18:49:39Z

@Ruibin-Liu Hmm, that's a really great point. I could add a read_header arg to cpdb. In any case, I wouldn't have thought it would make a huge difference to speed; in terms of line count PDB files are most coordinates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Enabling Cython-based PDB parser backend for speed improvements #139

[RFC] Enabling Cython-based PDB parser backend for speed improvements #139

a-r-j commented Aug 29, 2023

rasbt commented Aug 29, 2023

a-r-j commented Aug 29, 2023

Ruibin-Liu commented Aug 30, 2023

a-r-j commented Aug 30, 2023

[RFC] Enabling Cython-based PDB parser backend for speed improvements #139

[RFC] Enabling Cython-based PDB parser backend for speed improvements #139

Comments

a-r-j commented Aug 29, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

rasbt commented Aug 29, 2023

a-r-j commented Aug 29, 2023

Ruibin-Liu commented Aug 30, 2023

a-r-j commented Aug 30, 2023