Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of PDB_Prepare #2

Open
discoleo opened this issue Jan 6, 2023 · 0 comments
Open

Refactoring of PDB_Prepare #2

discoleo opened this issue Jan 6, 2023 · 0 comments

Comments

@discoleo
Copy link

discoleo commented Jan 6, 2023

Function PDB_Prepare

This function can benefit a lot from refactoring:

  • split sub-functions into separate functions: these can be useful on their own;
  • reduce the complexity of the code;
  • optimize the code and remove dependency on dplyr;
  • correct an ugly bug;

I will address all these topics below. The refactored code is also available on GitHub:

PDB_Prepare Main Function

The refactored main function is in file Proteins.Structure.FiScore.R (this is only a convenience name for my script files; the initial name can be retained).

  • most of the code has been moved to external helper functions;
  • the lower limit of aa can be explicitly set: default = 5;
  • the last for-loop should run more efficiently and the dependence on dplyr has also been removed;

Features

Are extracted by the helper function features.pdb (see file Proteins.Structure.R). This function can be used on its own and should be exported by the package).

  • the function also uses the helper functions: as.type.helix and as.type.sheet;
  • the structure name is stored as a factor (for efficiency): therefore requires explicit as.character() when used in the main function;

Torsions & B-Factor

Are computed by separate functions (see file Proteins.Structure.R). These functions can be used on their own as well.

  • string extraction: the vectorized version is used directly and should run far more efficiently, e.g.:
    df_resno = as.numeric(stringr::str_extract(rownames(pdb_df), "[0-9]{1,}"));

Ugly bug was also corrected:

  • the torsions function now stores an attribute with the complete cases (as a logical vector):
    attr(pdb_df, "complete") = isComplete;
  • the BFactor function uses explicitly this information to select only the complete cases;

Other

  • read.pdb: is a minor helper function not actually used in the code;

The refactored code should be faster and more robust. The function names are provisional and may be changed or adapted to better suite various workflows.

Note:

  • the refactored code has NOT been thoroughly tested!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant