Skip to content

Tool for the canonicalization of Polymer SMILES (P🙂) strings

License

Notifications You must be signed in to change notification settings

Ramprasad-Group/canonicalize_psmiles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Canonicalize PSMILES

I recommend using the psmiles Python package that integrates canonicalization and other tools to work with PSMILES.

PSMILES (Polymer SMILES) is a chemical language to represent polymer structures. PSMILES strings have two stars ([*] or *) symbols that indicate the two endpoints of the polymer repeat unit and otherwise follow the daylight SMILES syntax defined at OpenSmiles. Developed as part of arXiv.

The raw PSMILES syntax is ambiguous and non-unique; i.e., the same polymer may be written using many PSMILES strings:

Polyethylene Polyethylene oxide Polypropylene
[*]C[*] [*]CCO[*] [*]CC([*])C
[*]CC[*] [*]COC[*] [*]CC(CC([*])C)C
[*]CCC[*] [*]OCC[*] CC([*])C[*]

The canonicalization routine of the PSMILES packages finds a canonicalized version of the SMILES string by

  1. Finding the shortest representation of a PSMILES string

[*]CCOCCO[*] -> [*]CCO[*]

  1. Making the PSMILES string cyclic

[*]CCO[*] -> C1 CCO C1

  1. Applying the canonicalization routine as implemented in RDKit

C1 CCO C1 -> C1 COC C1

  1. Breaking the cyclic bond

C1 COC C1 -> [*]COC[*]

Install

pip install git+https://github.com/Ramprasad-Group/canonicalize_psmiles.git

How to use

See also test.ipynb

from canonicalize_psmiles.canonicalize import canonicalize

smiles = "[*]NC(C)CC([*])=O"
print(smiles)
print(canonicalize(smiles))