Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

populate drugs structure property table #142

Closed
sgosline opened this issue Apr 9, 2024 · 5 comments · Fixed by #192
Closed

populate drugs structure property table #142

sgosline opened this issue Apr 9, 2024 · 5 comments · Fixed by #192
Assignees
Labels
new data Request for additional data to be added

Comments

@sgosline
Copy link
Member

sgosline commented Apr 9, 2024

need to add drug structure/fingerprints.

@sgosline sgosline added the new data Request for additional data to be added label Apr 9, 2024
@jjacobson95 jjacobson95 self-assigned this Apr 11, 2024
@jjacobson95
Copy link
Collaborator

This is pretty straight forward with python rdkit. Just need to know how detailed we want the fingerprints to be

from rdkit import Chem
from rdkit.Chem import AllChem
def smiles_to_fingerprint(smiles):
    mol = Chem.MolFromSmiles(smiles)
    fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=1024)  # update these parameters
    fingerprint_array = np.array(fingerprint)
    return fingerprint_array

@sgosline
Copy link
Member Author

also need to add mordred descriptors

@jjacobson95
Copy link
Collaborator

@sgosline What are your thoughts on changing the schema for Drug Descriptor to have three columns, improve_drug_id, fingerprint, and mordred?

Also what should these files be called? "{dataset}_drug_descriptor.tsv"? or "{dataset}_structures.tsv"?

Current Schema:

  Drug:
    description: List of chemicals/drugs used in the data package. Each identifer corresponds to a distinct structure.
    slots:
      - improve_drug_id
    attributes:
      chem_name:
        description: Name of drug
      canSMILES:
        description: Canonical SMILE string
      isoSMILES:
        description: Isomeric SMILE string
      InChIKey:
        description: InChIKey
      formula:
        description: Chemical formula
      weight:
        description: Molecular weight
        range: float
      pubchem_id:
        description: PubChem Identifier for this drug, can be many.
        range: int
  Drug Descriptor:
    description: Computational summary of drug chemical properties
    slots:
      - improve_drug_id
    attributes:
      structural_descriptor:
        description: string name describing structural descriptor
      descriptor_value:
        range: any
        description: value representing descriptor value

@sgosline
Copy link
Member Author

There are currently 1800 different mordred descriptors - creating an entry for each drug will be time consuming and space intensive. We need a use case/algorithm to motivate this further.

@sgosline
Copy link
Member Author

Here is the script to intrgrate: https://github.com/adpartin/mol-features/blob/master/src/gen_mol_fea.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new data Request for additional data to be added
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants