# Post-Translational Modification (PTM) SMILES Notebook

This notebook is designed to update a Unimod table with post-translational modifications (PTMs) by adding SMILES formulas for some of the PTMs. We'll also calculate their molecular weights to validate the correctness of the structures against the ground truth data in the table.

## Setup and Imports

First, we import the necessary libraries and modules:


In [1]:
import os
import pandas as pd
from rdkit import Chem

from alphabase.constants._const import CONST_FILE_FOLDER
from alphabase.constants.atom import ChemicalCompositonFormula
from alphabase.constants.aa import AA_Formula

from alphabase.constants.modification import MOD_DF, add_new_modifications
from alphabase.smiles.smiles import modify_amino_acid
from alphabase.smiles.smiles import n_term_modifications as n_term_modifications_smi
from alphabase.smiles.smiles import c_term_modifications as c_term_modifications_smi
from alphabase.smiles.smiles import ptm_dict as ptm_dict_smi

We then define dictionaries with SMILES structures for N-terminal modifications, C-terminal modifications, and PTMs:

In [2]:
(ChemicalCompositonFormula("C(1)H(4)") - ChemicalCompositonFormula("H(2)O(1)")).elements

defaultdict(int, {'H': 2, 'O': -1, 'C': 1})

In [3]:
ChemicalCompositonFormula("C(1)H(2)O(-1)")

ChemicalCompositonFormula('C(1)H(2)O(-1)')

In [4]:
n_term_modifications = {'mTRAQ@Any_N-term': 'C(=O)CN1CCN(CC1)C',
 'mTRAQ:13C(3)15N(1)@Any_N-term': 'C(=O)[13C]([H])([H])[15N]1[13C]([H])([H])[13C]([H])([H])N(CC1)C',
 'mTRAQ:13C(6)15N(2)@Any_N-term': 'C(=O)[13C]([H])([H])[15N]1[13C]([H])([H])[13C]([H])([H])[15N]([13C]([H])([H])[13C]1([H])([H]))[13C]([H])([H])([H])',
 'Acetyl@Any_N-term': 'C(=O)C',
 'Propionyl@Any_N-term': 'C(=O)CC',
 'Biotin@Any_N-term': 'C(=O)CCCCC1SCC2NC(=O)NC21',
 'Carbamidomethyl@Any_N-term': 'C(=O)NC',
 'Carbamyl@Any_N-term': 'C(=O)N',
 'Propionamide@Any_N-term': 'CCC(N)=O',
 'Pyridylacetyl@Any_N-term': 'C(=O)Cc1ccccn1',
 'Methyl@Any_N-term': 'C',
 'Dimethyl@Any_N-term': 'C',
 'Dimethyl:2H(6)13C(2)@Any_N-term': '[13C]([2H])([2H])([2H])',
 'Dimethyl:2H(4)@Any_N-term': 'C([2H])([2H])([1H])',
 'Dimethyl:2H(4)13C(2)@Any_N-term': '[13C]([2H])([2H])([1H])'}


c_term_modifications = {'Methyl@Any_C-term': 'OC',
 'Ethyl@Any_C-term': 'OCC',
 'Propyl@Any_C-term': 'OCCC',
 'Amidated@Any_C-term': 'N',
 'Cation:Na@Any_C-term': 'O[Na]',
 'Cation:K@Any_C-term': 'O[K]',
 'Cation:Cu[I]@Any_C-term': 'O[Cu]',
 'Cation:Li@Any_C-term': 'O[Li]'}


ptm_dict = {'Carbamidomethyl@C': 'C(C(C(=O)[Rn])N([Xe])([Xe]))SCC(=O)N',
 'Oxidation@M': 'O=C([Rn])C(N([Xe])([Xe]))CCS(=O)C',
 'GlyGly@K': 'NCC(=O)NCC(=O)NCCCC[C@H](N([Xe])([Xe]))C([Rn])=O',
 'Deamidated@N': 'C([C@@H](C(=O)[Rn])N([Xe])([Xe]))C(=O)O',
 'Propionyl@K': 'CCC(=O)NCCCCC(C(=O)[Rn])N([Xe])([Xe])',
 'Deamidated@Q': 'C(CC(=O)O)[C@@H](C(=O)[Rn])N([Xe])([Xe])',
 'Gln->pyro-Glu@Q^Any_N-term': 'O=C([Rn])[C@H]1N([Xe])C(=O)CC1',
 'Glu->pyro-Glu@E^Any_N-term': 'O=C([Rn])[C@H]1N([Xe])C(=O)CC1',
 'Phospho@S': 'O=P(O)(O)OC[C@@H](C(=O)[Rn])N([Xe])([Xe])',
 'Nitro@Y': 'O=[N+]([O-])c1cc(ccc1O)C[C@@H](C(=O)[Rn])N([Xe])([Xe])',
 'Acetyl@K': 'CC(=O)NCCCC[C@H](N([Xe])([Xe]))C(=O)[Rn]',
 'Dimethyl@K': 'CN(C)CCCC[C@H](N([Xe])([Xe]))C(=O)[Rn]',
 'mTRAQ@K': '[H]N(CCCC[C@H](N([Xe])([Xe]))C(=O)[Rn])C(=O)CN1CCN(C)CC1',
 'mTRAQ:13C(3)15N(1)@K': '[H]N(CCCC[C@H](N([Xe])([Xe]))C(=O)[Rn])C(=O)[13CH2][15N]1CCN(C)[13CH2][13CH2]1',
 'mTRAQ:13C(6)15N(2)@K': '[H]N(CCCC[C@H](N([Xe])([Xe]))C(=O)[Rn])C(=O)[13CH2][15N]1[13CH2][13CH2][15N]([13CH3])[13CH2][13CH2]1',
 'Pyridylethyl@C': 'C1=CN=CC=C1CCSCC(C(=O)[Rn])N([Xe])([Xe])',
 'Butyryl@K': 'CCCC(=O)NCCCCC(C(=O)[Rn])N([Xe])([Xe])',
 'Phospho@T': 'CC(C(C(=O)[Rn])N([Xe])([Xe]))OP(=O)(O)O',
 'Methylthio@C': 'CSSC[C@H](N([Xe])([Xe]))C([Rn])=O',
 'Carbamidomethyl@M': 'CS(CCC(N([Xe])([Xe]))C([Rn])=O)=CC(N)=O',
 'Succinyl@K': 'C(CCN)CC(C(=O)[Rn])N([Xe])C(=O)CCC(=O)O',
 'Crotonyl@K': 'CC=CC(=O)NCCCCC(C(=O)[Rn])N([Xe])([Xe])',
 'Phospho@Y': 'C1=CC(=CC=C1CC(C(=O)[Rn])N([Xe])([Xe]))OP(=O)(O)O',
 'Malonyl@K': 'N([Xe])([Xe])[C@@H](CCCC(NC(=O)CC(=O)O))C(=O)[Rn]',
 'Met->Hse@M^Any_C-term': 'N([Xe])([Xe])[C@H](C(=O)[Rn])CCO',
 'Pro->(2S,4R)-4-fluoroproline@P': 'F[C@@H]1C[C@H](N([Xe])C1)C(=O)[Rn]',
 'Pro->(2S,4S)-4fluoroproline@P': 'F[C@H]1C[C@H](N([Xe])C1)C(=O)[Rn]',
 'Pro->(2S)-1,3-thiazolidine-2-carboxylic_acid@P': 'S1[C@H](N([Xe])CC1)C(=O)[Rn]',
 'Pro->(4R)-1,3-Thiazolidine-4-carboxylic_acid@P': 'S1CN([Xe])[C@@H](C1)C(=O)[Rn]',
 'Pro->(2S,4R)-4-hydroxyproline@P': 'O[C@@H]1C[C@H](N([Xe])C1)C(=O)[Rn]',
 'Pro->(DL)-pipecolic_acid@P': 'C1CCN([Xe])C(C1)C(=O)[Rn]',
 'Pro->3,4-Dehydro-L-proline@P': 'C1C=CC(N1([Xe]))C(=O)[Rn]',
 'Pro->(1S,3S,5S)-2-Azabicyclo[3.1.0]hexane-3-carboxylic_acid@P': '[C@H]12N([Xe])[C@@H](C[C@@H]2C1)C(=O)[Rn]',
 'Pro->(1R,3S,5R)-2-Azabicyclo[3.1.0]hexane-3-carboxylic_acid@P': '[C@@H]12N([Xe])[C@@H](C[C@H]2C1)C(=O)[Rn]',
 'Pro->(2S,3aS,7aS)-Octahydro-1H-indole-2-carboxylic_acid@P': 'N1([Xe])[C@@H](C[C@@H]2CCCC[C@H]12)C(=O)[Rn]',
 'Pro->(DL)-5-trifluoromethylproline@P': 'FC(C1CCC(N1([Xe]))C(=O)[Rn])(F)F'}


These dictionaries contain the SMILES representations of various modifications.

## Updating Modification Dictionaries

We overwrite the existing modification dictionaries with our new SMILES representations (this script can be used when the `modification.tsv` is still not updated, thus regular `n_term_modifications` in `alphabase.smiles.smiles` would be empty):

In [5]:
for i in n_term_modifications:
    n_term_modifications_smi[i] = n_term_modifications[i]

for i in c_term_modifications:
    c_term_modifications_smi[i] = c_term_modifications[i]

for i in ptm_dict:
    ptm_dict_smi[i] = ptm_dict[i]

## Validation of Amino Acid Formulas

Next, we validate the amino acid formulas by comparing their chemical compositions:


In [6]:
for aa in AA_Formula.index:
    aa_row = AA_Formula.loc[aa]
    if pd.isna(aa_row["smiles"]):
        continue
    aa_smiles = modify_amino_acid(aa_row["smiles"])
    aa_mol = Chem.MolFromSmiles(aa_smiles)
    chem_composition = ChemicalCompositonFormula.from_rdkit_mol(aa_mol)
    assert str(chem_composition - ChemicalCompositonFormula(aa_row["formula"]) - ChemicalCompositonFormula("H(2)O(1)")) == ""



This step ensures that the chemical compositions derived from SMILES match the known formulas for each amino acid.

## Processing PTMs

We then process the PTMs, calculating their compositions and creating a dictionary of modifications to add:


In [7]:
ptms_to_add = {}

for ptm in ptm_dict:
    mol = Chem.MolFromSmiles(modify_amino_acid(ptm_dict[ptm]))
    ptm_formula = ChemicalCompositonFormula.from_rdkit_mol(mol)
    original_aa = ptm.split("@")[1].split("^")[0]
    if original_aa.startswith("Any"):
        original_aa = "A"
    original_aa_brutto_formula = AA_Formula.loc[original_aa, "formula"]
    ptms_to_add[ptm] = {"composition": str(ptm_formula - ChemicalCompositonFormula(original_aa_brutto_formula) - ChemicalCompositonFormula("H(2)O(1)")),
                        "smiles": ptm_dict[ptm]}
ptms_to_add

{'Carbamidomethyl@C': {'composition': 'C(2)H(3)N(1)O(1)',
  'smiles': 'C(C(C(=O)[Rn])N([Xe])([Xe]))SCC(=O)N'},
 'Oxidation@M': {'composition': 'O(1)',
  'smiles': 'O=C([Rn])C(N([Xe])([Xe]))CCS(=O)C'},
 'GlyGly@K': {'composition': 'C(4)H(6)N(2)O(2)',
  'smiles': 'NCC(=O)NCC(=O)NCCCC[C@H](N([Xe])([Xe]))C([Rn])=O'},
 'Deamidated@N': {'composition': 'H(-1)N(-1)O(1)',
  'smiles': 'C([C@@H](C(=O)[Rn])N([Xe])([Xe]))C(=O)O'},
 'Propionyl@K': {'composition': 'C(3)H(4)O(1)',
  'smiles': 'CCC(=O)NCCCCC(C(=O)[Rn])N([Xe])([Xe])'},
 'Deamidated@Q': {'composition': 'H(-1)N(-1)O(1)',
  'smiles': 'C(CC(=O)O)[C@@H](C(=O)[Rn])N([Xe])([Xe])'},
 'Gln->pyro-Glu@Q^Any_N-term': {'composition': 'H(-3)N(-1)',
  'smiles': 'O=C([Rn])[C@H]1N([Xe])C(=O)CC1'},
 'Glu->pyro-Glu@E^Any_N-term': {'composition': 'H(-2)O(-1)',
  'smiles': 'O=C([Rn])[C@H]1N([Xe])C(=O)CC1'},
 'Phospho@S': {'composition': 'H(1)O(3)P(1)',
  'smiles': 'O=P(O)(O)OC[C@@H](C(=O)[Rn])N([Xe])([Xe])'},
 'Nitro@Y': {'composition': 'H(-1)N(1)O(2)',
  '

We update the dataframe with those modifications:

In [8]:
add_new_modifications(ptms_to_add)


## Processing N-terminal Modifications

Similar to PTMs, we process N-terminal modifications:


In [9]:
nterms_to_add = {}


for ptm in n_term_modifications:
    original_mod = ptm.split("@")[0]
    mol = Chem.MolFromSmiles(modify_amino_acid(AA_Formula.loc["A", "smiles"], n_term_mod=ptm))
    ptm_formula = ChemicalCompositonFormula.from_rdkit_mol(mol)
    original_aa_brutto_formula = AA_Formula.loc["A", "formula"]
    suffixes = ["Any_N-term", "Protein_N-term"]
    for suffix in suffixes:
        nterms_to_add[original_mod + "@" + suffix] = {"composition": str(ptm_formula - ChemicalCompositonFormula(original_aa_brutto_formula) - ChemicalCompositonFormula("H(2)O(1)")),
                        "smiles": n_term_modifications[ptm]}
nterms_to_add

{'mTRAQ@Any_N-term': {'composition': 'C(7)H(12)N(2)O(1)',
  'smiles': 'C(=O)CN1CCN(CC1)C'},
 'mTRAQ@Protein_N-term': {'composition': 'C(7)H(12)N(2)O(1)',
  'smiles': 'C(=O)CN1CCN(CC1)C'},
 'mTRAQ:13C(3)15N(1)@Any_N-term': {'composition': '13C(3)15N(1)C(4)H(12)N(1)O(1)',
  'smiles': 'C(=O)[13C]([H])([H])[15N]1[13C]([H])([H])[13C]([H])([H])N(CC1)C'},
 'mTRAQ:13C(3)15N(1)@Protein_N-term': {'composition': '13C(3)15N(1)C(4)H(12)N(1)O(1)',
  'smiles': 'C(=O)[13C]([H])([H])[15N]1[13C]([H])([H])[13C]([H])([H])N(CC1)C'},
 'mTRAQ:13C(6)15N(2)@Any_N-term': {'composition': '13C(6)15N(2)C(1)H(12)O(1)',
  'smiles': 'C(=O)[13C]([H])([H])[15N]1[13C]([H])([H])[13C]([H])([H])[15N]([13C]([H])([H])[13C]1([H])([H]))[13C]([H])([H])([H])'},
 'mTRAQ:13C(6)15N(2)@Protein_N-term': {'composition': '13C(6)15N(2)C(1)H(12)O(1)',
  'smiles': 'C(=O)[13C]([H])([H])[15N]1[13C]([H])([H])[13C]([H])([H])[15N]([13C]([H])([H])[13C]1([H])([H]))[13C]([H])([H])([H])'},
 'Acetyl@Any_N-term': {'composition': 'C(2)H(2)O(1)', 'smi

In [10]:
add_new_modifications(nterms_to_add)


## Processing C-terminal Modifications

We also process C-terminal modifications in a similar manner:


In [11]:
cterms_to_add = {}


for ptm in c_term_modifications:
    original_mod = ptm.split("@")[0]
    mol = Chem.MolFromSmiles(modify_amino_acid(AA_Formula.loc["A", "smiles"], c_term_mod=ptm))
    ptm_formula = ChemicalCompositonFormula.from_rdkit_mol(mol)
    original_aa_brutto_formula = AA_Formula.loc["A", "formula"]
    suffixes = ["Any_C-term", "Protein_C-term"]
    for suffix in suffixes:
        composition = str(ptm_formula - ChemicalCompositonFormula(original_aa_brutto_formula) - ChemicalCompositonFormula("H(2)O(1)"))
        cterms_to_add[original_mod + "@" + suffix] = {"composition": composition,
                        "smiles": c_term_modifications[ptm]}
cterms_to_add

{'Methyl@Any_C-term': {'composition': 'C(1)H(2)', 'smiles': 'OC'},
 'Methyl@Protein_C-term': {'composition': 'C(1)H(2)', 'smiles': 'OC'},
 'Ethyl@Any_C-term': {'composition': 'C(2)H(4)', 'smiles': 'OCC'},
 'Ethyl@Protein_C-term': {'composition': 'C(2)H(4)', 'smiles': 'OCC'},
 'Propyl@Any_C-term': {'composition': 'C(3)H(6)', 'smiles': 'OCCC'},
 'Propyl@Protein_C-term': {'composition': 'C(3)H(6)', 'smiles': 'OCCC'},
 'Amidated@Any_C-term': {'composition': 'H(1)N(1)O(-1)', 'smiles': 'N'},
 'Amidated@Protein_C-term': {'composition': 'H(1)N(1)O(-1)', 'smiles': 'N'},
 'Cation:Na@Any_C-term': {'composition': 'H(-1)Na(1)', 'smiles': 'O[Na]'},
 'Cation:Na@Protein_C-term': {'composition': 'H(-1)Na(1)', 'smiles': 'O[Na]'},
 'Cation:K@Any_C-term': {'composition': 'H(-1)K(1)', 'smiles': 'O[K]'},
 'Cation:K@Protein_C-term': {'composition': 'H(-1)K(1)', 'smiles': 'O[K]'},
 'Cation:Cu[I]@Any_C-term': {'composition': 'Cu(1)H(-1)', 'smiles': 'O[Cu]'},
 'Cation:Cu[I]@Protein_C-term': {'composition': 'Cu(

In [12]:
add_new_modifications(cterms_to_add)


## Final Steps

Finally, we examine the updated modification database:

In [13]:
MOD_DF

Unnamed: 0_level_0,mod_name,unimod_mass,unimod_avge_mass,composition,unimod_modloss,modloss_composition,classification,unimod_id,smiles,modloss_importance,mass,modloss_original,modloss
mod_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Acetyl@T,Acetyl@T,42.010565,42.0367,H(2)C(2)O(1),0.0,,Post-translational,1.0,,0.0,42.010565,0.0,0.0
Acetyl@Protein_N-term,Acetyl@Protein_N-term,42.010565,42.0367,H(2)C(2)O(1),0.0,,Post-translational,1.0,C(=O)C,0.0,42.010565,0.0,0.0
Acetyl@S,Acetyl@S,42.010565,42.0367,H(2)C(2)O(1),0.0,,Post-translational,1.0,,0.0,42.010565,0.0,0.0
Acetyl@C,Acetyl@C,42.010565,42.0367,H(2)C(2)O(1),0.0,,Post-translational,1.0,,0.0,42.010565,0.0,0.0
Acetyl@Any_N-term,Acetyl@Any_N-term,42.010565,42.0367,H(2)C(2)O(1),0.0,,Multiple,1.0,C(=O)C,0.0,42.010565,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ethyl@Protein_C-term,Ethyl@Protein_C-term,0.000000,0.0000,C(2)H(4),0.0,,User-added,0.0,OCC,0.0,28.031300,0.0,0.0
Cation:Na@Protein_C-term,Cation:Na@Protein_C-term,0.000000,0.0000,H(-1)Na(1),0.0,,User-added,0.0,O[Na],0.0,21.981944,0.0,0.0
Cation:K@Protein_C-term,Cation:K@Protein_C-term,0.000000,0.0000,H(-1)K(1),0.0,,User-added,0.0,O[K],0.0,37.955881,0.0,0.0
Cation:Cu[I]@Protein_C-term,Cation:Cu[I]@Protein_C-term,0.000000,0.0000,Cu(1)H(-1),0.0,,User-added,0.0,O[Cu],0.0,61.921773,0.0,0.0


This displays the final, updated database of modifications, including the newly added SMILES representations and calculated compositions.
If needed, we can save the updated database to a file:

In [14]:
# orig_df = pd.read_csv(os.path.join(CONST_FILE_FOLDER, "modification.tsv"), sep="\t", index_col=0)
# MOD_DF[["mod_name", *orig_df.columns]].to_csv(os.path.join(CONST_FILE_FOLDER, "modification.tsv"), sep="\t", index=False)