# __Analyzing the ring stsyems within the ChEMBL small molecule database__

__In this notebook we'll analyze a set of marketed drugs from the ChEMBL database and find the most commonly occuring ring systems. To do this, we'll follow these steps.__

1.Read the drugs as SMILES

2.Convert the SMILES to RDKit Molecules

3.Indentify the ring systems in the molecules

4.Collect the individual ring systems and count their frequencies

This analysis is similar to the one performed in Taylor, R. D., MacCoss, M., & Lawson, A. D. (2014). Rings in drugs: Miniperspective, Journal of Medicinal Chemistry, 57(14), 5845-5859.

### Requirements for notebook
!pip install useful_rdkit_utils mols2grid

In [2]:
import pandas as pd
from rdkit import Chem
import mols2grid
import useful_rdkit_utils as uru
from tqdm.auto import tqdm
from itertools import chain

# Set progress bar for pandas
tqdm.pandas()

In [4]:
# Read in molecules
df = pd.read_csv("HDAC1_ChEMBL_prepared_data/HDAC1_ChEMBL_IC50.csv")


In [5]:
# Make rdkit molecules from smiles
df["rdkit_mol"] = df["SMILES"].progress_apply(Chem.MolFromSmiles)

  0%|          | 0/5752 [00:00<?, ?it/s]

In [6]:
# Indentify ring systems in molecules
ring_sys_finder = uru.RingSystemFinder()

df["ring_systems"] = df["rdkit_mol"].progress_apply(ring_sys_finder.find_ring_systems)

  0%|          | 0/5752 [00:00<?, ?it/s]

In [12]:
# Collect individual ring systems and Count ring systems frequency
# ring system column is a list of lists, so we need to flatten it.
# The chain method from itertools library is used to flatten the list of lists
ring_list = chain(*df["ring_systems"].values)

#Create a pandas series from the iterator made by chain method
ring_series = pd.Series(ring_list)

# Count number of ring systems
print(ring_series.value_counts())


c1ccccc1                                                            7581
c1ccncc1                                                             596
c1cncnc1                                                             458
c1ccc2[nH]ccc2c1                                                     426
c1cscn1                                                              377
                                                                    ... 
C1=Nc2ccccc2C2=NCCN12                                                  1
O=C1NCc2ncc([nH]2)-c2ccc3c(c2)C2CCC3N2C/C=C/CN2CCC3(CC2)C[C@H]13       1
O=c1[nH]c(=O)c2ccccc2o1                                                1
O=C1CCc2ccccc21                                                        1
O=C1CNC(=O)CNC(=O)CNC(=O)CN1                                           1
Name: count, Length: 411, dtype: int64


In [14]:
# Convert value counts to dataframe
ring_df = pd.DataFrame(ring_series.value_counts()).reset_index()
ring_df.columns = ["SMILES", "Count"]

# Visualize the ring systems
mols2grid.display(ring_df,
                  subset=["SMILES", "img", "Count"],
                  n_cols=4)


MolGridWidget()