## Data Exploration

### Objectives
- Understand the structure of the Tox21 and ChEMBL datasets
- Interpret assay labels (NR vs SR)
- Inspect missing values and class balance
- Validate chemical representations (SMILES)


In [2]:
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Chemistry
from rdkit import Chem

In [3]:
tox21_path = "../data/raw/tox21.csv"
tox21 = pd.read_csv(tox21_path)

tox21.head()


Unnamed: 0,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53,mol_id,smiles
0,0.0,0.0,1.0,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,TOX3021,CCOc1ccc2nc(S(N)(=O)=O)sc2c1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,TOX3020,CCN1C(=O)NC(c2ccccc2)C1=O
2,,,,,,,,0.0,,0.0,,,TOX3024,CC[C@]1(O)CC[C@H]2[C@@H]3CCC4=CCCC[C@@H]4[C@H]...
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,TOX3027,CCCN(CC)C(CC)C(=O)Nc1c(C)cccc1C
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,TOX20800,CC(O)(P(=O)(O)O)P(=O)(O)O


In [4]:
tox21.shape

(7831, 14)

In [5]:
tox21.columns

Index(['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD',
       'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53',
       'mol_id', 'smiles'],
      dtype='object')

In [6]:
tox21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7831 entries, 0 to 7830
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   NR-AR          7265 non-null   float64
 1   NR-AR-LBD      6758 non-null   float64
 2   NR-AhR         6549 non-null   float64
 3   NR-Aromatase   5821 non-null   float64
 4   NR-ER          6193 non-null   float64
 5   NR-ER-LBD      6955 non-null   float64
 6   NR-PPAR-gamma  6450 non-null   float64
 7   SR-ARE         5832 non-null   float64
 8   SR-ATAD5       7072 non-null   float64
 9   SR-HSE         6467 non-null   float64
 10  SR-MMP         5810 non-null   float64
 11  SR-p53         6774 non-null   float64
 12  mol_id         7831 non-null   object 
 13  smiles         7831 non-null   object 
dtypes: float64(12), object(2)
memory usage: 856.6+ KB


### What does one row represent?

Each row represents **one chemical compound**, identified by:
- `mol_id` → unique molecule identifier
- `smiles` → chemical structure
- Assay columns → toxicity outcomes for that molecule

A molecule can:
- Activate some pathways
- Be inactive in others
- Have missing assay results


In [7]:
assay_columns = tox21.columns.drop(["mol_id", "smiles"])
assay_columns

Index(['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD',
       'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53'],
      dtype='object')

### Assay Categories

#### Nuclear Receptor (NR)
- Hormone-regulated transcription factors
- Effects are subtle and chronic
- Example: estrogen, androgen signaling

#### Stress Response (SR)
- Cellular damage detection
- Effects are direct and acute
- Example: oxidative stress, DNA damage

#### Label Meaning
- 1 → Active (toxic effect detected)
- 0 → Inactive
- NaN → Assay not performed


In [8]:
missing_fraction = tox21.isna().mean().sort_values(ascending=False)
missing_fraction

SR-MMP           0.258077
NR-Aromatase     0.256672
SR-ARE           0.255268
NR-ER            0.209169
NR-PPAR-gamma    0.176350
SR-HSE           0.174180
NR-AhR           0.163708
NR-AR-LBD        0.137020
SR-p53           0.134976
NR-ER-LBD        0.111863
SR-ATAD5         0.096922
NR-AR            0.072277
mol_id           0.000000
smiles           0.000000
dtype: float64

In [9]:
tox21[assay_columns].apply(lambda x: x.value_counts(normalize=True))

Unnamed: 0,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
0.0,0.957467,0.96493,0.88273,0.948462,0.871952,0.949676,0.971163,0.838477,0.96267,0.942477,0.841997,0.937555
1.0,0.042533,0.03507,0.11727,0.051538,0.128048,0.050324,0.028837,0.161523,0.03733,0.057523,0.158003,0.062445


In [10]:
def is_valid_smiles(smiles):
    return Chem.MolFromSmiles(smiles) is not None

tox21["valid_smiles"] = tox21["smiles"].apply(is_valid_smiles)
tox21["valid_smiles"].value_counts()


[19:19:24] Explicit valence for atom # 8 Al, 6, is greater than permitted
[19:19:24] Explicit valence for atom # 3 Al, 6, is greater than permitted
[19:19:24] Explicit valence for atom # 4 Al, 6, is greater than permitted
[19:19:24] Explicit valence for atom # 4 Al, 6, is greater than permitted
[19:19:24] Explicit valence for atom # 9 Al, 6, is greater than permitted
[19:19:25] Explicit valence for atom # 5 Al, 6, is greater than permitted
[19:19:25] Explicit valence for atom # 16 Al, 6, is greater than permitted
[19:19:25] Explicit valence for atom # 20 Al, 6, is greater than permitted


valid_smiles
True     7823
False       8
Name: count, dtype: int64

In [15]:
from rdkit.Chem import SDMolSupplier

chembl_path = "../data/raw/chembl_36.sdf"
supplier = SDMolSupplier(chembl_path)

sample_mols = []

for i, mol in enumerate(supplier):
    if mol is not None:
        sample_mols.append(mol)
    if len(sample_mols) == 1000:
        break

len(sample_mols)


1000

### Role of ChEMBL in This Project

ChEMBL provides:
- Drug-like chemical space
- Real pharmaceutical molecules
- Structures for descriptor learning

We will later:
- Filter ChEMBL
- Use it to train drug-likeness models

## Summary

✔ Dataset structure understood  
✔ Biological meaning of labels clarified  
✔ Data quality assessed  
✔ Ready for cleaning & preprocessing  

➡️ Next Stage:  
**Data Cleaning & Label Preparation**
