# Drug Interaction and Side Effect Integration Pipeline

This notebook builds a dataset that integrates DrugBank, SIDER, and TWOSIDES.

### Steps:
1. Load DrugBank filtered dataset (DrugBank ID + Name).
2. Map DrugBank drugs to STITCH IDs.
   - If a DrugBank ↔ STITCH mapping file is provided, use it.
   - Otherwise, attempt name-based fuzzy matching.
3. Integrate **SIDER**:
   - For each drug, extract all `MedDRA_term` side effects.
4. Integrate **TWOSIDES**:
   - For each drug pair, if `(STITCH1, STITCH2)` exists, extract polypharmacy side effects.
5. Build a final dataset with:
   - `drug1` (DrugBank Name)
   - `drug2` (DrugBank Name)
   - `possible_interactions` (sources such as `twosides_pair_signals`)
   - `sideeffects` (merged side effect terms)


In [1]:
import pandas as pd
from tqdm import tqdm
from itertools import combinations
import difflib

# Load datasets
drugbank = pd.read_csv('data/processed/drugbank_filtered.csv')
sider = pd.read_csv('data/sider.csv')
twosides = pd.read_csv('data/twosides.csv')

drugbank.head(), sider.head(), twosides.head()

(  DrugBank ID                 Name
 0     DB00001            Lepirudin
 1     DB00002            Cetuximab
 2     DB00003         Dornase alfa
 3     DB00004  Denileukin diftitox
 4     DB00005           Etanercept,
   STITCH_compound_ID_flat  STITCH_compound_ID_stereo  UMLS_concept_ID  \
 0            CID100000085               CID000010917         C0000729   
 1            CID100000085               CID000010917         C0000729   
 2            CID100000085               CID000010917         C0000737   
 3            CID100000085               CID000010917         C0000737   
 4            CID100000085               CID000010917         C0000737   
 
    MedDRA_type  MedDRA_concept_ID            MedDRA_term  
 0          LLT           C0000729       Abdominal cramps  
 1           PT           C0000737         Abdominal pain  
 2          LLT           C0000737         Abdominal pain  
 3           PT           C0687713  Gastrointestinal pain  
 4           PT           C0000737   

## Step 1: Build DrugBank → STITCH mapping

If you have a mapping file, load it here. If not, we'll attempt name-based fuzzy matching.

In [2]:
# Placeholder for mapping file
# Example: mapping = pd.read_csv('drugbank_stitch_mapping.csv')
mapping = None

if mapping is not None:
    drug_to_stitch = dict(zip(mapping['DrugBank ID'], mapping['STITCH_ID']))
else:
    # fallback: use fuzzy name matching with SIDER STITCH names
    sider_drugs = sider[['STITCH_compound_ID_flat']].drop_duplicates()
    drug_to_stitch = {}
    for _, row in tqdm(drugbank.iterrows(), total=len(drugbank)):
        name = row['Name']
        # Here we only demonstrate placeholder: ideally use external name dictionaries
        # For now, we map nothing (to be replaced by your mapping method)
        drug_to_stitch[row['DrugBank ID']] = None

len(drug_to_stitch)

100%|██████████| 17430/17430 [00:00<00:00, 61519.47it/s]


17430

## Step 2: Integrate SIDER side effects per drug

In [4]:
sider_map = {}

# Ensure consistent column names (strip spaces etc.)
sider.columns = sider.columns.str.strip()

for _, row in sider.iterrows():
    stitch_id = row['STITCH_compound_ID_flat']
    se = row['MedDRA_term']
    if pd.notna(stitch_id) and pd.notna(se):
        sider_map.setdefault(stitch_id, set()).add(se)

print(f"Mapped {len(sider_map)} drugs to side effects")


Mapped 1430 drugs to side effects


## Step 3: Integrate TWOSIDES side effects per drug pair

In [5]:
twosides_map = {}
for _, row in tqdm(twosides.iterrows(), total=len(twosides)):
    d1, d2, se = row['STITCH 1'], row['STITCH 2'], row['Side Effect Name']
    key = tuple(sorted([d1, d2]))
    if pd.notna(se):
        twosides_map.setdefault(key, set()).add(se)

len(twosides_map)

100%|██████████| 4649441/4649441 [01:45<00:00, 44077.51it/s]


63473

## Step 4: Build final annotated dataset
Merge DrugBank names with SIDER/TWOSIDES signals.

In [8]:
output_rows = []
drug_names = drugbank[['DrugBank ID','Name']]

# Only iterate subset for demonstration (all pairs is too large)
subset = drug_names.head(100)
for (id1, name1), (id2, name2) in combinations(zip(subset['DrugBank ID'], subset['Name']), 2):
    s1, s2 = drug_to_stitch.get(id1), drug_to_stitch.get(id2)
    se_list = set()
    notes = []
    if s1 in sider_map:
        se_list |= sider_map[s1]
        notes.append('sider_individual_SEs')
    if s2 in sider_map:
        se_list |= sider_map[s2]
        notes.append('sider_individual_SEs')
    if s1 and s2:
        key = tuple(sorted([s1, s2]))
        if key in twosides_map:
            se_list |= twosides_map[key]
            notes.append('twosides_pair_signals')
    output_rows.append({
        'drug1': name1,
        'drug2': name2,
        'possible_interactions': ';'.join(set(notes)),
        'sideeffects': '|'.join(set(se_list))
    })

final_df = pd.DataFrame(output_rows)
final_df.to_csv('drug_interactions_annotated.csv', index=False)
final_df.head()

Unnamed: 0,drug1,drug2,possible_interactions,sideeffects
0,Lepirudin,Cetuximab,,
1,Lepirudin,Dornase alfa,,
2,Lepirudin,Denileukin diftitox,,
3,Lepirudin,Etanercept,,
4,Lepirudin,Bivalirudin,,


In [4]:
import pandas as pd

# Input and output paths (✅ using /)
input_path = "C:/Users/rd773/Desktop/PolyRisk AI Risk Prediction/data/processed/chemicals.v5.0.tsv"
output_path = "C:/Users/rd773/Desktop/PolyRisk AI Risk Prediction/data/processed/chemicals.csv"

# Read the TSV file
df = pd.read_csv(input_path, sep="\t", encoding="utf-8")

# Save it as a CSV file
df.to_csv(output_path, index=False)

print(f"✅ Conversion complete! Saved as: {output_path}")
print(f"Total rows: {len(df)}")
print(f"Columns: {list(df.columns)}")


✅ Conversion complete! Saved as: C:/Users/rd773/Desktop/PolyRisk AI Risk Prediction/data/processed/chemicals.csv
Total rows: 116224359
Columns: ['chemical', 'name', 'molecular_weight', 'SMILES_string']
