## Data Processing
This Jupyter notebook covers some of the data processing steps followed to obtain usable data for our visualization and model.

First, data was obtained from DrugBank using a script found: https://gist.github.com/rosherbal/56461421c69a8a7da775336c95fa62e0
This was slightly modified to obtain desired data. To run script:

python extract_db.py

The output has all the drugs, plus synonym, classification, drug interactions, pathways, and targets.

Several other intermediary files were used to extract desired data from the output above, but those are uneccesary to include here.

### Imports

In [None]:
import pandas as pd
import ast
from rdkit import Chem

### Getting Features for ML model
The scripts below outline how the features were obtained for the ML model.

#### Jaccard Similarity

In [None]:
# Step 1: Extract Drug-Target Data
def extract_drug_targets(input_csv):
    df = pd.read_csv(input_csv)

    # Ensure target columns are lists of strings or empty
    def parse_targets(x):
        if pd.isna(x) or x == "[]":
            return []
        try:
            return list(set(ast.literal_eval(x)))  # Remove duplicates
        except (ValueError, SyntaxError):
            return [x]

    df['target_name'] = df['target_name'].apply(parse_targets)

    # Group targets by drug and remove duplicates
    drug_targets = (
        df.groupby(['dg_name'])['target_name']
        .agg(lambda x: list(set(sum(x, []))))  # Flatten lists and remove duplicates
        .reset_index()
    )
    return drug_targets

# Step 2: Match Drug Interactions with Targets
def get_interaction_targets(interaction_csv, drug_targets):
    interactions = pd.read_csv(interaction_csv)

    # Lookup for drug-target data
    drug_target_map = {
        row['dg_name']: set(row['target_name']) for _, row in drug_targets.iterrows()
    }

    # Match interactions with targets
    interaction_targets = []
    for _, row in interactions.iterrows():
        drug1, drug2 = row['Drug_Name_1'], row['Drug_Name_2']
        drug1_targets = drug_target_map.get(drug1, set())
        drug2_targets = drug_target_map.get(drug2, set())

        interaction_targets.append([drug1, drug2, drug1_targets, drug2_targets])

    columns = ['Drug_Name_1', 'Drug_Name_2', 'Drug1_Targets', 'Drug2_Targets']
    interaction_df = pd.DataFrame(interaction_targets, columns=columns)
    return interaction_df

# Step 3: Calculate Jaccard Similarity
def calculate_jaccard(interaction_df, output_csv):
    jaccard_scores = []
    for _, row in interaction_df.iterrows():
        try:
            targets1 = row['Drug1_Targets']
            targets2 = row['Drug2_Targets']
            if targets1 and targets2:  # Avoid empty sets
                jaccard = len(targets1 & targets2) / len(targets1 | targets2)
            else:
                jaccard = 0.0 # No similar
        except Exception:
            jaccard = 0.0  # Skip problematic rows
        jaccard_scores.append(jaccard)

    # Create final output df
    result_df = interaction_df[['Drug_Name_1', 'Drug_Name_2']].copy()
    result_df['Jaccard_Similarity'] = jaccard_scores

    # Save results
    result_df.to_csv(output_csv, index=False)
    return result_df

# Main workflow
input_csv = 'extracted_full.csv'
interaction_csv = 'predict_pairs_dwindle.csv'
output_similarity_csv = 'jaccard_similarity_ddinter.csv'

drug_targets = extract_drug_targets(input_csv)
interaction_targets = get_interaction_targets(interaction_csv, drug_targets)
similarity_results = calculate_jaccard(interaction_targets, output_similarity_csv)

print("Jaccard similarity computation complete. Results saved to", output_similarity_csv)

#### Tanimoto Correlation
This is calculated in the ML model code. The code below shows how the SMILES data is extracted, which is used for the calculation.

In [None]:
# Load the SDF file
sdf_file = "structures.sdf"
suppl = Chem.SDMolSupplier(sdf_file)

data = [] # Store data

# Extract information
for mol in suppl:
    if mol is not None:
        database_id = mol.GetProp('DRUGBANK_ID') if mol.HasProp('DRUGBANK_ID') else ""
        smiles = Chem.MolToSmiles(mol)
        data.append([database_id, smiles])

# Create a DataFrame and save to CSV
df = pd.DataFrame(data, columns=['DRUGBANK_ID', 'SMILES'])
df.to_csv("SMILES_per_drugID.csv", index=False)

#### FAERS Data Extraction
This was done using two scripts, get_features_faers.py, and batch_run.py.

batch_run.py runs batches of the specified amount (in this case 1000) to call the FAERS API. It calls get_features_faers.py, which is what actually calls and outputs the data, using a subprocess.

To run:

python batch_run.py

The output from this is a file of all the specified drug pairs, each with the counted number of adverse events, and the severity level of those adverse events.

#### Combining Extracted/Calculated Data
The code below merges all of the data for feeding to the model. This is an example of a helper script used. This project has many, but we felt including only the most important one was necessary.

In [None]:
all_data = []
# Get FAERS data
df_faers = pd.read_csv("faers_feats.csv", low_memory=False)

# Tanimoto calculations
tanimoto = pd.read_csv("tanimoto_sim_all.csv", low_memory=False)

# Merge these two
faers_tan = pd.merge(df_faers, tanimoto,
                     on=['Drug_Name_1', 'Drug_Name_2'], how='inner')
filt_faers_tan = faers_tan.drop_duplicates()

# Get Jaccard Similarities
jaccard = pd.read_csv("jaccard_similarity.csv", low_memory=False)

# Merge these with all the previous data
merged_df = pd.merge(filt_faers_tan, jaccard,
                     on=['Drug_Name_1', 'Drug_Name_2'], how='inner')
filtered_df = merged_df.drop_duplicates()

# Add label for testing/training if desired
# This allows metrics to see how model performed
# DO NOT USE WHEN RUNNING FINAL DATA - ONLY FOR TEST AND TRAIN
def add_label(filtered_df):
    df = pd.read_csv("data.csv", low_memory=False)
    
    # Extract drug names and label only
    df_label = df[['Drug_Name_1', 'Drug_Name_2', 'label']]
    
    # Merge with other data
    final_df = pd.merge(filtered_df, df_label,
                        on=['Drug_Name_1', 'Drug_Name_2'], how='inner')
    return final_df

# Save the filtered df - no label
filtered_df.to_csv("data_with_features.csv", index=False)

# Save the df with label
labeled_df = add_label(filtered_df)
labeled_df.to_csv("data_with_features_and_label.csv", index=False)