<a href="https://colab.research.google.com/github/JihongOh/PC-netpharm-transcriptomics/blob/main/CT_network/Compound_target_network_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
Compound-Target Network Preprocessing from HERB 2.0 Database
Related targets of Core-4 compounds of Polygonum cuspidatum

This module processes target proteins identified from HERB 2.0 database
for the Core-4 compounds (Resveratrol, Polydatin, Emodin, Physcion)
of Polygonum cuspidatum and generates a structured network file.

Database Information:
    - HERB 2.0: http://herb.ac.cn/v2 (Accessed: 2025-01-21)
    - TCMSP 2.3: Traditional Chinese Medicine Systems Pharmacology Database (Accessed: 2025-01-21)
    - Matching basis: InChIKey from TCMSP matched against HERB 2.0 targets

Workflow:
    1. Upload HERB 2.0 extracted target files (one per compound)
    2. Parse and deduplicate target genes
    3. Standardize gene symbols
    4. Generate compound-target network CSV file
    5. Generate summary statistics

Author: Jihong Oh
License: MIT (Public Domain)
"""

import pandas as pd
import numpy as np
from google.colab import files
import io
from pathlib import Path
from typing import Dict, List, Tuple


class CompoundTargetNetworkGenerator:
    """
    Generate compound-target association networks from HERB 2.0 database extracts.

    Attributes:
        compound_dict (dict): Mapping of filename prefixes to standardized compound names
        target_col_aliases (list): Possible column names for target gene identifiers
    """

    # Standard compound naming (Core-4 components of Polygonum cuspidatum)
    COMPOUND_DICT = {
        'physcion': 'Physcion',
        'emodin': 'Emodin',
        'resveratrol': 'Resveratrol',
        'polydatin': 'Polydatin'
    }

    # Possible column name variations for gene targets
    TARGET_COL_ALIASES = [
        'Gene Symbol', 'gene symbol', 'Gene', 'gene',
        'Target Gene', 'target gene', 'Gene Name', 'gene name'
    ]

    def __init__(self):
        """Initialize the network generator."""
        self.compound_target_pairs = []
        self.processing_log = []

    def extract_compound_name(self, filename: str) -> str:
        """
        Extract compound name from filename.

        Args:
            filename (str): Input filename (e.g., 'physcion_targets_HERB2.0.xlsx')

        Returns:
            str: Standardized compound name
        """
        compound_prefix = filename.split('_')[0].lower()
        return self.COMPOUND_DICT.get(compound_prefix, compound_prefix.capitalize())

    def find_target_column(self, df: pd.DataFrame) -> str:
        """
        Identify target gene column from DataFrame.

        Attempts to match column names against known aliases and returns
        the first match found (case-insensitive).

        Args:
            df (pd.DataFrame): Input dataframe

        Returns:
            str or None: Column name if found, None otherwise
        """
        df_columns_lower = {col.lower(): col for col in df.columns}

        for alias in self.TARGET_COL_ALIASES:
            alias_lower = alias.lower()
            if alias_lower in df_columns_lower:
                return df_columns_lower[alias_lower]

        return None

    def preprocess_genes(self, genes: np.ndarray) -> List[str]:
        """
        Preprocess gene symbols: remove whitespace and filter empty values.

        Args:
            genes (np.ndarray): Array of gene symbols

        Returns:
            list: Cleaned gene symbols
        """
        cleaned = [str(gene).strip() for gene in genes if pd.notna(gene)]
        return [g for g in cleaned if len(g) > 0]

    def process_file(self, filename: str, file_content: bytes) -> Tuple[int, int]:
        """
        Process a single compound target file from HERB 2.0.

        Extracts unique target genes and creates compound-target associations.
        Duplicates are removed at this stage.

        Args:
            filename (str): Name of the input file
            file_content (bytes): File content as bytes

        Returns:
            tuple: (unique_genes_count, compound_target_pairs_added)
        """
        print(f"\nüîç Processing: {filename}")

        compound = self.extract_compound_name(filename)

        try:
            # Read Excel file
            df = pd.read_excel(io.BytesIO(file_content))
            print(f"   ‚úì File loaded: {len(df)} rows")
            print(f"   Available columns: {list(df.columns)}")

        except Exception as e:
            log_msg = f"   ‚ùå Error reading file: {str(e)}"
            print(log_msg)
            self.processing_log.append(log_msg)
            return 0, 0

        # Find gene symbol column
        gene_col = self.find_target_column(df)

        if gene_col is None:
            log_msg = (f"   ‚ö†Ô∏è Warning: No target gene column found. "
                      f"Available columns: {list(df.columns)}")
            print(log_msg)
            self.processing_log.append(log_msg)
            return 0, 0

        print(f"   Target gene column: '{gene_col}'")

        # Extract and deduplicate genes
        raw_genes = df[gene_col].dropna().unique()
        genes = self.preprocess_genes(raw_genes)
        unique_count = len(genes)

        print(f"   Raw entries: {len(raw_genes)} ‚Üí Unique: {unique_count}")

        # Create compound-target pairs
        pairs_count = 0
        for gene in genes:
            self.compound_target_pairs.append({
                'compound': compound,
                'target': gene
            })
            pairs_count += 1

        log_msg = (f"   ‚úì {compound}: {unique_count} unique targets, "
                   f"{pairs_count} pairs created")
        print(log_msg)
        self.processing_log.append(log_msg)

        return unique_count, pairs_count

    def remove_duplicates(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, int]:
        """
        Remove duplicate compound-target pairs.

        Args:
            df (pd.DataFrame): Dataframe with 'compound' and 'target' columns

        Returns:
            tuple: (deduplicated_dataframe, duplicates_removed)
        """
        initial_count = len(df)
        df_deduplicated = df.drop_duplicates(subset=['compound', 'target'])
        duplicates_removed = initial_count - len(df_deduplicated)

        return df_deduplicated, duplicates_removed

    def generate_statistics(self, df: pd.DataFrame) -> Dict:
        """
        Generate summary statistics for the network.

        Args:
            df (pd.DataFrame): Processed network dataframe

        Returns:
            dict: Dictionary containing network statistics
        """
        stats = {
            'total_pairs': len(df),
            'unique_compounds': df['compound'].nunique(),
            'unique_targets': df['target'].nunique(),
            'compound_distribution': df['compound'].value_counts().to_dict(),
            'target_frequency': df['target'].value_counts()
        }

        # Identify multi-target compounds (targets shared across all compounds)
        target_compound_count = df.groupby('target')['compound'].nunique()
        all_compound_targets = target_compound_count[
            target_compound_count == stats['unique_compounds']
        ]
        stats['shared_targets'] = all_compound_targets.index.tolist()
        stats['shared_targets_count'] = len(stats['shared_targets'])

        return stats

    def run(self, output_filename: str = 'compound_target_network.csv') -> pd.DataFrame:
        """
        Execute the complete preprocessing pipeline.

        Steps:
            1. Upload compound target files from HERB 2.0
            2. Parse and extract genes from each file
            3. Remove duplicate pairs
            4. Generate summary statistics
            5. Save to CSV file

        Args:
            output_filename (str): Name for output CSV file

        Returns:
            pd.DataFrame: Processed network dataframe
        """
        print("="*70)
        print("Compound-Target Network Preprocessing from HERB 2.0")
        print("="*70)

        # Step 1: Upload files
        print("\nüìÅ Uploading HERB 2.0 target files...")
        print("   Expected files for Core-4 compounds:")
        print("   - physcion_targets_HERB2.0.xlsx")
        print("   - emodin_targets_HERB2.0.xlsx")
        print("   - resveratrol_targets_HERB2.0.xlsx")
        print("   - polydatin_targets_HERB2.0.xlsx\n")

        uploaded = files.upload()

        if not uploaded:
            print("‚ùå No files uploaded.")
            return None

        # Step 2: Process each file
        print("\n" + "="*70)
        print("Processing Files")
        print("="*70)

        total_genes = 0
        total_pairs = 0

        for filename, file_content in uploaded.items():
            genes, pairs = self.process_file(filename, file_content)
            total_genes += genes
            total_pairs += pairs

        if not self.compound_target_pairs:
            print("\n‚ùå Error: No compound-target data extracted.")
            return None

        # Step 3: Create dataframe and remove duplicates
        print("\n" + "="*70)
        print("Deduplication")
        print("="*70)

        df = pd.DataFrame(self.compound_target_pairs)
        df_deduplicated, duplicates = self.remove_duplicates(df)

        if duplicates > 0:
            print(f"‚ö†Ô∏è  Found {duplicates} duplicate pairs")
            print(f"‚úì Removed duplicates: {len(df_deduplicated)} pairs remain")
        else:
            print("‚úì No duplicates found")

        # Step 4: Generate statistics
        print("\n" + "="*70)
        print("Network Statistics")
        print("="*70)

        stats = self.generate_statistics(df_deduplicated)

        print(f"\nüìä Summary:")
        print(f"   Total compound-target pairs: {stats['total_pairs']}")
        print(f"   Unique compounds: {stats['unique_compounds']}")
        print(f"   Unique targets: {stats['unique_targets']}")

        print(f"\nüìà Compound Distribution:")
        for compound, count in stats['compound_distribution'].items():
            print(f"   {compound}: {count} targets")

        if stats['shared_targets_count'] > 0:
            print(f"\nüîó Shared Targets (all {stats['unique_compounds']} compounds):")
            print(f"   Count: {stats['shared_targets_count']}")
            print(f"   Examples: {', '.join(stats['shared_targets'][:10])}")
        else:
            print(f"\nüîó No targets shared across all compounds")

        # Step 5: Save and display preview
        print("\n" + "="*70)
        print("Data Preview (first 15 rows)")
        print("="*70)
        print(df_deduplicated.head(15).to_string(index=False))

        # Step 6: Export to CSV
        df_deduplicated.to_csv(output_filename, index=False)
        print(f"\n‚úÖ Network file saved: {output_filename}")
        print(f"   Rows: {len(df_deduplicated)}")
        print(f"   Columns: {list(df_deduplicated.columns)}")

        # Step 7: Download file
        print(f"\n‚¨áÔ∏è  Downloading {output_filename}...")
        files.download(output_filename)

        print("\n" + "="*70)
        print("‚ú® Preprocessing Complete!")
        print("="*70)

        return df_deduplicated


# Main execution
if __name__ == "__main__":
    generator = CompoundTargetNetworkGenerator()
    network_df = generator.run()

Compound-Target Network Preprocessing from HERB 2.0

üìÅ Uploading HERB 2.0 target files...
   Expected files for Core-4 compounds:
   - physcion_targets_HERB2.0.xlsx
   - emodin_targets_HERB2.0.xlsx
   - resveratrol_targets_HERB2.0.xlsx
   - polydatin_targets_HERB2.0.xlsx



Saving polydatin_ingredient_reference_target_2026. 1. 21..xlsx to polydatin_ingredient_reference_target_2026. 1. 21..xlsx
Saving emodin_ingredient_reference_target_2026. 1. 21..xlsx to emodin_ingredient_reference_target_2026. 1. 21..xlsx
Saving resveratrol_ingredient_reference_target_2026. 1. 21..xlsx to resveratrol_ingredient_reference_target_2026. 1. 21..xlsx
Saving physcion_ingredient_target_2026. 1. 21..xlsx to physcion_ingredient_target_2026. 1. 21..xlsx

Processing Files

üîç Processing: polydatin_ingredient_reference_target_2026. 1. 21..xlsx
   ‚úì File loaded: 64 rows
   Available columns: ['Target id', 'Gene symbol', 'Protein name', 'Reference ID', 'PubMed ID', 'Reference title', 'Relationship', 'Grade', 'Supporting sentences']
   Target gene column: 'Gene symbol'
   Raw entries: 54 ‚Üí Unique: 54
   ‚úì Polydatin: 54 unique targets, 54 pairs created

üîç Processing: emodin_ingredient_reference_target_2026. 1. 21..xlsx
   ‚úì File loaded: 131 rows
   Available columns: ['Tar

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


‚ú® Preprocessing Complete!
