# Generate input files for hierarchical hotnet
==============================================

This notebook processes the edge file obtained by WGCNA and the differential expression analyses results to obtain the input files for hierarchical hotnet (edge list file, index to gene file, gene to score files). 

Input:
------
- edges_module.txt: files containing protein-protein interactions with the following columns:
  *fromNode: Identifier for the source protein.
  *toNode: Identifier for the target protein.
  *weight: Strength of the interaction between proteins.
- differential.csv: Differential expression results including:
  * Protein: Protein identifier
  * Group comparisons
  * P-values
  * Q-values (FDR corrected)
  * Log2 fold changes
  * Beta coefficients

Output:
-------
- network_1_index_gene.tsv: A file mapping protein identifiers to unique indices for network analysis.
- network_1_edge_list.tsv: A list of the protein interactions, with proteins represented by their indices and weighted interactions. Only edges with weight > 0.1 selected. 
- scores_1.tsv: A file containing proteins and their corresponding -log10 transformed Q-values (related to the CJD vs CTRL differential expression analysis).
- scores_2.tsv: A file containing proteins and their corresponding absolute Log2 fold changes (related to the CJD vs CTRL differential expression analysis)

Analysis Steps:
---------------
1. Get edges and index files:
   - Loads edge files, filtering out interactions with weights below a threshold (0.1), and creating a list of unique proteins.
   - Maps proteins to unique indices is created and stored in a new file (network_1_index_gene.tsv).
   - The protein interaction edges are mapped to the corresponding indices based on the gene names.
   - The edge list, including interaction weights, is saved in the network_1_edge_list.tsv file.

2. Get scores files: 
   - Loads differential.csv file, filtering out data for the "CJD vs CTRL" comparison.
   - Extracts for each protein the Q-values and Log2 fold changes.
   - Log-transforms Q-values.
   - Map Differential Expression Data to Network.
   - Save score files

In [13]:
import pandas as pd
import numpy as np
import os

In [None]:
def process_network_data(edges_file, differential_file, output_directory):
    # Get the absolute path of the current working directory
    current_dir = os.getcwd()

    # Step 1: get Edges and Nodes files

    # Load input files
    edges_file_path = os.path.join(current_dir, edges_file)
    edges_df = pd.read_csv(edges_file_path, delimiter='\t')

    # Filter edges based on weight threshold
    edges_df = edges_df[edges_df['weight'] >= 0.1]

    # Extract unique proteins from filtered edges
    proteins = set(edges_df['fromNode']).union(set(edges_df['toNode']))

    # Create DataFrame with indices and gene names
    network_df = pd.DataFrame({'Index': range(1, len(proteins) + 1), 'Gene': list(proteins)})

    # Save the network index file (with relative paths)
    network_output_file_path = os.path.join(current_dir, output_directory, 'network_1_index_gene.tsv')
    network_df.to_csv(network_output_file_path, sep='\t', header=False, index=False)

    # Create gene-to-index mapping
    gene_to_index = dict(zip(network_df['Gene'], network_df['Index']))

    # Replace fromNode and toNode with their respective indices
    edges_df['fromNode'] = edges_df['fromNode'].map(gene_to_index)
    edges_df['toNode'] = edges_df['toNode'].map(gene_to_index)

    # Filter out NaN values (nodes that are not in network_df)
    edges_df = edges_df.dropna(subset=['fromNode', 'toNode'])

    # Ensure node indices are integers
    edges_df[['fromNode', 'toNode']] = edges_df[['fromNode', 'toNode']].astype(int)

    # Create final edge list with weight
    edge_list_output_file_path = os.path.join(current_dir, output_directory, 'network_1_edge_list.tsv')
    edges_df[['fromNode', 'toNode', 'weight']].to_csv(edge_list_output_file_path, sep='\t', header=False, index=False)

    # Step 2: get scores files
    
    # Load the differential.csv file
    differential_file_path = os.path.join(current_dir, differential_file)
    differential_df = pd.read_csv(differential_file_path)

    # Filter for 'CJD vs CTRL'
    filtered_df = differential_df[differential_df['Group1_vs_Group2'] == 'CJD vs CTRL']

    # Create mapping dictionaries
    protein_to_qvalue = dict(zip(filtered_df['Protein'], filtered_df['Q_Value']))
    protein_to_log2fc = dict(zip(filtered_df['Protein'], filtered_df['Log2_Fold_Change']))

    # Map Q_Value
    network_df['Q_Value'] = network_df['Gene'].map(protein_to_qvalue)
    network_df = network_df.dropna(subset=['Q_Value'])  # Remove NaNs

    # Convert Q_Value to -log10(Q_Value)
    network_df['Q_Value'] = -np.log10(network_df['Q_Value'])

    # Save scores_1.tsv
    scores_1_output_file_path = os.path.join(current_dir, output_directory, 'scores_1.tsv')
    network_df[['Gene', 'Q_Value']].to_csv(scores_1_output_file_path, sep='\t', header=False, index=False)

    # Map Log2_Fold_Change
    network_df['Log2_Fold_Change'] = network_df['Gene'].map(protein_to_log2fc).abs()
    network_df = network_df.dropna(subset=['Log2_Fold_Change'])  

    # Save scores_2.tsv
    scores_2_output_file_path = os.path.join(current_dir, output_directory, 'scores_2.tsv')
    network_df[['Gene', 'Log2_Fold_Change']].to_csv(scores_2_output_file_path, sep='\t', header=False, index=False)

    print("Process completed successfully. Files saved in output path")



In [22]:
process_network_data(
    edges_file='../../data/results/WGCNA/Edges-blue.txt', # change module name to analyse different modules
    differential_file='../../data/results/differential/differential.csv',
    output_directory='.'
)


Process completed successfully. Files saved in output path
