# WGCNA and hierarchical hotnet results
This notebook plots WGCNA e hierarchical hotnet results with two functions. 

## 1. **generate_subnetwork_plot** Function

### **Input:**
- **subnetwork.xlsx:** Excel file containing the subnetwork data, which includes information about protein modules.
- **differential.csv:** CSV file containing differential expression data, including log2 fold changes and p-values for proteins.
- **edges files:** edge files storing protein interactions within each module.

### **Output:**
- **subnetwork_plot.png:** A plot with subnetwork graphs for each module (blue, brown, green, red, turquoise, and yellow). Each subnetwork is displayed with labeled proteins, edges, and color-coded nodes based on log2 fold change and q-value intensity. 

### **Analysis Steps:**
1. **Modules Definition:** The list of modules to be analyzed is predefined as `['blue', 'brown', 'green', 'red', 'turquoise', 'yellow']`.
2. **Data Loading:**
3. **Creating Subnetwork Plots:**
   - A 3x2 grid of subplots is created using `matplotlib` to hold the six module plots.
   - For each module, corresponding edge files are loaded, and data is filtered based on the proteins present in the subnetwork.
   - A graph is created using NetworkX (`nx.Graph`), where edges represent interactions, and node attributes represent the protein's differential expression (log2 fold change and p-value).
4. **Graph Visualization:**
   - The graph is visualized using `networkx.draw_networkx_edges` and `networkx.draw_networkx_labels` to display edges, node labels, and the associated color maps based on the log2 fold change and p-value.
   - Node labels are annotated with color backgrounds that reflect the log2 fold change (blue to red) and p-value (light to dark).
5. **Legends and Colorbars:**
   - Two colorbars are added to the figure: one for log2 fold change and one for p-value intensity.

## 2. **generate_heatmaps** Function

### **Input:**
- **Differential_Module_Expression.xlsx** The Excel file containing data about protein modules, fold changes (FC), p-values (P), module and subnetwork size.

### **Output:**
- **differential_module_expression_heatmap.png** A heatmap plot displaying:
  - Average Log FC of protein for each module and each sCJD subtype.
  - Module size and subnetwork size.
  - Bold values mark statistically significant results.

### **Analysis Steps:**
1. **Data Loading:**
2. **Creating Heatmaps:**
   - A 2x1 GridSpec layout is created for the figure to display two heatmaps:
     - The first heatmap visualizes the fold changes for each module in each subtype (MMV1, MV2K, VV2).
     - The second heatmap visualizes the sizes of modules and subnetworks.
   - In the first heatmap, bold font is applied to the FC values where the corresponding p-value is less than 0.05.
   - In the second heatmap, bold font is applied to the `Subnetwork_Size` values where `Subnetwork_pvalue` is less than 0.05.
3. **Legends and Colorbars:**
   - Two colorbars are added to the figure: one for log2 fold change and one for module and subnetwork size.

In [1]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as mcolors
from matplotlib.ticker import FuncFormatter
from datetime import datetime
import seaborn as sns
from matplotlib.gridspec import GridSpec
import os
import re

In [2]:
base_path = os.path.dirname(os.getcwd()) + '/10_hierarchical_hotnet/hierarchical-hotnet'
wgcna_path = os.path.dirname(os.path.dirname(os.getcwd())) + '/data/results/WGCNA/Differential_Module_Expression.xlsx'

### WGCNA Summary file

In [3]:
# Load the WGCNA results file
df = pd.read_excel(wgcna_path)

# Create new columns if they don't exist
if "Subnetwork_Size" not in df.columns:
    df["Subnetwork_Size"] = None
if "Subnetwork_pvalue" not in df.columns:
    df["Subnetwork_pvalue"] = None

# Get module folders that start with a number followed by '_'
module_folders = [d for d in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, d)) and re.match(r"\d+_", d)]

# Create a mapping of color names to folder names
folder_map = {}
for folder in module_folders:
    color_match = re.search(r"\d+_(.+)", folder)
    if color_match:
        color = color_match.group(1).lower()
        folder_map[color] = folder

# Determine the module column name
possible_module_columns = ["module_name", "Module_Name", "module", "Module", "color", "Color", "moduleColor", "module_color"]
module_column = next((col for col in possible_module_columns if col in df.columns), None)

if not module_column:
    # Try to find a column containing color names
    for col in df.columns:
        color_values = df[col].astype(str).str.lower()
        if any(color in ' '.join(color_values.tolist()) for color in folder_map.keys()):
            module_column = col
            break

if not module_column:
    raise ValueError(f"Could not find module column. Available columns: {df.columns.tolist()}")

# Process each row in the dataframe
for index, row in df.iterrows():
    module_value = str(row.get(module_column, "")).lower()
    
    # Skip empty module values
    if not module_value or module_value == 'nan':
        continue
    
    # Find matching folder
    matching_folder = None
    
    # Direct match
    if module_value in folder_map:
        matching_folder = folder_map[module_value]
    else:
        # Partial match
        for color, folder in folder_map.items():
            if color in module_value or any(c in module_value for c in color) or any(c in color for c in module_value):
                matching_folder = folder
                break
    
    if matching_folder:
        # Path to cluster file
        cluster_file_path = os.path.join(base_path, matching_folder, "results", "clusters_network_1_scores_1.tsv")
        
        if os.path.exists(cluster_file_path):
            # Read the file and extract values
            with open(cluster_file_path, 'r') as file:
                content = file.read()
                
                # Extract size and p-value
                size_match = re.search(r"# Observed size of largest cluster at observed cut height: (\d+)", content)
                pvalue_match = re.search(r"# p-value: ([\d\.]+)", content)
                
                if size_match:
                    df.at[index, "Subnetwork_Size"] = int(size_match.group(1))
                if pvalue_match:
                    df.at[index, "Subnetwork_pvalue"] = float(pvalue_match.group(1))

# Save the results
df.to_excel('modules.xlsx', index=False)

### HHT Summary file

In [None]:
# Module colors from the directory structure
modules = ['blue', 'brown', 'green', 'grey', 'red', 'turquoise', 'yellow']

# Dictionary to store data from each module
module_data = {}

# Read data from each module
for module in modules:
    # Construct the path to the clusters file
    module_dir = os.path.join(base_path, f'0{modules.index(module)+1}_{module}', 'results', 'clusters_network_1_scores_1.tsv')
    
    try:
        # Read the TSV file
        with open(module_dir, 'r') as f:
            content = f.read()
            
            # Find the section after "# Clusters:" which contains the proteins
            clusters_match = re.search(r"# Clusters:\n(.*?)(?:\n\n|\n[A-Z])", content, re.DOTALL)
            
            if clusters_match:
                # Get only the first line after "# Clusters:" (the first cluster)
                first_cluster_line = clusters_match.group(1).strip().split('\n')[0]
                
                # Split by tabs to get individual proteins in the first cluster
                first_cluster_proteins = first_cluster_line.strip().split('\t')
                module_data[module] = first_cluster_proteins
            else:
                print(f"Warning: No clusters section found in file for {module} module")
                module_data[module] = []
    except FileNotFoundError:
        print(f"Warning: Could not find clusters_network_1_scores_1.tsv for {module} module")
        module_data[module] = []

# Find the maximum length of protein lists
max_length = max(len(proteins) for proteins in module_data.values())

# Create a DataFrame with aligned columns
df_dict = {}
for module in modules:
    # Pad shorter lists with empty strings to match the longest list
    padded_proteins = module_data[module] + [''] * (max_length - len(module_data[module]))
    df_dict[f'{module}'] = padded_proteins

# Create DataFrame
df = pd.DataFrame(df_dict)

# Save to Excel in the current directory
output_file = 'subnetwork.xlsx'
df.to_excel(output_file, index=False)
print(f"Excel file created successfully at: {output_file}")

# Print summary statistics
print("\nSummary of proteins per module:")
for module in modules:
    protein_count = len([p for p in module_data[module] if p])  # Count non-empty proteins
    print(f"{module.capitalize()}: {protein_count} proteins")

### Main plots

In [4]:
def generate_subnetwork_plot(subnetwork_file_path, differential_file_path, edges_folder, figure_folder):
    # Define modules
    modules = ['blue', 'brown', 'green', 'red', 'turquoise', 'yellow']
    
    # Load data
    subnetwork_df = pd.read_excel(subnetwork_file_path)
    differential_df = pd.read_csv(differential_file_path)
    
    # Create figure and axes grid
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(40, 50))
    axes = axes.flatten()

    for i, module in enumerate(modules):
        # Load edges file for the module
        edges_file_path = os.path.join(edges_folder, f'Edges-{module}.txt')
        edges_df = pd.read_csv(edges_file_path, sep='\t')
        
        # Filter proteins and edges
        proteins_subgroup = subnetwork_df[module].dropna().tolist()
        filtered_edges_df = edges_df[edges_df['fromNode'].isin(proteins_subgroup) & edges_df['toNode'].isin(proteins_subgroup)]
        filtered_df = differential_df[differential_df['Group1_vs_Group2'] == 'CJD vs CTRL']
        
        # Create graph
        G = nx.Graph()
        for _, row in filtered_edges_df.iterrows():
            G.add_edge(row['fromNode'], row['toNode'], weight=row['weight'])

        # Node attributes
        protein_attributes = filtered_df.set_index("Protein")
        ax = axes[i]
        pos = nx.spring_layout(G, seed=42, weight='weight', k=0.05)

        # Draw edges and nodes
        edges = G.edges(data=True)
        nx.draw_networkx_edges(G, pos, edgelist=edges, width=[d['weight'] * 5 for (_, _, d) in edges], alpha=0.6, ax=ax)
        
        # Color maps
        vmin_fc, vmax_fc = -2, 2
        cmap_fc = plt.get_cmap("coolwarm")
        norm_fc = mcolors.Normalize(vmin=vmin_fc, vmax=vmax_fc)

        vmin_qvalue, vmax_qvalue = 0, 0.10
        cmap_qvalue = plt.get_cmap("Greys_r")
        norm_qvalue = mcolors.Normalize(vmin=vmin_qvalue, vmax=vmax_qvalue)
        
        # Draw labels
        for node in G.nodes():
            if node in protein_attributes.index:
                q_value = protein_attributes.loc[node, "Q_Value"]
                log2_fc = protein_attributes.loc[node, "Log2_Fold_Change"]
                edge_width = np.interp(q_value, (filtered_df["Q_Value"].min(), filtered_df["Q_Value"].max()), (3, 0.5))
                color_fc = cmap_fc(norm_fc(log2_fc))
                color_qvalue = cmap_qvalue(norm_qvalue(q_value))
                nx.draw_networkx_labels(G, pos, {node: node}, font_size=20, font_weight="bold", 
                                        bbox=dict(facecolor=color_fc, edgecolor=color_qvalue, linewidth=edge_width, boxstyle="round,pad=0.9"), ax=ax)

        ax.axis("off")
        ax.set_title(module, fontsize=90, fontweight='bold')

    # 1. Legend for Log2 Fold Change color 
    cbar_ax = fig.add_axes([0.885, 0.59, 0.01, 0.2])  # Adjust position to the side of the plot
    sm_fc = plt.cm.ScalarMappable(cmap=cmap_fc, norm=norm_fc)
    sm_fc.set_array([])  # Required for colorbar
    colorbar_fc  = fig.colorbar(sm_fc, cax=cbar_ax, label="Log2 Fold Change", shrink=0.2)
    colorbar_fc.set_label("Log2 Fold Change", fontsize=20)  # Change font size of the label
    colorbar_fc.ax.tick_params(labelsize=12)  # Change font size of the ticks
    colorbar_fc.set_ticks([-2, 0, 2])  # Custom ticks positions for example

# 2. Legend for Q-Value intensity 
    cbar_ax_qvalue = fig.add_axes([0.885, 0.3, 0.01, 0.2])  # Another axis for the second colorbar
    sm_qvalue = plt.cm.ScalarMappable(cmap=cmap_qvalue, norm=norm_qvalue)
    sm_qvalue.set_array([])  # Required for colorbar
    colorbar_qvalue = fig.colorbar(sm_qvalue, cax=cbar_ax_qvalue, label="p value", shrink=0.2)
    colorbar_qvalue.set_label("p value", fontsize=20)  # Change font size of the label
    colorbar_qvalue.ax.tick_params(labelsize=12)  # Change font size of the ticks
    colorbar_qvalue.set_ticks([0, 0.05, 0.10])  # Custom ticks positions for example
    colorbar_qvalue.ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f"{x:.2f}"))

    plt.subplots_adjust(wspace=0.000005, hspace=0.1, right=0.88)

    # Save figure
    figure_filename = f"subnetwork_plot.png"
    figure_path = os.path.join(figure_folder, figure_filename)
    fig.savefig(figure_path, bbox_inches='tight', dpi=300)
    plt.show()

In [None]:
# Define paths
subnetwork_file = os.path.join(os.getcwd(), 'subnetwork.xlsx')
differential_file = os.path.join(os.getcwd(), '..', '..', 'data', 'results', 'differential', 'differential.csv')
edges_folder = os.path.join(os.getcwd(), '..', '..', 'data', 'results', 'WGCNA')
figure_folder = os.path.join(os.getcwd(), '..', '..', 'figures', 'WGCNA_HHN')

# Call the function
generate_subnetwork_plot(subnetwork_file, differential_file, edges_folder, figure_folder)

In [None]:
def generate_heatmaps(file_path, figure_folder):
    # Load Excel file
    df = pd.read_excel(file_path)

    # Remove 'grey' module
    df = df[df["Module"] != "grey"]

    # Round FC values
    fc_columns = ["MMV1-FC", "MV2K-FC", "VV2-FC"]
    p_columns = ["MMV1-P", "MV2K-P", "VV2-P"]
    df[fc_columns] = df[fc_columns].round(2)

    # Prepare data for heatmaps
    heatmap_data = df.set_index("Module")[fc_columns]
    p_values = df.set_index("Module")[p_columns]
    heatmap_data_sizes = df[['Module_Size', 'Subnetwork_Size']]

    # Color scale limits
    vmin, vmax = 0, 200

    # Create figure with GridSpec layout
    fig = plt.figure(figsize=(12, 4))
    gs = GridSpec(2, 1, height_ratios=[3, 2], hspace=0)

    # First heatmap (FC values)
    ax0 = fig.add_subplot(gs[0])
    sns.heatmap(heatmap_data.T, cmap="coolwarm", annot=True, fmt=".2f", center=0, ax=ax0, 
                cbar_kws={'shrink': 0.8, 'label': 'Fold Change'})

    # Bold values with P < 0.05
    for text, (row, col) in zip(ax0.texts, np.ndindex(heatmap_data.T.shape)):
        if p_values.iloc[col, row] < 0.05:
            text.set_weight("bold")

    ax0.set_xlabel("")
    ax0.set_ylabel("")

    # Second heatmap (Module_Size and Subnetwork_Size)
    ax1 = fig.add_subplot(gs[1])
    sns.heatmap(heatmap_data_sizes.T, cmap="pink", annot=True, fmt=".0f", center=0, vmin=vmin, vmax=vmax, 
                xticklabels=df["Module"], yticklabels=["Module Size", "Subnetwork Size"], ax=ax1, 
                cbar_kws={'shrink': 0.8, 'label': 'Size'})

    # Bold Subnetwork_Size values where Subnetwork_pvalue < 0.05
    for text, (row, col) in zip(ax1.texts, np.ndindex(heatmap_data_sizes.T.shape)):
        if row == 1 and df.iloc[col]["Subnetwork_pvalue"] < 0.05:
            text.set_weight("bold")

    ax1.set_xlabel("")
    ax1.set_ylabel("")

    # Save the figure
    os.makedirs(figure_folder, exist_ok=True)
    figure_filename = "differential_module_expression_heatmap.png"
    figure_path = os.path.join(figure_folder, figure_filename)
    fig.savefig(figure_path, bbox_inches='tight', dpi=300)

    # Show the plot
    plt.show()

In [None]:
base_dir = os.getcwd()
file_path = os.path.join(base_dir, "modules.xlsx")
figure_folder = os.path.join(base_dir, '..', '..', 'figures', 'WGCNA_HHN')
generate_heatmaps(file_path, figure_folder)