# Cell 0: Notebook Header & Documentation
# Description: Provides context and instructions for this setup notebook.

## Notebook Title: Ablation Study - Setup, Configuration & Data Generation

### Purpose and Context

*   **Goal:** To establish the common foundation for the AIFM1 Network Automaton ablation study. This includes importing libraries, setting up the environment, defining baseline configurations, preparing the AIFM1 network subgraph, and saving essential configuration and graph data outputs to files.
*   **Contribution:** Generates the standardized input files (configuration, graph, node lists, layout, seed index) used by subsequent experimental notebooks (`ablation_01` onwards). Defines base directories and a MASTER_SEED.
*   **Inputs:** Requires access to STRING database files (will download if not present).
*   **Outputs:**
    *   Creates base output directories (`simulation_results`, `biological_analysis_results`).
    *   Creates a dedicated setup output directory (`simulation_results/Ablation_Setup_Files`).
    *   Saves the baseline configuration (including `MASTER_SEED`) to `baseline_config.json`.
    *   Saves the prepared NetworkX graph `G` to `graph_G.pkl`.
    *   Saves the calculated layout `pos` to `graph_pos.pkl`.
    *   Saves `node_list`, `node_to_int`, `int_to_node`, and `INITIAL_SEED_NODES_IDX` to respective `.pkl` files.
    *   Saves device information (`device.json`).

### How to Run

*   **Prerequisites:** Python environment with necessary libraries installed (`pandas`, `networkx`, `numpy`, `requests`, `tqdm`, `torch`).
*   **Configuration:** Check/set base directory paths (`DATA_ROOT_DIR`, `OUTPUT_DIR`, `ANALYSIS_DIR`), `MASTER_SEED`, `TARGET_NODE_ID`, and baseline parameters in Cell 2 if defaults need changing.
*   **Execution:** Run all cells sequentially from top to bottom (Cell 1 through Cell 3). **This notebook MUST be run successfully before any `ablation_01` through `ablation_08` notebooks.**
*   **Expected Runtime:** Variable depending on download/extraction time for STRING data (minutes), graph preparation and layout calculation (minutes). File saving is fast.

### Expected Results & Analysis (within this notebook)

*   This notebook performs setup, configuration, and graph preparation, saving outputs to files.
*   Successful execution results in:
    *   Confirmation messages for setup steps.
    *   Logs detailing graph preparation.
    *   Printout of AIFM1 subgraph properties.
    *   Confirmation messages listing saved files in `simulation_results/Ablation_Setup_Files/`.
*   **NO simulations are run, NO analysis is performed, and NO functions are defined for external use.**

Copyright 2025 Michael G. Young II, Emergenics Foundation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

In [1]:
# Cell 1: Initial Setup, Imports, Device Check
# Description: Basic imports, setup directories, checks for GPU availability, saves device info.

import os
import sys
import numpy as np
import pandas as pd
# Only import plotting libraries if actually plotting *in this notebook*
# import matplotlib.pyplot as plt
# import matplotlib.colors as mcolors
# import matplotlib.cm as cm
# import seaborn as sns
import networkx as nx
import torch
import requests
import io
import gzip
import shutil
import copy
import math
import json
import time
import pickle
import warnings
import traceback
import random # Import random for seeding
# Keep imports needed by graph prep functions defined later
from tqdm.auto import tqdm
import gc # For memory management

# Ignore common warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning, module="matplotlib")
warnings.filterwarnings("ignore", category=RuntimeWarning)

print(f"--- Cell 1: Initial Setup & Imports ({time.strftime('%Y-%m-%d %H:%M:%S')}) ---")

# --- Device Check ---
dev_name = None # Initialize
if torch.cuda.is_available():
    # Check actual device assignment just in case
    try:
        # Try to get the current device index, default to 0 if error
        current_dev_index = torch.cuda.current_device()
        device = torch.device(f'cuda:{current_dev_index}')
        dev_name = torch.cuda.get_device_name(current_dev_index)
        print(f"✅ CUDA available, using default GPU: {device} ({dev_name})")
    except Exception as e:
        print(f"⚠️ CUDA available, but error getting current device: {e}. Falling back to cuda:0.")
        device = torch.device('cuda:0') # Fallback
        try:
            dev_name = torch.cuda.get_device_name(0)
        except Exception:
             dev_name = "CUDA Device (Unknown Name)"
        print(f"   Using fallback: {device} ({dev_name})")
    device_type = 'cuda'
else:
    device = torch.device('cpu')
    print("⚠️ CUDA not available, using CPU.")
    device_type = 'cpu'
    dev_name = 'CPU'

# --- Base Directories ---
DATA_ROOT_DIR = "/tmp/cakg_data"
OUTPUT_DIR = "simulation_results"
ANALYSIS_DIR = "biological_analysis_results"
SETUP_OUTPUT_SUBDIR = "Ablation_Setup_Files" # Specific folder for setup outputs
SETUP_OUTPUT_DIR = os.path.join(OUTPUT_DIR, SETUP_OUTPUT_SUBDIR)

# Create all directories
os.makedirs(DATA_ROOT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(ANALYSIS_DIR, exist_ok=True)
os.makedirs(SETUP_OUTPUT_DIR, exist_ok=True) # Create the specific setup dir
print(f"Checked/created base directories.")
print(f"Setup files will be saved in: {SETUP_OUTPUT_DIR}")

# --- Save Device Info ---
# Use standard library types for JSON compatibility
device_info = {'device_type': device_type, 'device_name': dev_name, 'torch_device_str': str(device)}
device_info_path = os.path.join(SETUP_OUTPUT_DIR, "device_info.json")
try:
    with open(device_info_path, 'w') as f:
        json.dump(device_info, f, indent=4)
    print(f"  ✅ Saved device info to {device_info_path}")
except Exception as e:
    print(f"  ⚠️ Warning: Could not save device info: {e}")


print("\nCell 1: Initial Setup complete.")

--- Cell 1: Initial Setup & Imports (2025-04-28 20:44:56) ---
✅ CUDA available, using default GPU: cuda:0 (NVIDIA GeForce RTX 2060)
Checked/created base directories.
Setup files will be saved in: simulation_results/Ablation_Setup_Files
  ✅ Saved device info to simulation_results/Ablation_Setup_Files/device_info.json

Cell 1: Initial Setup complete.


In [2]:
# Cell 2: Baseline Configuration Definition & Saving (Includes Master Seed)
# Description: Defines baseline configuration, rule parameters (typically 2D H+P), and MASTER_SEED.
#              Saves the complete configuration dictionary to a JSON file in the setup directory.
#              This configuration will be loaded by subsequent notebooks.

import numpy as np
import os
import json
import traceback
import copy
import random # Import random for seeding

print(f"\n--- Cell 2: Baseline Configuration Definition & Saving ---")

# Retrieve setup output path defined in Cell 1
SETUP_OUTPUT_DIR = globals().get("SETUP_OUTPUT_DIR")
if SETUP_OUTPUT_DIR is None or not os.path.isdir(SETUP_OUTPUT_DIR):
    # Attempt to recreate if missing, maybe from globals set in Cell 1 if run standalone
    output_base = globals().get("OUTPUT_DIR", "simulation_results")
    setup_subdir = "Ablation_Setup_Files"
    SETUP_OUTPUT_DIR = os.path.join(output_base, setup_subdir)
    print(f"Warning: SETUP_OUTPUT_DIR not found globally, attempting default: {SETUP_OUTPUT_DIR}")
    os.makedirs(SETUP_OUTPUT_DIR, exist_ok=True) # Create if doesn't exist

# --- MASTER SEED for Reproducibility ---
MASTER_SEED = 42
print(f"MASTER_SEED set to: {MASTER_SEED}")
# Apply seed globally in this setup notebook for consistency in graph/layout
np.random.seed(MASTER_SEED)
random.seed(MASTER_SEED)

# --- Experiment Info ---
DATASET_NAME = "STRING"
# Define a base name that later notebooks can modify/append
BASE_EXPERIMENT_NAME = 'string_ca_subgraph_AIFM1_CORRECTED' # Base name for study

# --- Subgraph Selection (TARGET AIFM1) ---
TARGET_NODE_ID = '9606.ENSP00000287295' # AIFM1 ID
SUBGRAPH_RADIUS = 2

# --- Baseline State Parameters (2D H+P) ---
STATE_DIM = 2 # Start with 2D as baseline default
ACT_IDX = 0; INH_IDX = 1
PLACEHOLDER_W_IDX = 2; PLACEHOLDER_X_IDX = 3; PLACEHOLDER_Y_IDX = 4 # Define indices even if dim=2

INIT_MODE = 'seeds'
SEED_ACTIVATION_VALUE = [1.0, 0.0] # Use list for JSON compatibility
DEFAULT_INACTIVE_STATE = [0.0, 0.0] # Use list for JSON compatibility

# --- Simulation Parameters ---
MAX_SIMULATION_STEPS = 500
CONVERGENCE_THRESHOLD = 0.0001

# --- Visualization & Output Parameters ---
NODES_TO_PLOT_COUNT = 10
SNAPSHOT_STEPS = [0, 50, 100, 250, 499] # Baseline snapshot steps (adjust end for 0-based indexing)

# --- Baseline Rule Parameters (2D H+P Configuration) ---
# Contains ALL potential parameters used across different ablation runs,
# with defaults set for the baseline 2D H+P run.
rule_params = {
    'activation_threshold': 0.5, 'activation_increase_rate': 0.15, 'activation_decay_rate': 0.05,
    'inhibition_threshold': 0.5, 'inhibition_increase_rate': 0.1, 'inhibition_decay_rate': 0.1,
    'inhibition_feedback_threshold': 0.6, 'inhibition_feedback_strength': 0.3,
    'diffusion_factor': 0.05,
    'noise_level': 0.001,
    'harmonic_factor': 0.05,             # H Active in baseline
    'pheromone_increase_rate': 0.02,     # P Active in baseline
    'pheromone_multiplicative_decay_rate': 0.99,
    # Placeholder decay keys (inactive in baseline)
    'w_decay_rate': 0.0, 'x_decay_rate': 0.0, 'y_decay_rate': 0.0, 'placeholder_decay_rate': 0.0,
    # Refractory/Dynamic Weights keys (inactive in baseline)
    'refractory_threshold': 1.1, 'additional_decay_factor': 0.0, 'use_dynamic_weights': False,
    'potentiation_threshold': 1.1, 'potentiation_increase_rate': 0.0, 'potentiation_decay_rate': 1.0, 'max_potentiation': 0.0,
    # Weight usage flag
    'use_confidence_weight': True,
}

# --- Combine All Config into One Dictionary ---
# This dictionary is saved to file for other notebooks to load
baseline_config = {
    # Reproducibility
    'MASTER_SEED': MASTER_SEED,
    # Experiment Info
    'DATASET_NAME': DATASET_NAME,
    'EXPERIMENT_NAME': BASE_EXPERIMENT_NAME, # Base name saved
    'TARGET_NODE_ID': TARGET_NODE_ID,
    'SUBGRAPH_RADIUS': SUBGRAPH_RADIUS,
    # State Info
    'STATE_DIM': STATE_DIM, # Save the baseline state dim
    'ACT_IDX': ACT_IDX, 'INH_IDX': INH_IDX,
    'PLACEHOLDER_W_IDX': PLACEHOLDER_W_IDX, 'PLACEHOLDER_X_IDX': PLACEHOLDER_X_IDX, 'PLACEHOLDER_Y_IDX': PLACEHOLDER_Y_IDX,
    'INIT_MODE': INIT_MODE,
    'SEED_ACTIVATION_VALUE': SEED_ACTIVATION_VALUE,
    'DEFAULT_INACTIVE_STATE': DEFAULT_INACTIVE_STATE,
    # Simulation Params
    'MAX_SIMULATION_STEPS': MAX_SIMULATION_STEPS,
    'CONVERGENCE_THRESHOLD': CONVERGENCE_THRESHOLD,
    # Visualization Params
    'NODES_TO_PLOT_COUNT': NODES_TO_PLOT_COUNT,
    'SNAPSHOT_STEPS': SNAPSHOT_STEPS,
    # Rule Params (Contains ALL possible keys, values set for baseline)
    'rule_params': rule_params,
    # Base Dirs (useful for context when loading)
    'DATA_ROOT_DIR': globals().get('DATA_ROOT_DIR', '/tmp/cakg_data'),
    'OUTPUT_DIR': globals().get('OUTPUT_DIR', 'simulation_results'),
    'ANALYSIS_DIR': globals().get('ANALYSIS_DIR', 'biological_analysis_results'),
    'SETUP_OUTPUT_DIR': SETUP_OUTPUT_DIR
}

print(f"Baseline State Dim: {baseline_config['STATE_DIM']}")
print(f"Baseline Seed Value: {baseline_config['SEED_ACTIVATION_VALUE']}")
print(f"Baseline Default State: {baseline_config['DEFAULT_INACTIVE_STATE']}")
print("\n--- Baseline Rule Parameters (Initial: 2D H+P Active) ---")
print(json.dumps(baseline_config['rule_params'], indent=2))

# --- Save Configuration to File ---
config_save_path = os.path.join(SETUP_OUTPUT_DIR, "baseline_config.json")
try:
    with open(config_save_path, 'w') as f:
        # Use default=str to handle potential numpy types if they crept in
        json.dump(baseline_config, f, indent=4, default=str)
    print(f"\n✅ Saved baseline configuration (including MASTER_SEED) to: {config_save_path}")
except Exception as e:
    print(f"\n❌ Error saving baseline configuration: {e}")
    traceback.print_exc(limit=1)

print("\nCell 2: Baseline configuration defined and saved.")


--- Cell 2: Baseline Configuration Definition & Saving ---
MASTER_SEED set to: 42
Baseline State Dim: 2
Baseline Seed Value: [1.0, 0.0]
Baseline Default State: [0.0, 0.0]

--- Baseline Rule Parameters (Initial: 2D H+P Active) ---
{
  "activation_threshold": 0.5,
  "activation_increase_rate": 0.15,
  "activation_decay_rate": 0.05,
  "inhibition_threshold": 0.5,
  "inhibition_increase_rate": 0.1,
  "inhibition_decay_rate": 0.1,
  "inhibition_feedback_threshold": 0.6,
  "inhibition_feedback_strength": 0.3,
  "diffusion_factor": 0.05,
  "noise_level": 0.001,
  "harmonic_factor": 0.05,
  "pheromone_increase_rate": 0.02,
  "pheromone_multiplicative_decay_rate": 0.99,
  "w_decay_rate": 0.0,
  "x_decay_rate": 0.0,
  "y_decay_rate": 0.0,
  "placeholder_decay_rate": 0.0,
  "refractory_threshold": 1.1,
  "additional_decay_factor": 0.0,
  "use_dynamic_weights": false,
  "potentiation_threshold": 1.1,
  "potentiation_increase_rate": 0.0,
  "potentiation_decay_rate": 1.0,
  "max_potentiation": 0.0,

In [3]:
# Cell 3: Graph Preparation and Saving
# Description: Downloads/extracts STRING data, filters by score, builds NetworkX graph G,
#              extracts AIFM1 ego subgraph, calculates layout 'pos', generates node lists/mappings,
#              and saves all graph-related objects to files in the setup directory.

import os
import pandas as pd
import requests
import gzip
import shutil
from tqdm.auto import tqdm
import warnings
import numpy as np
import networkx as nx
import random
import traceback
import time
import pickle # Needed for saving graph objects
import json # Needed for loading config path
import gc # Import garbage collector

print("\n--- Cell 3: Graph Preparation and Saving (AIFM1 Subgraph) ---")

# --- Load Config for Paths and Parameters ---
config = {}
graph_prep_error = False
local_G = None; local_node_list = []; local_node_to_int = {}; local_int_to_node = {}; local_pos = None; local_INITIAL_SEED_NODES_IDX = []
SETUP_OUTPUT_DIR_save = None # Initialize

try:
    # Load config saved by Cell 2
    SETUP_OUTPUT_DIR_load = os.path.join("simulation_results", "Ablation_Setup_Files")
    if not os.path.isdir(SETUP_OUTPUT_DIR_load): raise NotADirectoryError(f"Setup directory not found: {SETUP_OUTPUT_DIR_load}. Run Cell 1 & 2 first.")
    config_path_load = os.path.join(SETUP_OUTPUT_DIR_load, "baseline_config.json")
    if not os.path.exists(config_path_load): raise FileNotFoundError(f"Baseline config file not found: {config_path_load}")
    with open(config_path_load, 'r') as f: config = json.load(f)
    print(f"  Loaded config from {config_path_load}")

    # Extract necessary parameters
    DATA_ROOT_DIR_local = config.get('DATA_ROOT_DIR')
    DATASET_NAME_local = config.get('DATASET_NAME', 'STRING')
    TARGET_NODE_ID_local = config.get('TARGET_NODE_ID')
    SUBGRAPH_RADIUS_local = config.get('SUBGRAPH_RADIUS')
    SCORE_THRESHOLD_local = 0.6 # Fixed threshold for consistency
    INIT_MODE_local = config.get('INIT_MODE')
    SETUP_OUTPUT_DIR_save = config.get('SETUP_OUTPUT_DIR') # Use path from loaded config
    MASTER_SEED = config.get('MASTER_SEED', 42)

    if None in [DATA_ROOT_DIR_local, TARGET_NODE_ID_local, SUBGRAPH_RADIUS_local, INIT_MODE_local, SETUP_OUTPUT_DIR_save]:
        raise ValueError("Essential parameters missing from loaded config.")
    print(f"  Target Node: {TARGET_NODE_ID_local}, Radius: {SUBGRAPH_RADIUS_local}, Score Threshold: {SCORE_THRESHOLD_local}")

except Exception as e_conf:
    print(f"❌ Error loading config for graph prep: {e_conf}")
    graph_prep_error = True

# --- Helper Functions (Define locally for self-containment) ---
def download_file(url, dest_path, desc=None):
    """Downloads a file with progress, robust error handling."""
    if os.path.exists(dest_path):
        print(f"File already exists: {dest_path}")
        return dest_path
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    effective_desc = desc or os.path.basename(dest_path)
    print(f"Downloading {url} to {dest_path}...")
    try:
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, stream=True, timeout=300, headers=headers) # Increased timeout
        response.raise_for_status()
        total_size = int(response.headers.get('content-length', 0))
        with open(dest_path, 'wb') as f, tqdm(
            desc=effective_desc, total=total_size, unit='iB',
            unit_scale=True, unit_divisor=1024, miniters=1, leave=False
        ) as bar:
            for chunk in response.iter_content(chunk_size=1024*1024): # Larger chunk size
                if chunk:
                    size = f.write(chunk)
                    bar.update(size)
        actual_size = os.path.getsize(dest_path)
        if total_size != 0 and actual_size < total_size :
            warnings.warn(f"Size mismatch (may be OK): Expected {total_size}, got {actual_size}")
        print(f"Download completed: {dest_path}")
        return dest_path
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {url}: {e}")
        if os.path.exists(dest_path):
             try: os.remove(dest_path); print(f"Removed partial download: {dest_path}")
             except OSError as rm_err: print(f"Error removing partial download {dest_path}: {rm_err}")
        raise
    except Exception as e:
        print(f"Unexpected download error: {e}")
        if os.path.exists(dest_path):
             try: os.remove(dest_path); print(f"Removed partial download: {dest_path}")
             except OSError as rm_err: print(f"Error removing partial download {dest_path}: {rm_err}")
        raise

def extract_file(src_path, dest_dir, desc=None):
    """Extracts .gz files with progress and robust checks."""
    if not os.path.exists(src_path):
        raise FileNotFoundError(f"Source not found: {src_path}")
    os.makedirs(dest_dir, exist_ok=True)
    effective_desc = desc or f"Extracting {os.path.basename(src_path)}"
    extracted_path = None
    try:
        if src_path.endswith('.gz') and not src_path.endswith('.tar.gz'):
            dest_file = os.path.join(dest_dir, os.path.basename(src_path)[:-3])
            if os.path.exists(dest_file) and os.path.getmtime(dest_file) >= os.path.getmtime(src_path):
                print(f"Extracted file {dest_file} up-to-date. Skipping.")
                return dest_file
            print(f"Extracting {src_path} to {dest_file}...")
            src_file_size = os.path.getsize(src_path)
            with gzip.open(src_path, 'rb') as f_in, open(dest_file, 'wb') as f_out, tqdm(
                total=src_file_size, unit='B', unit_scale=True, desc=effective_desc, leave=False
            ) as pbar:
                chunk_size = 1024 * 1024 # 1MB chunks
                while True:
                    chunk = f_in.read(chunk_size)
                    if not chunk:
                        break
                    bytes_written = f_out.write(chunk)
                    # Attempt to update progress based on compressed file position
                    try:
                        new_pos = f_in.tell()
                        update_amount = new_pos - pbar.n
                        pbar.update(update_amount if update_amount > 0 else len(chunk))
                    except Exception:
                        pbar.update(len(chunk)) # Fallback to bytes written
            extracted_path = dest_file
        else:
            print(f"Unsupported format: {src_path}")
            return None
    except Exception as e:
        print(f"Extraction error: {e}")
        raise

    if extracted_path and os.path.exists(extracted_path):
        return extracted_path
    else:
        warnings.warn(f"Extraction path '{extracted_path}' invalid or non-existent.")
        return None


# --- Main Graph Preparation Logic ---
if not graph_prep_error:
    try:
        # Apply MASTER_SEED for layout reproducibility later
        layout_seed = config.get('MASTER_SEED', 42)
        random.seed(layout_seed)
        np.random.seed(layout_seed)

        # --- Download/Extract ---
        string_species_id = TARGET_NODE_ID_local.split('.')[0]; string_version = "v12.0"; edge_url = f"https://stringdb-downloads.org/download/protein.links.full.{string_version}/{string_species_id}.protein.links.full.{string_version}.txt.gz"; dataset_root = os.path.join(DATA_ROOT_DIR_local, DATASET_NAME_local)
        gz_filename = os.path.basename(edge_url); gz_file_path = os.path.join(dataset_root, gz_filename); extracted_filename = f"{string_species_id}.protein.links.full.{string_version}.txt"; edge_file_path = os.path.join(dataset_root, extracted_filename)
        os.makedirs(dataset_root, exist_ok=True)
        download_file(edge_url, gz_file_path, desc="STRING Links Archive")
        extracted_path_check = extract_file(gz_file_path, dataset_root, desc="Extracting STRING Links")
        if not extracted_path_check or not os.path.exists(edge_file_path): raise FileNotFoundError(f"Required STRING file not found: {edge_file_path}")
        print(f"Using STRING data file: {edge_file_path}")

        # --- Load Data & Clean Score ---
        print("Loading STRING data...")
        required_cols = ['protein1', 'protein2', 'combined_score']; col_dtypes_load = {'protein1': str, 'protein2': str, 'combined_score': float}
        edges_df = pd.read_csv(edge_file_path, sep=' ', usecols=required_cols, dtype=col_dtypes_load, low_memory=False)
        print(f"Loaded {len(edges_df)} raw interactions.")
        if edges_df.empty: raise ValueError("Loaded empty DataFrame.")
        initial_rows = len(edges_df); edges_df.dropna(subset=['combined_score'], inplace=True); rows_after_dropna = len(edges_df)
        if rows_after_dropna < initial_rows: print(f"  Dropped {initial_rows - rows_after_dropna} rows with non-numeric score.")
        edges_df['combined_score'] = edges_df['combined_score'].astype(int); print("  Score column cleaned and converted to integer.")

        # --- Pre-Filter Check ---
        target_node_str = str(TARGET_NODE_ID_local)
        if not (edges_df['protein1'].eq(target_node_str).any() or edges_df['protein2'].eq(target_node_str).any()): warnings.warn(f"🚨 Target '{target_node_str}' NOT FOUND in raw data!")
        else: print(f"✅ Target '{target_node_str}' present in raw data.")

        # --- Filter by Score ---
        score_int_threshold = int(SCORE_THRESHOLD_local * 1000)
        initial_count = len(edges_df)
        edges_df = edges_df[edges_df['combined_score'] >= score_int_threshold].copy() # Apply filter after conversion
        print(f"Filtered edges by score >= {score_int_threshold}. Kept {len(edges_df)} / {initial_count} interactions.")
        if edges_df.empty: raise ValueError(f"No interactions left after filtering at score {score_int_threshold}.")

        # --- Build Full Graph ---
        print("Building NetworkX graph (G_full)...")
        G_full = nx.Graph()
        unique_nodes = pd.unique(edges_df[['protein1', 'protein2']].values.ravel('K')); G_full.add_nodes_from(unique_nodes); print(f"Added {G_full.number_of_nodes()} nodes.")
        skipped_self_loops = 0
        for _, row in tqdm(edges_df.iterrows(), total=len(edges_df), desc="Adding Edges"):
            u, v, score = str(row['protein1']), str(row['protein2']), row['combined_score']
            if u == v: skipped_self_loops += 1; continue
            # Add edge only if it doesn't exist to avoid duplicate checks/warnings
            if not G_full.has_edge(u,v):
                 G_full.add_edge(u, v, weight=float(score / 1000.0))
        if skipped_self_loops > 0: print(f"Skipped {skipped_self_loops} self-loops.")
        print(f"Built full filtered graph: {G_full.number_of_nodes()} nodes, {G_full.number_of_edges()} edges.")
        if target_node_str not in G_full: warnings.warn(f"🚨 Target '{target_node_str}' NOT FOUND in filtered graph nodes!")
        del edges_df; gc.collect() # Free memory

        # --- Extract Ego Graph ---
        print(f"Extracting ego graph (radius {SUBGRAPH_RADIUS_local}) for '{target_node_str}'...")
        temp_G = None
        if target_node_str in G_full:
            try:
                temp_G = nx.ego_graph(G_full, target_node_str, radius=SUBGRAPH_RADIUS_local, undirected=True, center=True)
            except Exception as ego_err:
                print(f"Error extracting ego graph: {ego_err}. Using full.")
                temp_G = G_full
        else:
             print("Target not found in filtered graph. Using full.")
             temp_G = G_full
        del G_full; gc.collect() # Cleanup memory
        if temp_G is None or temp_G.number_of_nodes() == 0:
            raise ValueError("Final graph 'G' is None or empty.")
        local_G = temp_G # Assign to local variable for saving
        print(f"Final graph 'G' created: {local_G.number_of_nodes()} nodes, {local_G.number_of_edges()} edges.")

        # --- Build Mappings ---
        print("Building node list and mappings...")
        local_node_list = sorted([str(n) for n in local_G.nodes()])
        local_node_to_int = {node_id: i for i, node_id in enumerate(local_node_list)}
        local_int_to_node = {i: node_id for node_id, i in local_node_to_int.items()}
        print(f"Node list length: {len(local_node_list)}")

        # --- Set Seed Index ---
        local_INITIAL_SEED_NODES_IDX = []
        if INIT_MODE_local == 'seeds' and target_node_str in local_node_to_int:
            seed_idx = local_node_to_int[target_node_str]
            local_INITIAL_SEED_NODES_IDX = [seed_idx] # Store as list
            print(f"Target '{target_node_str}' mapped to index: {local_INITIAL_SEED_NODES_IDX}")
        elif INIT_MODE_local == 'seeds':
             warnings.warn(f"Target '{target_node_str}' not in final map! Cannot set seed.")

        # --- Calculate Layout ---
        print("Calculating graph layout (Spring)...")
        local_pos = None; layout_start_time = time.time()
        try:
            layout_seed = config.get('MASTER_SEED', 42)
            random.seed(layout_seed); np.random.seed(layout_seed) # Re-seed just before layout
            node_count = local_G.number_of_nodes(); k_val = 0.8 / np.sqrt(max(1, node_count)); iterations_val = 75
            print(f"  Layout params: k={k_val:.3f}, iterations={iterations_val}")
            local_pos = nx.spring_layout(local_G, k=k_val, iterations=iterations_val, seed=layout_seed)
            layout_duration = time.time() - layout_start_time; print(f"  Layout finished in {layout_duration:.2f} sec.")
            if len(local_pos) != node_count :
                 warnings.warn(f"Layout node count mismatch ({len(local_pos)} vs {node_count})")
        except Exception as layout_err:
             print(f"Layout error: {layout_err}. Setting pos=None."); local_pos = None

        print("\n--- Graph Preparation Steps Completed ---")

        # --- Save Graph Objects to Files ---
        print(f"\n--- Saving Graph Data to Files in {SETUP_OUTPUT_DIR_save} ---")
        save_graph_error = False
        objects_to_save = {
            "graph_G.pkl": local_G,
            "graph_pos.pkl": local_pos,
            "node_list.pkl": local_node_list,
            "node_to_int.pkl": local_node_to_int,
            "int_to_node.pkl": local_int_to_node,
            "initial_seed_nodes_idx.pkl": local_INITIAL_SEED_NODES_IDX # Save the potentially empty list
        }
        for filename, obj in objects_to_save.items():
             filepath = os.path.join(SETUP_OUTPUT_DIR_save, filename)
             # Save even if obj is None (like pos) or empty list
             if obj is not None or isinstance(obj, (dict, list)) or filename == "graph_pos.pkl":
                 try:
                      with open(filepath, 'wb') as f:
                           pickle.dump(obj, f)
                      print(f"  ✅ Saved {filename}")
                 except Exception as e:
                      print(f"  ❌ Error saving {filename}: {e}"); save_graph_error = True
             else:
                  print(f"  ⚠️ Skipping {filename} (object is None and not pos/list/dict).")

        if save_graph_error:
            warnings.warn("Errors occurred during graph object saving.")
        else:
            print("--- Graph data saving complete. ---")

    except Exception as prep_error:
        print(f"\n❌❌❌ ERROR during graph preparation or saving: {prep_error} ❌❌❌")
        traceback.print_exc()
        graph_prep_error = True

else: # graph_prep_error was True from config loading
    print("Skipping graph preparation due to configuration loading errors.")

# --- Final Sanity Check ---
if not graph_prep_error:
     print("\n--- Verifying Saved Files ---")
     files_verified = True
     try:
         # Verify essential files were saved
         essential_files = ["graph_G.pkl", "node_list.pkl", "node_to_int.pkl", "int_to_node.pkl", "initial_seed_nodes_idx.pkl"]
         for fname in essential_files:
             fpath = os.path.join(SETUP_OUTPUT_DIR_save, fname)
             if not os.path.exists(fpath):
                 print(f"  ❌ Verification Error: Essential file '{fname}' not found.")
                 files_verified = False
             else:
                 # Optionally load and check content basic properties
                 with open(fpath, 'rb') as f_check: obj_check = pickle.load(f_check)
                 if fname == "graph_G.pkl" and (not isinstance(obj_check, nx.Graph) or obj_check.number_of_nodes() == 0):
                     print(f"  ❌ Verification Error: Loaded graph from '{fname}' is invalid or empty.")
                     files_verified = False
                 elif fname == "node_list.pkl" and (not isinstance(obj_check, list)):
                      print(f"  ❌ Verification Error: Loaded node_list from '{fname}' is not a list.")
                      files_verified = False
                 # Add more checks if needed...
         if files_verified: print("  ✅ Essential saved files verified.")

     except Exception as e_verify:
         print(f"  Error verifying saved files: {e_verify}")
         files_verified = False
     if not files_verified: print("  ⚠️ Verification failed. Check errors above.")
else:
     print("\n--- Graph Preparation Failed - Cannot Verify Outputs ---")

print("\nCell 3: Graph Preparation and Saving execution complete.")


--- Cell 3: Graph Preparation and Saving (AIFM1 Subgraph) ---
  Loaded config from simulation_results/Ablation_Setup_Files/baseline_config.json
  Target Node: 9606.ENSP00000287295, Radius: 2, Score Threshold: 0.6
File already exists: /tmp/cakg_data/STRING/9606.protein.links.full.v12.0.txt.gz
Extracted file /tmp/cakg_data/STRING/9606.protein.links.full.v12.0.txt up-to-date. Skipping.
Using STRING data file: /tmp/cakg_data/STRING/9606.protein.links.full.v12.0.txt
Loading STRING data...
Loaded 9310247 raw interactions.
  Dropped 1 rows with non-numeric score.
  Score column cleaned and converted to integer.
✅ Target '9606.ENSP00000287295' present in raw data.
Filtered edges by score >= 600. Kept 490325 / 9310246 interactions.
Building NetworkX graph (G_full)...
Added 17674 nodes.


Adding Edges:   0%|          | 0/490325 [00:00<?, ?it/s]

Built full filtered graph: 17674 nodes, 322463 edges.
Extracting ego graph (radius 2) for '9606.ENSP00000287295'...
Final graph 'G' created: 2334 nodes, 79930 edges.
Building node list and mappings...
Node list length: 2334
Target '9606.ENSP00000287295' mapped to index: [556]
Calculating graph layout (Spring)...
  Layout params: k=0.017, iterations=75
  Layout finished in 33.88 sec.

--- Graph Preparation Steps Completed ---

--- Saving Graph Data to Files in simulation_results/Ablation_Setup_Files ---
  ✅ Saved graph_G.pkl
  ✅ Saved graph_pos.pkl
  ✅ Saved node_list.pkl
  ✅ Saved node_to_int.pkl
  ✅ Saved int_to_node.pkl
  ✅ Saved initial_seed_nodes_idx.pkl
--- Graph data saving complete. ---

--- Verifying Saved Files ---
  ✅ Essential saved files verified.

Cell 3: Graph Preparation and Saving execution complete.


In [4]:
# Cell 3.1: Calculate and Save Static Baselines (Degree and RWR)
# Description: Calculates the Top Degree and Top Random Walk with Restart (RWR)
#              nodes for the prepared AIFM1 graph. Saves these lists to a text file
#              (baseline_nodes.txt) in the main simulation_results directory
#              for use by analysis notebooks.

import networkx as nx
import numpy as np
import os
import time
import warnings
import random # Ensure random is imported for potential seeding
import traceback # Ensure traceback is imported

print(f"\n--- Cell 3.1: Calculate and Save Static Baselines (Degree and RWR) ({time.strftime('%Y-%m-%d %H:%M:%S')}) ---")

# --- Prerequisites Check ---
baselines_calc_error = False

# Check if graph data is available from Cell 3
if 'local_G' not in globals() or not isinstance(local_G, nx.Graph) or local_G.number_of_nodes() == 0:
    print("❌ Baseline Calculation Error: Graph 'local_G' missing or invalid (Run Cell 3).")
    baselines_calc_error = True
if 'local_node_list' not in globals() or not isinstance(local_node_list, list) or not local_node_list:
    print("❌ Baseline Calculation Error: Node list 'local_node_list' missing or invalid (Run Cell 3).")
    baselines_calc_error = True
if 'local_node_to_int' not in globals() or not isinstance(local_node_to_int, dict) or not local_node_to_int:
     print("❌ Baseline Calculation Error: Node map 'local_node_to_int' missing or invalid (Run Cell 3).")
     baselines_calc_error = True
if 'local_INITIAL_SEED_NODES_IDX' not in globals() or not isinstance(local_INITIAL_SEED_NODES_IDX, list):
     print("❌ Baseline Calculation Error: Seed index list 'local_INITIAL_SEED_NODES_IDX' missing or invalid (Run Cell 3).")
     baselines_calc_error = True

# Check if output directory is available from Cell 1 or 2
output_dir_base = globals().get('OUTPUT_DIR', 'simulation_results')
if not output_dir_base or not os.path.isdir(output_dir_base):
     print(f"❌ Baseline Saving Error: Base output directory '{output_dir_base}' missing or invalid. Check Cell 1 or 2.")
     baselines_calc_error = True


# --- Define Output File Path ---
# This file is saved in the main simulation_results directory
baseline_output_filepath = os.path.join(output_dir_base, "baseline_nodes.txt")
print(f"Static baseline node lists will be saved to: {baseline_output_filepath}")


# --- Define Number of Top Nodes to Save ---
# Saving a reasonable number (e.g., top 100) allows analysis notebooks to slice as needed.
N_TOP_BASELINE_NODES_TO_SAVE = 100 # Save top 100 for each baseline type
print(f"Will calculate and save top {N_TOP_BASELINE_NODES_TO_SAVE} nodes for each baseline.")


# --- Initialize baseline lists ---
top_degree_nodes = []
top_rwr_nodes = []


# --- Execute Baseline Calculation and Saving ---
if not baselines_calc_error:
    print("\nCalculating static baseline node lists...")
    try:
        # --- 1. Calculate Top Degree Nodes ---
        print("  Calculating Top Degree nodes...")
        try:
            degrees = dict(local_G.degree())
            # Sort nodes by degree in descending order and get the top N node IDs
            # Ensure nodes are in the local_node_list if necessary, though G should contain them
            top_degree_nodes = sorted(degrees, key=degrees.get, reverse=True)[:N_TOP_BASELINE_NODES_TO_SAVE]
            print(f"  ✅ Calculated Top {len(top_degree_nodes)} Degree nodes.")
        except Exception as e_deg:
            print(f"  ❌ Error calculating Top Degree nodes: {e_deg}"); traceback.print_exc(limit=1)


        # --- 2. Calculate Top RWR Nodes ---
        print("  Calculating Top RWR nodes from target...")
        try:
            # Need the target node ID to start RWR
            if not local_INITIAL_SEED_NODES_IDX:
                 warnings.warn("⚠️ Cannot calculate RWR baseline: No seed index available in 'local_INITIAL_SEED_NODES_IDX'. Skipping RWR.")
            else:
                 # Get ID from the first seed index (assuming seeds mode)
                 target_node_id = local_node_list[local_INITIAL_SEED_NODES_IDX[0]]

                 if target_node_id not in local_G:
                      warnings.warn(f"⚠️ Cannot calculate RWR baseline: Target node '{target_node_id}' not found in graph 'local_G'. Skipping RWR.")
                 elif local_G.number_of_edges() == 0:
                       warnings.warn("⚠️ Cannot calculate RWR baseline: Graph 'local_G' has no edges. Skipping RWR.")
                 else:
                     # Define personalization vector: 1.0 for the target node, 0.0 for others
                     personalization = {node: 0.0 for node in local_G.nodes()}
                     personalization[target_node_id] = 1.0

                     # Calculate PageRank (RWR) scores
                     # REMOVED 'seed' argument from nx.pagerank as it's not standard
                     rwr_scores = nx.pagerank(local_G, alpha=0.85, personalization=personalization,
                                                weight='weight', max_iter=1000, tol=1.0e-6)

                     # Sort nodes by RWR score in descending order and get the top N
                     top_rwr_nodes = sorted(rwr_scores, key=rwr_scores.get, reverse=True)[:N_TOP_BASELINE_NODES_TO_SAVE]
                     print(f"  ✅ Calculated Top {len(top_rwr_nodes)} RWR nodes from '{target_node_id}'.")

        except Exception as e_rwr:
            print(f"  ❌ Error calculating Top RWR nodes: {e_rwr}"); traceback.print_exc(limit=1)


        # --- 3. Save Baseline Node Lists to File ---
        print("\n  Saving baseline node lists to file...")
        try:
            with open(baseline_output_filepath, 'w') as f:
                f.write("--- Baseline: Top Nodes by Degree ---\n")
                # Write node IDs, one per line
                for node_id in top_degree_nodes:
                    f.write(f"{node_id}\n")

                f.write("\n--- Baseline: Top Nodes by RWR from Target ---\n")
                 # Write node IDs, one per line
                for node_id in top_rwr_nodes:
                    f.write(f"{node_id}\n")

                # Add placeholder for Leiden if that analysis is added later
                f.write("\n--- Baseline: Target Node Community (Leiden) ---\n")
                f.write("N/A (Leiden baseline not calculated in setup)\n") # Explicitly state not included here


            print(f"  ✅ Static baseline node lists saved to: {baseline_output_filepath}")

        except Exception as e_save:
            print(f"  ❌ Error saving static baseline node lists: {e_save}"); traceback.print_exc(limit=1)

    except Exception as e_calc_save_block:
        # Catch any error in the main calculation/save block if not caught above
        print(f"❌ An unexpected error occurred during baseline calculation or saving: {e_calc_save_block}"); traceback.print_exc()

else: # baselines_calc_error was True from prereqs
    print("Skipping static baseline calculation and saving due to missing prerequisites.")

print("\n--- Cell 3.1: Static Baseline Calculation and Saving Complete ---")


--- Cell 3.1: Calculate and Save Static Baselines (Degree and RWR) (2025-04-28 20:46:19) ---
Static baseline node lists will be saved to: simulation_results/baseline_nodes.txt
Will calculate and save top 100 nodes for each baseline.

Calculating static baseline node lists...
  Calculating Top Degree nodes...
  ✅ Calculated Top 100 Degree nodes.
  Calculating Top RWR nodes from target...
  ✅ Calculated Top 100 RWR nodes from '9606.ENSP00000287295'.

  Saving baseline node lists to file...
  ✅ Static baseline node lists saved to: simulation_results/baseline_nodes.txt

--- Cell 3.1: Static Baseline Calculation and Saving Complete ---
