## ATHENA: Residue-Level IDR Classification Tutorial
This Google Colab notebook provides a step-by-step guide for running the ATHENA residue-level (per-amino-acid) classifier (ATHENA_IDR_classification.py).

This pipeline operates in two main stages:

1. **Embedding Generation:** It uses an ESM model (e.g., esmc_300m) to generate per-residue embeddings for each protein in your FASTA file.

2. **IDR Prediction:** It feeds these embeddings into the trained Bi-LSTM + Transformer model to predict a disorder score and label for every single residue.

### Expected Runtime

**Anywhere from 1-10 minutes**, depending on the number of sequences to be classified.

### Step 1: Clone Repository and Set Up Environment
First, we will clone the ATHENA GitHub repository, move into the new directory, and install the required Python libraries. This includes the esm library, which is required for generating protein embeddings.

In [None]:
# 1. Clone the Repository

!git clone https://github.com/Shimizu-team/ATHENA
%cd ATHENA

# 2. Install Dependencies
!pip install torch pandas esm httpx tabulate


### Step 2: Prepare Python Scripts
The main script ATHENA_IDR_classification.py imports the model architecture from a file named Transformer_LSTM.py. The repository contains this model in ATHENA_IDR_Model.py.

We rename this file before we import the modules.

In [None]:
# Rename Model Script for Import
import os

model_source = "ATHENA_IDR_Model.py"
model_target = "Transformer_LSTM.py"

if os.path.exists(model_source):
    !mv {model_source} {model_target}
    print(f"Renamed '{model_source}' to '{model_target}' for import.")
else:
    print(f"Could not find '{model_source}'. Did the git clone fail or was the file misnamed?")
    

### Step 3: Import Libraries
We import the necessary modules.

In [None]:
# Import Libraries
import os
import sys
import torch
import pandas as pd
import argparse

# Import the main function
try:
    # We rename the import for clarity
    from ATHENA_IDR_classification import main as run_idr_prediction
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("Please ensure 'Transformer_LSTM.py' (renamed from 'ATHENA_IDR_Model.py') is present.")

print("Core modules imported successfully.")

### Step 4: Prepare Model Parameters
Now we will unzip the model weights.


In [None]:
# Define the directory and the zip file
model_dir = "ATHENA_IDR_model_params"
zip_filename = "ATHENA_IDR_Weights.pt.zip"
zip_filepath = os.path.join(model_dir, zip_filename)

print(f"Attempting to unzip '{zip_filepath}' into its parent directory '{model_dir}'...")

# Check if the zip file exists at the specified path
if not os.path.isfile(zip_filepath):
    print(f"\nError: File not found at '{zip_filepath}'.")
    print("Please ensure the GitHub repository was cloned correctly and this file exists.")
else:
    print(f"\nExtracting '{zip_filepath}'...")
    
    !unzip -q -o "{zip_filepath}" -d "{model_dir}"
    
    print(f"Unzip command finished. Files extracted to '{model_dir}'.")
    

### Step 5: Define Run Configuration
The original `ATHENA_IDR_classification.py` script is designed to be run from the command line. We define the run configurations here. 

We designate the input fasta file using the --fasta_file config.

In [None]:
# Colab Cell 6: Define Configuration (via sys.argv)
import sys

# These arguments match the 'README' and the script defaults
args_list = [
    'ATHENA_IDR_classification.py',
    
    '--fasta_file',
    'input/example_sequences.fasta',
    
    '--idr_model_path',
    'ATHENA_IDR_model_params/ATHENA_IDR_Weights.pt', 
    
    '--output_csv',
    'output/IDR_predictions.csv',
    
    '--esm_model_name',
    'esmc_300m', 
    
    '--batch_size',
    '8'
]

# Set sys.argv
sys.argv = args_list

print("--- Running Configuration ---")
print(" ".join(args_list))
print("-----------------------------")

### Step 6: Run Prediction 
With all configuration set, we can now call the main function from the script. This will execute the full pipeline.

In [None]:
# Colab Cell 7: Execute Prediction
print("Starting End-to-End IDR Prediction")
print("STEP 1: Generating Protein Embeddings (this may take time)...")

try:
    # Call the main function we imported
    run_idr_prediction()
    print("\nPipeline Complete")

except ImportError:
    print("Error: 'EnhancedLSTMTransformerIDRPredictor' not found.")
    print("Please ensure Step 2 (renaming the .py file) completed successfully.")

### Step 7 Review and Interpret Results
The script saves all predictions to a single CSV file. We load it with pandas to see the results.

The output contains one row for every amino acid in your input file.

In [None]:
# Colab Cell 8: Load and Display Results
import pandas as pd
import os

output_file = 'output/IDR_predictions.csv'

if os.path.exists(output_file):
    print(f"Loading results from '{output_file}'...")
    
    df = pd.read_csv(output_file)
    
    print(f"\nTotal residues predicted: {len(df)}")
    
    # Display the first 20 predictions
    print("\n--- First 20 Residue Predictions ---")
    print(df.head(20).to_markdown(index=False))
    
    # Display predictions for the start of the second protein
    print("\n--- Predictions for 'protein_sample_02' ---")
    print(df[df['Protein ID'] == 'protein_sample_02'].head(20).to_markdown(index=False))

else:
    print(f"Output file '{output_file}' not found.")
    print("Please ensure Step 7 completed without errors.")