## ATHENA: Protein-Level IDP Classification Tutorial
This Google Colab notebook provides a step-by-step guide for running the ATHENA protein-level classifier (ATHENA_IDP_classification.py).

This tutorial will guide you through cloning the repository, setting up the environment, preparing data, and running the prediction pipeline to generate the ATHENA Score for your protein sequences.

### Expected Runtime

**Anywhere from 1-10 minutes**, depending on the number of sequences to be scored.

### Step 1: Clone Repository and Set Up Environment
First, we will clone the ATHENA GitHub repository and move into the new directory. Then, we'll install the required Python libraries.

In [None]:
# Colab Cell 1: Clone Repo & Install Dependencies

# Clone the Repository
!git clone https://github.com/Shimizu-team/ATHENA
%cd ATHENA

# Install Dependencies
!pip install torch transformers peft tqdm einops

print("\n Repository cloned and dependencies installed.")

### Step 2: Import Required Libraries
Now that we are in the repository's directory, we can import the necessary modules from your scripts, along with standard libraries.

In [None]:
# Import Libraries
import os
import sys
import torch
import pandas as pd
import argparse
import json
from google.colab import files

from config import Config
from ATHENA_IDP_classification import predict


### Step 3: Prepare Model Parameters and Input Data
This step involves preparing all the necessary files for the model to run.

**3. Unzip Model Parameters**

Model parameters are in a zip file. The script expects the unzipped files to be in a directory named ATHENA_IDP_model_params.

In [None]:
# Define the directory and the zip file
model_dir = "ATHENA_IDP_model_params"
zip_filename = "ATHENA_IDP_model_params.zip"
zip_filepath = os.path.join(model_dir, zip_filename)

print(f"Attempting to unzip '{zip_filepath}' into its parent directory '{model_dir}'...")

# Check if the zip file exists at the specified path
if not os.path.isfile(zip_filepath):
    print(f"\n Error: File not found at '{zip_filepath}'.")
    print("Please ensure the GitHub repository was cloned correctly and this file exists.")
else:
    print(f"\nExtracting '{zip_filepath}'...")
    
    !unzip -q -o "{zip_filepath}" -d "{model_dir}"
    
    print(f"Unzip command finished. Files extracted to '{model_dir}'.")
    

### Step 4: Define Run Configuration

In [None]:
# Colab Cell 4: Define Configuration
os.makedirs("output", exist_ok=True)
manual_args = argparse.Namespace()

# Basic Settings
manual_args.output_type = "before_softmax" # "IDP_probability" or "before_softmax"
manual_args.base_model = "Synthyra/ESMplusplus_small"
manual_args.classifier_params_path = "ATHENA_IDP_model_params" # Relative path to unzipped folder
manual_args.fasta_path = "input/example_sequences.fasta" # Relative path to created file
manual_args.output_dir = "output"

# Adapter Settings
# Simulates: --adapter_paths "IDP_LoRA=model_params"
manual_args.adapter_paths = {"IDP_LoRA": "ATHENA_IDP_model_params"} 
manual_args.adapter_weights = None # Not needed for a single adapter

# Model Settings 
manual_args.num_labels = 2
manual_args.max_length = 10000
manual_args.batch_size = 32 # Adjust based on Colab GPU memory (e.g., T4)

# Output Settings
manual_args.title = "IDP_Inference_Colab" # Title for output files

# Create the final config object
conf = Config(manual_args)

print("--- Running Configuration ---")
print(json.dumps(vars(conf), indent=2))
print("-----------------------------")

### Step 5: Run Prediction 
With all scripts imported, models loaded, and configuration set, we can now call the predict function.

This will:

1. Download the base model (Synthyra/ESMplusplus_small) from Hugging Face.

2. Load the base model and apply your LoRA adapter from model_params.

3. Load the fine-tuned classifier_params.pth.

4. Process the sequences from example_sequences.fasta in batches.

5. Save the results to the output directory.

In [None]:
import torch

print("--- Starting Prediction ---")
try:
    # Call the main predict function from your script
    predictions = predict(conf)
    print("\nPrediction Complete")

except FileNotFoundError as e:
    print(f"\nExecution Error")
    print(e)
    print("Prediction failed. Please double-check that your model files (e.g., 'classifier_params.pth')")
    print("are correctly located inside the 'model_params' directory after unzipping.")

### Step 6: Review and Interpret Results
The predict function saves the results as a .pt file, but also prints a summary. Let's load the saved file using torch and pandas for a cleaner view.

Since we set --output_type "before_softmax", the output is the raw logit score (the "ATHENA Score").

In [None]:
import torch
import pandas as pd
import os

# Determine the output file path based on our config
if conf.output_type == "before_softmax":
    output_file = os.path.join(conf.output_dir, f"IDP_score_before_softmax_{conf.title}.pt")
    column_name = "ATHENA Score (Logit)"
else:
    output_file = os.path.join(conf.output_dir, f"IDP_score_{conf.title}.pt")
    column_name = "IDP Probability"

# Check if the file was created
if os.path.exists(output_file):
    print(f"Loading results from '{output_file}'...")
    
    # Load the saved dictionary (mapping to CPU for safety)
    data = torch.load(output_file, map_location='cpu')
    
    # Convert to Pandas DataFrame for nice formatting
    df = pd.DataFrame(list(data.items()), columns=['Sequence ID', column_name])
    
    # Sort by score, descending
    df_sorted = df.sort_values(by=column_name, ascending=False).reset_index(drop=True)
    
    print(f"\nPrediction Results (Top {len(df_sorted)})")
    
    # Display as a clean markdown table
    print(df_sorted.to_markdown(index=False))

else:
    print(f"Output file '{output_file}' not found.")
    print("Please ensure Step 5 completed without errors.")