<a href="https://colab.research.google.com/github/JKourelis/BUSCO-Colab/blob/main/BUSCO_ProteomeQualityAssessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title **BUSCO v6 Proteome Quality Assessment**
# @markdown ---
# @markdown ### 🧬 **Benchmarking Universal Single-Copy Orthologs**
# @markdown
# @markdown This notebook runs **BUSCO v6.0.0** analysis on proteome (protein) FASTA files to assess completeness and quality based on evolutionarily-informed expectations of gene content.
# @markdown
# @markdown **What BUSCO Does:**
# @markdown - Identifies single-copy ortholog genes expected in your lineage
# @markdown - Calculates completeness metrics (Complete, Fragmented, Missing)
# @markdown - Validates genome assembly or gene prediction quality
# @markdown
# @markdown **Required Input:**
# @markdown - Proteome file in FASTA format (.faa, .fasta, .fa)
# @markdown - Selection of appropriate lineage dataset (odb12 recommended)
# @markdown
# @markdown **Expected Runtime:** 1-30 minutes depending on proteome size and runtime selection
# @markdown
# @markdown ---
# @markdown
# @markdown ### ⚡ **IMPORTANT: Runtime Selection for Best Performance**
# @markdown
# @markdown Before running this notebook, select **TPU runtime** for maximum speed:
# @markdown 1. Go to **Runtime → Change runtime type**
# @markdown 2. Set **Hardware accelerator** to **TPU v5e-1** or **TPU v6e-1**
# @markdown 3. This gives you **24-44 CPU threads** instead of 2
# @markdown
# @markdown **Performance comparison:**
# @markdown - **Standard GPU/CPU runtime**: 2 threads → 30-60 min for large proteomes
# @markdown - **TPU v5e-1 runtime**: 24 threads → 5-10 min for large proteomes (12x faster)
# @markdown - **TPU v6e-1 runtime**: 44 threads → 3-8 min for large proteomes (22x faster)
# @markdown
# @markdown **Note:** You're using the TPU's CPU host for computation, not the TPU cores themselves. BUSCO is CPU-intensive (HMMER/gene prediction), so TPU runtime is optimal.
# @markdown
# @markdown ---
# @markdown
# @markdown 📚 **Documentation:** [BUSCO Official Site](https://busco.ezlab.org/) | 📄 **Paper:** [Manni et al. 2021](https://doi.org/10.1093/molbev/msab199)
# @markdown
# @markdown ---

print("╔════════════════════════════════════════════════════════════╗")
print("║           BUSCO v6 Proteome Analysis Notebook            ║")
print("║           Using OrthoDB v12 Datasets (odb12)             ║")
print("╚════════════════════════════════════════════════════════════╝")
print("\n⚡ PERFORMANCE TIP:")
print("   For best results, use TPU runtime:")
print("   • TPU v5e-1: 24 threads (12x faster)")
print("   • TPU v6e-1: 44 threads (22x faster)")
print("   Runtime → Change runtime type → TPU v5e-1 or v6e-1")
print("   Standard GPU/CPU runtime only provides 2 threads\n")
print("📋 Follow the cells in order:")
print("   1. Setup Conda & Restart Kernel")
print("   2. Install BUSCO v6 and dependencies")
print("   3. Upload your proteome file")
print("   4. Browse available lineage datasets")
print("   5. Download your chosen lineage")
print("   6. Run BUSCO analysis\n")

╔════════════════════════════════════════════════════════════╗
║           BUSCO Proteome Analysis Notebook                ║
║                    Ready to Start                          ║
╚════════════════════════════════════════════════════════════╝

📋 Follow the cells in order:
   1. Connect to Google Drive
   2. Install BUSCO and dependencies
   3. Select your lineage dataset
   4. Upload your proteome file
   5. Run BUSCO analysis



In [1]:
# @title **Connect to Google Drive** { display-mode: "form" }
# @markdown Mount your Google Drive to access files and save results.

import os
from google.colab import drive

# Mount Google Drive
try:
    drive.mount('/content/drive', force_remount=False)
    print("✅ Google Drive connected successfully!")
    print(f"📁 Drive mounted at: /content/drive/MyDrive/")

    # Create BUSCO results directory if it doesn't exist
    results_dir = "/content/drive/MyDrive/BUSCO_Results"
    os.makedirs(results_dir, exist_ok=True)
    print(f"📂 Results will be saved to: {results_dir}")

except Exception as e:
    print(f"❌ Error mounting Google Drive: {e}")
    print("Please authorize access when prompted and try again.")

Mounted at /content/drive
✅ Google Drive connected successfully!
📁 Drive mounted at: /content/drive/MyDrive/
📂 Results will be saved to: /content/drive/MyDrive/BUSCO_Results


In [2]:
# @title Setup Conda & Restart Kernel
# @markdown Run this cell first to install the Mamba/Conda environment.
# @markdown The kernel will restart automatically (~1 minute).
# @markdown After the restart, proceed to Cell 2.

import subprocess
import sys

# Check if condacolab is already installed to avoid re-running on a prepped machine
try:
    import condacolab
    print("✅ Condacolab is already installed and the environment is ready.")
    print("\n➡️ Proceed to Cell 2 to install BUSCO.")
except ImportError:
    print("🔧 Installing Conda environment for the first time...")

    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "condacolab"], check=True)
    import condacolab

    print("\n✅ Condacolab installed.")
    print("⚠️ The kernel will now restart automatically to activate the environment.")
    print("   This is normal. After the restart, please run Cell 2.")

    condacolab.install()

🔧 Installing Conda environment for the first time...

✅ Condacolab installed.
⚠️ The kernel will now restart automatically to activate the environment.
   This is normal. After the restart, please run Cell 2.
⏬ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:06
🔁 Restarting kernel...


In [3]:
# @title Install BUSCO
# @markdown Run this cell AFTER the kernel has restarted from the previous cell.
# @markdown This will install BUSCO and all its dependencies (~3-5 minutes).

import subprocess
import sys

# Step 1: Clean up any previous failed environments to ensure a fresh start
print("="*70)
print("STEP 1: Cleaning up previous installation attempts...")
print("="*70)
subprocess.run("mamba env remove -n busco -y", shell=True, capture_output=True)
print("✅ Previous environments removed.\n")

# Step 2: Install BUSCO and all dependencies with a single command
print("="*70)
print("STEP 2: Installing BUSCO v6...")
print("="*70)
print("⏳ This will take 3-5 minutes...\n")

# This one command creates a new environment and installs Python and BUSCO.
# Mamba handles all dependencies (HMMER, pandas, biopython, etc.) correctly.
install_cmd = 'mamba create -n busco -y -c conda-forge -c bioconda "python=3.11" "busco=6.0.0"'

print(f"🔩 Running command: {install_cmd}\n")

# Stream the output so you can monitor the progress
process = subprocess.Popen(
    install_cmd,
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
    bufsize=1
)
for line in iter(process.stdout.readline, ''):
    print(line, end='')
process.wait()

# Step 3: Final Verification
print("\n" + "="*70)
print("STEP 3: Verifying installation...")
print("="*70)
if process.returncode == 0:
    verify_cmd = "mamba run -n busco busco --version"
    busco_version = subprocess.run(verify_cmd, shell=True, capture_output=True, text=True)

    if busco_version.returncode == 0:
        print(f"🎉 SUCCESS! {busco_version.stdout.strip()} is installed correctly.")
        print("\n💡 To run BUSCO, you must prefix your command like this:")
        print("   !mamba run -n busco busco -i your_proteins.faa -m proteins ...")
    else:
        print("❌ ERROR: Verification failed. The `busco` command is not working.")
        print(busco_version.stderr)
else:
    print(f"❌ ERROR: Mamba installation failed with exit code {process.returncode}.")
    print("   Please review the log above.")

STEP 1: Cleaning up previous installation attempts...
✅ Previous environments removed.

STEP 2: Installing BUSCO v6...
⏳ This will take 3-5 minutes...

🔩 Running command: mamba create -n busco -y -c conda-forge -c bioconda "python=3.11" "busco=6.0.0"

Transaction

  Prefix: /usr/local/envs/busco

  Updating specs:

   - python=3.11
   - busco=6.0.0


  Package                             Version  Build                 Channel           Size
─────────────────────────────────────────────────────────────────────────────────────────────
  Install:
─────────────────────────────────────────────────────────────────────────────────────────────

  + font-ttf-dejavu-sans-mono            2.37  hab24e00_0            conda-forge      397kB
  + python_abi                           3.11  8_cp311               conda-forge        7kB
  + font-ttf-inconsolata                3.000  h77eed37_0            conda-forge       97kB
  + font-ttf-source-code-pro            2.038  h77eed37_0            conda-forg

In [1]:
# @title Fetch and Display Lineages
# @markdown Run this cell to see all available BUSCO datasets.
# @markdown Copy the name of the lineage you want from the output below.

import subprocess
import re

print("📥 Fetching all available BUSCO lineage datasets...")

# Command to get the list
cmd = "mamba run -n busco busco --list-datasets"
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

if result.returncode != 0:
    print("\n❌ FATAL ERROR: Could not retrieve dataset list.")
    print("   Please ensure the BUSCO installation (Cell 2 in the previous step) ran successfully.")
else:
    # Find and sort all lineage names
    output = result.stdout
    lineage_pattern = r'([a-zA-Z_]+_odb\d+)'
    all_lineages = sorted(list(set(re.findall(lineage_pattern, output))))

    print(f"✅ Found {len(all_lineages)} total lineages.\n")
    print("="*70)
    print("COPY a lineage name from the list below (e.g., 'metazoa_odb10')")
    print("="*70)

    # Print the full list for the user
    for lineage in all_lineages:
        print(lineage)

    print("\n➡️ Now, paste your chosen lineage into the text field in the next cell and run it.")

📥 Fetching all available BUSCO lineage datasets...
✅ Found 524 total lineages.

COPY a lineage name from the list below (e.g., 'metazoa_odb10')
acari_odb12
acetobacter_odb12
acetobacteraceae_odb12
acholeplasmataceae_odb12
achromobacter_odb12
acidobacteriaceae_odb12
acidobacteriota_odb12
acidovorax_odb12
acinetobacter_odb12
aconoidasida_odb12
actinomadura_odb12
actinomyces_odb12
actinomycetaceae_odb12
actinomycetes_odb12
actinomycetota_odb12
actinoplanes_odb12
actinopterygii_odb12
aculeata_odb12
aerococcaceae_odb12
aeromicrobium_odb12
aeromonadaceae_odb12
aeromonas_odb12
afipia_odb12
agaricales_odb12
agaricomycetes_odb12
agrobacterium_group_odb12
agrobacterium_odb12
agromyces_odb12
ajellomycetaceae_odb12
alcaligenaceae_odb12
alcanivorax_odb12
algoriphagus_odb12
alicyclobacillaceae_odb12
alicyclobacillus_odb12
alistipes_odb12
alphabaculovirus_odb10
alphaherpesvirinae_odb10
alphaproteobacteria_odb12
alteromonadales_odb12
alteromonas_odb12
alveolata_odb12
amoebozoa_odb12
amycolatopsis_odb1

In [2]:
# @title Select and Download Your Chosen Lineage
# @markdown 1. Run the cell above and copy a lineage name.
# @markdown 2. Paste it into the `lineage_to_download` field below.
# @markdown 3. Run this cell.

import os
import subprocess
import sys

# @markdown ---
lineage_to_download = "solanales_odb12" # @param {type:"string"}
# @markdown ---

if not lineage_to_download.strip():
    print("❌ ERROR: The `lineage_to_download` field is empty. Please paste a lineage name and run again.")
    sys.exit()

final_lineage = lineage_to_download.strip()
print(f"✅ You have selected: **{final_lineage}**")

# Set up download path and config file
busco_downloads_path = "/content/busco_downloads"
os.makedirs(busco_downloads_path, exist_ok=True)

config_path = '/content/busco_config.ini'
with open(config_path, 'w') as f:
    f.write(f"[busco]\ndownload_path = {busco_downloads_path}\n")

print(f"\n📥 Downloading '{final_lineage}' dataset...")
print("   (This may take 1-5 minutes)\n")

# Run the download command
cmd_download = f"mamba run -n busco busco --download {final_lineage} --config {config_path}"
result = subprocess.run(cmd_download, shell=True, capture_output=True, text=True)

# Verify the download - accept multiple success indicators
success_indicators = [
    "successfully downloaded",
    "is the last available version",
    "Decompressing file"
]

download_successful = result.returncode == 0 and any(indicator in result.stdout for indicator in success_indicators)

if download_successful:
    print(f"✅ **SUCCESS!** Dataset '{final_lineage}' is ready.")

    lineage_path = os.path.join(busco_downloads_path, "lineages", final_lineage)

    if os.path.exists(lineage_path):
        print(f"📂 Dataset location: {lineage_path}")

        # Save the name for the final analysis step
        with open('/content/selected_lineage.txt', 'w') as f:
            f.write(final_lineage)

        print("\n➡️ You are now ready to run the BUSCO analysis.")
    else:
        print(f"⚠️ Download reported success, but the directory was not found at:")
        print(f"   {lineage_path}")
        print("\n🔍 Check the output above for the actual download location.")
else:
    print(f"❌ **ERROR:** Failed to download '{final_lineage}'")
    print("\n**BUSCO Error Message:**")
    print(result.stderr if result.stderr else result.stdout)

✅ You have selected: **solanales_odb12**

📥 Downloading 'solanales_odb12' dataset...
   (This may take 1-5 minutes)

✅ **SUCCESS!** Dataset 'solanales_odb12' is ready.
📂 Dataset location: /content/busco_downloads/lineages/solanales_odb12

➡️ You are now ready to run the BUSCO analysis.


In [3]:
# @title **Upload Proteome File** { display-mode: "form" }
# @markdown Upload your proteome FASTA file or specify a path from Google Drive.
# @markdown ---
upload_method = "Upload from computer" # @param ["Upload from computer", "Use Google Drive path"]
drive_path = "" # @param {type:"string"}
# @markdown **Upload Method:**
# @markdown - `Upload from computer` - Direct file upload (recommended for files < 100 MB)
# @markdown - `Use Google Drive path` - Path to file in your Drive (e.g., `/content/drive/MyDrive/my_proteome.faa`)

import os
import gzip
from google.colab import files

proteome_path = None

print("📄 Proteome File Upload\n")

if upload_method == "Upload from computer":
    print("Please select your proteome FASTA file...")
    print("Accepted formats: .faa, .fasta, .fa, .pep, .faa.gz, .fasta.gz, .fa.gz\n")

    uploaded = files.upload()

    if uploaded:
        filename = list(uploaded.keys())[0]

        # Handle .gz decompression
        if filename.endswith('.gz'):
            print(f"\n🗜️ Decompressing {filename}...")
            decompressed_filename = filename[:-3]

            with gzip.open(filename, 'rt') as f_in:
                with open(decompressed_filename, 'w') as f_out:
                    f_out.write(f_in.read())

            os.remove(filename)
            filename = decompressed_filename
            print(f"✅ Decompressed to: {filename}")

        proteome_path = f"/content/{filename}"
        print(f"\n✅ File uploaded: {filename}")
        print(f"   Location: {proteome_path}")
    else:
        print("❌ No file uploaded. Please run this cell again and select a file.")

elif upload_method == "Use Google Drive path":
    if drive_path.strip():
        # Handle .gz in Drive path
        if drive_path.endswith('.gz'):
            if os.path.exists(drive_path):
                print(f"🗜️ Decompressing {os.path.basename(drive_path)}...")
                decompressed_path = drive_path[:-3]

                with gzip.open(drive_path, 'rt') as f_in:
                    with open(decompressed_path, 'w') as f_out:
                        f_out.write(f_in.read())

                proteome_path = decompressed_path
                print(f"✅ Decompressed and found file at: {decompressed_path}")
            else:
                print(f"❌ File not found: {drive_path}")
                print("Please check the path and make sure Google Drive is mounted.")
        else:
            if os.path.exists(drive_path):
                proteome_path = drive_path
                print(f"✅ Found file at: {drive_path}")
            else:
                print(f"❌ File not found: {drive_path}")
                print("Please check the path and make sure Google Drive is mounted.")
    else:
        print("❌ Please enter a valid Google Drive path in the 'drive_path' field above.")

# Validate the proteome file
if proteome_path and os.path.exists(proteome_path):
    print("\n🔍 Validating proteome file...")

    # Check file size
    file_size_mb = os.path.getsize(proteome_path) / (1024 * 1024)
    print(f"   File size: {file_size_mb:.2f} MB")

    # Quick FASTA format check
    with open(proteome_path, 'r') as f:
        first_lines = [f.readline() for _ in range(5)]

    if first_lines[0].startswith('>'):
        print("   ✅ Valid FASTA format detected")

        # Count sequences (rough estimate from first 1000 lines)
        with open(proteome_path, 'r') as f:
            sample = [f.readline() for _ in range(1000)]
            seq_count_estimate = sum(1 for line in sample if line.startswith('>'))

        if seq_count_estimate > 0:
            print(f"   📊 Estimated protein sequences: ~{seq_count_estimate * max(1, int(file_size_mb / 0.1))}")

        print("\n✅ Proteome file ready for BUSCO analysis!")
        print(f"   Path saved for next step: {proteome_path}")

        # Store path for next cell
        with open('/content/proteome_path.txt', 'w') as f:
            f.write(proteome_path)
    else:
        print("   ⚠️ Warning: File doesn't appear to be in FASTA format")
        print("   FASTA files should start with '>' followed by sequence ID")
        print("   BUSCO may fail if the format is incorrect.")

📄 Proteome File Upload

Please select your proteome FASTA file...
Accepted formats: .faa, .fasta, .fa, .pep, .faa.gz, .fasta.gz, .fa.gz



Saving NbT2T.pep.fasta.gz to NbT2T.pep.fasta.gz

🗜️ Decompressing NbT2T.pep.fasta.gz...
✅ Decompressed to: NbT2T.pep.fasta

✅ File uploaded: NbT2T.pep.fasta
   Location: /content/NbT2T.pep.fasta

🔍 Validating proteome file...
   File size: 20.39 MB
   ✅ Valid FASTA format detected
   📊 Estimated protein sequences: ~101500

✅ Proteome file ready for BUSCO analysis!
   Path saved for next step: /content/NbT2T.pep.fasta


In [4]:
# @title **Run BUSCO Analysis and Download Results** { display-mode: "form" }
# @markdown Execute BUSCO analysis and download comprehensive results.

import os
import subprocess
import shutil
import re
import psutil
from datetime import datetime

# Auto-detect available CPU cores
available_cores = psutil.cpu_count(logical=True)
cpu_threads = available_cores

# Load proteome path to auto-populate output name
try:
    with open('/content/proteome_path.txt', 'r') as f:
        proteome_path = f.read().strip()
    proteome_basename = os.path.splitext(os.path.basename(proteome_path))[0]
    default_name = re.sub(r'[^a-zA-Z0-9._-]', '_', proteome_basename)
except:
    default_name = "my_proteome"

# @markdown ---
output_name = "" # @param {type:"string"}
# @markdown **Optional:** Leave empty to auto-name from your proteome file, or enter a custom name to override.
# @markdown ---

print("🚀 Starting BUSCO Analysis\n")

# Display detected CPU configuration with CORRECT runtime identification
print("="*60)
print("CPU CONFIGURATION")
print("="*60)
print(f"🖥️  Detected CPU threads: {cpu_threads}")

# Correct runtime detection based on actual thread counts
if cpu_threads >= 40:
    print(f"✅ TPU v6e-1 Runtime detected - EXCELLENT performance ({cpu_threads} threads)")
elif cpu_threads >= 20:
    print(f"✅ TPU v5e-1 Runtime detected - EXCELLENT performance ({cpu_threads} threads)")
elif cpu_threads >= 8:
    print(f"✅ Pro+ High-RAM Runtime detected - GOOD performance ({cpu_threads} threads)")
elif cpu_threads >= 4:
    print(f"⚠️  Standard Runtime detected - MODERATE performance ({cpu_threads} threads)")
    print(f"💡 TIP: Use TPU v5e-1 or v6e-1 runtime for 12-22x faster analysis")
else:
    print(f"⚠️  Standard Runtime detected - SLOW performance ({cpu_threads} threads)")
    print(f"💡 TIP: Use TPU v5e-1 (24 threads) or v6e-1 (44 threads) for much faster analysis")
    print(f"   Runtime → Change runtime type → TPU v5e-1 or v6e-1")

print(f"⚙️  BUSCO will use all {cpu_threads} threads (100% CPU utilization)")
print("="*60 + "\n")

# Use auto-detected name if box is empty, otherwise use user input
if not output_name.strip():
    output_name = default_name
    print(f"📝 Output name: {output_name} (auto-detected)")
else:
    # Sanitize user input
    output_name = re.sub(r'[^a-zA-Z0-9._-]', '_', output_name.strip())
    print(f"📝 Output name: {output_name} (custom)")

# Verify file exists
if not os.path.exists(proteome_path):
    print(f"❌ Proteome file not found: {proteome_path}")
    raise FileNotFoundError("Please upload your proteome file first.")

# Get BUSCO version
print("\n🔍 Detecting BUSCO version...")
version_cmd = "mamba run -n busco busco --version 2>&1"
version_result = subprocess.run(version_cmd, shell=True, capture_output=True, text=True)
busco_version = "unknown"
for line in version_result.stdout.split('\n'):
    if 'BUSCO' in line and any(char.isdigit() for char in line):
        version_match = re.search(r'(\d+\.\d+\.\d+)', line)
        if version_match:
            busco_version = version_match.group(1)
            break
print(f"✅ BUSCO version: {busco_version}\n")

# Clean FASTA headers in-place
print("🧹 Cleaning FASTA headers...")
with open(proteome_path, 'r') as f:
    lines = f.readlines()
cleaned_count = 0
with open(proteome_path, 'w') as f:
    for line in lines:
        if line.startswith('>'):
            original = line
            clean_header = re.sub(r'[/|:;,\s\(\)\[\]\{\}].*', '', line.strip())
            f.write(clean_header + '\n')
            if original != clean_header + '\n':
                cleaned_count += 1
        else:
            f.write(line)
print(f"✅ {cleaned_count} headers cleaned\n")

# Get lineage from previous cell
lineage_file = '/content/busco_downloads/lineages'
if not os.path.exists(lineage_file):
    print("❌ Lineage dataset not found. Please run the 'Select Lineage' cell first.")
    raise FileNotFoundError("Please download a lineage dataset first.")

available_lineages = os.listdir(lineage_file)
if not available_lineages:
    print("❌ No lineage datasets found. Please run the 'Select Lineage' cell first.")
    raise FileNotFoundError("Please download a lineage dataset first.")

selected_lineage = available_lineages[0]

# Construct final folder name: output_name_BUSCOvX.X.X_database
drive_folder_name = f"{output_name}_BUSCOv{busco_version}_{selected_lineage}"
print(f"📁 Results will be saved as: {drive_folder_name}\n")

# Count sequences for progress estimation
seq_count = sum(1 for line in open(proteome_path) if line.startswith('>'))

# Calculate estimated time based on ACTUAL CPU thread count
if cpu_threads >= 40:  # TPU v6e-1
    time_min = max(1, seq_count * 0.00015)
    time_max = max(3, seq_count * 0.0008)
elif cpu_threads >= 20:  # TPU v5e-1
    time_min = max(1, seq_count * 0.0003)
    time_max = max(5, seq_count * 0.001)
elif cpu_threads >= 8:  # Pro+ High-RAM
    time_min = max(2, seq_count * 0.002)
    time_max = max(8, seq_count * 0.008)
else:  # Standard 2-4 threads
    time_min = max(5, seq_count * 0.01)
    time_max = max(30, seq_count * 0.05)

print(f"📊 Analysis Configuration:")
print(f"   Proteome: {os.path.basename(proteome_path)}")
print(f"   Sequences: {seq_count:,}")
print(f"   Lineage: {selected_lineage}")
print(f"   CPU threads: {cpu_threads}")
print(f"\n⏳ Estimated time: {time_min:.0f}-{time_max:.0f} minutes")
print(f"⏰ Started: {datetime.now().strftime('%H:%M:%S')}\n")

# Create temp output directory
temp_output = f"/content/busco_temp_{output_name}"
os.makedirs(temp_output, exist_ok=True)

# Build BUSCO command with mamba environment wrapper
busco_cmd = f"""mamba run -n busco busco \
    -i {proteome_path} \
    -o busco_temp_{output_name} \
    -m protein \
    -l {selected_lineage} \
    -c {cpu_threads} \
    --offline \
    --download_path /content/busco_downloads \
    -f"""

# Run BUSCO with real-time output and progress tracking
print("="*60)
print(f"📈 BUSCO Progress (live output - using {cpu_threads} threads):")
print("="*60)

start_time = datetime.now()
process = subprocess.Popen(
    busco_cmd,
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    universal_newlines=True,
    bufsize=1
)

stage_markers = {
    'Checking': '🔍',
    'Running': '⚙️',
    'Hmmsearch': '🔬',
    'Results': '📊'
}

for line in process.stdout:
    for marker, emoji in stage_markers.items():
        if marker in line:
            line = f"{emoji} {line}"
            break
    print(line.rstrip())

process.wait()
elapsed = (datetime.now() - start_time).total_seconds() / 60
print("="*60)
print(f"⏱️  Completed in {elapsed:.1f} minutes using {cpu_threads} CPU threads")
print("="*60)

if process.returncode == 0:
    print("\n✅ BUSCO Analysis Complete!\n")

    # Find the actual results directory
    busco_results_dir = f"/content/busco_temp_{output_name}/run_{selected_lineage}"

    # Find and display summary
    summary_file = f"{busco_results_dir}/short_summary.txt"
    if not os.path.exists(summary_file):
        summary_file = f"/content/busco_temp_{output_name}/short_summary.specific.{selected_lineage}.busco_temp_{output_name}.txt"

    if os.path.exists(summary_file):
        print("📊 BUSCO RESULTS SUMMARY:")
        print("="*60)
        with open(summary_file, 'r') as f:
            summary = f.read()
            print(summary)

            for line in summary.split('\n'):
                if 'Complete BUSCOs' in line:
                    print(f"\n🎯 {line.strip()}")
        print("="*60)

    # Copy results to Google Drive
    print(f"\n💾 Saving to Google Drive: BUSCO_Results/{drive_folder_name}")
    drive_base = "/content/drive/MyDrive/BUSCO_Results"
    os.makedirs(drive_base, exist_ok=True)
    drive_output = os.path.join(drive_base, drive_folder_name)

    try:
        if os.path.exists(busco_results_dir):
            shutil.copytree(busco_results_dir, drive_output)
            print(f"✅ Results saved to: {drive_output}")
        else:
            shutil.copytree(f"/content/busco_temp_{output_name}", drive_output)
            print(f"✅ Results saved to: {drive_output}")
    except Exception as e:
        print(f"⚠️ Could not save to Drive: {e}")

    # Create downloadable ZIP
    print("\n📦 Creating downloadable archive...")
    zip_basename = f"{drive_folder_name}"
    if os.path.exists(busco_results_dir):
        shutil.make_archive(f"/content/{zip_basename}", 'zip', busco_results_dir)
    else:
        shutil.make_archive(f"/content/{zip_basename}", 'zip', f"/content/busco_temp_{output_name}")

    print("\n🎉 Analysis Complete!")
    print(f"📁 Google Drive: BUSCO_Results/{drive_folder_name}")
    print(f"\n📥 Downloading: {zip_basename}.zip")

    from google.colab import files
    files.download(f"/content/{zip_basename}.zip")

    print(f"\n✅ Download complete!")

else:
    print("\n❌ BUSCO analysis failed. Please check the error messages above.")
    print("\nCommon issues:")
    print("   • Invalid FASTA format")
    print("   • Incorrect lineage selection")
    print("   • Insufficient memory (try smaller proteome or restart runtime)")

🚀 Starting BUSCO Analysis

CPU CONFIGURATION
🖥️  Detected CPU threads: 24
✅ High-RAM Runtime detected - GOOD performance (8 threads)
⚙️  BUSCO will use all 24 threads (100% CPU utilization)

📝 Output name: NbT2T.pep (auto-detected)

🔍 Detecting BUSCO version...
✅ BUSCO version: 6.0.0

🧹 Cleaning FASTA headers...
✅ 0 headers cleaned

📁 Results will be saved as: NbT2T.pep_BUSCOv6.0.0_solanales_odb12

📊 Analysis Configuration:
   Proteome: NbT2T.pep.fasta
   Sequences: 57,023
   Lineage: solanales_odb12
   CPU threads: 24

⏳ Estimated time: 114-456 minutes
⏰ Started: 17:56:23

📈 BUSCO Progress (live output - using 24 threads):
2025-10-21 17:56:25 INFO:	***** Start a BUSCO v6.0.0 analysis, current time: 10/21/2025 17:56:25 *****
2025-10-21 17:56:25 INFO:	Configuring BUSCO with local environment
⚙️ 2025-10-21 17:56:25 INFO:	Running proteins mode
2025-10-21 17:56:25 INFO:	'Force' option selected; overwriting previous results directory
2025-10-21 17:56:25 INFO:	Input file is /content/NbT2T.pe

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


✅ Download complete!
