<a href="https://colab.research.google.com/github/JKourelis/Colab_TMbed/blob/main/TMbed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://colab.research.google.com/assets/colab-badge.svg" height="40" align="right" style="height:40px">

## TMbed: Transmembrane Protein Prediction via Language Model Embeddings

Easy to use transmembrane protein prediction using [TMbed](https://github.com/BernhoferM/TMbed). TMbed predicts transmembrane beta barrel and alpha helical proteins, the position and orientation of their transmembrane segments, and signal peptides using ProtT5-XL-U50 language model embeddings.

**Key Features:**
- **Transmembrane Prediction**: Beta barrel and alpha helical transmembrane proteins with high accuracy
- **Segment Orientation**: Predicts position and directionality (IN→OUT vs OUT→IN) of transmembrane segments  
- **Signal Peptides**: Identifies signal peptide regions
- **Multiple Output Formats**: 5 different output formats including 3-line, tabular, and GFF conversion
- **Language Model Powered**: Uses ProtT5-XL-U50 protein language model embeddings

**Usage Options:**
1. **FASTA Upload**: Upload protein sequences for batch prediction
2. **Google Drive Integration**: Automatic result storage and organization
3. **Format Selection**: Choose from 5 different output formats for downstream analysis

**Precomputed Predictions:**
- Visit [TMVisDB](https://tmvisdb.predictprotein.org) for precomputed predictions on AlphaFold DB structures
- Human proteome and UniProtKB/Swiss-Prot predictions available on [Zenodo](https://zenodo.org/records/14705941)

**Citations:**

[Bernhofer M, Littmann M, Heinzinger M, et al. TMbed: transmembrane proteins predicted through language model embeddings. *BMC Bioinformatics*, 2022](https://doi.org/10.1186/s12859-022-04873-x)

[Bernhofer M, Littmann M, Heinzinger M, et al. TMbed: transmembrane proteins predicted through language model embeddings. *bioRxiv*, 2022](https://doi.org/10.1101/2022.06.12.495804)

[Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing. *IEEE TPAMI*, 2021](https://doi.org/10.1109/TPAMI.2021.3095381)

**Also Available:**
- [TMbed Original Software Repository](https://github.com/BernhoferM/TMbed)
- [TMbed Colab Repository](https://github.com/JKourelis/Colab_TMbed)
- [bio_embeddings Integration](https://github.com/sacdallago/bio_embeddings)  
- [LambdaPP Web Interface](https://embed.predictprotein.org/)
- [Original Google Colab](https://colab.research.google.com/drive/1FbT2rQHYT67NNHCrGw4t0WMEHCY9lqR2?usp=sharing)

In [None]:
# @title FASTA Upload + Google Drive Setup + Output Format Selection
# @markdown Configure all settings: upload file, connect to Google Drive, and select output format
# @markdown
# @markdown **OUTPUT FORMATS:**
# @markdown - Format 0: 3-line format with directed segments (B/b for beta strands, H/h for alpha helices) + GFF
# @markdown - Format 1: 3-line format with undirected segments and inside/outside annotation + GFF
# @markdown - Format 2: Tabular format with directed segments and probabilities + GFF
# @markdown - Format 3: Tabular format with undirected segments and probabilities + GFF
# @markdown - Format 4: 3-line format with directed segments and explicit inside/outside prediction + GFF

import os
import re
from google.colab import files

# Output format selection
output_format = 0 #@param ["0", "1", "2", "3", "4"] {type:"raw"}

print(f"✅ Selected output format: {output_format}")

# Google Drive setup
setup_google_drive = True #@param {type:"boolean"}
#@markdown Connect to Google Drive for automatic result upload
gdrive_folder_name = "TMbed_Predictions" #@param {type:"string"}
#@markdown Google Drive folder name for storing results

# Setup Google Drive if requested
drive = None
if setup_google_drive:
    try:
        from pydrive2.drive import GoogleDrive
        from pydrive2.auth import GoogleAuth
        from google.colab import auth
        from oauth2client.client import GoogleCredentials

        print("Setting up Google Drive...")
        auth.authenticate_user()
        gauth = GoogleAuth()
        gauth.credentials = GoogleCredentials.get_application_default()
        drive = GoogleDrive(gauth)
        print("✅ Google Drive connected successfully!")
    except Exception as e:
        print(f"❌ Google Drive setup failed: {e}")
        drive = None

# Upload FASTA file
print("Upload your FASTA file:")
uploaded = files.upload()

if uploaded:
    # Get the uploaded file
    uploaded_filename = list(uploaded.keys())[0]
    file_content = uploaded[uploaded_filename]

    # Clean filename (remove any duplicate numbers that Colab may have added)
    clean_filename = re.sub(r'\s*\(\d+\)', '', uploaded_filename)

    # Write the content to the clean filename
    with open(clean_filename, 'wb') as f:
        f.write(file_content)

    # Get base name without extension for output file
    base_name = os.path.splitext(clean_filename)[0]
    output_filename = f"{base_name}.pred"

    print(f"✅ File uploaded and saved as: {clean_filename}")
    print(f"Output will be saved as: {output_filename}")

    # Store variables for next cells
    globals()['clean_filename'] = clean_filename
    globals()['output_filename'] = output_filename
    globals()['base_name'] = base_name
    globals()['drive'] = drive
    globals()['gdrive_folder_name'] = gdrive_folder_name
    globals()['output_format'] = output_format
else:
    print("❌ No file was uploaded. Please run this cell again and upload a file.")

✅ Selected output format: 0
Setting up Google Drive...
✅ Google Drive connected successfully!
Upload your FASTA file:


Saving GWHBKBG00000000_AA_Named_v20251016.fasta to GWHBKBG00000000_AA_Named_v20251016.fasta
✅ File uploaded and saved as: GWHBKBG00000000_AA_Named_v20251016.fasta
Output will be saved as: GWHBKBG00000000_AA_Named_v20251016.pred


In [None]:
# @title Dependencies and Environment Setup

import os
import subprocess
import sys

print("🔧 Installing TMbed and dependencies...")

# Install TMbed from GitHub
!pip install -q git+https://github.com/BernhoferM/TMbed.git

# Install PyDrive2 for Google Drive integration
!pip install -q PyDrive2

print("✅ Installation complete!")

# Verify GPU availability
import torch
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"✅ GPU detected: {gpu_name}")
else:
    print("⚠️ No GPU detected - will use CPU (slower)")

# Pre-download the T5 model to avoid race conditions during prediction
print("\n📥 Pre-downloading T5 encoder model (2.4GB - this may take a minute)...")
print("This prevents errors during prediction and speeds up subsequent runs.")

try:
    from transformers import T5Tokenizer, T5EncoderModel
    import torch

    model_name = "Rostlab/prot_t5_xl_half_uniref50-enc"

    # Download tokenizer
    tokenizer = T5Tokenizer.from_pretrained(model_name, do_lower_case=False)

    # Download model (use float16 if GPU available, float32 otherwise)
    if torch.cuda.is_available():
        model = T5EncoderModel.from_pretrained(model_name, torch_dtype=torch.float16)
        print("✅ Model downloaded and cached (GPU-optimized)")
    else:
        model = T5EncoderModel.from_pretrained(model_name)
        print("✅ Model downloaded and cached (CPU mode)")

    # Clear memory
    del model
    del tokenizer
    torch.cuda.empty_cache() if torch.cuda.is_available() else None

except Exception as e:
    print(f"⚠️ Model pre-download encountered an issue: {e}")
    print("Will attempt to download during prediction instead.")

print("\n🎉 Setup complete! Proceed to the next cell.")

🔧 Installing TMbed and dependencies...
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for tmbed (pyproject.toml) ... [?25l[?25hdone
✅ Installation complete!
✅ GPU detected: NVIDIA A100-SXM4-80GB

📥 Pre-downloading T5 encoder model (2.4GB - this may take a minute)...
This prevents errors during prediction and speeds up subsequent runs.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
`torch_dtype` is deprecated! Use `dtype` instead!


pytorch_model.bin:   0%|          | 0.00/2.42G [00:00<?, ?B/s]

✅ Model downloaded and cached (GPU-optimized)


model.safetensors:   0%|          | 0.00/2.42G [00:00<?, ?B/s]


🎉 Setup complete! Proceed to the next cell.


In [None]:
# @title Run TMbed Prediction + Process Results + Upload to Google Drive/Download Results

import time
import os
import shutil
import torch
import gc
from pathlib import Path

# Check if required variables exist
if 'clean_filename' not in globals() or 'output_filename' not in globals():
    print("❌ Error: Please run the configuration cell first.")
else:
    # Clear GPU memory before starting
    print("🧹 Clearing GPU memory from previous runs...")
    torch.cuda.empty_cache()
    gc.collect()

    if torch.cuda.is_available():
        before_mem = torch.cuda.memory_allocated() / 1e9
        reserved_mem = torch.cuda.memory_reserved() / 1e9
        print(f"   GPU memory: {before_mem:.2f}GB allocated, {reserved_mem:.2f}GB reserved")

    # Google Drive helper functions
    def find_or_create_folder(drive, folder_name, parent_id='root'):
        """Find existing folder or create new one in Google Drive."""
        if not drive:
            return None

        try:
            file_list = drive.ListFile({'q': f"title='{folder_name}' and '{parent_id}' in parents and mimeType='application/vnd.google-apps.folder' and trashed=false"}).GetList()

            if file_list:
                print(f"✅ Found existing folder: {folder_name}")
                return file_list[0]['id']
            else:
                folder = drive.CreateFile({
                    'title': folder_name,
                    'mimeType': 'application/vnd.google-apps.folder',
                    'parents': [{'id': parent_id}]
                })
                folder.Upload()
                print(f"✅ Created new folder: {folder_name}")
                return folder['id']
        except Exception as e:
            print(f"❌ Error with folder '{folder_name}': {e}")
            return None

    def upload_to_gdrive(drive, file_path, folder_id, description):
        """Upload file to Google Drive folder."""
        if not drive or not os.path.exists(file_path):
            return None

        try:
            uploaded_file = drive.CreateFile({
                'title': os.path.basename(file_path),
                'parents': [{'id': folder_id}]
            })
            uploaded_file.SetContentFile(file_path)
            uploaded_file.Upload()

            file_url = f"https://drive.google.com/file/d/{uploaded_file['id']}/view"
            print(f"✅ Uploaded {description}: {file_url}")
            return file_url
        except Exception as e:
            print(f"❌ Upload failed for {description}: {e}")
            return None

    def analyze_sequence_lengths(fasta_file):
        """Analyze sequence lengths and return statistics."""
        lengths = []
        current_seq = []
        current_header = None

        with open(fasta_file, 'r') as f:
            for line in f:
                line = line.strip()
                if line.startswith('>'):
                    if current_seq:
                        seq_len = len(''.join(current_seq))
                        lengths.append((current_header, seq_len))
                    current_header = line[1:].split()[0]
                    current_seq = []
                else:
                    current_seq.append(line)

            if current_seq:
                seq_len = len(''.join(current_seq))
                lengths.append((current_header, seq_len))

        return lengths

    def filter_by_length(input_file, output_file, max_length=5000):
        """Filter sequences by maximum length. Returns (filtered_file, excluded_file, stats)."""
        kept = 0
        excluded = 0
        excluded_list = []

        with open(input_file, 'r') as infile, \
             open(output_file, 'w') as outfile, \
             open(f"{output_file}.excluded.txt", 'w') as excluded_file:

            current_header = None
            current_seq = []

            for line in infile:
                line_stripped = line.strip()
                if line_stripped.startswith('>'):
                    # Process previous sequence
                    if current_header is not None:
                        seq = ''.join(current_seq)
                        if len(seq) <= max_length:
                            outfile.write(current_header + '\n')
                            outfile.write(seq + '\n')
                            kept += 1
                        else:
                            excluded += 1
                            excluded_list.append((current_header[1:].split()[0], len(seq)))
                            excluded_file.write(f"{current_header[1:].split()[0]}\t{len(seq)}\n")

                    current_header = line_stripped
                    current_seq = []
                else:
                    current_seq.append(line_stripped)

            # Don't forget last sequence
            if current_header is not None:
                seq = ''.join(current_seq)
                if len(seq) <= max_length:
                    outfile.write(current_header + '\n')
                    outfile.write(seq + '\n')
                    kept += 1
                else:
                    excluded += 1
                    excluded_list.append((current_header[1:].split()[0], len(seq)))
                    excluded_file.write(f"{current_header[1:].split()[0]}\t{len(seq)}\n")

        return kept, excluded, excluded_list

    def count_sequences(fasta_file):
        """Count number of sequences in FASTA file."""
        count = 0
        with open(fasta_file, 'r') as f:
            for line in f:
                if line.startswith('>'):
                    count += 1
        return count

    def merge_prediction_files(chunk_outputs, final_output):
        """Merge prediction outputs from multiple chunks."""
        with open(final_output, 'w') as outfile:
            for i, chunk_file in enumerate(chunk_outputs):
                if os.path.exists(chunk_file):
                    with open(chunk_file, 'r') as infile:
                        outfile.write(infile.read())
                    os.remove(chunk_file)

    # Universal TMbed to GFF conversion function
    def tmbed_to_gff(tmbed_file, gff_file):
        """Convert any TMbed output format to GFF format."""
        with open(tmbed_file, 'r') as f:
            lines = f.readlines()

        with open(gff_file, 'w') as out:
            out.write("##gff-version 3\n")

            i = 0
            while i < len(lines):
                line = lines[i].strip()

                if not line:
                    i += 1
                    continue

                if line.startswith('>'):
                    header = line[1:]
                    sequence_name = header.split()[0]

                    if i + 1 < len(lines) and lines[i + 1].strip().startswith('AA'):
                        sequence = ''
                        prediction = ''
                        i += 2
                        while i < len(lines) and not lines[i].strip().startswith('>') and lines[i].strip():
                            data_line = lines[i].strip()
                            if data_line:
                                parts = data_line.split()
                                if len(parts) >= 2:
                                    sequence += parts[0]
                                    prediction += parts[1]
                            i += 1
                    elif i + 2 < len(lines):
                        sequence = lines[i + 1].strip()
                        prediction = lines[i + 2].strip()
                        i += 3
                    else:
                        i += 1
                        continue

                    if prediction:
                        simplified_pred = ''
                        for char in prediction:
                            if char in 'Bb':
                                simplified_pred += 'B'
                            elif char in 'Hh':
                                simplified_pred += 'H'
                            elif char == 'S':
                                simplified_pred += 'S'
                            elif char in 'io':
                                simplified_pred += '.'
                            else:
                                simplified_pred += '.'

                        segments = []
                        j = 0
                        while j < len(simplified_pred):
                            if simplified_pred[j] in 'BHS':
                                feature_type = 'signal_peptide' if simplified_pred[j] == 'S' else \
                                              'TMbeta' if simplified_pred[j] == 'B' else 'TMhelix'
                                start = j
                                while j < len(simplified_pred) and simplified_pred[j] == simplified_pred[start]:
                                    j += 1
                                segments.append({
                                    'type': feature_type,
                                    'start': start + 1,
                                    'end': j
                                })
                            else:
                                j += 1

                        has_signal = any(seg['type'] == 'signal_peptide' for seg in segments)

                        if segments:
                            final_regions = []
                            for seg in segments:
                                final_regions.append(seg)

                            if has_signal:
                                sorted_segments = sorted(segments, key=lambda x: x['start'])
                                current_location = 'inside'
                                last_end = 1

                                for seg in sorted_segments:
                                    if seg['start'] > last_end:
                                        final_regions.append({
                                            'type': current_location,
                                            'start': last_end,
                                            'end': seg['start'] - 1
                                        })

                                    if seg['type'] == 'signal_peptide':
                                        current_location = 'outside'
                                    elif seg['type'] in ['TMbeta', 'TMhelix']:
                                        current_location = 'outside' if current_location == 'inside' else 'inside'

                                    last_end = seg['end'] + 1

                                if last_end <= len(simplified_pred):
                                    final_regions.append({
                                        'type': current_location,
                                        'start': last_end,
                                        'end': len(simplified_pred)
                                    })

                            final_regions.sort(key=lambda x: x['start'])

                            feature_counts = {
                                'TMbeta': 0,
                                'TMhelix': 0,
                                'signal_peptide': 0,
                                'inside': 0,
                                'outside': 0
                            }

                            for region in final_regions:
                                feature_type = region['type']
                                feature_counts[feature_type] += 1
                                count = feature_counts[feature_type]
                                region_id = f"{sequence_name}_{feature_type}-{count}"
                                gff_line = f"{sequence_name}\tTMbed\t{feature_type}\t{region['start']}\t{region['end']}\t.\t.\t.\tID={region_id};Name={feature_type}\n"
                                out.write(gff_line)
                else:
                    i += 1

    # STEP 1: Analyze sequence lengths
    print("\n📊 STEP 1: Analyzing sequence lengths...")
    print("=" * 60)
    lengths = analyze_sequence_lengths(clean_filename)

    if lengths:
        lengths_sorted = sorted(lengths, key=lambda x: x[1], reverse=True)
        total_seqs = len(lengths)
        min_len = min(l[1] for l in lengths)
        max_len = max(l[1] for l in lengths)
        avg_len = sum(l[1] for l in lengths) / len(lengths)

        print(f"Total sequences: {total_seqs}")
        print(f"Length range: {min_len} - {max_len} amino acids")
        print(f"Average length: {avg_len:.1f} amino acids")

        # Check for problematic sequences
        very_long = sum(1 for _, l in lengths if l > 5000)
        extremely_long = sum(1 for _, l in lengths if l > 10000)

        if very_long > 0 or extremely_long > 0:
            print(f"\n⚠️  WARNING: Found problematic sequences!")
            print(f"   Sequences >5,000 aa: {very_long}")
            print(f"   Sequences >10,000 aa: {extremely_long}")
            print(f"\n🔴 Longest sequences (memory killers):")
            for i, (header, length) in enumerate(lengths_sorted[:10], 1):
                memory_gb = (length ** 2) * 4 / 1e9
                print(f"   {i}. {header[:50]}: {length} aa (~{memory_gb:.1f} GB)")

            # Filter sequences
            MAX_LENGTH = 5000
            print(f"\n🔧 FILTERING sequences longer than {MAX_LENGTH} amino acids...")
            filtered_file = f"{clean_filename}.filtered.fasta"
            kept, excluded, excluded_list = filter_by_length(clean_filename, filtered_file, MAX_LENGTH)

            print(f"✅ Filtered file created:")
            print(f"   Kept: {kept} sequences")
            print(f"   Excluded: {excluded} sequences")

            if excluded > 0:
                print(f"\n📝 Excluded sequences saved to: {filtered_file}.excluded.txt")
                print(f"   (You can run these separately if needed)")

            # Use filtered file for prediction
            clean_filename = filtered_file
        else:
            print(f"\n✅ All sequences are within acceptable length (<5,000 aa)")

    # STEP 2: Determine chunking strategy
    print(f"\n📊 STEP 2: Determining processing strategy...")
    print("=" * 60)
    total_sequences = count_sequences(clean_filename)
    print(f"Sequences to process: {total_sequences}")

    CHUNK_SIZE = 10000

    if total_sequences > CHUNK_SIZE:
        print(f"⚠️  Large file - will process in chunks of {CHUNK_SIZE}")

        # Split and process
        print(f"\n🔧 Splitting file...")
        from pathlib import Path
        chunks = []
        chunk_num = 0
        current_chunk = []
        current_count = 0

        with open(clean_filename, 'r') as f:
            current_seq = []
            for line in f:
                if line.startswith('>'):
                    if current_seq:
                        current_chunk.extend(current_seq)
                        current_count += 1

                        if current_count >= CHUNK_SIZE:
                            chunk_file = f"{clean_filename}.chunk{chunk_num}.fasta"
                            with open(chunk_file, 'w') as out:
                                out.writelines(current_chunk)
                            chunks.append(chunk_file)
                            chunk_num += 1
                            current_chunk = []
                            current_count = 0

                    current_seq = [line]
                else:
                    current_seq.append(line)

            if current_seq:
                current_chunk.extend(current_seq)
                current_count += 1

            if current_chunk:
                chunk_file = f"{clean_filename}.chunk{chunk_num}.fasta"
                with open(chunk_file, 'w') as out:
                    out.writelines(current_chunk)
                chunks.append(chunk_file)

        print(f"   Created {len(chunks)} chunks")

        # Process chunks
        chunk_outputs = []
        total_start_time = time.time()

        for i, chunk_file in enumerate(chunks, 1):
            print(f"\n🔄 Processing chunk {i}/{len(chunks)}...")

            torch.cuda.empty_cache()
            gc.collect()

            chunk_output = f"{chunk_file}.pred"
            shutil.copy2(chunk_file, "/content/tmbed/")

            %cd '/content/tmbed/'
            !python -m tmbed predict -f {os.path.basename(chunk_file)} -p {os.path.basename(chunk_output)} --out-format={output_format} --batch-size=2000
            %cd '/content'

            if os.path.exists(f"/content/tmbed/{os.path.basename(chunk_output)}"):
                shutil.copy2(f"/content/tmbed/{os.path.basename(chunk_output)}", chunk_output)
                chunk_outputs.append(chunk_output)
                print(f"   ✅ Chunk {i} completed")
            else:
                print(f"   ❌ Chunk {i} failed")

            os.remove(chunk_file)
            if os.path.exists(f"/content/tmbed/{os.path.basename(chunk_file)}"):
                os.remove(f"/content/tmbed/{os.path.basename(chunk_file)}")

        print(f"\n🔧 Merging {len(chunk_outputs)} chunks...")
        merge_prediction_files(chunk_outputs, output_filename)

        total_time = time.time() - total_start_time
        print(f"✅ Completed in {total_time:.1f} seconds")

    else:
        print(f"✅ File size OK - processing normally")
        start_time = time.time()

        shutil.copy2(clean_filename, "/content/tmbed/")

        %cd '/content/tmbed/'
        !python -m tmbed predict -f {os.path.basename(clean_filename)} -p {output_filename} --out-format={output_format} --batch-size=2000
        %cd '/content'

        if os.path.exists(f"/content/tmbed/{output_filename}"):
            shutil.copy2(f"/content/tmbed/{output_filename}", f"/content/{output_filename}")

        elapsed_time = time.time() - start_time
        print(f"✅ Completed in {elapsed_time:.1f} seconds")

    # STEP 3: Process results
    if os.path.exists(output_filename):
        print(f"\n📁 STEP 3: Processing results...")
        print("=" * 60)

        folder_id = None
        if drive:
            folder_id = find_or_create_folder(drive, gdrive_folder_name)

        uploaded_files = []

        gff_filename = f"{base_name}.gff"
        print("🔧 Converting to GFF format...")
        tmbed_to_gff(output_filename, gff_filename)
        print(f"✅ GFF file created: {gff_filename}")

        if drive and folder_id:
            pred_url = upload_to_gdrive(drive, output_filename, folder_id, f"prediction ({output_filename})")
            gff_url = upload_to_gdrive(drive, gff_filename, folder_id, f"GFF ({gff_filename})")

            if pred_url:
                uploaded_files.append(pred_url)
            if gff_url:
                uploaded_files.append(gff_url)
        else:
            from google.colab import files
            files.download(output_filename)
            files.download(gff_filename)

        print(f"\n✅ SUCCESS!")
        print("=" * 60)
        if uploaded_files:
            print(f"📤 Uploaded {len(uploaded_files)} files to Google Drive")
            for url in uploaded_files:
                print(f"   • {url}")
        else:
            print(f"📥 Files downloaded:")
            print(f"   • {output_filename}")
            print(f"   • {gff_filename}")

    else:
        print("❌ Error: Prediction failed")

🧹 Clearing GPU memory from previous runs...
   GPU memory: 0.00GB allocated, 0.00GB reserved

📊 STEP 1: Analyzing sequence lengths...
Total sequences: 48431
Length range: 49 - 58014 amino acids
Average length: 424.6 amino acids

   Sequences >5,000 aa: 29
   Sequences >10,000 aa: 15

🔴 Longest sequences (memory killers):
   1. Niben10Scf004g00347.1: 58014 aa (~13.5 GB)
   2. Niben10Scf013g02480.1: 47246 aa (~8.9 GB)
   3. Niben10Scf003g01768.1: 43278 aa (~7.5 GB)
   4. Niben10Scf003g01788.1: 41879 aa (~7.0 GB)
   5. Niben10Scf004g00340.1: 33879 aa (~4.6 GB)
   6. Niben10Scf003g01792.1: 29048 aa (~3.4 GB)
   7. Niben10Scf009g00077.1: 27390 aa (~3.0 GB)
   8. Niben10Scf004g00341.1: 26808 aa (~2.9 GB)
   9. Niben10Scf003g01761.1: 16098 aa (~1.0 GB)
   10. Niben10Scf003g01790.1: 15170 aa (~0.9 GB)

🔧 FILTERING sequences longer than 5000 amino acids...
✅ Filtered file created:
   Kept: 48402 sequences
   Excluded: 29 sequences

📝 Excluded sequences saved to: GWHBKBG00000000_AA_Named_v202510