<a href="https://colab.research.google.com/github/JKourelis/Colab_TMbed/blob/main/TMbed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://colab.research.google.com/assets/colab-badge.svg" height="40" align="right" style="height:40px">

## TMbed: Transmembrane Protein Prediction via Language Model Embeddings

Easy to use transmembrane protein prediction using [TMbed](https://github.com/BernhoferM/TMbed). TMbed predicts transmembrane beta barrel and alpha helical proteins, the position and orientation of their transmembrane segments, and signal peptides using ProtT5-XL-U50 language model embeddings.

**Key Features:**
- **Transmembrane Prediction**: Beta barrel and alpha helical transmembrane proteins with high accuracy
- **Segment Orientation**: Predicts position and directionality (IN→OUT vs OUT→IN) of transmembrane segments  
- **Signal Peptides**: Identifies signal peptide regions
- **Multiple Output Formats**: 5 different output formats including 3-line, tabular, and GFF conversion
- **Language Model Powered**: Uses ProtT5-XL-U50 protein language model embeddings

**Usage Options:**
1. **FASTA Upload**: Upload protein sequences for batch prediction
2. **Google Drive Integration**: Automatic result storage and organization
3. **Format Selection**: Choose from 5 different output formats for downstream analysis

**Precomputed Predictions:**
- Visit [TMVisDB](https://tmvisdb.predictprotein.org) for precomputed predictions on AlphaFold DB structures
- Human proteome and UniProtKB/Swiss-Prot predictions available on [Zenodo](https://zenodo.org/records/14705941)

**Citations:**

[Bernhofer M, Littmann M, Heinzinger M, et al. TMbed: transmembrane proteins predicted through language model embeddings. *BMC Bioinformatics*, 2022](https://doi.org/10.1186/s12859-022-04873-x)

[Bernhofer M, Littmann M, Heinzinger M, et al. TMbed: transmembrane proteins predicted through language model embeddings. *bioRxiv*, 2022](https://doi.org/10.1101/2022.06.12.495804)

[Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing. *IEEE TPAMI*, 2021](https://doi.org/10.1109/TPAMI.2021.3095381)

**Also Available:**
- [TMbed GitHub Repository](https://github.com/BernhoferM/TMbed)
- [bio_embeddings Integration](https://github.com/sacdallago/bio_embeddings)  
- [LambdaPP Web Interface](https://embed.predictprotein.org/)
- [Original Google Colab](https://colab.research.google.com/drive/1FbT2rQHYT67NNHCrGw4t0WMEHCY9lqR2?usp=sharing)

In [None]:
# @title FASTA Upload + Google Drive Setup + Output Format Selection
# @markdown Configure all settings: upload file, connect to Google Drive, and select output format
# @markdown
# @markdown **OUTPUT FORMATS:**
# @markdown - Format 0: 3-line format with directed segments (B/b for beta strands, H/h for alpha helices) + GFF
# @markdown - Format 1: 3-line format with undirected segments and inside/outside annotation + GFF
# @markdown - Format 2: Tabular format with directed segments and probabilities + GFF
# @markdown - Format 3: Tabular format with undirected segments and probabilities + GFF
# @markdown - Format 4: 3-line format with directed segments and explicit inside/outside prediction + GFF

import os
import re
from google.colab import files

# Output format selection
output_format = 0 #@param ["0", "1", "2", "3", "4"] {type:"raw"}

print(f"✅ Selected output format: {output_format}")

# Google Drive setup
setup_google_drive = True #@param {type:"boolean"}
#@markdown Connect to Google Drive for automatic result upload
gdrive_folder_name = "TMbed_Predictions" #@param {type:"string"}
#@markdown Google Drive folder name for storing results

# Setup Google Drive if requested
drive = None
if setup_google_drive:
    try:
        from pydrive2.drive import GoogleDrive
        from pydrive2.auth import GoogleAuth
        from google.colab import auth
        from oauth2client.client import GoogleCredentials

        print("Setting up Google Drive...")
        auth.authenticate_user()
        gauth = GoogleAuth()
        gauth.credentials = GoogleCredentials.get_application_default()
        drive = GoogleDrive(gauth)
        print("✅ Google Drive connected successfully!")
    except Exception as e:
        print(f"❌ Google Drive setup failed: {e}")
        drive = None

# Upload FASTA file
print("Upload your FASTA file:")
uploaded = files.upload()

if uploaded:
    # Get the uploaded file
    uploaded_filename = list(uploaded.keys())[0]
    file_content = uploaded[uploaded_filename]

    # Clean filename (remove any duplicate numbers that Colab may have added)
    clean_filename = re.sub(r'\s*\(\d+\)', '', uploaded_filename)

    # Write the content to the clean filename
    with open(clean_filename, 'wb') as f:
        f.write(file_content)

    # Get base name without extension for output file
    base_name = os.path.splitext(clean_filename)[0]
    output_filename = f"{base_name}.pred"

    print(f"✅ File uploaded and saved as: {clean_filename}")
    print(f"Output will be saved as: {output_filename}")

    # Store variables for next cells
    globals()['clean_filename'] = clean_filename
    globals()['output_filename'] = output_filename
    globals()['base_name'] = base_name
    globals()['drive'] = drive
    globals()['gdrive_folder_name'] = gdrive_folder_name
    globals()['output_format'] = output_format
else:
    print("❌ No file was uploaded. Please run this cell again and upload a file.")

OUTPUT FORMAT EXPLANATIONS:
• Format 0: 3-line format with directed segments (B/b for beta strands, H/h for alpha helices)
• Format 1: 3-line format with undirected segments and inside/outside annotation
• Format 2: Tabular format with directed segments and probabilities
• Format 3: Tabular format with undirected segments and probabilities
• Format 4: 3-line format with directed segments with explicit inside/outside prediction
✅ Selected output format: 0
Setting up Google Drive...
❌ Google Drive setup failed: Error: credential propagation was unsuccessful
Upload your FASTA file:


KeyboardInterrupt: 

In [None]:
# @title Install TMbed and dependencies
print("Installing TMbed...")

# Install dependencies exactly as in the official example
!pip install -q h5py numpy tqdm typer transformers sentencepiece torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

# Clone TMbed repository (don't install it as a package)
!git clone https://github.com/BernhoferM/TMbed.git tmbed

# Verify GPU is available
!nvidia-smi

print("TMbed installation complete. Ready for prediction.")

Installing TMbed...
fatal: destination path 'tmbed' already exists and is not an empty directory.
Mon Jun 23 13:47:34 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             46W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------

In [1]:
# @title Run TMbed Prediction + Process Results + Upload to Google Drive/Download Files

import time
import os
import shutil

# Check if required variables exist
if 'clean_filename' not in globals() or 'output_filename' not in globals():
    print("❌ Error: Please run the configuration cell first.")
else:
    # Google Drive helper functions
    def find_or_create_folder(drive, folder_name, parent_id='root'):
        """Find existing folder or create new one in Google Drive."""
        if not drive:
            return None

        try:
            file_list = drive.ListFile({'q': f"title='{folder_name}' and '{parent_id}' in parents and mimeType='application/vnd.google-apps.folder' and trashed=false"}).GetList()

            if file_list:
                print(f"✅ Found existing folder: {folder_name}")
                return file_list[0]['id']
            else:
                folder = drive.CreateFile({
                    'title': folder_name,
                    'mimeType': 'application/vnd.google-apps.folder',
                    'parents': [{'id': parent_id}]
                })
                folder.Upload()
                print(f"✅ Created new folder: {folder_name}")
                return folder['id']
        except Exception as e:
            print(f"❌ Error with folder '{folder_name}': {e}")
            return None

    def upload_to_gdrive(drive, file_path, folder_id, description):
        """Upload file to Google Drive folder."""
        if not drive or not os.path.exists(file_path):
            return None

        try:
            uploaded_file = drive.CreateFile({
                'title': os.path.basename(file_path),
                'parents': [{'id': folder_id}]
            })
            uploaded_file.SetContentFile(file_path)
            uploaded_file.Upload()

            file_url = f"https://drive.google.com/file/d/{uploaded_file['id']}/view"
            print(f"✅ Uploaded {description}: {file_url}")
            return file_url
        except Exception as e:
            print(f"❌ Upload failed for {description}: {e}")
            return None

    # Universal TMbed to GFF conversion function
    def tmbed_to_gff(tmbed_file, gff_file):
        """
        Convert any TMbed output format to GFF format.
        Handles both 3-line formats (0,1,4) and tabular formats (2,3).
        Only includes inside/outside regions if there's a signal peptide.
        """
        with open(tmbed_file, 'r') as f:
            lines = f.readlines()

        with open(gff_file, 'w') as out:
            # Write GFF header
            out.write("##gff-version 3\n")

            i = 0
            while i < len(lines):
                line = lines[i].strip()

                # Skip empty lines
                if not line:
                    i += 1
                    continue

                # Look for sequence header
                if line.startswith('>'):
                    header = line[1:]  # Remove '>' character
                    sequence_name = header.split()[0]  # Get first part as sequence name

                    # Check if this is 3-line format or tabular format
                    if i + 2 < len(lines) and not lines[i + 2].strip().startswith(('AA', 'M', 'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'F', 'P', 'S', 'T', 'W', 'Y', 'V')):
                        # This looks like 3-line format (line 3 is not amino acids)
                        sequence = lines[i + 1].strip()
                        prediction = lines[i + 2].strip()
                        i += 3
                    elif i + 1 < len(lines) and lines[i + 1].strip().startswith('AA'):
                        # This is tabular format (line 2 starts with "AA PRD...")
                        sequence = ''
                        prediction = ''

                        # Skip column header line
                        i += 2

                        # Read data rows until next header or end of file
                        while i < len(lines) and not lines[i].strip().startswith('>') and lines[i].strip():
                            data_line = lines[i].strip()
                            if data_line:
                                parts = data_line.split()
                                if len(parts) >= 2:
                                    sequence += parts[0]  # Amino acid
                                    prediction += parts[1]  # Prediction
                            i += 1
                    else:
                        # Skip malformed entries
                        i += 1
                        continue

                    # Now process the prediction using universal logic
                    if prediction:
                        # Standardize prediction format - convert to simplified format
                        simplified_pred = ''
                        for char in prediction:
                            if char in 'Bb':
                                simplified_pred += 'B'  # All beta strands as 'B'
                            elif char in 'Hh':
                                simplified_pred += 'H'  # All helices as 'H'
                            elif char == 'S':
                                simplified_pred += 'S'  # Signal peptides stay as 'S'
                            elif char in 'io':
                                simplified_pred += '.'  # Inside/outside becomes dot
                            else:
                                simplified_pred += '.'  # Everything else becomes a dot

                        # First pass: Identify all TM segments and signal peptides
                        segments = []
                        j = 0
                        while j < len(simplified_pred):
                            if simplified_pred[j] in 'BHS':
                                feature_type = 'signal_peptide' if simplified_pred[j] == 'S' else \
                                              'TMbeta' if simplified_pred[j] == 'B' else 'TMhelix'

                                start = j
                                while j < len(simplified_pred) and simplified_pred[j] == simplified_pred[start]:
                                    j += 1

                                segments.append({
                                    'type': feature_type,
                                    'start': start + 1,  # 1-based
                                    'end': j  # exclusive end -> inclusive end
                                })
                            else:
                                j += 1

                        # Check if we have a signal peptide
                        has_signal = any(seg['type'] == 'signal_peptide' for seg in segments)

                        # Only continue if we have TM or signal features to annotate
                        if segments:
                            final_regions = []

                            # Add all the TM and signal peptide segments
                            for seg in segments:
                                final_regions.append(seg)

                            # If we have a signal peptide, we need to add inside/outside regions
                            if has_signal:
                                # Sort segments by start position
                                sorted_segments = sorted(segments, key=lambda x: x['start'])

                                # Default: start as INSIDE
                                current_location = 'inside'

                                # Find regions between features
                                last_end = 1  # Start of the sequence

                                for seg in sorted_segments:
                                    # If there's a gap before this segment, add it as inside/outside
                                    if seg['start'] > last_end:
                                        # Add the gap region
                                        final_regions.append({
                                            'type': current_location,
                                            'start': last_end,
                                            'end': seg['start'] - 1
                                        })

                                    # Update location based on segment type
                                    if seg['type'] == 'signal_peptide':
                                        # After signal peptide is always OUTSIDE
                                        current_location = 'outside'
                                    elif seg['type'] in ['TMbeta', 'TMhelix']:
                                        # Flip side after TM segment
                                        current_location = 'outside' if current_location == 'inside' else 'inside'

                                    # Update last end
                                    last_end = seg['end'] + 1

                                # Add the final region if there's space after the last segment
                                if last_end <= len(simplified_pred):
                                    final_regions.append({
                                        'type': current_location,
                                        'start': last_end,
                                        'end': len(simplified_pred)
                                    })

                            # Sort all regions by start position
                            final_regions.sort(key=lambda x: x['start'])

                            # Count feature types for numbering
                            feature_counts = {
                                'TMbeta': 0,
                                'TMhelix': 0,
                                'signal_peptide': 0,
                                'inside': 0,
                                'outside': 0
                            }

                            # Write GFF entries for each region
                            for region in final_regions:
                                feature_type = region['type']
                                feature_counts[feature_type] += 1
                                count = feature_counts[feature_type]

                                # Create region ID
                                region_id = f"{sequence_name}_{feature_type}-{count}"

                                # Write GFF entry
                                gff_line = f"{sequence_name}\tTMbed\t{feature_type}\t{region['start']}\t{region['end']}\t.\t.\t.\tID={region_id};Name={feature_type}\n"
                                out.write(gff_line)
                else:
                    i += 1

    # Run TMbed prediction
    print(f"Starting prediction on {clean_filename}...")
    print(f"Using output format: {output_format}")
    start_time = time.time()

    # Move the file to the TMbed directory for processing
    shutil.copy2(clean_filename, "/content/tmbed/")

    # Change directory to TMbed and run the prediction
    %cd '/content/tmbed/'
    !python -m tmbed predict -f {clean_filename} -p {output_filename} --out-format={output_format}
    %cd '/content'

    # Move the output file back if it was created in the tmbed directory
    if os.path.exists(f"/content/tmbed/{output_filename}"):
        shutil.copy2(f"/content/tmbed/{output_filename}", f"/content/{output_filename}")

    elapsed_time = time.time() - start_time

    if os.path.exists(output_filename):
        print(f"✅ Prediction completed in {elapsed_time:.1f} seconds")
        print(f"Results saved to {output_filename}")

        # Setup Google Drive folder if available
        folder_id = None
        if drive:
            folder_id = find_or_create_folder(drive, gdrive_folder_name)

        # Process results based on output format
        uploaded_files = []

        if output_format == 0:
            # Format 0: Convert to GFF and upload both files
            gff_filename = f"{base_name}.gff"
            print("Converting to GFF format...")
            tmbed_format0_to_gff(output_filename, gff_filename)
            print(f"✅ Converted TMbed format 0 output to GFF format: {gff_filename}")

            # Upload to Google Drive if available, otherwise download normally
            if drive and folder_id:
                pred_url = upload_to_gdrive(drive, output_filename, folder_id, f"prediction file ({output_filename})")
                gff_url = upload_to_gdrive(drive, gff_filename, folder_id, f"GFF file ({gff_filename})")

                if pred_url:
                    uploaded_files.append(pred_url)
                if gff_url:
                    uploaded_files.append(gff_url)
            else:
                print("⚠️ Google Drive not available - downloading files normally")
                from google.colab import files
                print(f"Downloading prediction file: {output_filename}")
                files.download(output_filename)
                print(f"Downloading GFF file: {gff_filename}")
                files.download(gff_filename)

            # Show preview
            print(f"\nPreview of the GFF file:")
            !head -10 "{gff_filename}"

        elif output_format == 1:
            # Format 1: Convert to GFF and upload both files
            gff_filename = f"{base_name}.gff"
            print("Converting to GFF format...")
            tmbed_format1_to_gff(output_filename, gff_filename)
            print(f"✅ Converted TMbed format 1 output to GFF format: {gff_filename}")

            # Upload to Google Drive if available, otherwise download normally
            if drive and folder_id:
                pred_url = upload_to_gdrive(drive, output_filename, folder_id, f"prediction file ({output_filename})")
                gff_url = upload_to_gdrive(drive, gff_filename, folder_id, f"GFF file ({gff_filename})")

                if pred_url:
                    uploaded_files.append(pred_url)
                if gff_url:
                    uploaded_files.append(gff_url)
            else:
                print("⚠️ Google Drive not available - downloading files normally")
                from google.colab import files
                print(f"Downloading prediction file: {output_filename}")
                files.download(output_filename)
                print(f"Downloading GFF file: {gff_filename}")
                files.download(gff_filename)

            # Show preview
            print(f"\nPreview of the GFF file:")
            !head -10 "{gff_filename}"

        else:
            # Formats 2-4: Just upload the prediction file
            if drive and folder_id:
                pred_url = upload_to_gdrive(drive, output_filename, folder_id, f"prediction file ({output_filename})")
                if pred_url:
                    uploaded_files.append(pred_url)
            else:
                print("⚠️ Google Drive not available - file remains in Colab session")

            # Show preview
            print(f"\nPreview of the prediction file:")
            !head -10 "{output_filename}"

        # Summary
        if uploaded_files:
            print(f"\n✅ Successfully uploaded {len(uploaded_files)} file(s) to Google Drive folder: {gdrive_folder_name}")
            for i, url in enumerate(uploaded_files, 1):
                print(f"   {i}. {url}")
        elif drive and folder_id:
            print(f"\n⚠️ Google Drive available but no files uploaded - check for errors above")
        else:
            print(f"\n📁 Files downloaded to your computer:")
            print(f"   • {output_filename}")
            print(f"   • {base_name}.gff")

    else:
        print("❌ Error: Prediction failed")

❌ Error: Please run the configuration cell first.
