# antiSMASH Analysis on Google Colab

**Purpose:**
- Run antiSMASH (Biosynthetic Gene Cluster analysis) on GenBank or FASTA files
- Designed for Google Colab with easy micromamba setup
- Download and explore results interactively

**Before you start:**
- Runtime: Works on Google Colab (CPU is sufficient, no GPU needed)
- Input: GenBank (.gbk/.gbff) or nucleotide FASTA files (**支持 .fna 文件**)
- Output: Interactive HTML report with BGC annotations
- Requirements: Python 3.12+ recommended

**Quick start:**
1. Enable runtime (Runtime → Run all)
2. Upload your GenBank/FASTA file when prompted
3. Wait for analysis to complete
4. Download results from the file browser

---

## 中文使用说明 (Chinese Instructions)

### 📋 使用步骤：

1. **按顺序运行所有格子**（Cell 0 → Cell 7）
   - 点击每个代码格子，按 `Shift + Enter` 运行
   - 或点击菜单：`Runtime → Run all`

2. **Cell 3 - 文件输入**：
   - 会提示：`Download sample GenBank file for testing? (y/N):`
   - 输入 `y` = 下载示例文件测试
   - 输入 `N` = 上传你自己的文件
   - **支持的文件格式**：
     - ✅ GenBank: `.gbk`, `.gbff`
     - ✅ FASTA: `.fna`, `.fa`, `.fasta`

3. **文件路径**（自动处理，无需手动指定）：
   - 输入目录：`/content/antismash_workspace/input/`
   - 输出目录：`/content/antismash_workspace/output/`

4. **Cell 4 - 配置参数**：
   - Taxon: 选择 `bacteria`（细菌）或 `fungi`（真菌）
   - Analysis depth: 选择分析深度（1=最快，2=推荐，3=完整）

5. **Cell 5 - 运行分析**：
   - 自动运行 antiSMASH（10-30分钟）
   - 结果保存在输出目录

6. **Cell 7 - 下载结果**：
   - 可以打包下载所有结果为 ZIP 文件
   - 主要查看 `index.html` 文件

### ⚠️ 注意事项：
- 需要至少 5GB 磁盘空间（用于数据库）
- 大基因组文件分析时间较长
- 保持网络连接稳定

### 🤔 什么是 Micromamba？

**Micromamba** 是轻量级的包管理器，用于安装 antiSMASH 及其依赖：

| 特性 | Micromamba | Conda/Mamba |
|------|------------|-------------|
| **安装速度** | 🚀 极快（1-2分钟） | 🐢 较慢（5-15分钟） |
| **体积** | 📦 ~5 MB | 📦 ~500 MB |
| **依赖解析** | ⚡ 快速 | 🐌 较慢 |
| **最终结果** | ✅ 完全相同的 antiSMASH | ✅ 完全相同的 antiSMASH |
| **在 Colab 上** | ✅ 推荐使用 | ⚠️ 可用但慢 |

**为什么选择 Micromamba？**
- ✅ 在 Colab 上安装更快
- ✅ 占用空间更小
- ✅ 完全兼容 conda 包
- ✅ 最终运行的 antiSMASH 版本和功能完全一样

**安装方法对比：**
- **Micromamba (推荐 Colab)**：快速、轻量级
- **Conda/Mamba**：传统方法，本地环境常用
- **Docker**：使用官方 antiSMASH Docker 镜像


In [None]:
# Cell 0 - Setup working directory
from pathlib import Path
import os

# Detect environment
ON_COLAB = Path('/content').exists()

# Setup base directory
if ON_COLAB:
    BASE = Path('/content/antismash_workspace')
else:
    BASE = Path.cwd() / 'antismash_workspace'

INPUT_DIR = BASE / 'input'
OUTPUT_DIR = BASE / 'output'

# Create directories
for d in [BASE, INPUT_DIR, OUTPUT_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print('=' * 60)
print('🧬 antiSMASH Workspace Setup')
print('=' * 60)
print(f'Environment: {"Google Colab" if ON_COLAB else "Local"}')
print(f'Base directory: {BASE}')
print(f'Input directory: {INPUT_DIR}')
print(f'Output directory: {OUTPUT_DIR}')
print('✅ Directories created successfully')
print()

In [None]:
# Cell 1 - Install micromamba and antiSMASH
# 安装 Micromamba（轻量级包管理器）和 antiSMASH
import shutil
import subprocess

print('=' * 60)
print('📦 Installing antiSMASH / 安装 antiSMASH')
print('=' * 60)
print()
print('ℹ️  使用 Micromamba 安装（比 Conda 快 10-100 倍）')
print('ℹ️  Using Micromamba for installation (10-100x faster than Conda)')
print()

# Check if micromamba is already installed
if shutil.which('micromamba'):
    print('✅ Micromamba already installed / Micromamba 已安装')
else:
    print('📥 Installing micromamba / 正在安装 micromamba...')
    if ON_COLAB:
        # Install micromamba on Colab
        !mkdir -p /usr/local/bin
        !curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xj -C /usr/local bin/micromamba
        print('✅ Micromamba installed to /usr/local/bin')
    else:
        print('⚠️ Please install micromamba manually:')
        print('   macOS: brew install micromamba')
        print('   Linux: curl -Ls https://micro.mamba.pm/install.sh | bash')
        print('   Or visit: https://mamba.readthedocs.io/en/latest/installation.html')

# Check if antiSMASH environment exists
env_check = subprocess.run(
    ['micromamba', 'env', 'list'],
    capture_output=True,
    text=True
)

if 'antismash' in env_check.stdout:
    print('✅ antiSMASH environment already exists / antiSMASH 环境已存在')
else:
    print('📦 Creating antiSMASH environment / 正在创建 antiSMASH 环境...')
    print('   (this may take 5-10 minutes / 可能需要 5-10 分钟)')
    !micromamba create -y -n antismash -c conda-forge -c bioconda antismash python=3.12
    print('✅ antiSMASH environment created / antiSMASH 环境已创建')

print()
print('📥 Downloading antiSMASH databases / 正在下载 antiSMASH 数据库...')
print('   (this may take a few minutes / 可能需要几分钟)')
!micromamba run -n antismash download-antismash-databases --verbose

print()
print('✅ Installation complete! / 安装完成！')
print('=' * 60)

In [None]:
# Cell 2 - Verify installation
print('=' * 60)
print('🔍 Verifying antiSMASH Installation')
print('=' * 60)

try:
    # Create environment without matplotlib backend conflict
    # 创建环境时避免 matplotlib 后端冲突
    env = os.environ.copy()
    env.pop('MPLBACKEND', None)  # Remove incompatible backend setting

    result = subprocess.run(
        ['micromamba', 'run', '-n', 'antismash', 'antismash', '--version'],
        capture_output=True,
        text=True,
        check=True,
        env=env
    )
    print('✅ antiSMASH is ready!')
    print(f'Version: {result.stdout.strip()}')
except subprocess.CalledProcessError as e:
    print('❌ antiSMASH verification failed')
    print('Error:', e.stderr)
    print('Please re-run Cell 1')

print('=' * 60)

In [None]:
# Cell 3 - Upload or download sample data
print('=' * 60)
print('📁 Input File Setup / 文件输入设置')
print('=' * 60)
print()
print('支持的文件格式 (Supported formats):')
print('  ✅ GenBank: .gbk, .gbff')
print('  ✅ FASTA: .fna, .fa, .fasta')
print()

use_sample = input('Download sample GenBank file for testing? 下载示例文件测试？(y/N): ').strip().lower() == 'y'

if use_sample:
    print('📥 Downloading sample bacterial genome...')
    import urllib.request
    import gzip
    
    sample_url = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gbff.gz'
    sample_gz = INPUT_DIR / 'sample.gbff.gz'
    sample_file = INPUT_DIR / 'sample.gbff'
    
    try:
        urllib.request.urlretrieve(sample_url, sample_gz)
        
        # Decompress
        with gzip.open(sample_gz, 'rb') as f_in:
            with open(sample_file, 'wb') as f_out:
                f_out.write(f_in.read())
        
        sample_gz.unlink()  # Remove .gz
        print(f'✅ Sample file downloaded: {sample_file.name}')
        input_file = sample_file
        
    except Exception as e:
        print(f'❌ Download failed: {e}')
        input_file = None
else:
    # Upload file on Colab
    if ON_COLAB:
        from google.colab import files
        print('📤 Please upload your file / 请上传你的文件:')
        print('   Supported: GenBank (.gbk/.gbff) or FASTA (.fna/.fa/.fasta)')
        print('   支持格式：GenBank (.gbk/.gbff) 或 FASTA (.fna/.fa/.fasta)')
        print()
        uploaded = files.upload()
        
        if uploaded:
            filename = list(uploaded.keys())[0]
            input_file = INPUT_DIR / filename
            
            # Move uploaded file to input directory
            import shutil
            shutil.move(filename, input_file)
            print(f'✅ File uploaded / 文件已上传: {input_file.name}')
            print(f'📂 Saved to / 保存至: {input_file}')
        else:
            print('⚠️ No file uploaded / 未上传文件')
            input_file = None
    else:
        print('Please place your GenBank/FASTA file in:', INPUT_DIR)
        file_path = input('Enter filename (or full path): ').strip()
        
        if file_path:
            input_file = Path(file_path) if Path(file_path).is_absolute() else INPUT_DIR / file_path
            if not input_file.exists():
                print(f'❌ File not found: {input_file}')
                input_file = None
            else:
                print(f'✅ Using file: {input_file}')
        else:
            input_file = None

print('=' * 60)

In [None]:
# Cell 4 - Configure antiSMASH parameters
print('=' * 60)
print('⚙️ antiSMASH Configuration')
print('=' * 60)

print('\nCommon options (press Enter for defaults):')
print()

# Taxon
print('Taxon (bacteria/fungi):')
taxon = input('  [default: bacteria]: ').strip() or 'bacteria'

# Additional features
print('\nAnalysis depth:')
print('  1. Minimal (fastest)')
print('  2. Standard (recommended)')
print('  3. Full (includes all extra features, slowest)')
depth = input('  Select [1-3, default: 2]: ').strip() or '2'

# Build command arguments
extra_args = ['--taxon', taxon]

if depth == '3':
    extra_args.extend([
        '--cb-general',
        '--cb-subclusters',
        '--cb-knownclusters',
        '--asf',
        '--pfam2go',
        '--smcog-trees'
    ])
elif depth == '1':
    extra_args.append('--minimal')

print()
print('Configuration:')
print(f'  Taxon: {taxon}')
print(f'  Depth: {"Minimal" if depth == "1" else "Standard" if depth == "2" else "Full"}')
print(f'  Extra args: {" ".join(extra_args)}')
print('=' * 60)

In [None]:
# Cell 5 - Run antiSMASH analysis
if input_file is None:
    print('❌ No input file specified. Please run Cell 3 first.')
else:
    print('=' * 60)
    print('🔬 Running antiSMASH Analysis')
    print('=' * 60)
    print(f'Input: {input_file.name}')
    print(f'Output: {OUTPUT_DIR}')
    print()
    print('⏳ This may take 10-30 minutes depending on genome size...')
    print()
    
    # Prepare output subdirectory
    run_output = OUTPUT_DIR / input_file.stem
    run_output.mkdir(parents=True, exist_ok=True)
    
    # Build command
    cmd = [
        'micromamba', 'run', '-n', 'antismash',
        'antismash',
        str(input_file.resolve()),
        '--output-dir', str(run_output.resolve())
    ] + extra_args
    
    print('Running command:')
    print(' '.join(cmd))
    print()
    
    try:
        # Create environment without matplotlib backend conflict
        # 创建环境时避免 matplotlib 后端冲突
        env = os.environ.copy()
        env.pop('MPLBACKEND', None)  # Remove incompatible backend setting

        result = subprocess.run(cmd, check=True, env=env)

        print()
        print('=' * 60)
        print('✅ Analysis Complete!')
        print('=' * 60)
        
        # Check for output files
        index_html = run_output / 'index.html'
        if index_html.exists():
            print(f'📊 Results: {run_output}')
            print(f'📄 Main report: {index_html.name}')
            print()
            print('To view results:')
            if ON_COLAB:
                print('  - Use the file browser (left panel) to navigate to output folder')
                print('  - Download index.html and open in your browser')
            else:
                print(f'  - Open {index_html} in your browser')
        else:
            print('⚠️ index.html not found, but analysis completed')
            print(f'Check {run_output} for output files')
        
    except subprocess.CalledProcessError as e:
        print()
        print('❌ Analysis failed')
        print('Error code:', e.returncode)
        print('\nTroubleshooting:')
        print('  - Check that input file is valid GenBank/FASTA format')
        print('  - Ensure databases were downloaded successfully (Cell 1)')
        print('  - Try re-running the cell')

In [None]:
# Cell 6 - Display results summary
if 'run_output' in globals() and run_output.exists():
    print('=' * 60)
    print('📋 Results Summary')
    print('=' * 60)
    
    # List output files
    print('\nGenerated files:')
    for item in sorted(run_output.iterdir()):
        if item.is_file():
            size = item.stat().st_size
            size_str = f'{size:,} bytes' if size < 1024*1024 else f'{size/(1024*1024):.1f} MB'
            print(f'  📄 {item.name} ({size_str})')
    
    # Check for GenBank files with annotations
    gbk_files = list(run_output.glob('*.gbk'))
    if gbk_files:
        print(f'\n✅ Found {len(gbk_files)} annotated GenBank file(s)')
    
    # Check for region files
    region_files = list(run_output.glob('*.region*.gbk'))
    if region_files:
        print(f'✅ Found {len(region_files)} BGC region(s)')
    
    print('\n' + '=' * 60)
    print('🎉 Analysis complete! Download the output folder to view results.')
    print('=' * 60)
else:
    print('⚠️ No results to display. Please run Cell 5 first.')

In [None]:
# Cell 7 - Download results (Colab only)
if ON_COLAB and 'run_output' in globals() and run_output.exists():
    print('=' * 60)
    print('📥 Download Results')
    print('=' * 60)
    
    download_all = input('Create ZIP archive of all results? (y/N): ').strip().lower() == 'y'
    
    if download_all:
        import shutil
        from google.colab import files
        
        print('📦 Creating ZIP archive...')
        zip_path = BASE / f'{input_file.stem}_results.zip'
        
        shutil.make_archive(
            str(zip_path.with_suffix('')),
            'zip',
            run_output
        )
        
        print(f'✅ Archive created: {zip_path.name}')
        print('📥 Starting download...')
        
        files.download(str(zip_path))
        print('✅ Download complete!')
    else:
        print('💡 Tip: Use the file browser (left panel) to download individual files')
        print(f'   Navigate to: {run_output}')
else:
    if not ON_COLAB:
        print('ℹ️ Download cell only works on Google Colab')
    else:
        print('⚠️ No results to download. Run Cell 5 first.')

## Notes

### Understanding antiSMASH Results

Key output files:
- `index.html`: Main interactive report with all results
- `*.gbk`: Annotated GenBank files with BGC predictions
- `*.region*.gbk`: Individual BGC regions
- `*.json`: Machine-readable results

### Common BGC Types

- **NRPS**: Nonribosomal peptide synthetase
- **PKS**: Polyketide synthase
- **RiPP**: Ribosomally synthesized and post-translationally modified peptides
- **Terpene**: Terpene biosynthesis
- **Bacteriocin**: Antimicrobial peptides

### Troubleshooting

**Installation issues:**
- If databases fail to download, try running Cell 1 again
- Check internet connection
- Ensure sufficient disk space (>5 GB)

**Analysis issues:**
- Verify input file format (GenBank or FASTA)
- For large genomes (>10 MB), increase runtime timeout
- Use `--minimal` flag for faster analysis

### Resources

- [antiSMASH Documentation](https://docs.antismash.secondarymetabolites.org/)
- [antiSMASH Web Server](https://antismash.secondarymetabolites.org/)
- [Publication](https://academic.oup.com/nar/article/49/W1/W29/6274535)
