<a href="https://colab.research.google.com/github/Palaeoprot/TimeTree/blob/main/TimeTree_Downloader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cell 1 – Introduction & Environment Setup

## 🌳 TimeTree Newick File Downloader

This notebook automates the process of downloading phylogenetic trees in the **Newick (`.nwk`) format** from [TimeTree.org](http://timetree.org/). This is a necessary preliminary step for integrating divergence date information into our main ZooMS taxonomy analysis.

**Why is this separate script needed?**
TimeTree does not provide a simple public API for bulk data downloads. This script works by programmatically simulating the web request a user's browser makes when they click the "Download" button for a Newick file.

**What this cell does:**
*   **Installs Packages:** It ensures the `requests` library, used for making web requests, is installed.
*   **Imports Libraries:** It imports all the necessary tools.
*   **Connects to Google Drive:** It requests permission to mount your Google Drive, allowing the notebook to save the downloaded files directly to your project folder.

**⚠️ Important Note on Responsible Use:**
This script sends automated requests to the TimeTree server. A delay between requests has been added to be a "polite" web citizen and avoid overwhelming their server. Please use this tool responsibly and for academic purposes.

In [4]:
# ===== Cell 1 =====
# Environment Setup

print("🌳 TimeTree Downloader - Environment Setup")
print("=" * 50)

# Install required packages
!pip install -q requests

# Import necessary libraries
import requests
from pathlib import Path # Import Path from pathlib
import time
from google.colab import drive

# Mount Google Drive
try:
    drive.mount('/content/drive')
    print("✅ Google Drive mounted successfully.")
except Exception as e:
    print(f"❌ Google Drive mount failed: {e}")

print("\n✅ Setup complete. Proceed to Cell 2 to configure the download list.")

🌳 TimeTree Downloader - Environment Setup
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Google Drive mounted successfully.

✅ Setup complete. Proceed to Cell 2 to configure the download list.


# Cell 2 – Configuration

## 🎯 Configure Your Downloads

This is the **only cell you need to edit**. Select the list of taxonomic clades you want to download from the dropdown menu and ensure the output directory is correct.

**Instructions:**
1.  **`SELECTED_CLADES`:** Click the dropdown to select one or more clades. To select multiple, hold `Ctrl` (on Windows/Linux) or `Cmd` (on Mac) while clicking.
2.  **Advanced Users:** You can also click into the dropdown field and type a custom clade name (e.g., "Carnivora") that is not on the list.
3.  **`OUTPUT_DIRECTORY`:** This path **must exactly match** the path you configured in your main analysis notebook so it can find the files later. It should point to the `TimeTree_Newick_Files` folder inside your project directory.

In [None]:
# ===== Cell 2 =====
# Configuration

# @title 1. Select Target Clades
# @markdown Select one or more vertebrate clades from the list. Hold Ctrl/Cmd to select multiple. You can also type in a custom value.
SELECTED_CLADES = "Mammalia (mammals)" # @param ["Myxini (hagfish)", "Petromyzontida (lampreys)", "Chondrichthyes (cartilaginous fish)", "Actinopterygii (ray-finned fish)", "Dipnoi (lungfish)", "Actinistia (coelacanths)", "Amphibia (amphibians)", "Testudines (turtles)", "Lepidosauria (lizards, snakes, tuatara)", "Crocodylia (crocodilians)", "Aves (birds)", "Mammalia (mammals)"] {allow-input: true}


# @title 2. Output Directory
# @markdown Enter the full path to the directory where the Newick files will be saved.
OUTPUT_DIRECTORY = "/content/drive/MyDrive/Colab_Notebooks/GitHub/_SHARED_DATA/FASTA/ZooMS_Taxonomy/TimeTree_Newick_Files" #@param {type:"string"}

# @title 3. Download Delay
# @markdown Delay in seconds between each download request to be polite to the server.
DOWNLOAD_DELAY_SECONDS = 5 #@param {type:"slider", min:1, max:10, step:1}

# --- End of Configuration ---

# The multi-select param directly gives us a list of strings.
# We just need to clean up any potential extra whitespace from manual entries.
clade_list = [clade.strip() for clade in SELECTED_CLADES if clade.strip()]
output_path = Path(OUTPUT_DIRECTORY)

# Create the output directory if it doesn't exist
output_path.mkdir(parents=True, exist_ok=True)

print(f"✅ Configuration loaded.")
if not clade_list:
    print("   ⚠️ No clades selected. The next cell will have nothing to download.")
else:
    print(f"   Target clades to download ({len(clade_list)}): {', '.join(clade_list)}")
    print(f"   Output directory: {output_path}")

# Cell 3 – Main Execution

## 🚀 Download The Files

This cell is the "run" button. It will iterate through the list of clades you configured in Cell 2 and attempt to download a Newick file for each one.

**The Process:**
1.  It loops through each clade name.
2.  For each name, it constructs and sends a `POST` request to the TimeTree server, mimicking a browser.
3.  It checks if the server's response contains valid Newick data.
4.  If successful, it saves the data to a `.nwk` file in your specified output directory.
5.  If it fails (e.g., the clade is not found on TimeTree), it prints an error message and moves to the next one.
6.  It waits for the configured delay before starting the next download.

In [None]:
# ===== Cell 3 =====
# Main Execution

def download_newick_from_timetree(clade_name: str, save_path: Path):
    """
    Sends a POST request to TimeTree to download the Newick file for a given clade.
    """
    print(f"-> Attempting to download: '{clade_name}'...")

    # This is the URL endpoint that handles the download request
    url = "https://timetree.org/home"

    # This payload mimics the form data sent by a browser when clicking the download button
    payload = {
        'newick_str': 'Get Newick',
        'tree_type': 'species',
        'taxon_search': clade_name
    }

    try:
        # Make the POST request
        response = requests.post(url, data=payload, timeout=30)
        response.raise_for_status()  # Raise an exception for bad status codes (like 404, 500)

        # Validate the response content to ensure it's a Newick file
        if response.text and response.text.strip().startswith("(") and response.text.strip().endswith(";"):
            # Construct the output file path
            file_path = save_path / f"{clade_name.replace(' ', '_')}.nwk"

            # Write the content to the file
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write(response.text)

            print(f"   ✅ Success! Saved to {file_path.name}")
            return True
        else:
            print(f"   ⚠️ Failed. The server's response for '{clade_name}' was not a valid Newick tree.")
            return False

    except requests.exceptions.RequestException as e:
        print(f"   ❌ Failed. A network error occurred for '{clade_name}': {e}")
        return False

# --- Main Download Loop ---

if not clade_list:
    print("No clades configured in Cell 2. Nothing to download.")
else:
    print(f"\n🚀 Starting download of {len(clade_list)} Newick files...")
    print("=" * 50)

    success_count = 0
    failure_count = 0

    for i, clade in enumerate(clade_list):
        if download_newick_from_timetree(clade, output_path):
            success_count += 1
        else:
            failure_count += 1

        # Don't sleep after the very last request
        if i < len(clade_list) - 1:
            print(f"   ...waiting for {DOWNLOAD_DELAY_SECONDS} seconds...")
            time.sleep(DOWNLOAD_DELAY_SECONDS)

    print("\n" + "="*50)
    print("✨ Download process complete!")
    print(f"   Successfully downloaded: {success_count}")
    print(f"   Failed to download: {failure_count}")