# üöÄ Colab Ingest

Download files from **Pixeldrain**, **BuzzHeavier**, **Bunkr** and upload to Google Drive.

## Quick Start
1. Run **Cell 1**: Setup (only once)
2. Run **Cell 2**: Mount Google Drive
3. Run **Cell 3**: Paste your links
4. Run **Cell 4**: Start download!

---

## üì¶ Cell 1: Setup & Installation
Run this cell once to install everything.

In [1]:
#@title 1Ô∏è‚É£ Setup & Installation { display-mode: "form" }
#@markdown Run this cell to install colab-ingest and dependencies.

import os
import subprocess

print("üì¶ Installing system dependencies...")
subprocess.run(["apt-get", "update", "-qq"], check=True)
subprocess.run(["apt-get", "install", "-y", "-qq", "p7zip-full", "unrar-free"], check=True)
print("‚úÖ System dependencies installed")

# Clone repo if not exists
REPO_PATH = "/content/colab-ingest"
if not os.path.exists(REPO_PATH):
    print("\nüì• Cloning colab-ingest repository...")
    subprocess.run([
        "git", "clone", "--recurse-submodules",
        "https://github.com/Caowo0/colab-ingest.git",
        REPO_PATH
    ], check=True)
    print("‚úÖ Repository cloned")
else:
    print(f"\nüìÅ Repository already exists at {REPO_PATH}")
    # Update to latest
    subprocess.run(["git", "-C", REPO_PATH, "pull"], check=True)
    subprocess.run(["git", "-C", REPO_PATH, "submodule", "update", "--init", "--recursive"], check=True)
    print("‚úÖ Repository updated")

# Install Python package
print("\nüêç Installing Python package...")
subprocess.run(["pip", "install", "-q", "-e", REPO_PATH], check=True)
print("‚úÖ Python package installed")

# Install third-party downloader dependencies
print("\nüì¶ Installing third-party downloader dependencies...")
bunkr_req = os.path.join(REPO_PATH, "third_party", "BunkrDownloader", "requirements.txt")
if os.path.exists(bunkr_req):
    subprocess.run(["pip", "install", "-q", "-r", bunkr_req], check=True)
    print("‚úÖ BunkrDownloader dependencies installed")
else:
    print("‚ö†Ô∏è BunkrDownloader requirements.txt not found")

print("\n" + "="*50)
print("üéâ Setup complete! Proceed to the next cell.")
print("="*50)

üì¶ Installing system dependencies...
‚úÖ System dependencies installed

üì• Cloning colab-ingest repository...
‚úÖ Repository cloned

üêç Installing Python package...
‚úÖ Python package installed

üéâ Setup complete! Proceed to the next cell.


## üíæ Cell 2: Mount Google Drive
Required for uploading downloaded files to your Drive.

In [2]:
#@title 2Ô∏è‚É£ Mount Google Drive { display-mode: "form" }
#@markdown Click the link and authorize access to your Google Drive.

from google.colab import drive
from pathlib import Path

MOUNT_POINT = "/content/drive"

# Check if already mounted
mydrive = Path(MOUNT_POINT) / "MyDrive"
if mydrive.exists():
    print("‚úÖ Google Drive already mounted!")
    print(f"üìÅ MyDrive path: {mydrive}")
else:
    print("üîó Mounting Google Drive...")
    drive.mount(MOUNT_POINT)
    print("\n‚úÖ Google Drive mounted successfully!")

# Show some contents
print("\nüìÇ Contents of MyDrive (first 10 items):")
for i, item in enumerate(mydrive.iterdir()):
    if i >= 10:
        print("  ...")
        break
    icon = "üìÅ" if item.is_dir() else "üìÑ"
    print(f"  {icon} {item.name}")

üîó Mounting Google Drive...


KeyboardInterrupt: 

## üìù Cell 3: Enter Your Links
Paste your download links in the text area below (one per line).

In [None]:
#@title 3Ô∏è‚É£ Paste Your Links Here { display-mode: "form" }

import ipywidgets as widgets
from IPython.display import display, clear_output
import re
import json

# Configuration inputs
print("üìã Paste your links below (one URL per line):")
print("‚îÄ" * 50)

links_textarea = widgets.Textarea(
    value='https://pixeldrain.com/u/example1\nhttps://buzzheavier.com/f/example2\nhttps://bunkr.si/a/example3',
    placeholder='Paste your links here, one per line...',
    layout=widgets.Layout(width='100%', height='200px')
)

drive_dest_input = widgets.Text(
    value='MyDrive/Downloads',
    description='Drive Dest:',
    layout=widgets.Layout(width='400px')
)

api_key_input = widgets.Password(
    value='',
    description='Pixeldrain Key:',
    placeholder='Optional - for Pixeldrain links',
    layout=widgets.Layout(width='400px')
)

concurrency_slider = widgets.IntSlider(
    value=3,
    min=1,
    max=10,
    step=1,
    description='Concurrency:'
)

max_retries_slider = widgets.IntSlider(
    value=3,
    min=1,
    max=10,
    step=1,
    description='Max Retries:'
)

save_button = widgets.Button(
    description='üíæ Save Configuration',
    button_style='primary',
    layout=widgets.Layout(width='200px')
)

output_area = widgets.Output()

def save_config(b):
    with output_area:
        clear_output()
        
        # Parse links - extract URLs from mixed text
        links_text = links_textarea.value
        lines = links_text.strip().split("\n")
        valid_links = []
        invalid_lines = []
        
        # Pattern to extract URLs from mixed text
        # NOTE: For full Bunkr support (23 TLDs, CDN subdomains, direct file URLs),
        # we import from colab_ingest.utils.url_detect which has comprehensive patterns.
        # This inline pattern covers common cases for quick validation.
        #
        # Bunkr domains supported (via url_detect module):
        # - TLDs: si, su, la, ru, is, to, sk, ac, black, red, cat, ws, fi, ph,
        #         cr, site, media, click, se, cx, pk, ax, ps, org (24 total)
        # - CDN subdomains: cdn*, media-files*, i*, stream, v, videos, player
        # - 'bunkr', 'bunkrr', and 'bunkrrr' variants (1-3 r's)
        # - Direct CDN file URLs (without /a/, /f/, /v/ path prefixes)
        url_pattern = re.compile(
            r'(https?://(?:'
            r'pixeldrain\.com/[ul]/[a-zA-Z0-9]+'
            r'|buzzheavier\.com/f?/?[a-zA-Z0-9]+'
            r'|bzzhr\.co/[a-zA-Z0-9]+'
            # Bunkr: expanded TLDs (24 total), bunkr/bunkrr/bunkrrr variants
            r'|(?:(?:cdn\d*|media-files\d*|i\d*|stream|v|videos|player)\.)?bunkr{1,3}\.(?:si|su|la|ru|is|to|sk|ac|black|red|cat|ws|fi|ph|cr|site|media|click|se|cx|pk|ax|ps|org)(?:/[afvdi]/[a-zA-Z0-9_-]+(?:\.[a-zA-Z0-9]+)?|/[a-zA-Z0-9_-]+\.[a-zA-Z0-9]+)'
            r'))',
            re.IGNORECASE
        )
        
        for line in lines:
            line = line.strip()
            if not line or line.startswith('#'):
                continue
            # Find all URLs in the line (supports mixed text)
            found_urls = url_pattern.findall(line)
            if found_urls:
                valid_links.extend(found_urls)
            else:
                # Check if line has any http(s) URL that we couldn't match
                if re.search(r'https?://', line):
                    invalid_lines.append(line)
        
        # Save to file
        LINKS_FILE = "/content/links.txt"
        with open(LINKS_FILE, 'w') as f:
            f.write("\n".join(valid_links))
        
        # Display summary
        print("=" * 50)
        print("üìã LINKS SUMMARY")
        print("=" * 50)
        print(f"\n‚úÖ Valid links: {len(valid_links)}")
        
        # Categorize by host
        hosts = {"pixeldrain": 0, "buzzheavier": 0, "bunkr": 0, "other": 0}
        for link in valid_links:
            if "pixeldrain" in link:
                hosts["pixeldrain"] += 1
            elif "buzzheavier" in link:
                hosts["buzzheavier"] += 1
            elif "bunkr" in link:
                hosts["bunkr"] += 1
            else:
                hosts["other"] += 1
        
        print("\nüìä By host:")
        for host, count in hosts.items():
            if count > 0:
                print(f"   ‚Ä¢ {host}: {count}")
        
        if invalid_lines:
            print(f"\n‚ö†Ô∏è Invalid lines (skipped): {len(invalid_lines)}")
        
        print(f"\nüìÅ Links saved to: {LINKS_FILE}")
        print(f"üìÇ Upload destination: /content/drive/{drive_dest_input.value}")
        
        # Warnings
        if hosts["pixeldrain"] > 0 and not api_key_input.value:
            print("\n‚ö†Ô∏è WARNING: Pixeldrain links found but no API key provided!")
            print("   Get your key at: https://pixeldrain.com/user/api_keys")
        
        # Store config for next cell
        config = {
            "links_file": LINKS_FILE,
            "drive_dest": drive_dest_input.value,
            "pixeldrain_api_key": api_key_input.value,
            "concurrency": concurrency_slider.value,
            "max_retries": max_retries_slider.value,
            "num_links": len(valid_links)
        }
        with open("/content/.ingest_config.json", 'w') as f:
            json.dump(config, f)
        
        print("\n" + "=" * 50)
        print("‚úÖ Configuration saved! Run Cell 4 to start.")
        print("=" * 50)

save_button.on_click(save_config)

# Display widgets
display(links_textarea)
print("\n‚öôÔ∏è Configuration:")
display(drive_dest_input)
display(api_key_input)
display(concurrency_slider)
display(max_retries_slider)
print()
display(save_button)
display(output_area)

üìã Paste your links below (one URL per line):
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ


Textarea(value='https://pixeldrain.com/u/example1\nhttps://buzzheavier.com/f/example2\nhttps://bunkr.si/a/exam‚Ä¶


‚öôÔ∏è Configuration:


Text(value='MyDrive/Downloads', description='Drive Dest:', layout=Layout(width='400px'))

Password(description='Pixeldrain Key:', layout=Layout(width='400px'), placeholder='Optional - for Pixeldrain l‚Ä¶

IntSlider(value=3, description='Concurrency:', max=10, min=1)

IntSlider(value=3, description='Max Retries:', max=10, min=1)




Button(button_style='primary', description='üíæ Save Configuration', layout=Layout(width='200px'), style=ButtonS‚Ä¶

Output()

## üöÄ Cell 4: Start Download & Upload
This will download all files and upload them to your Google Drive.

In [None]:
#@title 4Ô∏è‚É£ Start Download & Upload { display-mode: "form" }
#@markdown Click **Run** to start the pipeline!

#@markdown ---
#@markdown ### Options:
dry_run = False #@param {type:"boolean"}
verbose = True #@param {type:"boolean"}
retry_failed = False #@param {type:"boolean"}

import json
import subprocess
import os

# Load config from previous cell
config_file = "/content/.ingest_config.json"
if not os.path.exists(config_file):
    print("‚ùå Error: Please run Cell 3 first and click 'Save Configuration'!")
    raise SystemExit(1)

with open(config_file, 'r') as f:
    config = json.load(f)

if config["num_links"] == 0:
    print("‚ùå Error: No valid links found. Please check Cell 3.")
    raise SystemExit(1)

print("üöÄ Starting colab-ingest pipeline...")
print("=" * 50)
print(f"üìã Links: {config['num_links']}")
print(f"üìÇ Destination: /content/drive/{config['drive_dest']}")
print(f"‚ö° Concurrency: {config['concurrency']}")
print(f"üîÑ Max retries: {config['max_retries']}")
print(f"üß™ Dry run: {dry_run}")
print("=" * 50 + "\n")

# Build command
cmd = [
    "colab-ingest", "run",
    "--links", config["links_file"],
    "--drive-dest", config["drive_dest"],
    "--workdir", "/content/work",
    "--concurrency", str(config["concurrency"]),
    "--max-retries", str(config["max_retries"]),
]

if config["pixeldrain_api_key"]:
    cmd.extend(["--pixeldrain-api-key", config["pixeldrain_api_key"]])

if dry_run:
    cmd.append("--dry-run")

if verbose:
    cmd.append("--verbose")

if retry_failed:
    cmd.append("--retry-failed")

# Run pipeline
try:
    result = subprocess.run(cmd, check=False)
    
    if result.returncode == 0:
        print("\n" + "=" * 50)
        print("üéâ Pipeline completed successfully!")
        print(f"üìÇ Files uploaded to: /content/drive/{config['drive_dest']}")
        print("=" * 50)
    else:
        print("\n" + "=" * 50)
        print("‚ö†Ô∏è Pipeline completed with some errors.")
        print("Run the 'Check Status' cell below for details.")
        print("=" * 50)
        
except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Pipeline interrupted. You can resume by running this cell again.")

---
## üîß Utility Cells

In [None]:
#@title üìä Check Status { display-mode: "form" }
#@markdown View status of all download tasks.

!colab-ingest status --workdir /content/work

In [None]:
#@title üîÑ Reset Failed Tasks { display-mode: "form" }
#@markdown Reset failed tasks so they can be retried.

!colab-ingest reset --workdir /content/work

In [None]:
#@title üßπ Cleanup Temp Files { display-mode: "form" }
#@markdown Remove temporary download files to free up space.

!colab-ingest clean --workdir /content/work --force

# Show disk usage
print("\nüìä Disk usage:")
!df -h /content

In [None]:
#@title üîç System Check { display-mode: "form" }
#@markdown Verify all dependencies are installed correctly.

!colab-ingest check