# Planet Minecraft Scraper - Exploration

This notebook explores scraping Minecraft maps/schematics from Planet Minecraft.

## Key Differences from minecraft-schematics.com:
- ✅ **Easier**: No login required, no Cloudflare
- ❌ **Harder**: No simple ID-based index system
- 🔗 **Complex**: Multiple download sources (direct, MediaFire, Dropbox, Patreon, etc.)

## Goals:
1. Understand page structure and metadata
2. Extract all relevant information (title, description, tags, category, date, etc.)
3. Identify download link patterns
4. Handle different download sources
5. Build a scraper to collect metadata first, download later

In [2]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import re
from pathlib import Path
import json
from datetime import datetime, timedelta

# Setup
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

print("✓ Libraries imported")

✓ Libraries imported


## ⚠️ Bot Protection Detected!

Planet Minecraft is blocking simple requests. We need to use Selenium to appear more human-like.

In [5]:
# Use Selenium instead of requests
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Setup Firefox with options
options = Options()
# Don't run headless - site might detect it
# options.add_argument('--headless')

# Anti-detection
options.set_preference("dom.webdriver.enabled", False)
options.set_preference('useAutomationExtension', False)

print("Setting up browser...")
driver = webdriver.Firefox(options=options)
print("✓ Browser ready")

Setting up browser...
✓ Browser ready


In [6]:
# Fetch page with Selenium
url = example_urls['direct_download']
print(f"Navigating to: {url}")

driver.get(url)

# Wait for page to load
print("Waiting for page to load...")
time.sleep(3)  # Give it a moment

# Check if we got through
page_title = driver.title
print(f"Page title: {page_title}")

# Get page source
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')

# Check if blocked
if "blocked" in page_title.lower() or "sorry" in page_title.lower():
    print("❌ Still blocked - might need manual intervention")
else:
    print("✓ Page loaded successfully!")

Navigating to: https://www.planetminecraft.com/project/1-21-8-gk-first-church-download/
Waiting for page to load...
Page title: [1.21.8] GK first church [Download] Minecraft Map
✓ Page loaded successfully!


## Step 1: Fetch and Examine Example Pages

Let's start by fetching the three example URLs you provided and examining their structure.

In [3]:
# Example URLs
example_urls = {
    'direct_download': 'https://www.planetminecraft.com/project/1-21-8-gk-first-church-download/',
    'external_links': 'https://www.planetminecraft.com/project/the-great-kingdoms/',
    'paid_patreon': 'https://www.planetminecraft.com/project/fantasy-blue-house/'
}

# Fetch the first example (direct download)
url = example_urls['direct_download']
print(f"Fetching: {url}")

response = requests.get(url, headers=headers)
print(f"Status: {response.status_code}")
print(f"Content length: {len(response.content)} bytes")

# Parse with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
print(f"✓ Page parsed successfully")

Fetching: https://www.planetminecraft.com/project/1-21-8-gk-first-church-download/
Status: 403
Content length: 5759 bytes
✓ Page parsed successfully


## Step 2: Extract Title and Category

In [7]:
# Extract title
title_elem = soup.find('h1')
title = title_elem.get_text(strip=True) if title_elem else 'N/A'
print(f"Title: {title}")

# Extract category (breadcrumb navigation)
breadcrumb = soup.find('ol', class_='breadcrumb')
if breadcrumb:
    categories = [a.get_text(strip=True) for a in breadcrumb.find_all('a')]
    category_full = ' / '.join(categories)
    print(f"Category: {category_full}")
else:
    print("No breadcrumb found")
    
# Look for category in meta tags
meta_category = soup.find('meta', {'property': 'og:type'})
if meta_category:
    print(f"Meta category: {meta_category.get('content')}")

Title: [1.21.8] GK first church [Download]
No breadcrumb found
Meta category: article


## Step 3: Extract Posted Date

The date is shown as relative time (e.g., "yesterday", "3 weeks ago"). Let's find how it's stored.

In [8]:
# Look for time elements
time_elements = soup.find_all('time')
print(f"Found {len(time_elements)} time elements:\n")

for i, time_elem in enumerate(time_elements):
    print(f"Time element {i+1}:")
    print(f"  Text: {time_elem.get_text(strip=True)}")
    print(f"  datetime attribute: {time_elem.get('datetime')}")
    print(f"  title attribute: {time_elem.get('title')}")
    print()

# Also look for "Published" text
published_text = soup.find(string=re.compile(r'Published|posted on', re.IGNORECASE))
if published_text:
    print(f"Found 'Published' text: {published_text.strip()}")
    # Get parent element
    parent = published_text.parent
    print(f"Parent element: {parent.name}")
    print(f"Parent text: {parent.get_text(strip=True)}")

Found 0 time elements:

Found 'Published' text: {"@context":"http:\/\/schema.org\/","@type":"CreativeWork","mainEntityOfPage":{"@type":"WebPage","@id":"https:\/\/www.planetminecraft.com\/project\/1-21-8-gk-first-church-download\/"},"url":"https:\/\/www.planetminecraft.com\/project\/1-21-8-gk-first-church-download\/","name":"[1.21.8] GK first church [Download] Minecraft Map","description":"Church first Graveyard Keeper This is a small, beginner level medieval church. The inspiration came from the church in the Graveyard Keeper game. I...","image":{"@type":"ImageObject","url":"https:\/\/static.planetminecraft.com\/files\/image\/minecraft\/project\/2025\/813\/19208273-x_l.jpg","name":"[1.21.8] GK first church [Download] Minecraft Map","width":"800px","height":"689px"},"thumbnailUrl":"https:\/\/www.planetminecraft.com\/files\/image\/minecraft\/project\/2025\/813\/19208273-x_l.jpg","genre":"Map","headline":"[1.21.8] GK first church [Download] Minecraft Map","dateCreated":"2025-10-08T00:00:0

## Step 4: Extract Description

In [9]:
# Look for description in meta tags first
meta_description = soup.find('meta', {'property': 'og:description'})
if meta_description:
    print("Meta description:")
    print(meta_description.get('content')[:200] + "...")
    print()

# Look for main content div
content_divs = soup.find_all('div', class_=re.compile(r'content|description|body', re.IGNORECASE))
print(f"Found {len(content_divs)} potential content divs")
print()

# Try to find the main project description
# Usually in a section with class containing "description" or similar
description_section = soup.find('section', class_=re.compile(r'description|content'))
if description_section:
    print("Description section found:")
    print(description_section.get_text(strip=True)[:300] + "...")
else:
    print("No obvious description section found, exploring structure...")

Meta description:
Church first Graveyard Keeper This is a small, beginner level medieval church. The inspiration came from the church in the Graveyard Keeper game. I......

Found 18 potential content divs

No obvious description section found, exploring structure...


## Step 5: Extract Tags

Tags are crucial metadata. Let's find where they're stored.

In [10]:
# Look for tag elements
tag_links = soup.find_all('a', class_=re.compile(r'tag', re.IGNORECASE))
print(f"Found {len(tag_links)} tag links:\n")

tags = []
for tag_link in tag_links[:15]:  # Show first 15
    tag_text = tag_link.get_text(strip=True)
    tag_href = tag_link.get('href', '')
    tags.append(tag_text)
    print(f"  • {tag_text} → {tag_href}")

print(f"\nAll tags: {tags}")

# Also check meta keywords
meta_keywords = soup.find('meta', {'name': 'keywords'})
if meta_keywords:
    print(f"\nMeta keywords: {meta_keywords.get('content')}")

Found 0 tag links:


All tags: []

Meta keywords: [1.21.8] GK first church [Download] Minecraft Map, Minecraft Map, medieval,land-structure,download,church,temple,ruined,downloadable,graveyardkeeper


## Step 6: Find Download Links (MOST IMPORTANT!)

This is the critical part - we need to identify:
1. Direct download links (hosted on Planet Minecraft)
2. External links (MediaFire, Dropbox, etc.)
3. Paid/Patreon links

In [11]:
# Look for all links
all_links = soup.find_all('a', href=True)
print(f"Total links on page: {len(all_links)}\n")

# Filter for download-related links
download_links = []
for link in all_links:
    href = link.get('href', '')
    text = link.get_text(strip=True).lower()
    classes = ' '.join(link.get('class', []))
    
    # Look for download indicators
    if any(keyword in text for keyword in ['download', 'get', 'access']):
        download_links.append({
            'text': link.get_text(strip=True),
            'href': href,
            'classes': classes
        })
    elif 'download' in href.lower():
        download_links.append({
            'text': link.get_text(strip=True),
            'href': href,
            'classes': classes
        })

print(f"Found {len(download_links)} potential download links:\n")
for i, dl in enumerate(download_links[:10]):  # Show first 10
    print(f"{i+1}. Text: {dl['text']}")
    print(f"   URL: {dl['href']}")
    print(f"   Classes: {dl['classes']}")
    print()

Total links on page: 231

Found 23 potential download links:

1. Text: Download Schematic
   URL: /project/1-21-8-gk-first-church-download/download/schematic/
   Classes: branded-download tooltip tipso_style

2. Text: Download
   URL: /projects/tag/download/
   Classes: 

3. Text: Downloadable
   URL: /projects/tag/downloadable/
   Classes: 

4. Text: 
   URL: /project/1-21-8-blood-eagle-download/
   Classes: 

5. Text: [1.21.8] Blood Eagle [Download]
   URL: /project/1-21-8-blood-eagle-download/
   Classes: r-title

6. Text: VIEW
   URL: /project/1-21-8-blood-eagle-download/
   Classes: 

7. Text: 
   URL: /project/1-21-8-gk-first-church-download/
   Classes: 

8. Text: [1.21.8] GK first church [Download]
   URL: /project/1-21-8-gk-first-church-download/
   Classes: r-title

9. Text: VIEW
   URL: /project/1-21-8-gk-first-church-download/
   Classes: 

10. Text: 
   URL: /project/1-20-8-rock-pack-x10-download/
   Classes: 



## Step 7: Classify Download Link Types

Now let's classify the download links by their source.

In [12]:
def classify_download_link(href):
    """Classify download link by source."""
    href_lower = href.lower()
    
    if 'planetminecraft.com' in href_lower and '/data_' in href_lower:
        return 'direct_pm'  # Direct Planet Minecraft hosted file
    elif 'mediafire.com' in href_lower:
        return 'mediafire'
    elif 'dropbox.com' in href_lower:
        return 'dropbox'
    elif 'patreon.com' in href_lower:
        return 'patreon_paid'
    elif 'drive.google.com' in href_lower or 'docs.google.com' in href_lower:
        return 'google_drive'
    elif 'mega.nz' in href_lower:
        return 'mega'
    elif 'github.com' in href_lower:
        return 'github'
    elif 'planetminecraft.com' in href_lower:
        return 'pm_redirect'  # Planet Minecraft page (might redirect)
    else:
        return 'other'

# Classify the download links
classified = {}
for dl in download_links:
    link_type = classify_download_link(dl['href'])
    if link_type not in classified:
        classified[link_type] = []
    classified[link_type].append(dl)

print("Download links by type:\n")
for link_type, links in classified.items():
    print(f"{link_type}: {len(links)} link(s)")
    for link in links[:2]:  # Show first 2 of each type
        print(f"  • {link['text'][:50]} → {link['href'][:80]}")
    if len(links) > 2:
        print(f"  ... and {len(links) - 2} more")
    print()

Download links by type:

other: 23 link(s)
  • Download Schematic → /project/1-21-8-gk-first-church-download/download/schematic/
  • Download → /projects/tag/download/
  ... and 21 more



## Step 8: Test on Other Example Pages

Let's fetch and examine the other two examples to see different download patterns.

In [None]:
# Test on external links example (MediaFire/Dropbox)
url_external = example_urls['external_links']
print(f"Fetching: {url_external}\n")

response_external = requests.get(url_external, headers=headers)
soup_external = BeautifulSoup(response_external.content, 'html.parser')

# Find download links
external_links = []
for link in soup_external.find_all('a', href=True):
    href = link.get('href', '')
    text = link.get_text(strip=True).lower()
    
    if any(keyword in text for keyword in ['download', 'get', 'access']) or 'download' in href.lower():
        external_links.append({
            'text': link.get_text(strip=True),
            'href': href,
            'type': classify_download_link(href)
        })

print("Download links found:")
for link in external_links[:10]:
    print(f"  Type: {link['type']}")
    print(f"  Text: {link['text'][:50]}")
    print(f"  URL: {link['href'][:80]}")
    print()

In [None]:
# Test on Patreon example
url_patreon = example_urls['paid_patreon']
print(f"Fetching: {url_patreon}\n")

response_patreon = requests.get(url_patreon, headers=headers)
soup_patreon = BeautifulSoup(response_patreon.content, 'html.parser')

# Find download links
patreon_links = []
for link in soup_patreon.find_all('a', href=True):
    href = link.get('href', '')
    text = link.get_text(strip=True).lower()
    
    if any(keyword in text for keyword in ['download', 'get', 'access', 'patreon']) or 'download' in href.lower():
        patreon_links.append({
            'text': link.get_text(strip=True),
            'href': href,
            'type': classify_download_link(href)
        })

print("Download links found:")
for link in patreon_links[:10]:
    print(f"  Type: {link['type']}")
    print(f"  Text: {link['text'][:50]}")
    print(f"  URL: {link['href'][:80]}")
    print()

## Step 9: Extract Additional Metadata

Let's extract other useful information like file size, version, etc.

In [None]:
# Back to first example
soup = BeautifulSoup(response.content, 'html.parser')

# Look for stats/info sections
print("Looking for additional metadata:\n")

# Search for common patterns
metadata_patterns = ['version', 'size', 'file type', 'downloads', 'views', 'favorites']

for pattern in metadata_patterns:
    # Look for labels/spans containing these words
    elements = soup.find_all(string=re.compile(pattern, re.IGNORECASE))
    if elements:
        print(f"\n{pattern.upper()}:")
        for elem in elements[:3]:  # Show first 3
            parent = elem.parent
            # Try to get the value (next sibling or parent text)
            if parent:
                print(f"  {parent.get_text(strip=True)[:100]}")

# Look for structured data (JSON-LD)
json_ld = soup.find('script', type='application/ld+json')
if json_ld:
    print("\n\n" + "="*60)
    print("JSON-LD Structured Data Found!")
    print("="*60)
    try:
        import json
        data = json.loads(json_ld.string)
        print(json.dumps(data, indent=2)[:500] + "...")
    except:
        print("Could not parse JSON-LD")

## Step 10: Build a Complete Metadata Extractor Function

Now let's create a function that extracts all metadata from a Planet Minecraft project page.

In [13]:
def extract_project_metadata(url, soup=None):
    """
    Extract all metadata from a Planet Minecraft project page.
    
    Args:
        url: The project URL
        soup: Optional pre-fetched BeautifulSoup object
    
    Returns:
        dict: Metadata including title, category, tags, description, download links, etc.
    """
    if soup is None:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
    
    metadata = {
        'url': url,
        'scraped_at': datetime.now().isoformat()
    }
    
    # Title
    title_elem = soup.find('h1')
    metadata['title'] = title_elem.get_text(strip=True) if title_elem else 'N/A'
    
    # Category from breadcrumb
    breadcrumb = soup.find('ol', class_='breadcrumb')
    if breadcrumb:
        categories = [a.get_text(strip=True) for a in breadcrumb.find_all('a')]
        metadata['category'] = ' / '.join(categories)
        metadata['subcategory'] = categories[-1] if categories else 'N/A'
    else:
        metadata['category'] = 'N/A'
        metadata['subcategory'] = 'N/A'
    
    # Posted date
    time_elem = soup.find('time', {'datetime': True})
    if time_elem:
        metadata['posted_date'] = time_elem.get('datetime')
        metadata['posted_date_relative'] = time_elem.get_text(strip=True)
    else:
        metadata['posted_date'] = 'N/A'
        metadata['posted_date_relative'] = 'N/A'
    
    # Description
    meta_description = soup.find('meta', {'property': 'og:description'})
    if meta_description:
        metadata['description'] = meta_description.get('content', 'N/A')
    else:
        metadata['description'] = 'N/A'
    
    # Tags
    tag_links = soup.find_all('a', class_=re.compile(r'tag', re.IGNORECASE))
    metadata['tags'] = [tag.get_text(strip=True) for tag in tag_links]
    
    # Download links
    download_links = []
    for link in soup.find_all('a', href=True):
        href = link.get('href', '')
        text = link.get_text(strip=True).lower()
        
        if any(keyword in text for keyword in ['download', 'get', 'access']) or 'download' in href.lower():
            download_links.append({
                'text': link.get_text(strip=True),
                'url': href,
                'type': classify_download_link(href)
            })
    
    metadata['download_links'] = download_links
    metadata['download_count'] = len(download_links)
    
    # Classify primary download type
    if download_links:
        types = [dl['type'] for dl in download_links]
        metadata['primary_download_type'] = types[0] if types else 'none'
        metadata['has_direct_download'] = 'direct_pm' in types
        metadata['has_external_download'] = any(t in types for t in ['mediafire', 'dropbox', 'google_drive', 'mega'])
        metadata['is_paid'] = 'patreon_paid' in types
    else:
        metadata['primary_download_type'] = 'none'
        metadata['has_direct_download'] = False
        metadata['has_external_download'] = False
        metadata['is_paid'] = False
    
    return metadata

# Test it
print("Testing metadata extractor on first example:\n")
test_metadata = extract_project_metadata(example_urls['direct_download'], soup)

# Pretty print the results
import json
print(json.dumps(test_metadata, indent=2))

Testing metadata extractor on first example:

{
  "url": "https://www.planetminecraft.com/project/1-21-8-gk-first-church-download/",
  "scraped_at": "2025-10-09T15:10:37.492166",
  "title": "[1.21.8] GK first church [Download]",
  "category": "N/A",
  "subcategory": "N/A",
  "posted_date": "N/A",
  "posted_date_relative": "N/A",
  "description": "Church first Graveyard Keeper This is a small, beginner level medieval church. The inspiration came from the church in the Graveyard Keeper game. I...",
  "tags": [],
  "download_links": [
    {
      "text": "Download Schematic",
      "url": "/project/1-21-8-gk-first-church-download/download/schematic/",
      "type": "other"
    },
    {
      "text": "Download",
      "url": "/projects/tag/download/",
      "type": "other"
    },
    {
      "text": "Downloadable",
      "url": "/projects/tag/downloadable/",
      "type": "other"
    },
    {
      "text": "",
      "url": "/project/1-21-8-blood-eagle-download/",
      "type": "other"
    

## ✅ Key Findings Summary

### What We Learned:

1. **Bot Protection**: Planet Minecraft blocks simple `requests` - **must use Selenium**

2. **JSON-LD Structured Data** 🎯: 
   - The page has `<script type="application/ld+json">` with ALL metadata!
   - Contains: title, description, **datePublished**, author, **keywords (tags)**
   - Much easier than scraping HTML!

3. **Direct Download Pattern**:
   - Format: `/project/{slug}/download/schematic/`
   - Example: `/project/1-21-8-gk-first-church-download/download/schematic/`

4. **Tags from JSON-LD**:
   - In the `keywords` field: `"medieval,land-structure,download,church,temple,ruined,downloadable,graveyardkeeper"`
   - Just split by comma!

5. **Date Format**:
   - Exact timestamp in JSON-LD: `"datePublished":"2025-10-08T00:00:00-04:00"`
   - No need to parse relative dates!

### Next Steps:
1. Update extractor to parse JSON-LD first (easiest!)
2. Improve download link classification
3. Figure out how to discover all projects (no ID-based system)

## Step 11: Parse JSON-LD for ALL Metadata

The JSON-LD has everything we need! Let's extract it properly.

In [14]:
# Find and parse JSON-LD
json_ld_script = soup.find('script', type='application/ld+json')

if json_ld_script:
    print("✓ Found JSON-LD script tag\n")
    
    # Parse the JSON
    json_data = json.loads(json_ld_script.string)
    
    # Extract all the fields
    print("Extracted data from JSON-LD:")
    print(f"Title: {json_data.get('name')}")
    print(f"Description: {json_data.get('description')}")
    print(f"Date Published: {json_data.get('datePublished')}")
    print(f"Date Modified: {json_data.get('dateModified')}")
    print(f"Author: {json_data.get('author', {}).get('name')}")
    print(f"Genre: {json_data.get('genre')}")  # This is the category!
    
    # Keywords = Tags
    keywords = json_data.get('keywords', '')
    tags_list = [tag.strip() for tag in keywords.split(',')]
    print(f"\nTags ({len(tags_list)}): {tags_list}")
    
    # Pretty print full JSON
    print("\n" + "="*60)
    print("Full JSON-LD:")
    print("="*60)
    print(json.dumps(json_data, indent=2))
else:
    print("❌ No JSON-LD found")

✓ Found JSON-LD script tag

Extracted data from JSON-LD:
Title: None
Description: None
Date Published: None
Date Modified: None
Author: None
Genre: None

Tags (1): ['']

Full JSON-LD:
{
  "@context": "http://schema.org/",
  "@type": "BreadcrumbList",
  "itemListElement": [
    {
      "@type": "ListItem",
      "position": 1,
      "item": {
        "@id": "https://www.planetminecraft.com/projects/",
        "name": "Minecraft Maps"
      }
    },
    {
      "@type": "ListItem",
      "position": 2,
      "item": {
        "@id": "https://www.planetminecraft.com/projects/1-21-8-gk-first-church-download/",
        "name": "[1.21.8] GK first church [Download] Minecraft Map"
      }
    }
  ]
}


In [15]:
# Find ALL JSON-LD script tags
json_ld_scripts = soup.find_all('script', type='application/ld+json')

print(f"Found {len(json_ld_scripts)} JSON-LD script tags\n")

for i, script in enumerate(json_ld_scripts, 1):
    print(f"{'='*60}")
    print(f"JSON-LD #{i}:")
    print(f"{'='*60}")
    
    try:
        json_data = json.loads(script.string)
        json_type = json_data.get('@type')
        
        print(f"Type: {json_type}")
        
        # Show preview of what this contains
        if json_type == 'CreativeWork':
            print(f"✓ This is the one we want!")
            print(f"  Title: {json_data.get('name')}")
            print(f"  Published: {json_data.get('datePublished')}")
            print(f"  Genre: {json_data.get('genre')}")
            print(f"  Keywords: {json_data.get('keywords')[:50]}...")
        
        # Print full JSON
        print(json.dumps(json_data, indent=2))
        
    except Exception as e:
        print(f"Error parsing: {e}")
    
    print()

Found 2 JSON-LD script tags

JSON-LD #1:
Type: BreadcrumbList
{
  "@context": "http://schema.org/",
  "@type": "BreadcrumbList",
  "itemListElement": [
    {
      "@type": "ListItem",
      "position": 1,
      "item": {
        "@id": "https://www.planetminecraft.com/projects/",
        "name": "Minecraft Maps"
      }
    },
    {
      "@type": "ListItem",
      "position": 2,
      "item": {
        "@id": "https://www.planetminecraft.com/projects/1-21-8-gk-first-church-download/",
        "name": "[1.21.8] GK first church [Download] Minecraft Map"
      }
    }
  ]
}

JSON-LD #2:
Type: CreativeWork
✓ This is the one we want!
  Title: [1.21.8] GK first church [Download] Minecraft Map
  Published: 2025-10-08T00:00:00-04:00
  Genre: Map
  Keywords: medieval,land-structure,download,church,temple,rui...
{
  "@context": "http://schema.org/",
  "@type": "CreativeWork",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://www.planetminecraft.com/project/1-21-8-gk-first-chu

## Step 12: Updated Metadata Extractor with JSON-LD Parsing

Now let's update the extraction function to parse JSON-LD first:

In [18]:
def extract_project_metadata_v2(soup, project_url):
    """
    Extract all metadata from Planet Minecraft project page
    Uses JSON-LD first (most reliable), HTML as fallback
    """
    metadata = {
        "url": project_url,
        "title": "N/A",
        "category": "N/A",
        "subcategory": "N/A",
        "posted_date": "N/A",
        "tags": [],
        "description": "N/A",
        "author": "N/A",
        "download_links": [],
        "has_direct_download": False
    }
    
    # Step 1: Try to extract from JSON-LD (most reliable)
    json_ld_scripts = soup.find_all('script', type='application/ld+json')
    creative_work = None
    
    for script in json_ld_scripts:
        try:
            json_data = json.loads(script.string)
            if json_data.get('@type') == 'CreativeWork':
                creative_work = json_data
                break
        except:
            continue
    
    # If we found JSON-LD, extract from it
    if creative_work:
        metadata['title'] = creative_work.get('name', 'N/A')
        metadata['description'] = creative_work.get('description', 'N/A')
        metadata['posted_date'] = creative_work.get('datePublished', 'N/A')
        metadata['category'] = creative_work.get('genre', 'N/A')  # "Map", "Skin", "Texture Pack", etc.
        
        # Extract tags from keywords
        keywords = creative_work.get('keywords', '')
        if keywords:
            metadata['tags'] = [tag.strip() for tag in keywords.split(',')]
        
        # Extract author
        author_data = creative_work.get('author', {})
        if isinstance(author_data, dict):
            metadata['author'] = author_data.get('name', 'N/A')
    
    # Step 2: HTML fallback for fields not in JSON-LD
    else:
        # Title fallback
        h1 = soup.find('h1')
        if h1:
            metadata['title'] = h1.get_text(strip=True)
    
    # Step 3: Extract download links (always from HTML)
    all_links = soup.find_all('a', href=True)
    
    for link in all_links:
        href = link['href']
        text = link.get_text(strip=True)
        
        # Look for download-related links
        if any(keyword in href.lower() for keyword in ['download', 'schematic', 'world']):
            # Skip tag links and other non-download links
            if '/tags/' in href or '/projects/' in href:
                continue
            
            download_info = {
                'text': text,
                'url': href,
                'type': classify_download_link(href)
            }
            metadata['download_links'].append(download_info)
            
            # Check for direct PM download
            if '/download/schematic/' in href or '/download/world/' in href:
                metadata['has_direct_download'] = True
    
    return metadata

# Test it on our loaded page
metadata_v2 = extract_project_metadata_v2(soup, url)

print("=" * 60)
print("UPDATED METADATA EXTRACTION:")
print("=" * 60)
print(json.dumps(metadata_v2, indent=2))

UPDATED METADATA EXTRACTION:
{
  "url": "https://www.planetminecraft.com/project/1-21-8-gk-first-church-download/",
  "title": "[1.21.8] GK first church [Download] Minecraft Map",
  "category": "Map",
  "subcategory": "N/A",
  "posted_date": "2025-10-08T00:00:00-04:00",
  "tags": [
    "medieval",
    "land-structure",
    "download",
    "church",
    "temple",
    "ruined",
    "downloadable",
    "graveyardkeeper"
  ],
  "description": "Church first Graveyard Keeper This is a small, beginner level medieval church. The inspiration came from the church in the Graveyard Keeper game. I...",
  "author": "Psemata",
  "download_links": [
    {
      "text": "Download Schematic",
      "url": "/project/1-21-8-gk-first-church-download/download/schematic/",
      "type": "other"
    },
    {
      "text": "",
      "url": "/project/1-21-8-blood-eagle-download/",
      "type": "other"
    },
    {
      "text": "[1.21.8] Blood Eagle [Download]",
      "url": "/project/1-21-8-blood-eagle-downlo

## Step 13: Update Download Link Classifier

Fix the classifier to properly recognize Planet Minecraft download patterns:

In [24]:
def classify_download_link_v2(url, text=""):
    """
    Classify download links by type
    """
    url_lower = url.lower()
    text_lower = text.lower()
    
    # Direct Planet Minecraft download
    if '/download/schematic/' in url_lower or '/download/world/' in url_lower:
        return 'direct_pm'
    
    # External file hosts
    if 'mediafire.com' in url_lower:
        return 'mediafire'
    if 'dropbox.com' in url_lower:
        return 'dropbox'
    if 'patreon.com' in url_lower:
        return 'patreon'
    if 'drive.google.com' in url_lower or 'docs.google.com' in url_lower:
        return 'google_drive'
    
    # Skip these (not actual downloads)
    if '/tags/' in url_lower:
        return 'tag_link'
    if '/projects/' in url_lower and 'download' not in url_lower:
        return 'related_project'
    
    return 'other'


# Now rebuild the extractor with updated classifier
def extract_project_metadata_v3(soup, project_url):
    """
    Extract all metadata from Planet Minecraft project page
    Uses JSON-LD first (most reliable), HTML as fallback
    """
    metadata = {
        "url": project_url,
        "title": "N/A",
        "category": "N/A",
        "subcategory": "N/A",
        "posted_date": "N/A",
        "tags": [],
        "description": "N/A",
        "author": "N/A",
        "download_links": [],
        "has_direct_download": False
    }
    
    # Step 1: Try to extract from JSON-LD (most reliable)
    json_ld_scripts = soup.find_all('script', type='application/ld+json')
    creative_work = None
    
    for script in json_ld_scripts:
        try:
            json_data = json.loads(script.string)
            if json_data.get('@type') == 'CreativeWork':
                creative_work = json_data
                break
        except:
            continue
    
    # If we found JSON-LD, extract from it
    if creative_work:
        metadata['title'] = creative_work.get('name', 'N/A')
        metadata['description'] = creative_work.get('description', 'N/A')
        metadata['posted_date'] = creative_work.get('datePublished', 'N/A')
        metadata['category'] = creative_work.get('genre', 'N/A')  # "Map", "Skin", "Texture Pack", etc.
        
        # Extract tags from keywords
        keywords = creative_work.get('keywords', '')
        if keywords:
            metadata['tags'] = [tag.strip() for tag in keywords.split(',')]
        
        # Extract author
        author_data = creative_work.get('author', {})
        if isinstance(author_data, dict):
            metadata['author'] = author_data.get('name', 'N/A')
    
    # Step 2: HTML fallback for fields not in JSON-LD
    else:
        # Title fallback
        h1 = soup.find('h1')
        if h1:
            metadata['title'] = h1.get_text(strip=True)
    
    # Step 3: Extract download links (always from HTML)
    all_links = soup.find_all('a', href=True)
    
    for link in all_links:
        href = link['href']
        text = link.get_text(strip=True)
        
        # Only look for actual download links (not just anything with "download" in URL)
        # Direct PM downloads have this specific pattern
        if '/download/schematic/' in href.lower() or '/download/world/' in href.lower():
            download_info = {
                'text': text,
                'url': href,
                'type': 'direct_pm'
            }
            metadata['download_links'].append(download_info)
            metadata['has_direct_download'] = True
        
        # PM redirect links to external hosts (/download/mirror/, /download/website/)
        elif '/download/mirror/' in href.lower() or '/download/website/' in href.lower():
            download_info = {
                'text': text,
                'url': href,
                'type': 'pm_external_redirect'
            }
            metadata['download_links'].append(download_info)
        
        # Direct external download hosts
        elif any(host in href.lower() for host in ['mediafire.com', 'dropbox.com', 'drive.google.com', 'mega.nz', 'patreon.com']):
            link_type = classify_download_link_v2(href, text)
            download_info = {
                'text': text,
                'url': href,
                'type': link_type
            }
            metadata['download_links'].append(download_info)
    
    return metadata


# Test the updated version
metadata_v3 = extract_project_metadata_v3(soup, url)

print("=" * 60)
print("FINAL METADATA EXTRACTION (with classifier fix):")
print("=" * 60)
print(json.dumps(metadata_v3, indent=2))
print("\n✓ All fields properly extracted from JSON-LD!")
print(f"✓ Found {len(metadata_v3['download_links'])} download link(s)")
print(f"✓ Has direct download: {metadata_v3['has_direct_download']}")

FINAL METADATA EXTRACTION (with classifier fix):
{
  "url": "https://www.planetminecraft.com/project/1-21-8-gk-first-church-download/",
  "title": "[1.21.8] GK first church [Download] Minecraft Map",
  "category": "Map",
  "subcategory": "N/A",
  "posted_date": "2025-10-08T00:00:00-04:00",
  "tags": [
    "medieval",
    "land-structure",
    "download",
    "church",
    "temple",
    "ruined",
    "downloadable",
    "graveyardkeeper"
  ],
  "description": "Church first Graveyard Keeper This is a small, beginner level medieval church. The inspiration came from the church in the Graveyard Keeper game. I...",
  "author": "Psemata",
  "download_links": [
    {
      "text": "Download Schematic",
      "url": "/project/1-21-8-gk-first-church-download/download/schematic/",
      "type": "direct_pm"
    }
  ],
  "has_direct_download": true
}

✓ All fields properly extracted from JSON-LD!
✓ Found 1 download link(s)
✓ Has direct download: True


## Step 14: Test on Other Example URLs

Let's test the extractor on the other two examples to verify it handles different download types correctly.

In [23]:
# Test #2: External links example (The Great Kingdoms - MediaFire/Dropbox)
print("="*70)
print("TEST #2: External Download Links (MediaFire/Dropbox)")
print("="*70)

url_external = example_urls['external_links']
print(f"URL: {url_external}\n")

# Navigate with Selenium
driver.get(url_external)
time.sleep(3)  # Wait for load

# Parse the page
page_source_external = driver.page_source
soup_external = BeautifulSoup(page_source_external, 'html.parser')

# Extract metadata
metadata_external = extract_project_metadata_v3(soup_external, url_external)

print(json.dumps(metadata_external, indent=2))
print(f"\n✓ Found {len(metadata_external['download_links'])} download link(s)")
print(f"✓ Download types: {[dl['type'] for dl in metadata_external['download_links']]}")
print(f"✓ Has direct download: {metadata_external['has_direct_download']}")

TEST #2: External Download Links (MediaFire/Dropbox)
URL: https://www.planetminecraft.com/project/the-great-kingdoms/

{
  "url": "https://www.planetminecraft.com/project/the-great-kingdoms/",
  "title": "The Great Kingdoms Minecraft Map",
  "category": "Map",
  "subcategory": "N/A",
  "posted_date": "2025-10-09T00:00:00-04:00",
  "tags": [
    "redstone-device",
    "kingdom",
    "portals",
    "commandblocks",
    "dimensions",
    "teleport",
    "superpowers",
    "other"
  ],
  "description": "A large scale world created by me and some buddies. Each one went out to establish his own land, his own kingdom, a place that he could call his own....",
  "author": "PokecrafterChamp",
  "download_links": [],
  "has_direct_download": false
}

✓ Found 0 download link(s)
✓ Download types: []
✓ Has direct download: False


In [22]:
# Debug: Let's see what links are actually on this page
print("\n" + "="*70)
print("DEBUG: Searching for ALL links that might be downloads...")
print("="*70)

all_links_external = soup_external.find_all('a', href=True)
potential_downloads = []

for link in all_links_external:
    href = link['href']
    text = link.get_text(strip=True)
    
    # Check for external hosts or download keywords
    if any(host in href.lower() for host in ['mediafire', 'dropbox', 'drive.google', 'mega.nz', 'patreon', 'download']):
        potential_downloads.append({
            'text': text[:60],
            'href': href[:100]
        })

print(f"\nFound {len(potential_downloads)} potential download links:")
for i, link in enumerate(potential_downloads[:10], 1):
    print(f"\n{i}. Text: {link['text']}")
    print(f"   URL: {link['href']}")


DEBUG: Searching for ALL links that might be downloads...

Found 5 potential download links:

1. Text: Kingdoms Download
   URL: /project/the-great-kingdoms/download/mirror/748184/

2. Text: Kingdoms Download
   URL: /project/the-great-kingdoms/download/website/748185/

3. Text: 
   URL: /project/nuestra-se-ora-de-la-sant-sima-trinidad-ship-of-the-line-free-download/

4. Text: Nuestra Señora de la Santísima Trinidad, ship of the line (F
   URL: /project/nuestra-se-ora-de-la-sant-sima-trinidad-ship-of-the-line-free-download/

5. Text: VIEW
   URL: /project/nuestra-se-ora-de-la-sant-sima-trinidad-ship-of-the-line-free-download/


In [25]:
# Re-extract with updated function
metadata_external_v2 = extract_project_metadata_v3(soup_external, url_external)

print("\n" + "="*70)
print("UPDATED EXTRACTION:")
print("="*70)
print(json.dumps(metadata_external_v2, indent=2))
print(f"\n✓ Found {len(metadata_external_v2['download_links'])} download link(s)")
if metadata_external_v2['download_links']:
    print(f"✓ Download types: {[dl['type'] for dl in metadata_external_v2['download_links']]}")
print(f"✓ Has direct download: {metadata_external_v2['has_direct_download']}")


UPDATED EXTRACTION:
{
  "url": "https://www.planetminecraft.com/project/the-great-kingdoms/",
  "title": "The Great Kingdoms Minecraft Map",
  "category": "Map",
  "subcategory": "N/A",
  "posted_date": "2025-10-09T00:00:00-04:00",
  "tags": [
    "redstone-device",
    "kingdom",
    "portals",
    "commandblocks",
    "dimensions",
    "teleport",
    "superpowers",
    "other"
  ],
  "description": "A large scale world created by me and some buddies. Each one went out to establish his own land, his own kingdom, a place that he could call his own....",
  "author": "PokecrafterChamp",
  "download_links": [
    {
      "text": "Kingdoms Download",
      "url": "/project/the-great-kingdoms/download/mirror/748184/",
      "type": "pm_external_redirect"
    },
    {
      "text": "Kingdoms Download",
      "url": "/project/the-great-kingdoms/download/website/748185/",
      "type": "pm_external_redirect"
    }
  ],
  "has_direct_download": false
}

✓ Found 2 download link(s)
✓ Download t

In [26]:
# Test #3: Patreon/Paid example
print("\n" + "="*70)
print("TEST #3: Patreon/Paid Download")
print("="*70)

url_patreon = example_urls['paid_patreon']
print(f"URL: {url_patreon}\n")

# Navigate with Selenium
driver.get(url_patreon)
time.sleep(3)  # Wait for load

# Parse the page
page_source_patreon = driver.page_source
soup_patreon = BeautifulSoup(page_source_patreon, 'html.parser')

# Extract metadata
metadata_patreon = extract_project_metadata_v3(soup_patreon, url_patreon)

print(json.dumps(metadata_patreon, indent=2))
print(f"\n✓ Found {len(metadata_patreon['download_links'])} download link(s)")
if metadata_patreon['download_links']:
    print(f"✓ Download types: {[dl['type'] for dl in metadata_patreon['download_links']]}")
print(f"✓ Has direct download: {metadata_patreon['has_direct_download']}")


TEST #3: Patreon/Paid Download
URL: https://www.planetminecraft.com/project/fantasy-blue-house/

{
  "url": "https://www.planetminecraft.com/project/fantasy-blue-house/",
  "title": "Fantasy blue house Minecraft Map",
  "category": "Map",
  "subcategory": "N/A",
  "posted_date": "2025-09-21T00:00:00-04:00",
  "tags": [
    "fantasy",
    "medieval",
    "land-structure",
    "castle",
    "minecraft",
    "creative",
    "statue",
    "build",
    "house",
    "modern",
    "realistic",
    "recreation",
    "download",
    "vanilla",
    "idea",
    "free",
    "new"
  ],
  "description": "CONTACT ME Want a CUSTOM BUILD for your server or PROJECT DM me on DISCORD SpicyTmc All my socials Download Patreon .litematic Sub lvl Green chili...",
  "author": "SpicyT",
  "download_links": [
    {
      "text": "File Download",
      "url": "/project/fantasy-blue-house/download/mirror/746905/",
      "type": "pm_external_redirect"
    }
  ],
  "has_direct_download": false
}

✓ Found 1 download

## ✅ Test Results Summary

All three example projects extracted successfully!

### Test #1: Direct Download (GK First Church)
- **Title**: "[1.21.8] GK first church [Download] Minecraft Map"
- **Category**: Map
- **Posted**: 2025-10-08
- **Tags**: 8 tags (medieval, land-structure, church, etc.)
- **Download**: 1 link - `direct_pm` type ✓
- **Has Direct Download**: True ✓

### Test #2: External Links (The Great Kingdoms)
- **Title**: "The Great Kingdoms Minecraft Map"
- **Category**: Map
- **Posted**: 2025-10-09
- **Tags**: 8 tags (kingdom, portals, commandblocks, etc.)
- **Download**: 2 links - both `pm_external_redirect` type ✓
- **Has Direct Download**: False ✓

### Test #3: Patreon/Paid (Fantasy Blue House)
- **Title**: "Fantasy blue house Minecraft Map"
- **Category**: Map
- **Posted**: 2025-09-21
- **Tags**: 17 tags (fantasy, medieval, house, etc.)
- **Download**: 1 link - `pm_external_redirect` type ✓
- **Has Direct Download**: False ✓

### Key Findings:
1. ✓ JSON-LD extraction works perfectly for all projects
2. ✓ Direct PM downloads detected: `/download/schematic/` pattern
3. ✓ External redirects detected: `/download/mirror/` and `/download/website/` patterns
4. ✓ All metadata fields populated correctly (title, category, date, tags, author, description)
5. ✓ No false positives - only actual download buttons captured

## Step 15: Improved Extractor with Better Category/Subcategory Logic

You're right - "land-structure" should be the subcategory! Let me explain how the data flows:

### Data Extraction Process:

1. **JSON-LD Script Tag** (Primary Source):
   - Planet Minecraft embeds structured data in `<script type="application/ld+json">`
   - We parse this JSON to get: title, description, date, genre (category), keywords (tags), author

2. **Genre vs Tags**:
   - `genre`: "Map" (broad category)
   - `keywords`: "medieval,land-structure,download,church,..." 
   - "land-structure" is actually a **tag**, not in a separate category field

3. **Subcategory Detection**:
   - We scan the tags for structure-type indicators
   - Common patterns: land-structure, air-structure, castle, house, city, etc.
   - First matching tag becomes the subcategory

Let's update the extractor:

In [27]:
def extract_project_metadata_final(soup, project_url):
    """
    Final version: Extract all metadata from Planet Minecraft project page
    Uses JSON-LD first (most reliable), HTML as fallback
    Properly detects subcategory from tags
    """
    metadata = {
        "url": project_url,
        "title": "N/A",
        "category": "N/A",
        "subcategory": "N/A",
        "posted_date": "N/A",
        "tags": [],
        "description": "N/A",
        "author": "N/A",
        "download_links": [],
        "has_direct_download": False
    }
    
    # Step 1: Try to extract from JSON-LD (most reliable)
    json_ld_scripts = soup.find_all('script', type='application/ld+json')
    creative_work = None
    
    for script in json_ld_scripts:
        try:
            json_data = json.loads(script.string)
            if json_data.get('@type') == 'CreativeWork':
                creative_work = json_data
                break
        except:
            continue
    
    # If we found JSON-LD, extract from it
    if creative_work:
        metadata['title'] = creative_work.get('name', 'N/A')
        metadata['description'] = creative_work.get('description', 'N/A')
        metadata['posted_date'] = creative_work.get('datePublished', 'N/A')
        
        # Category is the broad type from genre field
        metadata['category'] = creative_work.get('genre', 'N/A')  # "Map", "Skin", "Texture Pack", etc.
        
        # Extract tags from keywords (comma-separated string)
        keywords = creative_work.get('keywords', '')
        if keywords:
            metadata['tags'] = [tag.strip() for tag in keywords.split(',')]
            
            # Detect subcategory from structure-related tags
            # These are common subcategory indicators on Planet Minecraft
            structure_tags = [
                'land-structure', 'air-structure', 'underground-structure', 
                'water-structure', 'floating-structure',
                'castle', 'house', 'temple', 'tower', 'city', 'village',
                'mansion', 'fort', 'palace', 'church', 'cathedral'
            ]
            
            # Find first matching structure tag
            for tag in metadata['tags']:
                if tag in structure_tags:
                    metadata['subcategory'] = tag
                    break
        
        # Extract author
        author_data = creative_work.get('author', {})
        if isinstance(author_data, dict):
            metadata['author'] = author_data.get('name', 'N/A')
    
    # Step 2: HTML fallback for fields not in JSON-LD
    else:
        # Title fallback
        h1 = soup.find('h1')
        if h1:
            metadata['title'] = h1.get_text(strip=True)
    
    # Step 3: Extract download links (always from HTML)
    all_links = soup.find_all('a', href=True)
    
    for link in all_links:
        href = link['href']
        text = link.get_text(strip=True)
        
        # Only look for actual download links
        # Direct PM downloads
        if '/download/schematic/' in href.lower() or '/download/world/' in href.lower():
            download_info = {
                'text': text,
                'url': href,
                'type': 'direct_pm'
            }
            metadata['download_links'].append(download_info)
            metadata['has_direct_download'] = True
        
        # PM redirect links to external hosts
        elif '/download/mirror/' in href.lower() or '/download/website/' in href.lower():
            download_info = {
                'text': text,
                'url': href,
                'type': 'pm_external_redirect'
            }
            metadata['download_links'].append(download_info)
        
        # Direct external download hosts
        elif any(host in href.lower() for host in ['mediafire.com', 'dropbox.com', 'drive.google.com', 'mega.nz', 'patreon.com']):
            link_type = classify_download_link_v2(href, text)
            download_info = {
                'text': text,
                'url': href,
                'type': link_type
            }
            metadata['download_links'].append(download_info)
    
    return metadata


# Test on all three examples
print("="*70)
print("FINAL EXTRACTOR - Testing All Three Examples")
print("="*70)

for name, url_test in example_urls.items():
    print(f"\n{'='*70}")
    print(f"Testing: {name}")
    print(f"{'='*70}")
    
    # Get the soup (reuse if already loaded)
    if name == 'direct_download':
        soup_test = soup
    elif name == 'external_links':
        soup_test = soup_external
    else:  # patreon
        soup_test = soup_patreon
    
    result = extract_project_metadata_final(soup_test, url_test)
    
    # Show key fields
    print(f"Title: {result['title']}")
    print(f"Category: {result['category']}")
    print(f"Subcategory: {result['subcategory']} ← Detected from tags!")
    print(f"Posted: {result['posted_date']}")
    print(f"Author: {result['author']}")
    print(f"Tags: {result['tags'][:5]}..." if len(result['tags']) > 5 else f"Tags: {result['tags']}")
    print(f"Download links: {len(result['download_links'])} ({[dl['type'] for dl in result['download_links']]})")
    print(f"Has direct download: {result['has_direct_download']}")

FINAL EXTRACTOR - Testing All Three Examples

Testing: direct_download
Title: [1.21.8] GK first church [Download] Minecraft Map
Category: Map
Subcategory: land-structure ← Detected from tags!
Posted: 2025-10-08T00:00:00-04:00
Author: Psemata
Tags: ['medieval', 'land-structure', 'download', 'church', 'temple']...
Download links: 1 (['direct_pm'])
Has direct download: True

Testing: external_links
Title: The Great Kingdoms Minecraft Map
Category: Map
Subcategory: N/A ← Detected from tags!
Posted: 2025-10-09T00:00:00-04:00
Author: PokecrafterChamp
Tags: ['redstone-device', 'kingdom', 'portals', 'commandblocks', 'dimensions']...
Download links: 2 (['pm_external_redirect', 'pm_external_redirect'])
Has direct download: False

Testing: paid_patreon
Title: Fantasy blue house Minecraft Map
Category: Map
Subcategory: land-structure ← Detected from tags!
Posted: 2025-09-21T00:00:00-04:00
Author: SpicyT
Tags: ['fantasy', 'medieval', 'land-structure', 'castle', 'minecraft']...
Download links: 1 (['

## Step 16: Extract Additional Stats (Views, Downloads, Diamonds, Hearts, Updated Date)

You've identified additional important metadata! Let's explore where these stats are stored on the page.

In [28]:
# Load the new example page with stats
url_stats = "https://www.planetminecraft.com/project/medieval-house-fully-decorated-interior-download-6638738"
print(f"Loading example with stats: {url_stats}\n")

driver.get(url_stats)
time.sleep(3)

page_source_stats = driver.page_source
soup_stats = BeautifulSoup(page_source_stats, 'html.parser')

print("✓ Page loaded\n")

# Look for stat elements
print("="*70)
print("Searching for stats elements...")
print("="*70)

# Search for common stat patterns
stat_keywords = ['view', 'download', 'diamond', 'heart', 'favorite', 'update', 'upload']

for keyword in stat_keywords:
    elements = soup_stats.find_all(string=re.compile(keyword, re.IGNORECASE))
    if elements:
        print(f"\n{keyword.upper()}:")
        for elem in elements[:5]:
            parent = elem.parent
            # Get surrounding context
            if parent:
                text = parent.get_text(strip=True)
                # Also check for data attributes
                attrs = {k: v for k, v in parent.attrs.items() if 'data' in k or 'title' in k}
                print(f"  Text: {text[:100]}")
                if attrs:
                    print(f"  Attrs: {attrs}")
        print()

Loading example with stats: https://www.planetminecraft.com/project/medieval-house-fully-decorated-interior-download-6638738

✓ Page loaded

Searching for stats elements...

VIEW:
  Text: 1,460views,60today
  Text: VIEW
  Attrs: {'title': 'Medieval House – Fully Decorated Interior | Download Minecraft Map & Project'}
  Text: VIEW
  Attrs: {'title': 'Medieval House – Fully Decorated Interior | Download Minecraft Map & Project'}
  Text: VIEW
  Attrs: {'title': 'Medieval House – Fully Decorated Interior | Download Minecraft Map & Project'}
  Text: VIEW
  Attrs: {'title': 'Medieval House – Fully Decorated Interior | Download Minecraft Map & Project'}


DOWNLOAD:
  Text: Medieval House – Fully Decorated Interior | Download Minecraft Map
  Text: {"@context":"http:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem",
  Text: {"@context":"http:\/\/schema.org\/","@type":"CreativeWork","mainEntityOfPage":{"@type":"WebPage","@i
  Text: Medieval House – Fully Decorated

In [29]:
# Look for stat containers with specific patterns
print("="*70)
print("Looking for stat numbers...")
print("="*70)

# Find elements with stat-like text (numbers followed by keywords)
all_text = soup_stats.find_all(string=re.compile(r'\d+.*?(view|download|diamond|heart|favorite)', re.IGNORECASE))

for text in all_text[:15]:
    parent = text.parent
    print(f"\nText: {text.strip()}")
    print(f"Parent tag: {parent.name}")
    print(f"Parent class: {parent.get('class', 'no class')}")
    
# Also check for specific date elements
print("\n" + "="*70)
print("Looking for dates...")
print("="*70)

time_elements = soup_stats.find_all('time')
for time_elem in time_elements:
    print(f"\nTime element:")
    print(f"  Text: {time_elem.get_text(strip=True)}")
    print(f"  datetime: {time_elem.get('datetime')}")
    print(f"  title: {time_elem.get('title')}")
    
# Check JSON-LD for dateModified
json_ld_scripts_stats = soup_stats.find_all('script', type='application/ld+json')
for script in json_ld_scripts_stats:
    try:
        json_data = json.loads(script.string)
        if json_data.get('@type') == 'CreativeWork':
            print("\n" + "="*70)
            print("JSON-LD dates:")
            print("="*70)
            print(f"dateCreated: {json_data.get('dateCreated')}")
            print(f"datePublished: {json_data.get('datePublished')}")
            print(f"dateModified: {json_data.get('dateModified')}")
            break
    except:
        continue

Looking for stat numbers...

Text: {"@context":"http:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https:\/\/www.planetminecraft.com\/projects\/","name":"Minecraft Maps"}},{"@type":"ListItem","position":2,"item":{"@id":"https:\/\/www.planetminecraft.com\/projects\/medieval-house-fully-decorated-interior-download-6638738\/","name":"Medieval House \u2013 Fully Decorated Interior | Download Minecraft Map"}}]}
Parent tag: script
Parent class: no class

Text: {"@context":"http:\/\/schema.org\/","@type":"CreativeWork","mainEntityOfPage":{"@type":"WebPage","@id":"https:\/\/www.planetminecraft.com\/project\/medieval-house-fully-decorated-interior-download-6638738\/"},"url":"https:\/\/www.planetminecraft.com\/project\/medieval-house-fully-decorated-interior-download-6638738\/","name":"Medieval House \u2013 Fully Decorated Interior | Download Minecraft Map","description":"This Medieval House comes with a fully decorated interior, pe

In [30]:
# Search for the specific stat pattern we saw: "1,460views,60today"
print("="*70)
print("Looking for stat bars with numbers...")
print("="*70)

# Find all elements with class containing "stat" or "count"
stat_elements = soup_stats.find_all(class_=re.compile(r'stat|count|number|metric', re.IGNORECASE))
print(f"Found {len(stat_elements)} stat-like elements\n")

for elem in stat_elements[:20]:
    text = elem.get_text(strip=True)
    if any(keyword in text.lower() for keyword in ['view', 'download', 'diamond', 'heart', 'favorite']):
        print(f"Class: {elem.get('class')}")
        print(f"Text: {text}")
        print(f"Tag: {elem.name}")
        print()

# Also try to find by looking for number patterns
print("="*70)
print("Looking for elements with comma-separated numbers...")
print("="*70)

number_elements = soup_stats.find_all(string=re.compile(r'\d{1,3}(,\d{3})*'))
relevant_stats = []

for elem in number_elements:
    text = elem.strip()
    parent = elem.parent
    
    # Check if contains stat keywords
    siblings_text = ' '.join([s.get_text() if hasattr(s, 'get_text') else str(s) for s in parent.parent.children if s])
    
    if any(keyword in siblings_text.lower() for keyword in ['view', 'download', 'diamond', 'heart']):
        relevant_stats.append({
            'text': text,
            'context': siblings_text[:100],
            'parent_tag': parent.name,
            'parent_class': parent.get('class')
        })

print(f"Found {len(relevant_stats)} relevant stat elements:\n")
for stat in relevant_stats[:10]:
    print(f"Number: {stat['text']}")
    print(f"Context: {stat['context']}")
    print(f"Parent: <{stat['parent_tag']} class='{stat['parent_class']}'>")
    print()

Looking for stat bars with numbers...
Found 31 stat-like elements

Class: ['resource-statistics']
Text: 1,460views,60today185downloads,4today
Tag: ul

Looking for elements with comma-separated numbers...
Found 25 relevant stat elements:

Number: {"@context":"http:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https:\/\/www.planetminecraft.com\/projects\/","name":"Minecraft Maps"}},{"@type":"ListItem","position":2,"item":{"@id":"https:\/\/www.planetminecraft.com\/projects\/medieval-house-fully-decorated-interior-download-6638738\/","name":"Medieval House \u2013 Fully Decorated Interior | Download Minecraft Map"}}]}
Context:    Medieval House – Fully Decorated Interior | Download Minecraft Map 
  
          
  
          
 
Parent: <script class='None'>

Number: {"@context":"http:\/\/schema.org\/","@type":"CreativeWork","mainEntityOfPage":{"@type":"WebPage","@id":"https:\/\/www.planetminecraft.com\/project\/medieval-house-ful

In [31]:
# Try regex to find the exact pattern: "1,460views,60today"
print("="*70)
print("Searching for combined stat pattern (e.g., '1,460views,60today')...")
print("="*70)

pattern = re.compile(r'(\d[\d,]*)\s*(view|download|diamond|heart)', re.IGNORECASE)
matches = soup_stats.find_all(string=pattern)

print(f"Found {len(matches)} matches:\n")
for match in matches[:15]:
    print(f"Text: '{match.strip()}'")
    parent = match.parent
    print(f"Parent: <{parent.name} class='{parent.get('class')}'>")
    # Try to extract the numbers
    found = pattern.findall(match)
    if found:
        print(f"Extracted: {found}")
    print()

# Alternative: Look for title or data attributes with stats
print("="*70)
print("Checking for title/data attributes with stats...")
print("="*70)

elements_with_title = soup_stats.find_all(attrs={'title': re.compile(r'\d')})
for elem in elements_with_title[:10]:
    title = elem.get('title')
    if any(keyword in title.lower() for keyword in ['view', 'download', 'diamond', 'heart']):
        print(f"Title: {title}")
        print(f"Element: <{elem.name} class='{elem.get('class')}'>")
        print()

Searching for combined stat pattern (e.g., '1,460views,60today')...
Found 0 matches:

Checking for title/data attributes with stats...
Title: Download from third-party: https://www.patreon.com/posts/medieval-house-130857478?utm_medium=clipboard_copy&utm_source=copyLink&utm_campaign=postshare_creator&utm_content=join_link
Element: <a class='['third-party-download', 'branded-download']'>



In [32]:
# Search in the raw page source for the stats
print("="*70)
print("Searching raw HTML for stats...")
print("="*70)

# Find "views" in the page source
if '1,460' in page_source_stats:
    # Find the context around it
    idx = page_source_stats.find('1,460')
    context = page_source_stats[max(0, idx-200):min(len(page_source_stats), idx+200)]
    print("Found '1,460' in context:")
    print(context)
    print("\n" + "="*70 + "\n")

# Look for data attributes or aria-labels
print("Checking for aria-label or data- attributes with numbers...")
elements_with_aria = soup_stats.find_all(attrs={'aria-label': True})
for elem in elements_with_aria:
    aria = elem.get('aria-label')
    if re.search(r'\d', aria):
        print(f"aria-label: {aria}")
        print(f"Element: <{elem.name} class='{elem.get('class')}'>")
        print(f"Text: {elem.get_text(strip=True)[:50]}")
        print()

# Check data attributes
elements_with_data = soup_stats.find_all(attrs=lambda x: any(k.startswith('data-') for k in (x or {})))
for elem in elements_with_data[:20]:
    data_attrs = {k: v for k, v in elem.attrs.items() if k.startswith('data-')}
    if any(re.search(r'\d', str(v)) for v in data_attrs.values()):
        text = elem.get_text(strip=True)
        if any(keyword in text.lower() for keyword in ['view', 'download', 'diamond', 'heart', 'today']):
            print(f"Data attrs: {data_attrs}")
            print(f"Text: {text[:100]}")
            print()

Searching raw HTML for stats...
Found '1,460' in context:
data-timeago="1" data-original-date="2025-06-06T10:04:26-04:00" data-timestamp="1749218666000" data-timeago-list="0">5 months ago</abbr>
</div><ul class="resource-statistics">
<li>
<span class="stat">1,460</span> views, <span class="stat">60</span> today</li><li>
<span class="stat">185</span> downloads, <span class="stat">4</span> today</li>
</ul></div>
</div><div id="resource-options">
<ul class=


Checking for aria-label or data- attributes with numbers...


In [33]:
# Perfect! Found the structure. Now extract it properly
print("="*70)
print("Extracting stats from resource-statistics...")
print("="*70 + "\n")

# Find the resource-statistics ul
resource_stats = soup_stats.find('ul', class_='resource-statistics')

if resource_stats:
    # Get all stat spans
    stat_spans = resource_stats.find_all('span', class_='stat')
    
    print(f"Found {len(stat_spans)} stat values:")
    for i, span in enumerate(stat_spans):
        print(f"  Stat {i+1}: {span.get_text(strip=True)}")
    
    # Get the full text to understand labels
    print(f"\nFull stats text:")
    print(f"  {resource_stats.get_text(strip=True)}")
    
    # Parse it properly
    li_elements = resource_stats.find_all('li')
    stats_dict = {}
    
    for li in li_elements:
        text = li.get_text(strip=True)
        stat_values = [s.get_text(strip=True) for s in li.find_all('span', class_='stat')]
        
        print(f"\nLI text: {text}")
        print(f"Stat values: {stat_values}")
        
        # Determine what stat this is
        if 'view' in text.lower():
            if len(stat_values) >= 2:
                stats_dict['views'] = stat_values[0]
                stats_dict['views_today'] = stat_values[1]
        elif 'download' in text.lower():
            if len(stat_values) >= 2:
                stats_dict['downloads'] = stat_values[0]
                stats_dict['downloads_today'] = stat_values[1]
    
    print(f"\n{'='*70}")
    print("Extracted Stats:")
    print("="*70)
    for key, value in stats_dict.items():
        print(f"{key}: {value}")

else:
    print("❌ resource-statistics ul not found")

# Now look for diamonds and hearts (likely in a different location)
print(f"\n{'='*70}")
print("Looking for diamonds and hearts...")
print("="*70 + "\n")

# Search for elements with "diamond" or "heart"
diamond_elements = soup_stats.find_all(string=re.compile(r'diamond', re.IGNORECASE))
heart_elements = soup_stats.find_all(string=re.compile(r'(heart|favorite)', re.IGNORECASE))

print(f"Found {len(diamond_elements)} diamond mentions")
print(f"Found {len(heart_elements)} heart/favorite mentions")

# Look for specific counters
for elem in diamond_elements[:5]:
    parent = elem.parent
    # Check for numbers nearby
    parent_text = parent.get_text(strip=True)
    if re.search(r'\d+', parent_text):
        print(f"\nDiamond context: {parent_text[:100]}")
        print(f"Parent: <{parent.name} class='{parent.get('class')}'>")

for elem in heart_elements[:5]:
    parent = elem.parent
    parent_text = parent.get_text(strip=True)
    if re.search(r'\d+', parent_text) and 'log' not in parent_text.lower():
        print(f"\nHeart context: {parent_text[:100]}")
        print(f"Parent: <{parent.name} class='{parent.get('class')}'>")


Extracting stats from resource-statistics...

Found 4 stat values:
  Stat 1: 1,460
  Stat 2: 60
  Stat 3: 185
  Stat 4: 4

Full stats text:
  1,460views,60today185downloads,4today

LI text: 1,460views,60today
Stat values: ['1,460', '60']

LI text: 185downloads,4today
Stat values: ['185', '4']

Extracted Stats:
views: 1,460
views_today: 60
downloads: 185
downloads_today: 4

Looking for diamonds and hearts...

Found 2 diamond mentions
Found 2 heart/favorite mentions


In [34]:
# Look for diamond and heart counts more specifically
# They're likely in buttons or links for giving diamonds/hearts

print("="*70)
print("Searching for diamond/heart counts in buttons/links...")
print("="*70 + "\n")

# Find all buttons and links
buttons = soup_stats.find_all(['button', 'a'])

for btn in buttons:
    text = btn.get_text(strip=True)
    
    # Check for diamond
    if 'diamond' in text.lower():
        # Look for numbers
        numbers = re.findall(r'\d+', text)
        if numbers:
            print(f"Diamond button: {text}")
            print(f"Numbers found: {numbers}")
            print(f"Classes: {btn.get('class')}")
            print()
    
    # Check for heart/favorite
    if any(keyword in text.lower() for keyword in ['heart', 'favorite', 'follow']):
        numbers = re.findall(r'\d+', text)
        if numbers:
            print(f"Heart/Favorite button: {text}")
            print(f"Numbers found: {numbers}")
            print(f"Classes: {btn.get('class')}")
            print()

# Also check for specific classes
print("="*70)
print("Checking for social stats section...")
print("="*70 + "\n")

# Look for common social stat patterns
social_elements = soup_stats.find_all(class_=re.compile(r'social|engagement|action', re.IGNORECASE))

for elem in social_elements[:10]:
    text = elem.get_text(strip=True)
    if re.search(r'\d+', text) and any(keyword in text.lower() for keyword in ['diamond', 'heart', 'favorite', 'follow']):
        print(f"Text: {text[:100]}")
        print(f"Tag: <{elem.name} class='{elem.get('class')}'>")
        print()

# Check raw HTML for diamond/heart counts
print("="*70)
print("Searching raw HTML...")
print("="*70 + "\n")

# Look for patterns like "5 diamonds" or "1 heart"
diamond_pattern = r'(\d+)\s*(diamond|gem)'
heart_pattern = r'(\d+)\s*(heart|favorite)'

diamond_matches = re.findall(diamond_pattern, page_source_stats, re.IGNORECASE)
heart_matches = re.findall(heart_pattern, page_source_stats, re.IGNORECASE)

print(f"Diamond pattern matches: {diamond_matches[:5]}")
print(f"Heart pattern matches: {heart_matches[:5]}")

Searching for diamond/heart counts in buttons/links...

Checking for social stats section...

Text: 51Diamond LogFavorite LogFeature on profileEmbedReport
Tag: <ul class='['site-actions', 'r-data']'>

Searching raw HTML...

Diamond pattern matches: []
Heart pattern matches: [('24', 'favorite')]


In [35]:
# Found "51Diamond LogFavorite LogFeature..." - let's parse this better
print("="*70)
print("Parsing social actions more carefully...")
print("="*70 + "\n")

# Find the site-actions ul
site_actions = soup_stats.find('ul', class_='site-actions')

if site_actions:
    print("Found site-actions ul")
    
    # Get all li elements
    li_elements = site_actions.find_all('li')
    
    print(f"Found {len(li_elements)} action items:\n")
    
    for li in li_elements:
        # Get the link/button text
        link = li.find(['a', 'button'])
        if link:
            text = link.get_text(strip=True)
            print(f"Action: {text}")
            
            # Check for numbers at the start
            number_match = re.match(r'^(\d+)', text)
            if number_match:
                count = number_match.group(1)
                remaining_text = text[len(count):]
                print(f"  Count: {count}")
                print(f"  Label: {remaining_text}")
            
            # Check for data attributes
            data_attrs = {k: v for k, v in link.attrs.items() if k.startswith('data-')}
            if data_attrs:
                print(f"  Data: {data_attrs}")
            print()

# Alternative: look for specific action links
print("="*70)
print("Looking for specific give-diamond and give-favorite links...")
print("="*70 + "\n")

give_diamond = soup_stats.find('a', href=re.compile(r'/give_diamond/'))
give_favorite = soup_stats.find('a', href=re.compile(r'/give_favorite/'))

if give_diamond:
    print(f"Give Diamond link found:")
    print(f"  Text: {give_diamond.get_text(strip=True)}")
    print(f"  Href: {give_diamond.get('href')}")
    
if give_favorite:
    print(f"\nGive Favorite link found:")
    print(f"  Text: {give_favorite.get_text(strip=True)}")
    print(f"  Href: {give_favorite.get('href')}")

Parsing social actions more carefully...

Found site-actions ul
Found 11 action items:

Looking for specific give-diamond and give-favorite links...



In [36]:
# Let's inspect the raw structure of site-actions
print("="*70)
print("Raw HTML of site-actions...")
print("="*70 + "\n")

if site_actions:
    # Print the HTML
    print(site_actions.prettify()[:1500])
    
    # Try different approaches
    print("\n" + "="*70)
    print("All links in site-actions:")
    print("="*70 + "\n")
    
    all_links = site_actions.find_all('a')
    for link in all_links[:15]:
        text = link.get_text(strip=True)
        href = link.get('href', '')
        
        # Extract number if at start of text
        number_match = re.match(r'^(\d+)', text)
        if number_match or any(keyword in text.lower() for keyword in ['diamond', 'favorite', 'heart']):
            print(f"Text: '{text}'")
            print(f"Href: {href}")
            if number_match:
                print(f"Number: {number_match.group(1)}")
            print()

Raw HTML of site-actions...

<ul class="site-actions r-data" data-id="6638738" data-type="resource">
 <li class="resource-score">
  <div class="c-vote" title="Give diamond!">
   <div class="s-rate">
    <div class="c-icon">
    </div>
   </div>
   <span class="c-num-votes stat txtlrg">
    5
   </span>
  </div>
 </li>
 <li class="c-fav" title="Favorite &amp; follow updates">
  <div class="c-icon">
  </div>
  <span class="c-num-favs stat txtlrg">
   1
  </span>
 </li>
 <li class="c-cmt" title="Comment">
  <div class="c-icon">
   <i class="material-icons comment">
   </i>
  </div>
  <span class="num_comments stat txtlrg">
  </span>
 </li>
 <li class="collectable" data-id="6638738" data-subkey="projects" title="Add to collection...">
  <i class="material-icons playlist_add">
  </i>
 </li>
 <li class="action-share submenu_trigger" data-image="https://static.planetminecraft.com/files/image/minecraft/project/2025/738/19211527-medievalhouseb_l.jpg" data-url="/project/medieval-house-fully-deco

## Step 17: Final Complete Extractor with All Stats

Now I have found all the data locations! Let me build the complete extractor:

### Data Locations Summary:
1. **Basic metadata** → JSON-LD CreativeWork
2. **Category** → JSON-LD `genre` field ("Map")
3. **Subcategory** → From the fixed list, found in tags
4. **Views/Downloads** → `<ul class="resource-statistics">` with `<span class="stat">`
5. **Diamonds** → `.c-num-votes.stat` 
6. **Hearts/Favorites** → `.c-num-favs.stat`
7. **Dates** → JSON-LD `datePublished` and `dateModified`

In [43]:
def extract_project_metadata_complete(soup, project_url):
    """
    Complete extractor with all metadata and stats from Planet Minecraft
    """
    metadata = {
        "url": project_url,
        "title": "N/A",
        "category": "N/A",
        "subcategory": "N/A",
        "posted_date": "N/A",
        "updated_date": "N/A",
        "tags": [],
        "description": "N/A",
        "author": "N/A",
        "views": "0",
        "views_today": "0",
        "downloads": "0",
        "downloads_today": "0",
        "diamonds": "0",
        "hearts": "0",
        "download_links": [],
        "has_direct_download": False
    }
    
    # Step 1: Extract from JSON-LD
    json_ld_scripts = soup.find_all('script', type='application/ld+json')
    creative_work = None
    
    for script in json_ld_scripts:
        try:
            json_data = json.loads(script.string)
            if json_data.get('@type') == 'CreativeWork':
                creative_work = json_data
                break
        except:
            continue
    
    if creative_work:
        metadata['title'] = creative_work.get('name', 'N/A')
        metadata['description'] = creative_work.get('description', 'N/A')
        metadata['posted_date'] = creative_work.get('datePublished', 'N/A')
        metadata['updated_date'] = creative_work.get('dateModified', 'N/A')
        metadata['category'] = creative_work.get('genre', 'N/A')
        
        # Extract tags from keywords
        keywords = creative_work.get('keywords', '')
        if keywords:
            metadata['tags'] = [tag.strip() for tag in keywords.split(',')]
            
            # Detect subcategory from the official list
            valid_subcategories = [
                '3d-art', 'air-structure', 'challenge-adventure', 'complex', 'educational',
                'enviroment-landscaping', 'land-structure', 'minecart', 'music',
                'nether-structure', 'piston', 'pixel-art', 'redstone-device',
                'underground-structure', 'water-structure', 'other'
            ]
            
            for tag in metadata['tags']:
                if tag in valid_subcategories:
                    metadata['subcategory'] = tag
                    break
        
        # Extract author
        author_data = creative_work.get('author', {})
        if isinstance(author_data, dict):
            metadata['author'] = author_data.get('name', 'N/A')
    
    # Step 2: Extract views and downloads stats
    resource_stats = soup.find('ul', class_='resource-statistics')
    if resource_stats:
        li_elements = resource_stats.find_all('li')
        
        for li in li_elements:
            text = li.get_text(strip=True)
            stat_values = [s.get_text(strip=True) for s in li.find_all('span', class_='stat')]
            
            if 'view' in text.lower() and len(stat_values) >= 2:
                metadata['views'] = stat_values[0]
                metadata['views_today'] = stat_values[1]
            elif 'download' in text.lower() and len(stat_values) >= 2:
                metadata['downloads'] = stat_values[0]
                metadata['downloads_today'] = stat_values[1]
    
    # Step 3: Extract diamonds and hearts
    diamond_elem = soup.find('span', class_='c-num-votes')
    if diamond_elem:
        metadata['diamonds'] = diamond_elem.get_text(strip=True)
    
    heart_elem = soup.find('span', class_='c-num-favs')
    if heart_elem:
        metadata['hearts'] = heart_elem.get_text(strip=True)
    
    # Step 4: Extract download links
    all_links = soup.find_all('a', href=True)
    
    for link in all_links:
        href = link['href']
        text = link.get_text(strip=True)
        
        # Direct PM downloads - distinguish between world and schematic
        if '/download/schematic/' in href.lower():
            download_info = {
                'text': text,
                'url': href,
                'type': 'direct_pm_schematic',
                'file_type': 'schematic'
            }
            metadata['download_links'].append(download_info)
            metadata['has_direct_download'] = True
        
        elif '/download/world/' in href.lower() or '/download/worldmap/' in href.lower():
            download_info = {
                'text': text,
                'url': href,
                'type': 'direct_pm_world',
                'file_type': 'world'
            }
            metadata['download_links'].append(download_info)
            metadata['has_direct_download'] = True
        
        # PM redirect links to external hosts
        elif '/download/mirror/' in href.lower() or '/download/website/' in href.lower():
            download_info = {
                'text': text,
                'url': href,
                'type': 'pm_external_redirect',
                'file_type': 'unknown'
            }
            metadata['download_links'].append(download_info)
        
        # Direct external download hosts
        elif any(host in href.lower() for host in ['mediafire.com', 'dropbox.com', 'drive.google.com', 'mega.nz', 'patreon.com']):
            link_type = classify_download_link_v2(href, text)
            download_info = {
                'text': text,
                'url': href,
                'type': link_type,
                'file_type': 'unknown'
            }
            metadata['download_links'].append(download_info)
    
    return metadata


# Test on the stats example
print("="*70)
print("COMPLETE EXTRACTOR - Testing with Stats Example")
print("="*70)

result = extract_project_metadata_complete(soup_stats, url_stats)

print(json.dumps(result, indent=2))

print("\n" + "="*70)
print("Summary:")
print("="*70)
print(f"Title: {result['title']}")
print(f"Author: {result['author']}")
print(f"Category: {result['category']}")
print(f"Subcategory: {result['subcategory']}")
print(f"Posted: {result['posted_date']}")
print(f"Updated: {result['updated_date']}")
print(f"Views: {result['views']} ({result['views_today']} today)")
print(f"Downloads: {result['downloads']} ({result['downloads_today']} today)")
print(f"Diamonds: {result['diamonds']}")
print(f"Hearts: {result['hearts']}")
print(f"Tags: {len(result['tags'])} tags")
print(f"Download links: {len(result['download_links'])}")

COMPLETE EXTRACTOR - Testing with Stats Example
{
  "url": "https://www.planetminecraft.com/project/medieval-house-fully-decorated-interior-download-6638738",
  "title": "Medieval House \u2013 Fully Decorated Interior | Download Minecraft Map",
  "category": "Map",
  "subcategory": "other",
  "posted_date": "2025-06-06T00:00:00-04:00",
  "updated_date": "2025-10-08T08:11:49-04:00",
  "tags": [
    "medieval",
    "city",
    "house",
    "download",
    "casa",
    "schematics",
    "home",
    "schematic",
    "rustic",
    "interior",
    "fully",
    "downloadable",
    "decorated",
    "other",
    "litematica",
    "downloadablemap",
    "litematic"
  ],
  "description": "This Medieval House comes with a fully decorated interior, perfect for adding life and detail to your medieval towns or survival worlds. Download...",
  "author": "Raekon",
  "views": "1,460",
  "views_today": "60",
  "downloads": "185",
  "downloads_today": "4",
  "diamonds": "5",
  "hearts": "1",
  "download_li

## Step 18: Test on New Example - The Sunken Island

Let's test the extractor on another project to verify it works across different types of projects.

In [38]:
# Load the new example
url_sunken = "https://www.planetminecraft.com/project/custom-terrain-the-sunken-island/"
print(f"Loading: {url_sunken}\n")

driver.get(url_sunken)
time.sleep(3)

page_source_sunken = driver.page_source
soup_sunken = BeautifulSoup(page_source_sunken, 'html.parser')

print("✓ Page loaded\n")

# Extract metadata
result_sunken = extract_project_metadata_complete(soup_sunken, url_sunken)

print("="*70)
print("EXTRACTION RESULTS:")
print("="*70)
print(json.dumps(result_sunken, indent=2))

print("\n" + "="*70)
print("SUMMARY:")
print("="*70)
print(f"Title: {result_sunken['title']}")
print(f"Author: {result_sunken['author']}")
print(f"Category: {result_sunken['category']}")
print(f"Subcategory: {result_sunken['subcategory']}")
print(f"Posted: {result_sunken['posted_date']}")
print(f"Updated: {result_sunken['updated_date']}")
print(f"Views: {result_sunken['views']} ({result_sunken['views_today']} today)")
print(f"Downloads: {result_sunken['downloads']} ({result_sunken['downloads_today']} today)")
print(f"Diamonds: {result_sunken['diamonds']}")
print(f"Hearts: {result_sunken['hearts']}")
print(f"Tags: {result_sunken['tags']}")
print(f"Download links: {len(result_sunken['download_links'])} ({[dl['type'] for dl in result_sunken['download_links']]})")
print(f"Has direct download: {result_sunken['has_direct_download']}")

Loading: https://www.planetminecraft.com/project/custom-terrain-the-sunken-island/

✓ Page loaded

EXTRACTION RESULTS:
{
  "url": "https://www.planetminecraft.com/project/custom-terrain-the-sunken-island/",
  "title": "Custom Terrain: The Sunken Island Adventure(1.2.5) Minecraft Map",
  "category": "Map",
  "subcategory": "challenge-adventure",
  "posted_date": "2011-08-22T00:00:00-04:00",
  "updated_date": "2012-04-02T20:16:28-04:00",
  "tags": [
    "adventure",
    "environment",
    "landscaping",
    "challenge",
    "terrain",
    "island",
    "water",
    "mountains",
    "sunken",
    "challenge-adventure"
  ],
  "description": "This started as a simple custom terrain, but has now spawned into a highly detailed, full adventure map, with objectives, story line, and ending event....",
  "author": "inHaze",
  "views": "1,493,309",
  "views_today": "25",
  "downloads": "635,293",
  "downloads_today": "2",
  "diamonds": "1,939",
  "hearts": "660",
  "download_links": [
    {
      

## Step 19: Investigate Download Links - World vs Schematic

You're right - we need to distinguish between world downloads and schematic downloads, as they have very different file sizes. Let's examine the actual download links on this page.

In [39]:
# Look for ALL download-related links on The Sunken Island page
print("="*70)
print("Searching for ALL download links...")
print("="*70 + "\n")

all_links = soup_sunken.find_all('a', href=True)
download_related = []

for link in all_links:
    href = link['href']
    text = link.get_text(strip=True)
    
    # Look for anything download-related
    if any(keyword in href.lower() for keyword in ['download', 'mirror', 'website']) or \
       any(keyword in text.lower() for keyword in ['download', 'world', 'schematic', 'map']):
        
        # Skip navigation/tag links
        if '/tags/' not in href and '/projects/' not in href and text:
            download_related.append({
                'text': text,
                'href': href,
                'classes': link.get('class', [])
            })

print(f"Found {len(download_related)} potential download links:\n")

for i, link in enumerate(download_related[:20], 1):
    print(f"{i}. Text: '{link['text']}'")
    print(f"   URL: {link['href']}")
    print(f"   Classes: {link['classes']}")
    
    # Check if it's a direct PM hosted file
    if '/download/schematic/' in link['href']:
        print(f"   ⚠️ Type: Direct PM SCHEMATIC download")
    elif '/download/world/' in link['href']:
        print(f"   ⚠️ Type: Direct PM WORLD download")
    elif '/download/mirror/' in link['href'] or '/download/website/' in link['href']:
        print(f"   ⚠️ Type: PM redirect (need to check destination)")
    
    print()

# Check the raw HTML around download links
print("="*70)
print("Checking raw HTML for download patterns...")
print("="*70 + "\n")

if '/download/' in page_source_sunken:
    # Find all download URLs in the source
    download_urls = re.findall(r'/download/\w+/\d+/', page_source_sunken)
    unique_urls = list(set(download_urls))
    
    print(f"Found {len(unique_urls)} unique download URLs:\n")
    for url in unique_urls[:10]:
        print(f"  {url}")
        
        # Get context around this URL
        idx = page_source_sunken.find(url)
        if idx > 0:
            context = page_source_sunken[max(0, idx-150):min(len(page_source_sunken), idx+150)]
            # Look for keywords
            if 'world' in context.lower():
                print(f"    → Contains 'world' keyword")
            if 'schematic' in context.lower():
                print(f"    → Contains 'schematic' keyword")
        print()

Searching for ALL download links...

Found 10 potential download links:

1. Text: 'Part of the starting room with map info, story and rules.'
   URL: https://static.planetminecraft.com/files/resource_media/screenshot/1145/2011-11-06_024704_814571.jpg
   Classes: ['rsImg']

2. Text: 'Download Minecraft Map'
   URL: /project/custom-terrain-the-sunken-island/download/worldmap/
   Classes: ['branded-download']

3. Text: 'Downloadable Map'
   URL: /project/custom-terrain-the-sunken-island/download/mirror/285279/
   Classes: ['third-party-download', 'branded-download']
   ⚠️ Type: PM redirect (need to check destination)

4. Text: 'How to install Minecraft Maps on Java Edition'
   URL: https://www.planetminecraft.com/blog/how-to-download-and-install-minecraft-maps/
   Classes: []

5. Text: 'Hillside Manor World [1.8] 4 Year Anniversary'
   URL: /project/hillside-manor/
   Classes: ['r-title']

6. Text: 'Landscape Maps'
   URL: /collection/168383/landscape-maps/
   Classes: ['collection-title'

In [40]:
# Examine the worldmap download link more carefully
print("="*70)
print("Examining the /download/worldmap/ link...")
print("="*70 + "\n")

worldmap_link = soup_sunken.find('a', href=re.compile(r'/download/worldmap/'))

if worldmap_link:
    print("✓ Found worldmap download link!")
    print(f"Text: '{worldmap_link.get_text(strip=True)}'")
    print(f"Href: {worldmap_link.get('href')}")
    print(f"Classes: {worldmap_link.get('class')}")
    print(f"Title: {worldmap_link.get('title')}")
    
    # Get parent context
    parent = worldmap_link.parent
    print(f"\nParent tag: <{parent.name}>")
    print(f"Parent text: {parent.get_text(strip=True)}")
else:
    print("❌ worldmap link not found")

# Check for schematic pattern too
print("\n" + "="*70)
print("Checking for schematic download patterns...")
print("="*70 + "\n")

schematic_link = soup_sunken.find('a', href=re.compile(r'/download/schematic/'))
if schematic_link:
    print("✓ Found schematic download link!")
    print(f"Text: '{schematic_link.get_text(strip=True)}'")
    print(f"Href: {schematic_link.get('href')}")
else:
    print("❌ No schematic download link on this page (expected - it's a world map)")

# Summary of patterns
print("\n" + "="*70)
print("DOWNLOAD PATTERN SUMMARY:")
print("="*70)
print("\nDirect PM Downloads:")
print("  /download/worldmap/      → Full world download (large file)")
print("  /download/world/         → World download (alternative pattern)")
print("  /download/schematic/     → Schematic file (smaller, structure only)")
print("\nExternal Redirects:")
print("  /download/mirror/XXXX/   → Redirect to external host")
print("  /download/website/XXXX/  → Redirect to creator's website")

Examining the /download/worldmap/ link...

✓ Found worldmap download link!
Text: 'Download Minecraft Map'
Href: /project/custom-terrain-the-sunken-island/download/worldmap/
Classes: ['branded-download']
Title: Download this file for Minecraft.

Parent tag: <li>
Parent text: Download Minecraft Map

Checking for schematic download patterns...

❌ No schematic download link on this page (expected - it's a world map)

DOWNLOAD PATTERN SUMMARY:

Direct PM Downloads:
  /download/worldmap/      → Full world download (large file)
  /download/world/         → World download (alternative pattern)
  /download/schematic/     → Schematic file (smaller, structure only)

External Redirects:
  /download/mirror/XXXX/   → Redirect to external host
  /download/website/XXXX/  → Redirect to creator's website


## Step 20: Test Updated Extractor with World vs Schematic Detection

Now let's test the updated extractor that properly distinguishes between world files and schematic files.

In [41]:
# Test the updated extractor on The Sunken Island
result_updated = extract_project_metadata_complete(soup_sunken, url_sunken)

print("="*70)
print("UPDATED EXTRACTION - The Sunken Island")
print("="*70)
print(json.dumps(result_updated, indent=2))

print("\n" + "="*70)
print("DOWNLOAD LINKS DETAILS:")
print("="*70)

for i, dl in enumerate(result_updated['download_links'], 1):
    print(f"\n{i}. {dl['text']}")
    print(f"   URL: {dl['url']}")
    print(f"   Type: {dl['type']}")
    print(f"   File Type: {dl['file_type']}")

print(f"\n{'='*70}")
print(f"✓ Found {len(result_updated['download_links'])} download link(s)")
print(f"✓ Has direct download: {result_updated['has_direct_download']}")

# Also test on the first example (GK First Church) to verify it detects schematic
print("\n\n" + "="*70)
print("TESTING ON GK FIRST CHURCH (should have schematic)")
print("="*70)

result_church = extract_project_metadata_complete(soup, url)

print("\nDownload links:")
for i, dl in enumerate(result_church['download_links'], 1):
    print(f"{i}. {dl['text']}")
    print(f"   Type: {dl['type']}")
    print(f"   File Type: {dl['file_type']}")
    print()

UPDATED EXTRACTION - The Sunken Island
{
  "url": "https://www.planetminecraft.com/project/custom-terrain-the-sunken-island/",
  "title": "Custom Terrain: The Sunken Island Adventure(1.2.5) Minecraft Map",
  "category": "Map",
  "subcategory": "challenge-adventure",
  "posted_date": "2011-08-22T00:00:00-04:00",
  "updated_date": "2012-04-02T20:16:28-04:00",
  "tags": [
    "adventure",
    "environment",
    "landscaping",
    "challenge",
    "terrain",
    "island",
    "water",
    "mountains",
    "sunken",
    "challenge-adventure"
  ],
  "description": "This started as a simple custom terrain, but has now spawned into a highly detailed, full adventure map, with objectives, story line, and ending event....",
  "author": "inHaze",
  "views": "1,493,309",
  "views_today": "25",
  "downloads": "635,293",
  "downloads_today": "2",
  "diamonds": "1,939",
  "hearts": "660",
  "download_links": [
    {
      "text": "Downloadable Map",
      "url": "/project/custom-terrain-the-sunken-isl

KeyError: 'file_type'

In [42]:
# The function needs to be re-run. Let me check if the world link is being captured
print("="*70)
print("DEBUG: Checking what links are being found...")
print("="*70 + "\n")

all_links_check = soup_sunken.find_all('a', href=True)

world_links = []
schematic_links = []
mirror_links = []

for link in all_links_check:
    href = link['href']
    text = link.get_text(strip=True)
    
    if '/download/worldmap/' in href or '/download/world/' in href:
        world_links.append({'text': text, 'href': href})
    elif '/download/schematic/' in href:
        schematic_links.append({'text': text, 'href': href})
    elif '/download/mirror/' in href or '/download/website/' in href:
        mirror_links.append({'text': text, 'href': href})

print(f"World links found: {len(world_links)}")
for wl in world_links:
    print(f"  - {wl['text']}: {wl['href']}")

print(f"\nSchematic links found: {len(schematic_links)}")
for sl in schematic_links:
    print(f"  - {sl['text']}: {sl['href']}")

print(f"\nMirror/Website links found: {len(mirror_links)}")
for ml in mirror_links:
    print(f"  - {ml['text']}: {ml['href']}")

print("\n" + "="*70)
print("The world link exists but wasn't captured by the old function!")
print("="*70)

DEBUG: Checking what links are being found...

World links found: 1
  - Download Minecraft Map: /project/custom-terrain-the-sunken-island/download/worldmap/

Schematic links found: 0

Mirror/Website links found: 1
  - Downloadable Map: /project/custom-terrain-the-sunken-island/download/mirror/285279/

The world link exists but wasn't captured by the old function!


In [44]:
# Re-extract with the updated function
result_sunken_fixed = extract_project_metadata_complete(soup_sunken, url_sunken)

print("="*70)
print("FIXED EXTRACTION - The Sunken Island (with world detection)")
print("="*70 + "\n")

print(f"Total download links found: {len(result_sunken_fixed['download_links'])}\n")

for i, dl in enumerate(result_sunken_fixed['download_links'], 1):
    print(f"{i}. {dl['text']}")
    print(f"   URL: {dl['url']}")
    print(f"   Type: {dl['type']}")
    print(f"   File Type: {dl['file_type']}")
    print()

print("="*70)
print("VERIFICATION:")
print("="*70)
print(f"✓ Has direct download: {result_sunken_fixed['has_direct_download']}")

# Count by type
world_count = sum(1 for dl in result_sunken_fixed['download_links'] if dl['file_type'] == 'world')
schematic_count = sum(1 for dl in result_sunken_fixed['download_links'] if dl['file_type'] == 'schematic')
external_count = sum(1 for dl in result_sunken_fixed['download_links'] if dl['file_type'] == 'unknown')

print(f"✓ World downloads: {world_count}")
print(f"✓ Schematic downloads: {schematic_count}")
print(f"✓ External/Unknown: {external_count}")

print("\n" + "="*70)
print("SUCCESS! Now capturing:")
print("  - Direct PM world download (/download/worldmap/)")
print("  - External redirect (/download/mirror/)")
print("="*70)

FIXED EXTRACTION - The Sunken Island (with world detection)

Total download links found: 2

1. Download Minecraft Map
   URL: /project/custom-terrain-the-sunken-island/download/worldmap/
   Type: direct_pm_world
   File Type: world

2. Downloadable Map
   URL: /project/custom-terrain-the-sunken-island/download/mirror/285279/
   Type: pm_external_redirect
   File Type: unknown

VERIFICATION:
✓ Has direct download: True
✓ World downloads: 1
✓ Schematic downloads: 0
✓ External/Unknown: 1

SUCCESS! Now capturing:
  - Direct PM world download (/download/worldmap/)
  - External redirect (/download/mirror/)


## Step 21: Test on "Finally Hotel"

Let's test another project to verify the extractor works consistently.

In [45]:
# Load the Finally Hotel project
url_hotel = "https://www.planetminecraft.com/project/finally-hotel/"
print(f"Loading: {url_hotel}\n")

driver.get(url_hotel)
time.sleep(3)

page_source_hotel = driver.page_source
soup_hotel = BeautifulSoup(page_source_hotel, 'html.parser')

print("✓ Page loaded\n")

# Extract metadata
result_hotel = extract_project_metadata_complete(soup_hotel, url_hotel)

print("="*70)
print("EXTRACTION RESULTS - Finally Hotel")
print("="*70)
print(json.dumps(result_hotel, indent=2))

print("\n" + "="*70)
print("SUMMARY:")
print("="*70)
print(f"Title: {result_hotel['title']}")
print(f"Author: {result_hotel['author']}")
print(f"Category: {result_hotel['category']}")
print(f"Subcategory: {result_hotel['subcategory']}")
print(f"Posted: {result_hotel['posted_date']}")
print(f"Updated: {result_hotel['updated_date']}")
print(f"Views: {result_hotel['views']} ({result_hotel['views_today']} today)")
print(f"Downloads: {result_hotel['downloads']} ({result_hotel['downloads_today']} today)")
print(f"Diamonds: {result_hotel['diamonds']}")
print(f"Hearts: {result_hotel['hearts']}")

print(f"\nDownload Links ({len(result_hotel['download_links'])}):")
for i, dl in enumerate(result_hotel['download_links'], 1):
    print(f"  {i}. {dl['text'][:40]}")
    print(f"     Type: {dl['type']}")
    print(f"     File: {dl['file_type']}")

print(f"\n✓ Has direct download: {result_hotel['has_direct_download']}")

Loading: https://www.planetminecraft.com/project/finally-hotel/

✓ Page loaded

EXTRACTION RESULTS - Finally Hotel
{
  "url": "https://www.planetminecraft.com/project/finally-hotel/",
  "title": "Minecraft Hotel / Resort | 3D View | Map Download Minecraft Map",
  "category": "Map",
  "subcategory": "land-structure",
  "posted_date": "2013-11-01T00:00:00-04:00",
  "updated_date": "2022-06-24T11:30:38-04:00",
  "tags": [
    "city",
    "land-structure",
    "complex",
    "minecraft",
    "building",
    "modern",
    "realistic",
    "download",
    "beach",
    "hotel",
    "real",
    "pool",
    "realism",
    "schematic",
    "casino",
    "hall",
    "3d",
    "large",
    "resort",
    "sauna",
    "gym",
    "bar",
    "megastructure",
    "scheme",
    "spa",
    "modernhotel",
    "indoorpool",
    "top",
    "minecrafthotel",
    "minecraftresort"
  ],
  "description": "This is my 1st attempt at a hotel, took me about 4 months to complete. It features an indoor pool, outdoor 

## Step 22: Discovering All Projects - Exploration

Planet Minecraft doesn't have sequential IDs, so we need to find how to discover all projects. Let's explore:

### Possible approaches:
1. **Browse/Search pages** - Categories, tags, sort by date/popularity
2. **Sitemap** - XML sitemaps for search engines
3. **API** - Check if they have a public API
4. **RSS/Feeds** - RSS feeds for new content
5. **Pagination** - Browse pages with filters

Let's start investigating!

In [46]:
# Approach 1: Check the main projects browse page
print("="*70)
print("Approach 1: Browse/Search Pages")
print("="*70 + "\n")

browse_url = "https://www.planetminecraft.com/projects/"
print(f"Loading: {browse_url}\n")

driver.get(browse_url)
time.sleep(3)

page_source_browse = driver.page_source
soup_browse = BeautifulSoup(page_source_browse, 'html.parser')

print("✓ Page loaded\n")

# Look for project links
project_links = []
all_links = soup_browse.find_all('a', href=True)

for link in all_links:
    href = link['href']
    # Project URLs have pattern /project/SLUG/
    if re.match(r'^/project/[^/]+/$', href):
        project_links.append(href)

# Remove duplicates
unique_projects = list(set(project_links))

print(f"Found {len(unique_projects)} unique project links on the browse page")
print("\nFirst 10 projects:")
for proj in unique_projects[:10]:
    print(f"  {proj}")

# Look for pagination
print("\n" + "="*70)
print("Checking for pagination...")
print("="*70 + "\n")

# Look for "next page" or page numbers
pagination_links = soup_browse.find_all('a', href=re.compile(r'page=|/p\d+'))
print(f"Found {len(pagination_links)} pagination links")

for pag_link in pagination_links[:5]:
    print(f"  Text: '{pag_link.get_text(strip=True)}'")
    print(f"  URL: {pag_link.get('href')}")
    print()

# Check for filters/sorting
print("="*70)
print("Checking for filters and sorting options...")
print("="*70 + "\n")

# Look for select elements or filter links
selects = soup_browse.find_all('select')
print(f"Found {len(selects)} select dropdowns")

for sel in selects[:5]:
    name = sel.get('name', 'no name')
    print(f"\nSelect: {name}")
    options = sel.find_all('option')
    print(f"  Options ({len(options)}):")
    for opt in options[:8]:
        print(f"    - {opt.get('value')}: {opt.get_text(strip=True)}")
    if len(options) > 8:
        print(f"    ... and {len(options)-8} more")

Approach 1: Browse/Search Pages

Loading: https://www.planetminecraft.com/projects/

✓ Page loaded

Found 34 unique project links on the browse page

First 10 projects:
  /project/big-yellow-crane/
  /project/greenspire-island-1k-custom-survival-friendly-world-1-21/
  /project/containment-field/
  /project/suquare-colosseum-of-rome-palace-of-italian-civilization/
  /project/f-e-a-r/
  /project/small-contemporary-house/
  /project/preview-medieval-starter-map-moonvale-1-21-8-vanilla/
  /project/brownstone-halloween-house-download/
  /project/a-cottage-in-the-woods-4341009/
  /project/reuterstra-e-6b-townhouse-bonn-germany/

Checking for pagination...

Found 0 pagination links
Checking for filters and sorting options...

Found 0 select dropdowns


In [47]:
# Let's examine the page more carefully for pagination
# Often pagination is handled with JavaScript, so we need to look for different patterns

print("=" * 70)
print("Examining page structure for pagination...")
print("=" * 70)

# Look for "next" buttons or pagination elements
next_buttons = soup.find_all('a', string=lambda s: s and 'next' in s.lower())
print(f"\nLinks containing 'next': {len(next_buttons)}")
for btn in next_buttons[:5]:
    print(f"  Text: {btn.get_text(strip=True)}, href: {btn.get('href')}")

# Look for page numbers
page_numbers = soup.find_all('a', string=lambda s: s and s.strip().isdigit())
print(f"\nLinks that are just numbers: {len(page_numbers)}")
for num in page_numbers[:5]:
    print(f"  Text: {num.get_text(strip=True)}, href: {num.get('href')}")

# Look for common pagination class names
pagination_classes = ['pagination', 'pager', 'pages', 'page-numbers']
for cls in pagination_classes:
    elements = soup.find_all(class_=lambda c: c and cls in c.lower())
    if elements:
        print(f"\nFound elements with class containing '{cls}': {len(elements)}")
        for elem in elements[:2]:
            print(f"  Classes: {elem.get('class')}")
            print(f"  Content preview: {elem.get_text(strip=True)[:100]}")

# Check for "load more" buttons
load_more = soup.find_all(string=lambda s: s and 'load more' in s.lower())
print(f"\nElements containing 'load more': {len(load_more)}")

# Check the URL structure - maybe they use query parameters
print("\n" + "=" * 70)
print("Current page URL structure")
print("=" * 70)
print(f"URL: {driver.current_url}")

# Check if there's an infinite scroll mechanism
print("\n" + "=" * 70)
print("Checking for infinite scroll indicators...")
print("=" * 70)
scripts = soup.find_all('script')
for script in scripts:
    if script.string and 'infinite' in script.string.lower():
        print("Found script mentioning 'infinite'")
        break

Examining page structure for pagination...

Links containing 'next': 0

Links that are just numbers: 0

Elements containing 'load more': 0

Current page URL structure
URL: https://www.planetminecraft.com/projects/

Checking for infinite scroll indicators...


In [48]:
# Try different URL patterns to see if they support pagination
# Common patterns: /projects/?p=2, /projects/page/2/, /projects/?page=2, etc.

test_urls = [
    "https://www.planetminecraft.com/projects/?p=2",
    "https://www.planetminecraft.com/projects/?page=2",
    "https://www.planetminecraft.com/projects/page/2/",
    "https://www.planetminecraft.com/projects/?offset=50",
    "https://www.planetminecraft.com/projects/order/rating/",
    "https://www.planetminecraft.com/projects/order/popular/",
]

print("=" * 70)
print("Testing different URL patterns...")
print("=" * 70)

for test_url in test_urls[:3]:  # Test first 3
    print(f"\nTesting: {test_url}")
    driver.get(test_url)
    time.sleep(2)  # Wait for page to load
    
    test_soup = BeautifulSoup(driver.page_source, 'html.parser')
    project_links = test_soup.find_all('a', href=lambda h: h and '/project/' in h)
    unique_links = set([link.get('href').split('?')[0] for link in project_links])
    
    print(f"  Final URL: {driver.current_url}")
    print(f"  Found {len(unique_links)} unique project links")
    
    if len(unique_links) > 0:
        first_project = list(unique_links)[0]
        print(f"  First project: {first_project}")
        
# Go back to main page
driver.get("https://www.planetminecraft.com/projects/")
time.sleep(2)

Testing different URL patterns...

Testing: https://www.planetminecraft.com/projects/?p=2
  Final URL: https://www.planetminecraft.com/projects/?p=2
  Found 25 unique project links
  First project: /project/bulldozer-6736655/

Testing: https://www.planetminecraft.com/projects/?page=2
  Final URL: https://www.planetminecraft.com/projects/?page=2
  Found 33 unique project links
  First project: /project/big-yellow-crane/

Testing: https://www.planetminecraft.com/projects/page/2/
  Final URL: https://www.planetminecraft.com/projects/
  Found 33 unique project links
  First project: /project/big-yellow-crane/


In [49]:
# Verify that ?p=N gives us different pages
# Let's compare page 1, 2, and 3

print("=" * 70)
print("Verifying pagination with ?p=N parameter")
print("=" * 70)

pages_data = {}

for page_num in [1, 2, 3]:
    url = f"https://www.planetminecraft.com/projects/?p={page_num}"
    print(f"\nLoading page {page_num}...")
    driver.get(url)
    time.sleep(2)
    
    page_soup = BeautifulSoup(driver.page_source, 'html.parser')
    project_links = page_soup.find_all('a', href=lambda h: h and '/project/' in h)
    unique_links = set([link.get('href').split('?')[0] for link in project_links])
    
    pages_data[page_num] = unique_links
    print(f"  Found {len(unique_links)} unique project links")
    print(f"  First 3: {list(unique_links)[:3]}")

# Check for overlap
print("\n" + "=" * 70)
print("Checking for overlap between pages...")
print("=" * 70)

overlap_1_2 = pages_data[1].intersection(pages_data[2])
overlap_2_3 = pages_data[2].intersection(pages_data[3])
overlap_1_3 = pages_data[1].intersection(pages_data[3])

print(f"Page 1 & 2 overlap: {len(overlap_1_2)} projects")
print(f"Page 2 & 3 overlap: {len(overlap_2_3)} projects")
print(f"Page 1 & 3 overlap: {len(overlap_1_3)} projects")

print(f"\nTotal unique projects across pages 1-3: {len(pages_data[1] | pages_data[2] | pages_data[3])}")

# Try to find the last page by testing higher page numbers
print("\n" + "=" * 70)
print("Finding the last page...")
print("=" * 70)

# Try a very high page number
driver.get("https://www.planetminecraft.com/projects/?p=999999")
time.sleep(2)
print(f"Navigating to page 999999 redirects to: {driver.current_url}")

test_soup = BeautifulSoup(driver.page_source, 'html.parser')
test_links = test_soup.find_all('a', href=lambda h: h and '/project/' in h)
unique_test = set([link.get('href').split('?')[0] for link in test_links])
print(f"Found {len(unique_test)} projects on that page")

Verifying pagination with ?p=N parameter

Loading page 1...
  Found 33 unique project links
  First 3: ['/project/big-yellow-crane/', '/project/greenspire-island-1k-custom-survival-friendly-world-1-21/', '/project/space-station-end-hub/']

Loading page 2...
  Found 25 unique project links
  First 3: ['/project/bulldozer-6736655/', '/project/iralaya-5k-custom-terrain-map-download-12-custom-biomes-1-20-java-bedrock-varuna-studios/', '/project/fnaf-fall-fest-minecraft-tutorial-series-map/']

Loading page 3...
  Found 25 unique project links
  First 3: ['/project/big-yellow-crane/', '/project/dental-clinic-littletiles-minecraft/', '/project/basic-house-6744566/']

Checking for overlap between pages...
Page 1 & 2 overlap: 0 projects
Page 2 & 3 overlap: 0 projects
Page 1 & 3 overlap: 1 projects

Total unique projects across pages 1-3: 82

Finding the last page...
Navigating to page 999999 redirects to: https://www.planetminecraft.com/projects/?p=999999
Found 25 projects on that page


In [50]:
# Check if there's a page count indicator or total projects count
# Also check different sorting/filtering options

print("=" * 70)
print("Checking for total project count on the page...")
print("=" * 70)

driver.get("https://www.planetminecraft.com/projects/")
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Look for text containing "projects" or "results"
text_with_numbers = soup.find_all(string=lambda s: s and any(word in s.lower() for word in ['projects', 'results', 'total', 'showing']))
print("\nText containing count keywords:")
for text in text_with_numbers[:10]:
    clean_text = ' '.join(text.strip().split())
    if clean_text and any(char.isdigit() for char in clean_text):
        print(f"  {clean_text}")

# Check if there are filter options that might tell us about categories
print("\n" + "=" * 70)
print("Checking available filters/categories...")
print("=" * 70)

# Look for filter links
filter_links = soup.find_all('a', href=lambda h: h and '/projects/' in h and '?' in h)
print(f"\nFound {len(filter_links)} filter/category links")
unique_filters = set()
for link in filter_links:
    href = link.get('href')
    if href.startswith('/projects/'):
        unique_filters.add(href)

print(f"Unique filter URLs: {len(unique_filters)}")
for filt in sorted(unique_filters)[:15]:
    print(f"  {filt}")

# Also check for category-specific pages
print("\n" + "=" * 70)
print("Checking if category pages exist...")
print("=" * 70)

test_category_urls = [
    "https://www.planetminecraft.com/projects/?category=land-structure",
    "https://www.planetminecraft.com/projects/land-structure/",
    "https://www.planetminecraft.com/resources/land-structure/",
]

for cat_url in test_category_urls[:1]:
    print(f"\nTrying: {cat_url}")
    driver.get(cat_url)
    time.sleep(2)
    print(f"  Final URL: {driver.current_url}")
    
    cat_soup = BeautifulSoup(driver.page_source, 'html.parser')
    cat_links = cat_soup.find_all('a', href=lambda h: h and '/project/' in h)
    unique_cat = set([link.get('href').split('?')[0] for link in cat_links])
    print(f"  Found {len(unique_cat)} projects")

Checking for total project count on the page...

Text containing count keywords:
  {"@context":"http:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https:\/\/www.planetminecraft.com\/projects\/","name":"Minecraft Maps"}}]}

Checking available filters/categories...

Found 53 filter/category links
Unique filter URLs: 49
  /projects/?monetization=0
  /projects/?monetization=1
  /projects/?monetization=2
  /projects/?monetization=any
  /projects/?order=order_downloads
  /projects/?order=order_hot
  /projects/?order=order_latest
  /projects/?order=order_popularity
  /projects/?order=order_views
  /projects/?platform=1
  /projects/?platform=2
  /projects/?platform=any
  /projects/?share=any
  /projects/?share=schematic
  /projects/?share=seed

Checking if category pages exist...

Trying: https://www.planetminecraft.com/projects/?category=land-structure
  Final URL: https://www.planetminecraft.com/projects/?category=land-structure

In [51]:
# Let's do a binary search to find approximately how many pages exist
# Start with a reasonable range

import time

print("=" * 70)
print("Finding the approximate number of pages using binary search...")
print("=" * 70)

def has_projects(page_num):
    """Check if a page number has any projects"""
    url = f"https://www.planetminecraft.com/projects/?p={page_num}"
    driver.get(url)
    time.sleep(1.5)  # Shorter delay for searching
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    project_links = soup.find_all('a', href=lambda h: h and '/project/' in h)
    
    # Filter to actual project pages (not just any /project/ link)
    actual_projects = [link for link in project_links 
                      if link.get('href', '').startswith('/project/') 
                      and len(link.get('href', '').split('/')) >= 3]
    
    return len(actual_projects) > 0

# Binary search
low, high = 1, 50000  # Assuming max 50k pages (wild guess)
last_valid = 1

print("\nSearching for the last page...")
iteration = 0
max_iterations = 20  # Prevent infinite loop

while low <= high and iteration < max_iterations:
    mid = (low + high) // 2
    iteration += 1
    
    print(f"\nIteration {iteration}: Testing page {mid} (range: {low}-{high})")
    
    has_proj = has_projects(mid)
    
    if has_proj:
        print(f"  ✓ Page {mid} has projects")
        last_valid = mid
        low = mid + 1  # Try higher
    else:
        print(f"  ✗ Page {mid} is empty")
        high = mid - 1  # Try lower

print("\n" + "=" * 70)
print(f"Approximate last page: {last_valid}")
print("=" * 70)

# Verify by testing a few pages around the found number
print(f"\nVerifying pages around {last_valid}...")
for test_page in [last_valid - 1, last_valid, last_valid + 1, last_valid + 2]:
    if test_page > 0:
        has_proj = has_projects(test_page)
        status = "✓ HAS PROJECTS" if has_proj else "✗ EMPTY"
        print(f"  Page {test_page}: {status}")

Finding the approximate number of pages using binary search...

Searching for the last page...

Iteration 1: Testing page 25000 (range: 1-50000)
  ✓ Page 25000 has projects

Iteration 2: Testing page 37500 (range: 25001-50000)
  ✓ Page 37500 has projects

Iteration 3: Testing page 43750 (range: 37501-50000)
  ✓ Page 43750 has projects

Iteration 4: Testing page 46875 (range: 43751-50000)
  ✓ Page 46875 has projects

Iteration 5: Testing page 48438 (range: 46876-50000)
  ✓ Page 48438 has projects

Iteration 6: Testing page 49219 (range: 48439-50000)
  ✓ Page 49219 has projects

Iteration 7: Testing page 49610 (range: 49220-50000)
  ✓ Page 49610 has projects

Iteration 8: Testing page 49805 (range: 49611-50000)
  ✓ Page 49805 has projects

Iteration 9: Testing page 49903 (range: 49806-50000)
  ✓ Page 49903 has projects

Iteration 10: Testing page 49952 (range: 49904-50000)
  ✓ Page 49952 has projects

Iteration 11: Testing page 49976 (range: 49953-50000)
  ✓ Page 49976 has projects

Iter

In [52]:
# Let's try with a much higher upper bound
print("=" * 70)
print("Extending search to find the actual limit...")
print("=" * 70)

# Test some very high page numbers directly
test_pages = [100000, 200000, 500000, 1000000]

print("\nQuick spot checks on high page numbers:")
for page in test_pages:
    url = f"https://www.planetminecraft.com/projects/?p={page}"
    driver.get(url)
    time.sleep(1.5)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    project_links = soup.find_all('a', href=lambda h: h and '/project/' in h)
    actual_projects = [link for link in project_links 
                      if link.get('href', '').startswith('/project/') 
                      and len(link.get('href', '').split('/')) >= 3]
    
    status = "✓ HAS PROJECTS" if len(actual_projects) > 0 else "✗ EMPTY"
    print(f"  Page {page:,}: {status} ({len(actual_projects)} project links)")

print("\n" + "=" * 70)
print("Analysis:")
print("=" * 70)
print("It appears Planet Minecraft might not have a hard page limit,")
print("or the limit is very high (>1M pages).")
print("\nFor practical scraping, we have a few options:")
print("1. Scrape pages sequentially until we hit empty pages")
print("2. Use filters (?share=schematic, ?category=X) to narrow scope")
print("3. Use ?order=order_latest to get newest first, then stop when we")
print("   reach projects we've already seen (for incremental scraping)")
print("4. Estimate based on projects per page: if ~25-30 projects/page")
print("   and they show 50k+ pages, that's 1.25M+ projects total")

Extending search to find the actual limit...

Quick spot checks on high page numbers:
  Page 100,000: ✓ HAS PROJECTS (75 project links)
  Page 200,000: ✓ HAS PROJECTS (75 project links)
  Page 500,000: ✓ HAS PROJECTS (75 project links)
  Page 1,000,000: ✓ HAS PROJECTS (75 project links)

Analysis:
It appears Planet Minecraft might not have a hard page limit,
or the limit is very high (>1M pages).

For practical scraping, we have a few options:
1. Scrape pages sequentially until we hit empty pages
2. Use filters (?share=schematic, ?category=X) to narrow scope
3. Use ?order=order_latest to get newest first, then stop when we
   reach projects we've already seen (for incremental scraping)
4. Estimate based on projects per page: if ~25-30 projects/page
   and they show 50k+ pages, that's 1.25M+ projects total


In [53]:
# Check if high page numbers are just showing duplicates/cycling
print("=" * 70)
print("Checking if high page numbers are cycling content...")
print("=" * 70)

def get_projects_from_page(page_num):
    """Get list of unique project URLs from a page"""
    url = f"https://www.planetminecraft.com/projects/?p={page_num}"
    driver.get(url)
    time.sleep(1.5)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    project_links = soup.find_all('a', href=lambda h: h and '/project/' in h)
    
    # Get unique project URLs
    projects = set()
    for link in project_links:
        href = link.get('href', '')
        if href.startswith('/project/') and len(href.split('/')) >= 3:
            # Clean URL (remove query params and trailing slash)
            clean_href = href.split('?')[0].rstrip('/')
            projects.add(clean_href)
    
    return projects

# Compare different page ranges
test_pages = [1, 100, 1000, 10000, 100000, 500000]
all_projects = {}

print("\nSampling pages:")
for page in test_pages:
    projects = get_projects_from_page(page)
    all_projects[page] = projects
    print(f"  Page {page:>7,}: {len(projects):>2} unique projects")
    if len(projects) > 0:
        print(f"    First: {list(projects)[0]}")

print("\n" + "=" * 70)
print("Checking for duplicates across sampled pages...")
print("=" * 70)

# Check overlaps
pages_list = list(all_projects.keys())
for i in range(len(pages_list)):
    for j in range(i+1, len(pages_list)):
        page1, page2 = pages_list[i], pages_list[j]
        overlap = all_projects[page1].intersection(all_projects[page2])
        if overlap:
            print(f"Page {page1:,} & {page2:,}: {len(overlap)} shared projects")

# Count total unique
all_unique = set()
for projects in all_projects.values():
    all_unique.update(projects)

print(f"\nTotal unique projects across all sampled pages: {len(all_unique)}")

print("\n" + "=" * 70)
print("Conclusion:")
print("=" * 70)
print("Based on the overlap analysis, we can determine if:")
print("- High page numbers show unique content → need to scrape many pages")
print("- High page numbers repeat content → pagination wraps around")
print("- Few/no overlaps → sequential scraping will work well")

Checking if high page numbers are cycling content...

Sampling pages:
  Page       1: 33 unique projects
    First: /project/fanmade-erebor-the-lonely-mountain
  Page     100: 25 unique projects
    First: /project/fnaf-movie-2-withered-bonnie-the-bunny
  Page   1,000: 25 unique projects
    First: /project/dream-world-6729795
  Page  10,000: 25 unique projects
    First: /project/hamburg-building-6-by-tim0fei
  Page 100,000: 25 unique projects
    First: /project/brandi-modern-mansion-full-interior
  Page 500,000: 25 unique projects
    First: /project/messi-6689544

Checking for duplicates across sampled pages...
Page 1 & 1,000: 1 shared projects
Page 1 & 10,000: 2 shared projects
Page 1 & 100,000: 2 shared projects
Page 1 & 500,000: 2 shared projects

Total unique projects across all sampled pages: 151

Conclusion:
Based on the overlap analysis, we can determine if:
- High page numbers show unique content → need to scrape many pages
- High page numbers repeat content → pagination wr

## Summary: Planet Minecraft Discovery Strategy

### What We Found:
1. **Pagination works with `?p=N` parameter**
   - Example: `https://www.planetminecraft.com/projects/?p=2`
   - Each page has ~25-33 unique projects
   - Pages show mostly unique content (minimal overlap)

2. **Very high page limit**
   - Page 1,000,000 still returns projects (though might be cycling at that point)
   - Practical range is probably 1-100,000 pages
   - Estimated 1-3 million total projects

3. **Available filters:**
   - `?share=schematic` - schematic files only
   - `?share=seed` - seed only
   - `?category=land-structure` - filter by subcategory
   - `?order=order_latest` - sort by newest
   - `?order=order_popularity` - sort by popular
   - `?order=order_downloads` - sort by downloads
   - `?order=order_views` - sort by views
   - `?platform=1` - Java Edition
   - `?platform=2` - Bedrock Edition

### Recommended Scraping Strategy:

**Option 1: Full Sequential Scraping**
- Start at page 1, scrape sequentially
- Continue until hitting empty pages (or X consecutive empty pages)
- Store URLs in a set to avoid duplicates
- Estimated time: Days to weeks (depending on rate limiting)

**Option 2: Filtered Scraping**
- Use `?share=schematic` to get only schematic files
- Reduces total pages significantly
- Better for our use case (voxel models)

**Option 3: Incremental/Recent Scraping**
- Use `?order=order_latest` to get newest first
- Scrape until reaching projects already in database
- Good for keeping dataset updated

**Option 4: Category-Based Scraping**
- Iterate through each subcategory
- Scrape each category separately
- Easier to resume if interrupted

### Next Steps:
1. Build URL collection function that iterates through pages
2. Add rate limiting (delays between requests)
3. Implement duplicate detection
4. Add progress saving/resuming capability
5. Combine with our metadata extraction function

## Step 23: Build the Complete Scraper

Now we'll create a full scraper that:
1. Collects project URLs from paginated browse pages
2. Extracts metadata from each project
3. Saves results to CSV
4. Supports optional search parameters (category, share type, order, etc.)
5. Implements progress saving and resuming
6. Adds rate limiting to be polite

In [54]:
# URL Collection Function - Scrape project URLs from browse pages
def collect_project_urls(driver, start_page=1, max_pages=None, 
                        category=None, share=None, order=None, 
                        platform=None, delay=2.0, max_empty_pages=5):
    """
    Collect project URLs from Planet Minecraft browse pages.
    
    Parameters:
    - driver: Selenium WebDriver instance
    - start_page: Page number to start from (default: 1)
    - max_pages: Maximum number of pages to scrape (None = unlimited)
    - category: Filter by category (e.g., 'land-structure', 'challenge-adventure')
    - share: Filter by share type ('schematic', 'seed', or None for all)
    - order: Sort order ('order_latest', 'order_popularity', 'order_downloads', 'order_views', 'order_hot')
    - platform: Filter by platform (1=Java, 2=Bedrock, None=any)
    - delay: Delay between page requests in seconds
    - max_empty_pages: Stop after this many consecutive empty pages
    
    Returns:
    - set: Unique project URLs
    """
    
    base_url = "https://www.planetminecraft.com/projects/"
    all_project_urls = set()
    empty_page_count = 0
    current_page = start_page
    
    # Build query parameters
    params = []
    if category:
        params.append(f"category={category}")
    if share:
        params.append(f"share={share}")
    if order:
        params.append(f"order={order}")
    if platform:
        params.append(f"platform={platform}")
    
    param_string = "&".join(params) if params else ""
    
    print("=" * 70)
    print(f"Starting URL collection from page {start_page}")
    if param_string:
        print(f"Filters: {param_string}")
    print("=" * 70)
    
    while True:
        # Check if we've reached max_pages
        if max_pages and (current_page - start_page + 1) > max_pages:
            print(f"\nReached max_pages limit ({max_pages})")
            break
        
        # Build URL with page number
        if param_string:
            url = f"{base_url}?{param_string}&p={current_page}"
        else:
            url = f"{base_url}?p={current_page}"
        
        print(f"\nPage {current_page}: {url}")
        
        try:
            driver.get(url)
            time.sleep(delay)
            
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            project_links = soup.find_all('a', href=lambda h: h and '/project/' in h)
            
            # Extract unique project URLs
            page_projects = set()
            for link in project_links:
                href = link.get('href', '')
                if href.startswith('/project/') and len(href.split('/')) >= 3:
                    clean_href = href.split('?')[0].rstrip('/')
                    page_projects.add(clean_href)
            
            # Check if page has projects
            if len(page_projects) == 0:
                empty_page_count += 1
                print(f"  ⚠ Empty page (count: {empty_page_count}/{max_empty_pages})")
                
                if empty_page_count >= max_empty_pages:
                    print(f"\n  Stopping: {max_empty_pages} consecutive empty pages")
                    break
            else:
                empty_page_count = 0  # Reset counter
                new_projects = page_projects - all_project_urls
                all_project_urls.update(page_projects)
                print(f"  ✓ Found {len(page_projects)} projects ({len(new_projects)} new)")
                print(f"  Total unique: {len(all_project_urls)}")
            
            current_page += 1
            
        except Exception as e:
            print(f"  ✗ Error on page {current_page}: {e}")
            break
    
    print("\n" + "=" * 70)
    print(f"Collection complete: {len(all_project_urls)} unique project URLs")
    print("=" * 70)
    
    return all_project_urls

In [57]:
# Full scraper with progress saving
import csv
import json
from pathlib import Path
from datetime import datetime

def scrape_planet_minecraft(driver, 
                            output_csv='planet_minecraft_projects.csv',
                            progress_file='scraping_progress.json',
                            start_page=1, 
                            max_pages=None,
                            max_projects=None,
                            category=None, 
                            share=None, 
                            order=None,
                            platform=None,
                            page_delay=2.0,
                            project_delay=1.0,
                            resume=True):
    """
    Complete scraper for Planet Minecraft projects.
    
    Parameters:
    - driver: Selenium WebDriver instance
    - output_csv: Path to output CSV file
    - progress_file: Path to progress tracking JSON file
    - start_page: Page number to start URL collection
    - max_pages: Maximum pages to scrape for URLs (None = unlimited)
    - max_projects: Maximum projects to scrape metadata for (None = unlimited)
    - category: Filter by category (e.g., 'land-structure')
    - share: Filter by share type ('schematic', 'seed', None)
    - order: Sort order ('order_latest', 'order_popularity', etc.)
    - platform: Platform filter (1=Java, 2=Bedrock, None=any)
    - page_delay: Delay between page requests (seconds)
    - project_delay: Delay between project metadata requests (seconds)
    - resume: Whether to resume from progress file if it exists
    
    Returns:
    - dict: Scraping statistics
    """
    
    # Initialize progress tracking
    progress = {
        'scraped_urls': set(),
        'failed_urls': {},
        'total_collected': 0,
        'total_scraped': 0,
        'start_time': datetime.now().isoformat(),
        'filters': {
            'category': category,
            'share': share,
            'order': order,
            'platform': platform
        }
    }
    
    # Load existing progress if resuming
    if resume and Path(progress_file).exists():
        print(f"Loading progress from {progress_file}...")
        with open(progress_file, 'r') as f:
            saved_progress = json.load(f)
            progress['scraped_urls'] = set(saved_progress.get('scraped_urls', []))
            progress['failed_urls'] = saved_progress.get('failed_urls', {})
            print(f"  Resuming: {len(progress['scraped_urls'])} already scraped")
    
    # Check if output CSV exists and load existing URLs to avoid duplicates
    csv_exists = Path(output_csv).exists()
    if csv_exists and resume:
        print(f"Loading existing data from {output_csv}...")
        with open(output_csv, 'r', newline='', encoding='utf-8') as f:
            reader = csv.DictReader(f)
            for row in reader:
                if 'url' in row:
                    progress['scraped_urls'].add(row['url'])
        print(f"  Found {len(progress['scraped_urls'])} existing projects")
    
    # Step 1: Collect project URLs
    print("\n" + "=" * 70)
    print("STEP 1: Collecting project URLs")
    print("=" * 70)
    
    project_urls = collect_project_urls(
        driver=driver,
        start_page=start_page,
        max_pages=max_pages,
        category=category,
        share=share,
        order=order,
        platform=platform,
        delay=page_delay
    )
    
    progress['total_collected'] = len(project_urls)
    
    # Filter out already scraped URLs
    urls_to_scrape = [url for url in project_urls if url not in progress['scraped_urls']]
    
    if max_projects:
        urls_to_scrape = urls_to_scrape[:max_projects]
    
    print(f"\nURLs to scrape: {len(urls_to_scrape)} (out of {len(project_urls)} collected)")
    
    if len(urls_to_scrape) == 0:
        print("No new URLs to scrape!")
        return progress
    
    # Step 2: Extract metadata from each project
    print("\n" + "=" * 70)
    print("STEP 2: Extracting project metadata")
    print("=" * 70)
    
    # Prepare CSV file
    csv_headers = [
        'title', 'url', 'author', 'category', 'subcategory', 'description',
        'tags', 'posted_date', 'updated_date', 'views', 'views_today',
        'downloads', 'downloads_today', 'diamonds', 'hearts',
        'download_links', 'download_types', 'file_types'
    ]
    
    # Open CSV file in append mode if resuming, write mode if new
    file_mode = 'a' if (csv_exists and resume) else 'w'
    with open(output_csv, file_mode, newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=csv_headers)
        
        # Write header only if new file
        if file_mode == 'w':
            writer.writeheader()
        
        # Process each project
        for idx, project_url in enumerate(urls_to_scrape, 1):
            full_url = f"https://www.planetminecraft.com{project_url}"
            
            print(f"\n[{idx}/{len(urls_to_scrape)}] {project_url}")
            
            try:
                # Load project page
                driver.get(full_url)
                time.sleep(project_delay)
                
                # Extract metadata
                soup = BeautifulSoup(driver.page_source, 'html.parser')
                metadata = extract_project_metadata_complete(soup, full_url)
                
                if metadata and metadata.get('title'):
                    # Prepare CSV row
                    csv_row = {
                        'title': metadata.get('title', ''),
                        'url': full_url,
                        'author': metadata.get('author', ''),
                        'category': metadata.get('category', ''),
                        'subcategory': metadata.get('subcategory', ''),
                        'description': metadata.get('description', '')[:500],  # Truncate
                        'tags': ', '.join(metadata.get('tags', [])),
                        'posted_date': metadata.get('posted_date', ''),
                        'updated_date': metadata.get('updated_date', ''),
                        'views': metadata.get('views', ''),
                        'views_today': metadata.get('views_today', ''),
                        'downloads': metadata.get('downloads', ''),
                        'downloads_today': metadata.get('downloads_today', ''),
                        'diamonds': metadata.get('diamonds', ''),
                        'hearts': metadata.get('hearts', ''),
                        'download_links': ' | '.join([d['url'] for d in metadata.get('download_links', [])]),
                        'download_types': ' | '.join([d['type'] for d in metadata.get('download_links', [])]),
                        'file_types': ' | '.join([d.get('file_type', 'unknown') for d in metadata.get('download_links', [])])
                    }
                    
                    writer.writerow(csv_row)
                    csvfile.flush()  # Ensure data is written immediately
                    
                    progress['scraped_urls'].add(project_url)
                    progress['total_scraped'] += 1
                    
                    print(f"  ✓ {metadata.get('title')} - {metadata.get('subcategory', 'N/A')}")
                    print(f"    Stats: {metadata.get('views', 0)} views, {metadata.get('downloads', 0)} downloads")
                else:
                    print(f"  ⚠ No metadata extracted")
                    progress['failed_urls'][project_url] = "No metadata"
                    
            except Exception as e:
                print(f"  ✗ Error: {e}")
                progress['failed_urls'][project_url] = str(e)
            
            # Save progress every 10 projects
            if idx % 10 == 0:
                with open(progress_file, 'w') as f:
                    json.dump({
                        'scraped_urls': list(progress['scraped_urls']),
                        'failed_urls': progress['failed_urls'],
                        'total_collected': progress['total_collected'],
                        'total_scraped': progress['total_scraped'],
                        'start_time': progress['start_time'],
                        'last_update': datetime.now().isoformat(),
                        'filters': progress['filters']
                    }, f, indent=2)
                print(f"  📝 Progress saved ({progress['total_scraped']} scraped)")
    
    # Final progress save
    progress['end_time'] = datetime.now().isoformat()
    with open(progress_file, 'w') as f:
        json.dump({
            'scraped_urls': list(progress['scraped_urls']),
            'failed_urls': progress['failed_urls'],
            'total_collected': progress['total_collected'],
            'total_scraped': progress['total_scraped'],
            'start_time': progress['start_time'],
            'end_time': progress['end_time'],
            'filters': progress['filters']
        }, f, indent=2)
    
    print("\n" + "=" * 70)
    print("SCRAPING COMPLETE")
    print("=" * 70)
    print(f"Total URLs collected: {progress['total_collected']}")
    print(f"Successfully scraped: {progress['total_scraped']}")
    print(f"Failed: {len(progress['failed_urls'])}")
    print(f"Output saved to: {output_csv}")
    print(f"Progress saved to: {progress_file}")
    print("=" * 70)
    
    return progress

### Example Usage - Test with Small Sample

Let's test the scraper with a small sample first (5 pages, schematics only)

In [58]:
# Test scraper with small sample
# Scrape only 3 pages of schematic files

test_output_dir = Path('../data/planet_minecraft/')
test_output_dir.mkdir(exist_ok=True, parents=True)

test_stats = scrape_planet_minecraft(
    driver=driver,
    output_csv=str(test_output_dir / 'test_schematics.csv'),
    progress_file=str(test_output_dir / 'test_progress.json'),
    start_page=1,
    max_pages=3,  # Only 3 pages for testing
    max_projects=10,  # Max 10 projects for quick test
    share='schematic',  # Only schematic files
    order='order_latest',  # Get newest first
    page_delay=2.0,
    project_delay=1.5,
    resume=True
)

Loading progress from ../data/planet_minecraft/test_progress.json...
  Resuming: 0 already scraped
Loading existing data from ../data/planet_minecraft/test_schematics.csv...
  Found 0 existing projects

STEP 1: Collecting project URLs
Starting URL collection from page 1
Filters: share=schematic&order=order_latest

Page 1: https://www.planetminecraft.com/projects/?share=schematic&order=order_latest&p=1
  ✓ Found 25 projects (25 new)
  Total unique: 25

Page 2: https://www.planetminecraft.com/projects/?share=schematic&order=order_latest&p=2
  ✓ Found 25 projects (25 new)
  Total unique: 50

Page 3: https://www.planetminecraft.com/projects/?share=schematic&order=order_latest&p=3
  ✓ Found 25 projects (25 new)
  Total unique: 75

Reached max_pages limit (3)

Collection complete: 75 unique project URLs

URLs to scrape: 10 (out of 75 collected)

STEP 2: Extracting project metadata

[1/10] /project/piraeus-tower-heliopolis
  ✓ Piraeus tower Heliopolis Minecraft Map - land-structure
    Stats:

### Other Usage Examples

Here are examples for different scraping scenarios:

In [None]:
# Example 1: Scrape all land-structure category projects
# Uncomment to run:
"""
scrape_planet_minecraft(
    driver=driver,
    output_csv='../data/planet_minecraft/land_structures.csv',
    progress_file='../data/planet_minecraft/land_structures_progress.json',
    category='land-structure',
    order='order_popularity',  # Get most popular first
    page_delay=2.0,
    project_delay=1.5
)
"""

# Example 2: Scrape all schematic files (no category filter)
# Uncomment to run:
"""
scrape_planet_minecraft(
    driver=driver,
    output_csv='../data/planet_minecraft/all_schematics.csv',
    progress_file='../data/planet_minecraft/all_schematics_progress.json',
    share='schematic',
    order='order_latest',  # Get newest first
    max_pages=1000,  # Limit to 1000 pages
    page_delay=2.0,
    project_delay=1.5
)
"""

# Example 3: Scrape Java Edition projects only
# Uncomment to run:
"""
scrape_planet_minecraft(
    driver=driver,
    output_csv='../data/planet_minecraft/java_projects.csv',
    progress_file='../data/planet_minecraft/java_progress.json',
    platform=1,  # 1 = Java Edition
    max_pages=500,
    page_delay=2.0,
    project_delay=1.5
)
"""

# Example 4: Scrape challenge/adventure maps with schematics
# Uncomment to run:
"""
scrape_planet_minecraft(
    driver=driver,
    output_csv='../data/planet_minecraft/challenge_schematics.csv',
    progress_file='../data/planet_minecraft/challenge_progress.json',
    category='challenge-adventure',
    share='schematic',
    order='order_downloads',  # Get most downloaded first
    page_delay=2.0,
    project_delay=1.5
)
"""

# Example 5: Full scrape (no filters) - WARNING: This will take a VERY long time!
# Uncomment to run:
"""
scrape_planet_minecraft(
    driver=driver,
    output_csv='../data/planet_minecraft/all_projects.csv',
    progress_file='../data/planet_minecraft/all_progress.json',
    order='order_latest',
    max_pages=10000,  # Limit to prevent extremely long scraping
    page_delay=2.5,  # Slower to be more polite
    project_delay=2.0,
    resume=True
)
"""

print("Examples ready to use!")
print("\nAvailable filter options:")
print("  category: land-structure, challenge-adventure, 3d-art, etc.")
print("  share: 'schematic', 'seed', or None (all)")
print("  order: 'order_latest', 'order_popularity', 'order_downloads', 'order_views', 'order_hot'")
print("  platform: 1 (Java), 2 (Bedrock), None (any)")
print("\nTip: Start with small tests (max_pages=3-5) to verify everything works!")

## Step 24: Fix Download Detection and Switch to JSON

Let's investigate the missing downloads and improve the scraper to use JSON format.

In [65]:
# Let's investigate the problematic projects
test_urls = [
    "https://www.planetminecraft.com/project/halloween-dropper",
    "https://www.planetminecraft.com/project/fnaf-universe-bedrock"
]

for url in test_urls:
    print("=" * 70)
    print(f"Investigating: {url}")
    print("=" * 70)
    
    driver.get(url)
    time.sleep(2)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # Find all links with 'download' in text or href
    download_candidates = []
    
    for link in soup.find_all('a', href=True):
        href = link.get('href', '')
        text = link.get_text(strip=True).lower()
        
        # Check if it's download-related
        if 'download' in text or '/download/' in href.lower():
            download_candidates.append({
                'text': link.get_text(strip=True),
                'href': href,
                'class': link.get('class', [])
            })
    
    print(f"\nFound {len(download_candidates)} download candidates:")
    for idx, dl in enumerate(download_candidates, 1):
        print(f"\n  {idx}. Text: {dl['text']}")
        print(f"     Href: {dl['href']}")
        print(f"     Class: {dl['class']}")
    
    print("\n")

Investigating: https://www.planetminecraft.com/project/halloween-dropper

Found 2 download candidates:

  1. Text: Download Bedrock mcworld
     Href: /project/halloween-dropper/download/mcworld/
     Class: ['branded-download', 'tooltip', 'tipso_style']

  2. Text: Nuestra Señora de la Santísima Trinidad, ship of the line (Free Download)
     Href: /project/nuestra-se-ora-de-la-sant-sima-trinidad-ship-of-the-line-free-download/
     Class: ['r-title']


Investigating: https://www.planetminecraft.com/project/fnaf-universe-bedrock

Found 3 download candidates:

  1. Text: Download Bedrock mcworld
     Href: /project/fnaf-universe-bedrock/download/mcworld/
     Class: ['branded-download', 'tooltip', 'tipso_style']

  2. Text: JAVA
     Href: /project/fnaf-universe-bedrock/download/mirror/748193/
     Class: ['third-party-download', 'branded-download']

  3. Text: Nuestra Señora de la Santísima Trinidad, ship of the line (Free Download)
     Href: /project/nuestra-se-ora-de-la-sant-sima-t

In [66]:
# Updated metadata extractor with better download detection and JSON structure
def extract_project_metadata_v2(soup, project_url):
    """
    Enhanced extractor with complete download detection and JSON-friendly structure
    """
    metadata = {
        "url": project_url,
        "title": "N/A",
        "category": "N/A",
        "subcategory": "N/A",
        "posted_date": "N/A",
        "updated_date": "N/A",
        "tags": [],
        "description": "N/A",
        "author": "N/A",
        "stats": {
            "views": 0,
            "views_today": 0,
            "downloads": 0,
            "downloads_today": 0,
            "diamonds": 0,
            "hearts": 0
        },
        "downloads": [],  # Array of download objects
        "has_direct_download": False
    }
    
    # Step 1: Extract from JSON-LD
    json_ld_scripts = soup.find_all('script', type='application/ld+json')
    creative_work = None
    
    for script in json_ld_scripts:
        try:
            json_data = json.loads(script.string)
            if json_data.get('@type') == 'CreativeWork':
                creative_work = json_data
                break
        except:
            continue
    
    if creative_work:
        metadata['title'] = creative_work.get('name', 'N/A')
        metadata['description'] = creative_work.get('description', 'N/A')
        metadata['posted_date'] = creative_work.get('datePublished', 'N/A')
        metadata['updated_date'] = creative_work.get('dateModified', 'N/A')
        metadata['category'] = creative_work.get('genre', 'N/A')
        
        # Extract tags from keywords
        keywords = creative_work.get('keywords', '')
        if keywords:
            metadata['tags'] = [tag.strip() for tag in keywords.split(',')]
            
            # Detect subcategory from the official list
            valid_subcategories = [
                '3d-art', 'air-structure', 'challenge-adventure', 'complex', 'educational',
                'enviroment-landscaping', 'land-structure', 'minecart', 'music',
                'nether-structure', 'piston', 'pixel-art', 'redstone-device',
                'underground-structure', 'water-structure', 'other'
            ]
            
            for tag in metadata['tags']:
                if tag in valid_subcategories:
                    metadata['subcategory'] = tag
                    break
        
        # Extract author
        author_data = creative_work.get('author', {})
        if isinstance(author_data, dict):
            metadata['author'] = author_data.get('name', 'N/A')
    
    # Step 2: Extract views and downloads stats
    resource_stats = soup.find('ul', class_='resource-statistics')
    if resource_stats:
        li_elements = resource_stats.find_all('li')
        
        for li in li_elements:
            text = li.get_text(strip=True)
            stat_values = [s.get_text(strip=True) for s in li.find_all('span', class_='stat')]
            
            if 'view' in text.lower() and len(stat_values) >= 2:
                try:
                    metadata['stats']['views'] = int(stat_values[0].replace(',', ''))
                    metadata['stats']['views_today'] = int(stat_values[1].replace(',', ''))
                except:
                    pass
            elif 'download' in text.lower() and len(stat_values) >= 2:
                try:
                    metadata['stats']['downloads'] = int(stat_values[0].replace(',', ''))
                    metadata['stats']['downloads_today'] = int(stat_values[1].replace(',', ''))
                except:
                    pass
    
    # Step 3: Extract diamonds and hearts
    diamond_elem = soup.find('span', class_='c-num-votes')
    if diamond_elem:
        try:
            metadata['stats']['diamonds'] = int(diamond_elem.get_text(strip=True).replace(',', ''))
        except:
            pass
    
    heart_elem = soup.find('span', class_='c-num-favs')
    if heart_elem:
        try:
            metadata['stats']['hearts'] = int(heart_elem.get_text(strip=True).replace(',', ''))
        except:
            pass
    
    # Step 4: Enhanced download link detection
    all_links = soup.find_all('a', href=True)
    
    for link in all_links:
        href = link['href']
        text = link.get_text(strip=True)
        classes = link.get('class', [])
        
        download_info = None
        
        # Direct PM downloads - Java schematic
        if '/download/schematic/' in href.lower():
            download_info = {
                'text': text or 'Download Schematic',
                'url': href,
                'type': 'direct_pm_schematic',
                'file_type': 'schematic',
                'platform': 'java'
            }
            metadata['has_direct_download'] = True
        
        # Direct PM downloads - Java world
        elif '/download/world/' in href.lower() or '/download/worldmap/' in href.lower():
            download_info = {
                'text': text or 'Download World',
                'url': href,
                'type': 'direct_pm_world',
                'file_type': 'world',
                'platform': 'java'
            }
            metadata['has_direct_download'] = True
        
        # Direct PM downloads - Bedrock mcworld (NEW!)
        elif '/download/mcworld/' in href.lower():
            download_info = {
                'text': text or 'Download Bedrock World',
                'url': href,
                'type': 'direct_pm_mcworld',
                'file_type': 'world',
                'platform': 'bedrock'
            }
            metadata['has_direct_download'] = True
        
        # PM redirect links to external hosts
        elif '/download/mirror/' in href.lower() or '/download/website/' in href.lower():
            # Try to detect platform from text
            platform = 'unknown'
            if 'java' in text.lower():
                platform = 'java'
            elif 'bedrock' in text.lower():
                platform = 'bedrock'
            
            download_info = {
                'text': text or 'External Download',
                'url': href,
                'type': 'pm_external_redirect',
                'file_type': 'unknown',
                'platform': platform
            }
        
        # Direct external download hosts
        elif any(host in href.lower() for host in ['mediafire.com', 'dropbox.com', 'drive.google.com', 'mega.nz', 'patreon.com']):
            link_type = classify_download_link_v2(href, text) if 'classify_download_link_v2' in dir() else 'external'
            download_info = {
                'text': text,
                'url': href,
                'type': link_type,
                'file_type': 'unknown',
                'platform': 'unknown'
            }
        
        # Add download if we found one and it's not a duplicate
        if download_info and not any(d['url'] == download_info['url'] for d in metadata['downloads']):
            metadata['downloads'].append(download_info)
    
    return metadata

In [67]:
# Test the improved extractor on the problematic projects
print("Testing improved extractor...")
print("=" * 70)

for url in test_urls:
    print(f"\n{url}")
    driver.get(url)
    time.sleep(2)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    metadata = extract_project_metadata_v2(soup, url)
    
    print(f"  Title: {metadata['title']}")
    print(f"  Downloads found: {len(metadata['downloads'])}")
    
    for idx, dl in enumerate(metadata['downloads'], 1):
        print(f"    {idx}. [{dl['platform']}] {dl['type']}: {dl['url']}")
        print(f"       Text: {dl['text']}")
    
    print()

Testing improved extractor...

https://www.planetminecraft.com/project/halloween-dropper
  Title: Halloween Dropper Minecraft Map
  Downloads found: 1
    1. [bedrock] direct_pm_mcworld: /project/halloween-dropper/download/mcworld/
       Text: Download Bedrock mcworld


https://www.planetminecraft.com/project/fnaf-universe-bedrock
  Title: FNAF: UNIVERSE - BEDROCK Minecraft Map
  Downloads found: 2
    1. [bedrock] direct_pm_mcworld: /project/fnaf-universe-bedrock/download/mcworld/
       Text: Download Bedrock mcworld
    2. [java] pm_external_redirect: /project/fnaf-universe-bedrock/download/mirror/748193/
       Text: JAVA



In [68]:
# JSON-based scraper (better for variable-length arrays and nested data)
def scrape_planet_minecraft_json(driver, 
                                 output_json='planet_minecraft_projects.json',
                                 progress_file='scraping_progress.json',
                                 start_page=1, 
                                 max_pages=None,
                                 max_projects=None,
                                 category=None, 
                                 share=None, 
                                 order=None,
                                 platform=None,
                                 page_delay=2.0,
                                 project_delay=1.0,
                                 resume=True):
    """
    Complete scraper for Planet Minecraft - JSON output version.
    
    Same parameters as CSV version, but outputs to JSON with better structure.
    """
    
    # Initialize progress tracking
    progress = {
        'scraped_urls': set(),
        'failed_urls': {},
        'total_collected': 0,
        'total_scraped': 0,
        'start_time': datetime.now().isoformat(),
        'filters': {
            'category': category,
            'share': share,
            'order': order,
            'platform': platform
        }
    }
    
    # Load existing progress if resuming
    if resume and Path(progress_file).exists():
        print(f"Loading progress from {progress_file}...")
        with open(progress_file, 'r') as f:
            saved_progress = json.load(f)
            progress['scraped_urls'] = set(saved_progress.get('scraped_urls', []))
            progress['failed_urls'] = saved_progress.get('failed_urls', {})
            print(f"  Resuming: {len(progress['scraped_urls'])} already scraped")
    
    # Load existing data if resuming
    existing_projects = []
    if Path(output_json).exists() and resume:
        print(f"Loading existing data from {output_json}...")
        with open(output_json, 'r', encoding='utf-8') as f:
            data = json.load(f)
            existing_projects = data.get('projects', [])
            for project in existing_projects:
                progress['scraped_urls'].add(project['url'])
        print(f"  Found {len(existing_projects)} existing projects")
    
    # Step 1: Collect project URLs
    print("\n" + "=" * 70)
    print("STEP 1: Collecting project URLs")
    print("=" * 70)
    
    project_urls = collect_project_urls(
        driver=driver,
        start_page=start_page,
        max_pages=max_pages,
        category=category,
        share=share,
        order=order,
        platform=platform,
        delay=page_delay
    )
    
    progress['total_collected'] = len(project_urls)
    
    # Filter out already scraped URLs
    urls_to_scrape = [url for url in project_urls if url not in progress['scraped_urls']]
    
    if max_projects:
        urls_to_scrape = urls_to_scrape[:max_projects]
    
    print(f"\nURLs to scrape: {len(urls_to_scrape)} (out of {len(project_urls)} collected)")
    
    if len(urls_to_scrape) == 0:
        print("No new URLs to scrape!")
        return progress
    
    # Step 2: Extract metadata from each project
    print("\n" + "=" * 70)
    print("STEP 2: Extracting project metadata")
    print("=" * 70)
    
    newly_scraped = []
    
    for idx, project_url in enumerate(urls_to_scrape, 1):
        full_url = f"https://www.planetminecraft.com{project_url}"
        
        print(f"\n[{idx}/{len(urls_to_scrape)}] {project_url}")
        
        try:
            # Load project page
            driver.get(full_url)
            time.sleep(project_delay)
            
            # Extract metadata
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            metadata = extract_project_metadata_v2(soup, full_url)
            
            if metadata and metadata.get('title') != 'N/A':
                newly_scraped.append(metadata)
                
                progress['scraped_urls'].add(project_url)
                progress['total_scraped'] += 1
                
                print(f"  ✓ {metadata.get('title')} - {metadata.get('subcategory', 'N/A')}")
                print(f"    Stats: {metadata['stats']['views']} views, {metadata['stats']['downloads']} downloads")
                print(f"    Downloads: {len(metadata['downloads'])} link(s)")
            else:
                print(f"  ⚠ No metadata extracted")
                progress['failed_urls'][project_url] = "No metadata"
                
        except Exception as e:
            print(f"  ✗ Error: {e}")
            progress['failed_urls'][project_url] = str(e)
        
        # Save progress every 10 projects
        if idx % 10 == 0:
            # Combine existing + newly scraped
            all_projects = existing_projects + newly_scraped
            
            # Save JSON
            output_data = {
                'metadata': {
                    'scraped_at': datetime.now().isoformat(),
                    'total_projects': len(all_projects),
                    'filters': progress['filters']
                },
                'projects': all_projects
            }
            
            with open(output_json, 'w', encoding='utf-8') as f:
                json.dump(output_data, f, indent=2, ensure_ascii=False)
            
            # Save progress
            with open(progress_file, 'w') as f:
                json.dump({
                    'scraped_urls': list(progress['scraped_urls']),
                    'failed_urls': progress['failed_urls'],
                    'total_collected': progress['total_collected'],
                    'total_scraped': progress['total_scraped'],
                    'start_time': progress['start_time'],
                    'last_update': datetime.now().isoformat(),
                    'filters': progress['filters']
                }, f, indent=2)
            
            print(f"  📝 Progress saved ({len(all_projects)} total projects)")
    
    # Final save
    all_projects = existing_projects + newly_scraped
    
    output_data = {
        'metadata': {
            'scraped_at': datetime.now().isoformat(),
            'total_projects': len(all_projects),
            'filters': progress['filters'],
            'scrape_stats': {
                'total_collected': progress['total_collected'],
                'successfully_scraped': progress['total_scraped'],
                'failed': len(progress['failed_urls'])
            }
        },
        'projects': all_projects
    }
    
    with open(output_json, 'w', encoding='utf-8') as f:
        json.dump(output_data, f, indent=2, ensure_ascii=False)
    
    # Final progress save
    progress['end_time'] = datetime.now().isoformat()
    with open(progress_file, 'w') as f:
        json.dump({
            'scraped_urls': list(progress['scraped_urls']),
            'failed_urls': progress['failed_urls'],
            'total_collected': progress['total_collected'],
            'total_scraped': progress['total_scraped'],
            'start_time': progress['start_time'],
            'end_time': progress['end_time'],
            'filters': progress['filters']
        }, f, indent=2)
    
    print("\n" + "=" * 70)
    print("SCRAPING COMPLETE")
    print("=" * 70)
    print(f"Total URLs collected: {progress['total_collected']}")
    print(f"Successfully scraped: {progress['total_scraped']}")
    print(f"Failed: {len(progress['failed_urls'])}")
    print(f"Output saved to: {output_json}")
    print(f"Progress saved to: {progress_file}")
    print("=" * 70)
    
    return progress

### Test JSON Scraper

Now let's test with the same parameters but JSON output:

In [69]:
# Test JSON scraper with small sample
test_stats_json = scrape_planet_minecraft_json(
    driver=driver,
    output_json=str(test_output_dir / 'test_schematics.json'),
    progress_file=str(test_output_dir / 'test_progress_json.json'),
    start_page=1,
    max_pages=2,  # Only 2 pages
    max_projects=5,  # Only 5 projects for quick test
    share='schematic',
    order='order_latest',
    page_delay=2.0,
    project_delay=1.5,
    resume=False  # Fresh start for JSON
)


STEP 1: Collecting project URLs
Starting URL collection from page 1
Filters: share=schematic&order=order_latest

Page 1: https://www.planetminecraft.com/projects/?share=schematic&order=order_latest&p=1
  ✓ Found 25 projects (25 new)
  Total unique: 25

Page 2: https://www.planetminecraft.com/projects/?share=schematic&order=order_latest&p=2
  ✓ Found 25 projects (25 new)
  Total unique: 50

Reached max_pages limit (2)

Collection complete: 50 unique project URLs

URLs to scrape: 5 (out of 50 collected)

STEP 2: Extracting project metadata

[1/5] /project/toulouse-house
  ✓ Toulouse house Minecraft Map - other
    Stats: 87 views, 12 downloads
    Downloads: 1 link(s)

[2/5] /project/hamburg-building-6-by-tim0fei
  ✓ Hamburg Building 6 by tim0fei Minecraft Map - other
    Stats: 55 views, 7 downloads
    Downloads: 1 link(s)

[3/5] /project/redstonefy
  ✓ RedstoneFy Minecraft Map - redstone-device
    Stats: 43 views, 1 downloads
    Downloads: 1 link(s)

[4/5] /project/estate-6745324
 

In [70]:
# Load and display the JSON structure
json_file = test_output_dir / 'test_schematics.json'

with open(json_file, 'r', encoding='utf-8') as f:
    data = json.load(f)

print("=" * 70)
print("JSON Structure:")
print("=" * 70)
print(f"\nMetadata:")
print(f"  Scraped at: {data['metadata']['scraped_at']}")
print(f"  Total projects: {data['metadata']['total_projects']}")
print(f"  Filters: {data['metadata']['filters']}")

print(f"\n\nFirst project (full structure):")
print("=" * 70)
first_project = data['projects'][0]
print(json.dumps(first_project, indent=2))

print("\n\n" + "=" * 70)
print("Download links summary:")
print("=" * 70)
for project in data['projects']:
    print(f"\n{project['title']}")
    print(f"  Subcategory: {project['subcategory']}")
    print(f"  Downloads: {len(project['downloads'])} link(s)")
    for dl in project['downloads']:
        print(f"    - [{dl['platform']}] {dl['type']}: {dl['url']}")
    print(f"  Tags: {', '.join(project['tags'][:5])}...")  # First 5 tags

JSON Structure:

Metadata:
  Scraped at: 2025-10-10T13:48:55.113189
  Total projects: 5
  Filters: {'category': None, 'share': 'schematic', 'order': 'order_latest', 'platform': None}


First project (full structure):
{
  "url": "https://www.planetminecraft.com/project/toulouse-house",
  "title": "Toulouse house Minecraft Map",
  "category": "Map",
  "subcategory": "other",
  "posted_date": "2025-10-09T00:00:00-04:00",
  "updated_date": "2025-10-09T14:59:47-04:00",
  "tags": [
    "city",
    "build",
    "house",
    "french",
    "red",
    "street",
    "toulouse",
    "other"
  ],
  "description": "Here are some typical houses in the city of Toulouse, France. Nicknamed the pink city, Toulouse is the third largest city in France and the city where...",
  "author": "Mineraf7",
  "stats": {
    "views": 87,
    "views_today": 87,
    "downloads": 12,
    "downloads_today": 12,
    "diamonds": 4,
    "hearts": 2
  },
  "downloads": [
    {
      "text": "Download Minecraft Map",
      "

## Summary: JSON vs CSV Format

### ✅ JSON Format Advantages (RECOMMENDED)

**Better data structure:**
- Native arrays for tags (no comma escaping issues)
- Native arrays for downloads (each with platform, type, file_type)
- Nested stats object keeps related data together
- No need for pipe separators or string concatenation

**Better for variable-length data:**
- Projects can have 1-20+ tags without formatting issues
- Projects can have 1-10+ download links cleanly stored
- Easy to add new fields without breaking parsing

**Better for programmatic use:**
- Direct parsing in Python/JavaScript/etc
- Easy filtering (e.g., `downloads.platform == 'java'`)
- Easy aggregation and analysis
- Can be loaded into pandas: `pd.read_json()`

**Example JSON structure:**
```json
{
  "title": "Project Name",
  "tags": ["land-structure", "modern", "city"],
  "downloads": [
    {
      "url": "/download/schematic/",
      "type": "direct_pm_schematic",
      "platform": "java",
      "file_type": "schematic",
      "text": "Download Schematic"
    },
    {
      "url": "/download/mcworld/",
      "type": "direct_pm_mcworld",
      "platform": "bedrock",
      "file_type": "world",
      "text": "Download Bedrock mcworld"
    }
  ],
  "stats": {
    "views": 1234,
    "downloads": 567
  }
}
```

### CSV Format (Still available)

**When to use:**
- Need to open in Excel/Google Sheets
- Simpler analysis tools
- Flat structure preferred

**Limitations:**
- Uses `|` separators for multi-value fields
- All downloads concatenated into strings
- Harder to filter by platform or download type
- Tag limits or escaping issues

### Recommendation

**Use JSON format for:**
- Machine learning / data science workflows
- Building download scripts
- Filtering by platform (Java vs Bedrock)
- Projects with multiple downloads
- Long-term data collection

**Use CSV format for:**
- Quick Excel analysis
- Simple spreadsheet workflows
- When JSON parsing not available

### Verify Fix: Test the problematic projects

Let's verify the Halloween Dropper and FNAF Universe projects now have all downloads:

In [71]:
# Create a mini scrape of just the two problematic projects
verification_urls = [
    '/project/halloween-dropper',
    '/project/fnaf-universe-bedrock'
]

print("=" * 70)
print("Verification Test: Problematic Projects")
print("=" * 70)

verification_projects = []

for url in verification_urls:
    full_url = f"https://www.planetminecraft.com{url}"
    print(f"\nScraping: {url}")
    
    driver.get(full_url)
    time.sleep(2)
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    metadata = extract_project_metadata_v2(soup, full_url)
    
    verification_projects.append(metadata)
    
    print(f"  ✓ {metadata['title']}")
    print(f"  Downloads found: {len(metadata['downloads'])}")
    for dl in metadata['downloads']:
        print(f"    - [{dl['platform']}] {dl['type']}")
        print(f"      URL: {dl['url']}")
        print(f"      Text: {dl['text']}")

# Save verification results
verification_output = test_output_dir / 'verification_test.json'
with open(verification_output, 'w', encoding='utf-8') as f:
    json.dump({
        'metadata': {
            'test_purpose': 'Verify multi-download detection',
            'tested_at': datetime.now().isoformat()
        },
        'projects': verification_projects
    }, f, indent=2, ensure_ascii=False)

print(f"\n✓ Verification results saved to: {verification_output}")

# Check if both have multiple downloads
print("\n" + "=" * 70)
print("VERIFICATION RESULT:")
print("=" * 70)
halloween = verification_projects[0]
fnaf = verification_projects[1]

print(f"\n✓ Halloween Dropper: {len(halloween['downloads'])} download(s)")
if len(halloween['downloads']) >= 1:
    print("  ✅ PASS: Has Bedrock mcworld download")
else:
    print("  ❌ FAIL: Missing downloads")

print(f"\n✓ FNAF Universe: {len(fnaf['downloads'])} download(s)")
if len(fnaf['downloads']) >= 2:
    print("  ✅ PASS: Has both Bedrock mcworld and Java external link")
else:
    print("  ❌ FAIL: Missing downloads")

Verification Test: Problematic Projects

Scraping: /project/halloween-dropper
  ✓ Halloween Dropper Minecraft Map
  Downloads found: 1
    - [bedrock] direct_pm_mcworld
      URL: /project/halloween-dropper/download/mcworld/
      Text: Download Bedrock mcworld

Scraping: /project/fnaf-universe-bedrock
  ✓ FNAF: UNIVERSE - BEDROCK Minecraft Map
  Downloads found: 2
    - [bedrock] direct_pm_mcworld
      URL: /project/fnaf-universe-bedrock/download/mcworld/
      Text: Download Bedrock mcworld
    - [java] pm_external_redirect
      URL: /project/fnaf-universe-bedrock/download/mirror/748193/
      Text: JAVA

✓ Verification results saved to: ../data/planet_minecraft/verification_test.json

VERIFICATION RESULT:

✓ Halloween Dropper: 1 download(s)
  ✅ PASS: Has Bedrock mcworld download

✓ FNAF Universe: 2 download(s)
  ✅ PASS: Has both Bedrock mcworld and Java external link


## ✅ Complete Planet Minecraft Scraper - READY TO USE

### What's Fixed:
1. ✅ **Download detection improved** - Now detects:
   - `/download/schematic/` (Java schematic files)
   - `/download/world/` or `/download/worldmap/` (Java world files)
   - `/download/mcworld/` (Bedrock world files) **← NEW!**
   - `/download/mirror/` or `/download/website/` (External redirects)
   - Direct external hosts (MediaFire, Dropbox, etc.)

2. ✅ **Multiple downloads per project** - Correctly captures projects with multiple download options (Java + Bedrock, schematic + world, etc.)

3. ✅ **Platform detection** - Each download tagged with platform: `java`, `bedrock`, or `unknown`

4. ✅ **JSON format** - Clean nested structure for variable-length data (tags, downloads)

5. ✅ **Progress tracking** - Resume capability if scraping is interrupted

### Available Functions:

**For JSON output (RECOMMENDED):**
```python
scrape_planet_minecraft_json(
    driver=driver,
    output_json='planet_minecraft_projects.json',
    start_page=1,
    max_pages=100,  # Or None for unlimited
    category='land-structure',  # Optional filter
    share='schematic',  # Optional: 'schematic', 'seed', None
    order='order_latest',  # Optional sort
    platform=1,  # Optional: 1=Java, 2=Bedrock
    page_delay=2.0,
    project_delay=1.5,
    resume=True
)
```

**For CSV output:**
```python
scrape_planet_minecraft(  # Original CSV version
    driver=driver,
    output_csv='planet_minecraft_projects.csv',
    # ... same parameters
)
```

### Ready for Production!

The scraper is now ready for full-scale scraping. You can:
- Scrape all projects (millions of pages)
- Filter by category, platform, or file type
- Resume if interrupted
- Get clean, structured data with all download links and metadata

### Loading and Analyzing Results

In [61]:
# Load and analyze scraped data
import pandas as pd

# Example: Load the test results
csv_file = test_output_dir / 'test_schematics.csv'

if csv_file.exists():
    df = pd.read_csv(csv_file)
    
    print("=" * 70)
    print(f"Loaded {len(df)} projects from {csv_file}")
    print("=" * 70)
    
    print("\nDataset Overview:")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Date range: {df['posted_date'].min()} to {df['posted_date'].max()}")
    
    print("\nSubcategory Distribution:")
    print(df['subcategory'].value_counts())
    
    print("\nTop 10 by Views:")
    top_views = df.nlargest(10, 'views')[['title', 'author', 'views', 'downloads', 'subcategory']]
    print(top_views.to_string(index=False))
    
    print("\nTop 10 by Downloads:")
    top_downloads = df.nlargest(10, 'downloads')[['title', 'author', 'views', 'downloads', 'subcategory']]
    print(top_downloads.to_string(index=False))
    
    print("\nFile Type Distribution:")
    # Count file types from the file_types column
    file_type_counts = {}
    for types_str in df['file_types'].dropna():
        for file_type in types_str.split(' | '):
            file_type_counts[file_type] = file_type_counts.get(file_type, 0) + 1
    
    for file_type, count in sorted(file_type_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"  {file_type}: {count}")
    
else:
    print(f"CSV file not found: {csv_file}")
    print("Run the test scraper first!")

Loaded 10 projects from ../data/planet_minecraft/test_schematics.csv

Dataset Overview:
  Columns: ['title', 'url', 'author', 'category', 'subcategory', 'description', 'tags', 'posted_date', 'updated_date', 'views', 'views_today', 'downloads', 'downloads_today', 'diamonds', 'hearts', 'download_links', 'download_types', 'file_types']
  Date range: 2025-10-06T00:00:00-04:00 to 2025-10-09T00:00:00-04:00

Subcategory Distribution:
subcategory
other                    7
land-structure           1
underground-structure    1
redstone-device          1
Name: count, dtype: int64

Top 10 by Views:
                                                          title             author  views  downloads           subcategory
                         Piraeus tower Heliopolis Minecraft Map     HeliopolisCity    237         75        land-structure
                        The Emperor&#039;s castle Minecraft Map Dimitrius von Brit    135         36                 other
    Water tower | 1.21.1 | litematic

In [60]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.3.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.3-cp311-cp311-macosx_11_0_arm64.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m6.6 MB/s[0m  [33m0:00:01[0m eta [36m0:00:01[0m
[?25hUsing cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandas]2m2/3[0m [pandas]
[1A[2KSuccessfully installed pandas-2.3.3 pytz-2025.2 tzdata-2025.2


In [62]:
df.head()

Unnamed: 0,title,url,author,category,subcategory,description,tags,posted_date,updated_date,views,views_today,downloads,downloads_today,diamonds,hearts,download_links,download_types,file_types
0,Piraeus tower Heliopolis Minecraft Map,https://www.planetminecraft.com/project/piraeu...,HeliopolisCity,Map,land-structure,Piraeus tower modern office skyscraper Origina...,"city, land-structure, architecture, modern, sk...",2025-10-07T00:00:00-04:00,2025-10-07T10:44:05-04:00,237,71,75,22,6,4,/project/piraeus-tower-heliopolis/download/wor...,direct_pm_world,world
1,Water tower | 1.21.1 | litematica | Grotiva | ...,https://www.planetminecraft.com/project/water-...,outsider-Grotiva,Map,other,"Hello everyone, today I present to you a water...","build, building, download, buildings, buitiful...",2025-10-06T00:00:00-04:00,2025-10-06T16:23:58-04:00,133,37,18,8,1,1,/project/water-tower-1-21-1-litematica-grotiva...,direct_pm_world,world
2,Hamburg Building 6 by tim0fei Minecraft Map,https://www.planetminecraft.com/project/hambur...,tim0fei,Map,other,Hamburg Building 6 by tim0fei. Subscribe and l...,"city, minecraft, building, architecture, moder...",2025-10-08T00:00:00-04:00,2025-10-08T18:02:23-04:00,38,38,7,7,4,0,/project/hamburg-building-6-by-tim0fei/downloa...,direct_pm_schematic,schematic
3,Fractured Femur- Mining defence training world...,https://www.planetminecraft.com/project/fractu...,Laborleiter42,Map,underground-structure,You might get any item you think you are able ...,underground-structure,2025-10-08T00:00:00-04:00,2025-10-08T01:28:30-04:00,45,13,2,0,1,0,/project/fractured-femur-mining-defence-traini...,direct_pm_world,world
4,Halloween Dropper Minecraft Map,https://www.planetminecraft.com/project/hallow...,CristionWolf,Map,other,Happy Halloween 4 levels theme for halloween y...,"challenge, puzzle, dropper, halloween, other",2025-10-07T00:00:00-04:00,2025-10-07T05:09:11-04:00,99,26,25,8,2,1,,,


In [64]:
df["Halloween Dropper Minecraft Map"]

KeyError: 'Halloween Dropper Minecraft Map'