# Internet Archive Science Fiction Downloader

Source: https://github.com/jjjake/internetarchive

This notebook downloads major science fiction collections from Internet Archive:
1. **ultimate-pgsf-txt** - 1,900+ Project Gutenberg SF texts
2. **Pulp Magazine Archive** - Thousands of pulp magazine issues (Amazing Stories, Weird Tales, Galaxy, etc.)
3. **sciencefiction collection** - Individual SF books and texts (thousands available)

Features:
- Downloads all files from each collection
- Creates metadata.csv for each dataset
- Uses checksums to skip already-downloaded files
- Extracts: title, author, language, subject, url, local_path, format, size, md5

## Installation

In [None]:
!pip install internetarchive

## Import Libraries

In [None]:
from internetarchive import download, get_item, search_items
import os
import csv
from pathlib import Path

## Configuration

In [None]:
# Configure download directory
DOWNLOAD_DIR = os.path.expanduser("~/scifi_datasets/internet_archive")
os.makedirs(DOWNLOAD_DIR, exist_ok=True)

print(f"Download location: {DOWNLOAD_DIR}")

## Dataset Definitions

### Available Collections:

1. **ultimate-pgsf-txt** (1,900+ texts, 87MB)
   - Plain text files from Project Gutenberg
   - Classic SF authors: Asimov, Leinster, etc.
   - Pre-1960s public domain works

2. **Pulp Magazine Archive** (Thousands of magazines)
   - PDF scans with original layout and artwork
   - Amazing Stories (1926+), Weird Tales (1923-1954), Galaxy (355 issues)
   - Authors: Asimov, Clarke, Dick, Lovecraft, Bradbury

3. **sciencefiction** (Collection with thousands of items)
   - Individual books, scholarly works, critical essays
   - Mixed content types and time periods

In [None]:
# Primary collections to download
COLLECTIONS = [
    {
        "id": "ultimate-pgsf-txt",
        "name": "Ultimate Project Gutenberg SF Collection",
        "description": "1,900+ plain text science fiction files from Project Gutenberg",
        "formats": ["Text"]  # Download only text files
    },
    {
        "id": "pulpmagazinearchive",
        "name": "Pulp Magazine Archive (Science Fiction subset)",
        "description": "Thousands of pulp magazine issues - PDF scans with original layout",
        "formats": ["PDF"],  # Download PDF format
        "query": "collection:pulpmagazinearchive AND subject:science fiction",  # Search query for SF subset
        "use_search": True  # This needs to be searched, not downloaded directly
    },
]

## Helper Functions

In [None]:
def create_metadata_csv(item, item_dir, csv_path):
    """
    Create a metadata CSV file for an Internet Archive item.
    Format: title, author, ia_identifier, language, subject, url, local_path, format, size, md5
    """
    metadata = item.item_metadata.get('metadata', {})
    
    # Prepare metadata records for each file
    records = []
    
    for file in item.files:
        # Skip metadata and derivative files
        if file['name'].endswith(('.xml', '.sqlite', '_meta.mrc', '.torrent')):
            continue
        
        # Extract file metadata
        file_path = os.path.join(item_dir, file['name'])
        
        # Get metadata fields
        title = metadata.get('title', '')
        creator = metadata.get('creator', metadata.get('author', ''))
        if isinstance(creator, list):
            creator = '; '.join(creator)
        
        language = metadata.get('language', '')
        if isinstance(language, list):
            language = '; '.join(language)
        
        subject = metadata.get('subject', '')
        if isinstance(subject, list):
            subject = '; '.join(subject)
        
        identifier = item.identifier
        url = f"https://archive.org/details/{identifier}"
        
        records.append({
            'title': title,
            'author': creator,
            'ia_identifier': identifier,
            'language': language,
            'subject': subject,
            'url': url,
            'local_path': file_path,
            'format': file.get('format', ''),
            'size': file.get('size', ''),
            'md5': file.get('md5', '')
        })
    
    # Write CSV
    if records:
        with open(csv_path, 'w', newline='', encoding='utf-8') as f:
            fieldnames = ['title', 'author', 'ia_identifier', 'language', 'subject',
                         'url', 'local_path', 'format', 'size', 'md5']
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(records)
        
        print(f"  ✓ Created metadata CSV: {csv_path}")
        return True
    
    return False

In [None]:
def download_collection(collection_info, checksum=True):
    """
    Download a single collection from Internet Archive.
    """
    item_id = collection_info["id"]
    name = collection_info["name"]
    formats = collection_info.get("formats")
    
    print(f"\n[Downloading] {name}")
    print(f"Identifier: {item_id}")
    
    try:
        # Get item metadata first
        item = get_item(item_id)
        print(f"Title: {item.item_metadata['metadata'].get('title', 'N/A')}")
        print(f"Size: {item.item_size / (1024**3):.2f} GB")
        print(f"Files: {item.files_count}")
        
        # Create directory for this collection
        item_dir = os.path.join(DOWNLOAD_DIR, item_id)
        os.makedirs(item_dir, exist_ok=True)
        
        # Download with options
        original_dir = os.getcwd()
        os.chdir(item_dir)
        
        if formats:
            print(f"Downloading formats: {', '.join(formats)}")
            download(item_id, verbose=True, checksum=checksum, formats=formats)
        else:
            print("Downloading all formats...")
            download(item_id, verbose=True, checksum=checksum)
        
        os.chdir(original_dir)
        
        # Create metadata CSV
        csv_path = os.path.join(item_dir, 'metadata.csv')
        create_metadata_csv(item, item_dir, csv_path)
        
        print(f"✓ Successfully downloaded {name}\n")
        return True
        
    except Exception as e:
        print(f"✗ Error downloading {item_id}: {e}\n")
        os.chdir(original_dir)
        return False

In [None]:
def search_and_download_collection(query, collection_name, max_items=100, formats=None):
    """
    Search for items matching a query and download them.
    Used for collections like Pulp Magazine Archive and sciencefiction.
    """
    print(f"\n{'='*70}")
    print(f"[Collection Search] {collection_name}")
    print(f"{'='*70}")
    print(f"Query: {query}")
    print(f"Downloading up to {max_items} items...\n")
    
    # Create subdirectory for collection items
    collection_dir = os.path.join(DOWNLOAD_DIR, collection_name.lower().replace(' ', '_'))
    os.makedirs(collection_dir, exist_ok=True)
    
    # Aggregate metadata CSV for entire collection
    aggregate_records = []
    
    try:
        count = 0
        for result in search_items(query):
            if count >= max_items:
                break
            
            item_id = result['identifier']
            print(f"\n[{count+1}/{max_items}] Downloading: {item_id}")
            
            try:
                # Get item metadata
                item = get_item(item_id)
                
                # Create item subdirectory
                item_dir = os.path.join(collection_dir, item_id)
                os.makedirs(item_dir, exist_ok=True)
                
                # Download to item subdirectory
                original_dir = os.getcwd()
                os.chdir(item_dir)
                
                if formats:
                    download(item_id, verbose=True, checksum=True, formats=formats)
                else:
                    download(item_id, verbose=True, checksum=True)
                
                os.chdir(original_dir)
                
                # Create individual metadata CSV
                csv_path = os.path.join(item_dir, 'metadata.csv')
                if create_metadata_csv(item, item_dir, csv_path):
                    # Read records for aggregate CSV
                    with open(csv_path, 'r', encoding='utf-8') as f:
                        reader = csv.DictReader(f)
                        aggregate_records.extend(list(reader))
                
                count += 1
            except Exception as e:
                print(f"  ✗ Error: {e}")
                os.chdir(original_dir)
                continue
        
        print(f"\n✓ Downloaded {count} items from {collection_name}")
        
        # Create aggregate metadata CSV
        if aggregate_records:
            aggregate_csv_path = os.path.join(collection_dir, 'metadata_all.csv')
            with open(aggregate_csv_path, 'w', newline='', encoding='utf-8') as f:
                fieldnames = aggregate_records[0].keys()
                writer = csv.DictWriter(f, fieldnames=fieldnames)
                writer.writeheader()
                writer.writerows(aggregate_records)
            print(f"✓ Created aggregate metadata CSV: {aggregate_csv_path}")
        
        return count
        
    except Exception as e:
        print(f"✗ Error searching collection: {e}")
        return 0

## Explore Collections (Optional)

Run these cells to preview collection metadata before downloading

In [None]:
# List available collections with metadata
print("Available Science Fiction Collections:")
print("-" * 70)

for i, coll in enumerate(COLLECTIONS, 1):
    print(f"\n{i}. {coll['name']}")
    print(f"   ID: {coll['id']}")
    print(f"   Description: {coll['description']}")
    
    # Skip size check for search-based collections
    if coll.get('use_search'):
        print(f"   Note: This is a search-based collection with many items")
        continue
    
    try:
        item = get_item(coll['id'])
        print(f"   Size: {item.item_size / (1024**3):.2f} GB")
        print(f"   Files: {item.files_count}")
    except Exception as e:
        print(f"   (Metadata unavailable: {e})")

## Download Collections

### 1. Download Ultimate Project Gutenberg SF Collection

In [None]:
# Download the Ultimate Project Gutenberg SF Collection
download_collection(COLLECTIONS[0], checksum=True)

### 2. Download Pulp Magazine Archive (Science Fiction)

⚠️ **Warning**: The Pulp Magazine Archive is HUGE!
- Thousands of magazines available
- Each magazine is 10-50 MB (PDF scans)
- Set a reasonable `max_items` limit below (default: 100)
- Start small and increase if needed

In [None]:
# Download Pulp Magazine Archive - Science Fiction subset
# Adjust max_items as needed (100 = ~5GB, 1000 = ~50GB)
MAX_PULP_ITEMS = 100

pulp_collection = COLLECTIONS[1]
search_and_download_collection(
    query=pulp_collection['query'],
    collection_name=pulp_collection['name'],
    max_items=MAX_PULP_ITEMS,
    formats=pulp_collection['formats']
)

### 3. (Optional) Download from sciencefiction Collection

This collection contains thousands of individual SF books, scholarly works, and more.

In [None]:
# Download items from the sciencefiction collection
# Adjust max_items as needed
MAX_SCIFI_ITEMS = 50

search_and_download_collection(
    query='collection:sciencefiction AND mediatype:texts',
    collection_name='Science Fiction Collection',
    max_items=MAX_SCIFI_ITEMS,
    formats=['Text', 'DjVuTXT']
)

## Download Specific Collections

### Amazing Stories Magazine

In [None]:
# Download Amazing Stories specifically
search_and_download_collection(
    query='collection:pulpmagazinearchive AND title:"Amazing Stories"',
    collection_name='Amazing Stories Magazine',
    max_items=50,
    formats=['PDF']
)

### Weird Tales Magazine

In [None]:
# Download Weird Tales specifically
search_and_download_collection(
    query='collection:pulpmagazinearchive AND title:"Weird Tales"',
    collection_name='Weird Tales Magazine',
    max_items=50,
    formats=['PDF']
)

### Galaxy Magazine

In [None]:
# Download Galaxy magazine specifically
search_and_download_collection(
    query='collection:pulpmagazinearchive AND title:"Galaxy"',
    collection_name='Galaxy Magazine',
    max_items=50,
    formats=['PDF']
)

## Verify Downloads

Check what was downloaded and verify metadata CSVs were created

In [None]:
# List downloaded collections
import os

print("\nDownloaded Collections:")
print("=" * 70)

for item in os.listdir(DOWNLOAD_DIR):
    item_path = os.path.join(DOWNLOAD_DIR, item)
    if os.path.isdir(item_path):
        # Count files
        files = [f for f in os.listdir(item_path) if os.path.isfile(os.path.join(item_path, f))]
        
        # Check for metadata CSV
        has_metadata = 'metadata.csv' in files or 'metadata_all.csv' in files
        
        print(f"\n{item}:")
        print(f"  Files: {len(files)}")
        print(f"  Has metadata CSV: {'✓' if has_metadata else '✗'}")
        
        # Show directory size
        total_size = sum(os.path.getsize(os.path.join(item_path, f)) for f in files if os.path.isfile(os.path.join(item_path, f)))
        print(f"  Size: {total_size / (1024**2):.2f} MB")

## Quick Reference: Example NASA Download

Original examples from your notebook for reference

In [None]:
# Example: Download NASA collection with checksum verification
# from internetarchive import download
# download('nasa', verbose=True, checksum=True)