Skip to content

Critical: gh-pages History Cleanup - 96% Repository Size Reduction (4.2GB β†’ 100MB)Β #647

@mmcky

Description

@mmcky

Critical: gh-pages History Cleanup - 96% Repository Size Reduction

🚨 Executive Summary

The lecture-python.myst repository is 4.2 GB in size, but 99% of this is from the gh-pages branch history. By converting gh-pages to an orphan branch (keeping only the current deployment), we can reduce the repository size by 96% β€” from 4.2 GB to ~100 MB.

This is a much simpler and more effective solution than cleaning the main branch history.


πŸ“Š Analysis Results

Current Repository State

Metric Value
Total repository size 4.2 GB
Git history (.git) 3.9 GB
Working directory ~300 MB
gh-pages commits 193 deployments
Date range December 2020 - Present

Size Breakdown by Branch

Branch/Content Uncompressed Size % of Total
gh-pages history 7.56 GB 99.6%
Main branch + other branches ~30 MB 0.4%
Current gh-pages HEAD only 95 MB (what's actually needed)

What's in gh-pages History?

Each of the 193 deployments includes a complete website build:

File Type Total in History Description
Jupyter Notebooks (.ipynb) 1.70 GB Executed notebooks with outputs
Images (.png, .jpg, .svg) 1.55 GB Generated matplotlib plots
HTML/JS/CSS 1.41 GB Built website pages
Search indices & other 2.90 GB searchindex.js and assets
Total 7.56 GB All 193 deployments combined

🎯 The Problem

Every time the documentation is built and deployed to gh-pages:

  1. Full website generated: ~50-100 MB of HTML, images, notebooks
  2. Committed to gh-pages: Creates a new commit with all files
  3. History preserved forever: All 193 deployments remain in git history
  4. Clone downloads everything: Fresh clones download 4.2 GB instead of 100 MB

Impact on Users

  • Fresh clone time: 5-10 minutes (downloading 4.2 GB)
  • Disk space usage: 4.2 GB per clone
  • Bandwidth costs: Significant for CI/CD and contributors
  • Git operations: Slower due to large object database

βœ… Recommended Solution: Orphan gh-pages Branch

Convert gh-pages to an orphan branch with only the current deployment (no history).

Expected Results

Metric Before After Improvement
Repository size 4.2 GB ~100 MB 96% reduction
Fresh clone time 5-10 min ~30 sec 90% faster
gh-pages commits 193 1 99% fewer
Bandwidth per clone 4.2 GB 100 MB 97% less

πŸ“‹ Implementation Plan

Option 1: Manual Cleanup (One-Time)

Safe, controlled approach for one-time cleanup:

# ======================================
# STEP 1: Backup (CRITICAL!)
# ======================================
cd /path/to/lecture-python.myst

# Create backup branch
git branch gh-pages-backup origin/gh-pages

# Optional: Create a backup clone
cd ..
cp -r lecture-python.myst lecture-python.myst.backup
cd lecture-python.myst

# ======================================
# STEP 2: Create Fresh gh-pages
# ======================================

# Fetch latest gh-pages
git fetch origin gh-pages

# Checkout gh-pages
git checkout gh-pages

# Create new orphan branch (no parent commits)
git checkout --orphan gh-pages-new

# Stage all current files
git add -A

# Create single commit with current state
git commit -m "Fresh gh-pages deployment (history removed to reduce repo size from 4.2GB to ~100MB)"

# ======================================
# STEP 3: Replace old gh-pages
# ======================================

# Delete old gh-pages branch
git branch -D gh-pages

# Rename new branch to gh-pages
git branch -m gh-pages

# ======================================
# STEP 4: Force Push (REQUIRES COORDINATION!)
# ======================================

# Push the orphaned branch (overwrites history)
git push origin gh-pages --force

# ======================================
# STEP 5: Return to main branch
# ======================================

git checkout main

# ======================================
# STEP 6: Cleanup (IMPORTANT!)
# ======================================

# Remove all references to old commits
git reflog expire --expire=now --all

# Aggressive garbage collection
git gc --aggressive --prune=now

# Verify new size
git count-objects -vH
du -sh .git

# ======================================
# STEP 7: Notify contributors
# ======================================
# All contributors must re-clone or run:
# git fetch origin
# git checkout gh-pages
# git reset --hard origin/gh-pages
# git checkout main
# git gc --aggressive --prune=now

Option 2: Automated with GitHub Actions (Future Deployments)

Prevent history buildup in the future:

Update your deployment workflow to use force_orphan:

# .github/workflows/publish.yml
name: Build and Deploy Documentation

on:
  push:
    branches: [main]
    tags: ['*']

jobs:
  deploy:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      
      - name: Setup Python environment
        # ... your existing setup steps ...
      
      - name: Build documentation
        run: |
          jb build lectures --path-output ./ -W --keep-going
      
      - name: Deploy to GitHub Pages
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./_build/html
          force_orphan: true  # ← KEY: Creates orphan commits each time
          commit_message: "deploy: ${{ github.sha }}"
        
      # Optional: Upload build artifacts for debugging
      - name: Upload build artifacts
        uses: actions/upload-artifact@v3
        with:
          name: website-build
          path: _build/html
          retention-days: 30

Key change: force_orphan: true

This ensures each deployment:

  • βœ… Creates a fresh orphan commit
  • βœ… No history accumulation
  • βœ… Repository stays small forever
  • βœ… Only current deployment is kept

πŸ”„ Alternative: Switch to External Hosting

If you want even more control, consider:

Netlify / Vercel Deployment

Instead of gh-pages, deploy to Netlify/Vercel:

Benefits:

  • No gh-pages branch needed at all
  • Preview deployments for PRs
  • Better performance (CDN)
  • Deploy history managed externally
  • Repository stays purely source code

GitHub Actions Example:

name: Deploy to Netlify

on:
  push:
    branches: [main]
  pull_request:

jobs:
  deploy:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Build site
        run: jb build lectures --path-output ./
      
      - name: Deploy to Netlify
        uses: nwtgck/actions-netlify@v2
        with:
          publish-dir: './_build/html'
          production-deploy: ${{ github.event_name == 'push' }}
          github-token: ${{ secrets.GITHUB_TOKEN }}
        env:
          NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
          NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}

This completely eliminates the need for a gh-pages branch.


⚠️ Important Considerations

Before Running the Cleanup

  • Backup the repository (clone or branch backup)
  • Notify all contributors about the upcoming change
  • Check for dependencies on old gh-pages commits (unlikely)
  • Schedule maintenance window (low-impact operation)
  • Test on a fork first to verify the process

What Gets Removed

  • ❌ All 193 historical deployments of the website
  • ❌ Old versions of generated images
  • ❌ Old versions of built HTML/notebooks
  • ❌ Historical search indices

What Gets Kept

  • βœ… Current (latest) website deployment
  • βœ… All main branch history (source files)
  • βœ… All tags and releases
  • βœ… All other branches
  • βœ… Commit history for source code

Impact on Website

  • βœ… No impact: The live website will continue to work
  • βœ… No downtime: The current deployment remains available
  • βœ… No broken links: All current URLs remain valid
  • βœ… Immediate improvement: Faster clones for everyone

πŸ“ Post-Cleanup Actions

For Repository Maintainers

# After force push, verify the cleanup worked
git clone https://github.com/QuantEcon/lecture-python.myst.git test-clone
cd test-clone
du -sh .git  # Should be ~100 MB instead of 3.9 GB

# Check gh-pages has only 1 commit
git log origin/gh-pages --oneline  # Should show only 1 commit

# Verify website still works
open https://python.quantecon.org  # or your website URL

For Contributors

After the cleanup, contributors need to update their local clones:

Option A: Fresh clone (recommended)

cd ~/projects
rm -rf lecture-python.myst
git clone https://github.com/QuantEcon/lecture-python.myst.git
cd lecture-python.myst

Option B: Update existing clone

cd lecture-python.myst

# Save any uncommitted work
git stash

# Update remote references
git fetch origin

# Reset gh-pages to match remote
git checkout gh-pages
git reset --hard origin/gh-pages

# Return to main
git checkout main

# Clean up local objects
git reflog expire --expire=now --all
git gc --aggressive --prune=now

# Verify size reduction
du -sh .git

πŸ“Š Monitoring and Validation

Verify Repository Size

# Check local repository size
du -sh .git

# Check GitHub repository size
# Visit: https://github.com/QuantEcon/lecture-python.myst/settings
# Look for "Repository size" under "Danger Zone"

Expected Metrics After Cleanup

Metric Target Value
.git directory size 50-100 MB
gh-pages commit count 1
Fresh clone time < 1 minute
Total repository size < 150 MB

🎯 Recommended Timeline

Week 1: Preparation

  • Create backup branch: gh-pages-backup
  • Notify all active contributors
  • Document current state (commits, size, etc.)
  • Test cleanup process on a fork

Week 2: Execution

  • Perform gh-pages cleanup during low-activity period
  • Force push orphaned gh-pages branch
  • Verify website still works
  • Confirm repository size reduction

Week 3: Follow-up

  • Ensure all contributors have updated
  • Update deployment workflow with force_orphan: true
  • Document new process in CONTRIBUTING.md
  • Monitor repository size going forward

πŸ’‘ Future Prevention

To prevent this from happening again:

1. Use force_orphan in Deployments

Always use force_orphan: true in GitHub Actions (see Option 2 above).

2. Periodic Monitoring

Set up monthly checks:

# Add to CI or create a scheduled GitHub Action
git clone --bare https://github.com/QuantEcon/lecture-python.myst.git repo-check
cd repo-check
git count-objects -vH
# Alert if size-pack > 500 MB

3. Separate Build Artifacts

Consider:

  • GitHub Releases: For official PDF/ZIP distributions
  • GitHub Actions Artifacts: For CI build outputs (30-90 day retention)
  • External CDN: For large media files
  • Git LFS: For binary files that must be tracked (not recommended for builds)

4. Documentation

Update CONTRIBUTING.md:

## Build Artifacts

**NEVER commit build outputs to the main branch:**

- ❌ `_build/` directory
- ❌ `_pdf/` directory  
- ❌ Generated images from builds
- ❌ Compiled HTML/notebooks

These are auto-generated and deployed via GitHub Actions.

The `.gitignore` file is configured to prevent this, but double-check
before committing.

πŸ”— Related Issues

This issue focuses specifically on the quick win: cleaning gh-pages history.


❓ FAQ

Q: Will this break anything?
A: No. The website continues to work normally. Only historical deployments are removed.

Q: Can we recover old deployments if needed?
A: Yes, the gh-pages-backup branch will contain the full history. Keep it for 6-12 months.

Q: Do we need to coordinate with contributors?
A: Not really. Contributors work on the main branch. gh-pages is auto-generated. They just need to re-clone eventually for the size benefits.

Q: How long does the cleanup take?
A: 5-10 minutes to execute the commands. Instant for users after they re-clone.

Q: What about GitHub Pages custom domain?
A: Unaffected. The CNAME file will be preserved in the new orphan commit.

Q: Will GitHub show contribution graphs correctly?
A: Yes. Contribution graphs are based on the main branch commits, not gh-pages.

Q: Can we automate this completely?
A: Yes! Use force_orphan: true in your deployment action (Option 2).


πŸ“ž Next Steps

  1. Review this proposal and gather feedback
  2. Choose implementation approach: Manual cleanup or GitHub Actions update
  3. Schedule the cleanup for a convenient time
  4. Execute the plan following the steps above
  5. Verify results and update documentation

πŸ“š References


Document Version: 1.0
Date: October 16, 2025
Impact: πŸ”΄ High (96% size reduction)
Difficulty: 🟒 Low (simple git commands)
Risk: 🟒 Low (easily reversible with backup)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions