-
-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Critical: gh-pages History Cleanup - 96% Repository Size Reduction
π¨ Executive Summary
The lecture-python.myst repository is 4.2 GB in size, but 99% of this is from the gh-pages branch history. By converting gh-pages to an orphan branch (keeping only the current deployment), we can reduce the repository size by 96% β from 4.2 GB to ~100 MB.
This is a much simpler and more effective solution than cleaning the main branch history.
π Analysis Results
Current Repository State
| Metric | Value |
|---|---|
| Total repository size | 4.2 GB |
| Git history (.git) | 3.9 GB |
| Working directory | ~300 MB |
| gh-pages commits | 193 deployments |
| Date range | December 2020 - Present |
Size Breakdown by Branch
| Branch/Content | Uncompressed Size | % of Total |
|---|---|---|
| gh-pages history | 7.56 GB | 99.6% |
| Main branch + other branches | ~30 MB | 0.4% |
| Current gh-pages HEAD only | 95 MB | (what's actually needed) |
What's in gh-pages History?
Each of the 193 deployments includes a complete website build:
| File Type | Total in History | Description |
|---|---|---|
| Jupyter Notebooks (.ipynb) | 1.70 GB | Executed notebooks with outputs |
| Images (.png, .jpg, .svg) | 1.55 GB | Generated matplotlib plots |
| HTML/JS/CSS | 1.41 GB | Built website pages |
| Search indices & other | 2.90 GB | searchindex.js and assets |
| Total | 7.56 GB | All 193 deployments combined |
π― The Problem
Every time the documentation is built and deployed to gh-pages:
- Full website generated: ~50-100 MB of HTML, images, notebooks
- Committed to gh-pages: Creates a new commit with all files
- History preserved forever: All 193 deployments remain in git history
- Clone downloads everything: Fresh clones download 4.2 GB instead of 100 MB
Impact on Users
- Fresh clone time: 5-10 minutes (downloading 4.2 GB)
- Disk space usage: 4.2 GB per clone
- Bandwidth costs: Significant for CI/CD and contributors
- Git operations: Slower due to large object database
β Recommended Solution: Orphan gh-pages Branch
Convert gh-pages to an orphan branch with only the current deployment (no history).
Expected Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Repository size | 4.2 GB | ~100 MB | 96% reduction |
| Fresh clone time | 5-10 min | ~30 sec | 90% faster |
| gh-pages commits | 193 | 1 | 99% fewer |
| Bandwidth per clone | 4.2 GB | 100 MB | 97% less |
π Implementation Plan
Option 1: Manual Cleanup (One-Time)
Safe, controlled approach for one-time cleanup:
# ======================================
# STEP 1: Backup (CRITICAL!)
# ======================================
cd /path/to/lecture-python.myst
# Create backup branch
git branch gh-pages-backup origin/gh-pages
# Optional: Create a backup clone
cd ..
cp -r lecture-python.myst lecture-python.myst.backup
cd lecture-python.myst
# ======================================
# STEP 2: Create Fresh gh-pages
# ======================================
# Fetch latest gh-pages
git fetch origin gh-pages
# Checkout gh-pages
git checkout gh-pages
# Create new orphan branch (no parent commits)
git checkout --orphan gh-pages-new
# Stage all current files
git add -A
# Create single commit with current state
git commit -m "Fresh gh-pages deployment (history removed to reduce repo size from 4.2GB to ~100MB)"
# ======================================
# STEP 3: Replace old gh-pages
# ======================================
# Delete old gh-pages branch
git branch -D gh-pages
# Rename new branch to gh-pages
git branch -m gh-pages
# ======================================
# STEP 4: Force Push (REQUIRES COORDINATION!)
# ======================================
# Push the orphaned branch (overwrites history)
git push origin gh-pages --force
# ======================================
# STEP 5: Return to main branch
# ======================================
git checkout main
# ======================================
# STEP 6: Cleanup (IMPORTANT!)
# ======================================
# Remove all references to old commits
git reflog expire --expire=now --all
# Aggressive garbage collection
git gc --aggressive --prune=now
# Verify new size
git count-objects -vH
du -sh .git
# ======================================
# STEP 7: Notify contributors
# ======================================
# All contributors must re-clone or run:
# git fetch origin
# git checkout gh-pages
# git reset --hard origin/gh-pages
# git checkout main
# git gc --aggressive --prune=nowOption 2: Automated with GitHub Actions (Future Deployments)
Prevent history buildup in the future:
Update your deployment workflow to use force_orphan:
# .github/workflows/publish.yml
name: Build and Deploy Documentation
on:
push:
branches: [main]
tags: ['*']
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Python environment
# ... your existing setup steps ...
- name: Build documentation
run: |
jb build lectures --path-output ./ -W --keep-going
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./_build/html
force_orphan: true # β KEY: Creates orphan commits each time
commit_message: "deploy: ${{ github.sha }}"
# Optional: Upload build artifacts for debugging
- name: Upload build artifacts
uses: actions/upload-artifact@v3
with:
name: website-build
path: _build/html
retention-days: 30Key change: force_orphan: true
This ensures each deployment:
- β Creates a fresh orphan commit
- β No history accumulation
- β Repository stays small forever
- β Only current deployment is kept
π Alternative: Switch to External Hosting
If you want even more control, consider:
Netlify / Vercel Deployment
Instead of gh-pages, deploy to Netlify/Vercel:
Benefits:
- No gh-pages branch needed at all
- Preview deployments for PRs
- Better performance (CDN)
- Deploy history managed externally
- Repository stays purely source code
GitHub Actions Example:
name: Deploy to Netlify
on:
push:
branches: [main]
pull_request:
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build site
run: jb build lectures --path-output ./
- name: Deploy to Netlify
uses: nwtgck/actions-netlify@v2
with:
publish-dir: './_build/html'
production-deploy: ${{ github.event_name == 'push' }}
github-token: ${{ secrets.GITHUB_TOKEN }}
env:
NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}This completely eliminates the need for a gh-pages branch.
β οΈ Important Considerations
Before Running the Cleanup
- Backup the repository (clone or branch backup)
- Notify all contributors about the upcoming change
- Check for dependencies on old gh-pages commits (unlikely)
- Schedule maintenance window (low-impact operation)
- Test on a fork first to verify the process
What Gets Removed
- β All 193 historical deployments of the website
- β Old versions of generated images
- β Old versions of built HTML/notebooks
- β Historical search indices
What Gets Kept
- β Current (latest) website deployment
- β All main branch history (source files)
- β All tags and releases
- β All other branches
- β Commit history for source code
Impact on Website
- β No impact: The live website will continue to work
- β No downtime: The current deployment remains available
- β No broken links: All current URLs remain valid
- β Immediate improvement: Faster clones for everyone
π Post-Cleanup Actions
For Repository Maintainers
# After force push, verify the cleanup worked
git clone https://github.com/QuantEcon/lecture-python.myst.git test-clone
cd test-clone
du -sh .git # Should be ~100 MB instead of 3.9 GB
# Check gh-pages has only 1 commit
git log origin/gh-pages --oneline # Should show only 1 commit
# Verify website still works
open https://python.quantecon.org # or your website URLFor Contributors
After the cleanup, contributors need to update their local clones:
Option A: Fresh clone (recommended)
cd ~/projects
rm -rf lecture-python.myst
git clone https://github.com/QuantEcon/lecture-python.myst.git
cd lecture-python.mystOption B: Update existing clone
cd lecture-python.myst
# Save any uncommitted work
git stash
# Update remote references
git fetch origin
# Reset gh-pages to match remote
git checkout gh-pages
git reset --hard origin/gh-pages
# Return to main
git checkout main
# Clean up local objects
git reflog expire --expire=now --all
git gc --aggressive --prune=now
# Verify size reduction
du -sh .gitπ Monitoring and Validation
Verify Repository Size
# Check local repository size
du -sh .git
# Check GitHub repository size
# Visit: https://github.com/QuantEcon/lecture-python.myst/settings
# Look for "Repository size" under "Danger Zone"Expected Metrics After Cleanup
| Metric | Target Value |
|---|---|
.git directory size |
50-100 MB |
| gh-pages commit count | 1 |
| Fresh clone time | < 1 minute |
| Total repository size | < 150 MB |
π― Recommended Timeline
Week 1: Preparation
- Create backup branch:
gh-pages-backup - Notify all active contributors
- Document current state (commits, size, etc.)
- Test cleanup process on a fork
Week 2: Execution
- Perform gh-pages cleanup during low-activity period
- Force push orphaned gh-pages branch
- Verify website still works
- Confirm repository size reduction
Week 3: Follow-up
- Ensure all contributors have updated
- Update deployment workflow with
force_orphan: true - Document new process in CONTRIBUTING.md
- Monitor repository size going forward
π‘ Future Prevention
To prevent this from happening again:
1. Use force_orphan in Deployments
Always use force_orphan: true in GitHub Actions (see Option 2 above).
2. Periodic Monitoring
Set up monthly checks:
# Add to CI or create a scheduled GitHub Action
git clone --bare https://github.com/QuantEcon/lecture-python.myst.git repo-check
cd repo-check
git count-objects -vH
# Alert if size-pack > 500 MB3. Separate Build Artifacts
Consider:
- GitHub Releases: For official PDF/ZIP distributions
- GitHub Actions Artifacts: For CI build outputs (30-90 day retention)
- External CDN: For large media files
- Git LFS: For binary files that must be tracked (not recommended for builds)
4. Documentation
Update CONTRIBUTING.md:
## Build Artifacts
**NEVER commit build outputs to the main branch:**
- β `_build/` directory
- β `_pdf/` directory
- β Generated images from builds
- β Compiled HTML/notebooks
These are auto-generated and deployed via GitHub Actions.
The `.gitignore` file is configured to prevent this, but double-check
before committing.π Related Issues
- Repository Size Reduction: 4.2GB β ~500MB StrategyΒ #646 - Repository Size Reduction: 4.2GB β ~500MB Strategy (comprehensive analysis)
This issue focuses specifically on the quick win: cleaning gh-pages history.
β FAQ
Q: Will this break anything?
A: No. The website continues to work normally. Only historical deployments are removed.
Q: Can we recover old deployments if needed?
A: Yes, the gh-pages-backup branch will contain the full history. Keep it for 6-12 months.
Q: Do we need to coordinate with contributors?
A: Not really. Contributors work on the main branch. gh-pages is auto-generated. They just need to re-clone eventually for the size benefits.
Q: How long does the cleanup take?
A: 5-10 minutes to execute the commands. Instant for users after they re-clone.
Q: What about GitHub Pages custom domain?
A: Unaffected. The CNAME file will be preserved in the new orphan commit.
Q: Will GitHub show contribution graphs correctly?
A: Yes. Contribution graphs are based on the main branch commits, not gh-pages.
Q: Can we automate this completely?
A: Yes! Use force_orphan: true in your deployment action (Option 2).
π Next Steps
- Review this proposal and gather feedback
- Choose implementation approach: Manual cleanup or GitHub Actions update
- Schedule the cleanup for a convenient time
- Execute the plan following the steps above
- Verify results and update documentation
π References
- GitHub Actions: peaceiris/actions-gh-pages
- Git: Creating and Removing Orphan Branches
- GitHub Docs: Managing Large Files
- BFG Repo-Cleaner (alternative for main branch cleanup)
Document Version: 1.0
Date: October 16, 2025
Impact: π΄ High (96% size reduction)
Difficulty: π’ Low (simple git commands)
Risk: π’ Low (easily reversible with backup)