Skip to content

Repository Size Reduction: 4.2GB → ~500MB Strategy #646

@mmcky

Description

@mmcky

Repository Size Analysis and Reduction Strategy

Executive Summary

The lecture-python.myst repository has grown to 4.2 GB, with 3.9 GB (93%) consumed by git history. Fresh clones take an excessive amount of time due to historical build artifacts that were committed before being added to .gitignore. This document outlines the root causes and provides actionable solutions to reduce the repository size by 75-85% (down to ~500MB-1GB).

Current Repository Size Analysis

Overall Statistics

  • Total repository size: 4.2 GB
  • Git history (.git directory): 3.9 GB (93% of total)
  • Working directory: ~300 MB
  • Git pack files: 3.92 GB
  • Number of tracked files: 217
  • Number of branches: 390 (including remote branches)

Size Breakdown by File Type in Git History

File Type Total Size Versions Status
PDF files (_pdf/quantecon-python.pdf) 2.93 GB 137 commits ❌ In history
Jupyter Notebooks (.ipynb) 1.74 GB Many ❌ In history
Images (.png, .jpg, .svg) 1.59 GB 128-181 per file ❌ In history
HTML files 1.39 GB Many ❌ In history
Total build artifacts ~7.65 GB (uncompressed) - 🔴 CRITICAL

Note: These files are currently in .gitignore ✅ but their historical versions remain in git history, meaning every clone downloads all past versions.

Root Causes

1. Build Artifacts Committed to Git History (75% of .git size)

The primary issue is that build outputs were historically committed to the repository:

  • _pdf/quantecon-python.pdf - Committed 137 times (~30MB each time)
  • _images/*.png - Committed 128-181 times across versions
  • _sources/*.ipynb - Large notebook files with embedded outputs
  • HTML build files - Multiple versions stored

These are now in .gitignore, but all historical versions remain in every clone.

2. Excessive Branch Count

  • 390 branches exist (mostly remote branches from merged PRs)
  • While branches themselves don't consume much space, they prevent garbage collection of orphaned commits
  • Many branches appear to be old deployment branches and merged feature branches

3. Large Individual Files in History

Top 20 largest individual file versions:

_pdf/quantecon-python.pdf: 33.3 MB (20 versions >30MB each)
_sources/bayes_nonconj.ipynb: 11.6 MB (multiple versions >10MB)
bayes_nonconj.html: 8.9 MB (multiple versions >8MB)

Recommended Solutions

Option 1: BFG Repo-Cleaner ⭐ (RECOMMENDED)

Expected reduction: 75-85% (4.2GB → 500MB-1GB)
Difficulty: Medium
Risk: Medium (requires force push and coordination)

BFG is the fastest and most effective tool for removing large files from git history.

Prerequisites

# Install BFG Repo-Cleaner
brew install bfg

# Or download from: https://rtyley.github.io/bfg-repo-cleaner/

Implementation Steps

# 1. BACKUP THE REPOSITORY FIRST!
cd /path/to/parent/directory
cp -r lecture-python.myst lecture-python.myst.backup
# Also ensure GitHub has all important branches backed up

# 2. Clone a fresh mirror of the repository
git clone --mirror https://github.com/QuantEcon/lecture-python.myst.git lecture-python.myst-clean.git
cd lecture-python.myst-clean.git

# 3. Remove large files and directories from history
bfg --delete-files quantecon-python.pdf
bfg --delete-folders _pdf
bfg --delete-folders _build
bfg --delete-folders _images  
bfg --delete-folders _sources
bfg --delete-files '*.html'
bfg --delete-files '*.ipynb' --no-blob-protection

# 4. Clean up the repository
git reflog expire --expire=now --all
git gc --prune=now --aggressive

# 5. Verify the new size
git count-objects -vH
du -sh .

# 6. If satisfied with the results, push the cleaned history
# ⚠️ THIS REQUIRES COORDINATION WITH ALL CONTRIBUTORS
git push --force

# 7. Clean up
cd ..
rm -rf lecture-python.myst-clean.git

Post-cleanup Actions

After force pushing, all contributors must:

# Option A: Fresh clone (recommended)
cd /path/to/workspace
rm -rf lecture-python.myst
git clone https://github.com/QuantEcon/lecture-python.myst.git

# Option B: Reset existing clone
cd lecture-python.myst
git fetch origin
git reset --hard origin/main
git clean -fd

Option 2: git-filter-repo (More Control)

Expected reduction: 75-85%
Difficulty: Medium-High
Risk: Medium (requires force push)

Provides more granular control than BFG.

Prerequisites

# Install git-filter-repo
brew install git-filter-repo
# Or: pip install git-filter-repo

Implementation Steps

# 1. BACKUP FIRST!
cd /path/to/parent/directory
cp -r lecture-python.myst lecture-python.myst.backup

# 2. Fresh clone for safety
git clone https://github.com/QuantEcon/lecture-python.myst.git lecture-python.myst-clean
cd lecture-python.myst-clean

# 3. Remove paths from entire history
git filter-repo --path _pdf --invert-paths
git filter-repo --path _build --invert-paths  
git filter-repo --path _images --invert-paths
git filter-repo --path _sources --invert-paths
git filter-repo --path-glob '*.html' --invert-paths
git filter-repo --path-glob '*.ipynb' --invert-paths

# 4. Add back the remote (filter-repo removes it)
git remote add origin https://github.com/QuantEcon/lecture-python.myst.git

# 5. Verify size
git count-objects -vH

# 6. Force push (coordinate with team!)
git push --force --all
git push --force --tags

Option 3: Branch Cleanup (Lower Impact, Lower Risk)

Expected reduction: 10-15%
Difficulty: Low
Risk: Low

This can be done independently or in combination with other options.

Implementation Steps

# 1. List all merged remote branches
git branch -r --merged origin/main | grep -v "main\|HEAD"

# 2. Review the list, then delete merged branches
# Delete individual branches:
git push origin --delete branch-name

# Or batch delete all merged branches (review carefully first!):
git branch -r --merged origin/main | 
  grep -v "main\|HEAD" | 
  grep -v "GA4-code-update" |  # Keep any branches you want to preserve
  sed 's/origin\///' | 
  xargs -n 1 echo git push --delete origin  # Remove 'echo' to actually delete

# 3. Clean up local references
git fetch --prune origin

# 4. Run garbage collection
git gc --aggressive --prune=now

Option 4: Shallow Clone + Fresh Start (Nuclear Option)

Expected reduction: 90%+ (but loses history)
Difficulty: Low
Risk: High (loses all git history)

Only consider if git history is not critical.

Option 4a: Limited History

# Clone with only recent history (e.g., last 100 commits)
git clone --depth 100 https://github.com/QuantEcon/lecture-python.myst.git lecture-python.myst-shallow

This is useful for CI/CD systems but not recommended for development.

Option 4b: Complete Fresh Start

# 1. In existing repository
git checkout --orphan fresh-start
git add .
git commit -m "Fresh start - removed bloated history"

# 2. Replace main branch
git branch -D main
git branch -m main

# 3. Force push
git push --force origin main

# 4. Clean up old branches
git push origin --delete $(git branch -r | grep -v "main\|HEAD" | sed 's/origin\///')

⚠️ WARNING: This destroys all git history permanently.


Option 5: Git LFS for Future Large Files (Preventive)

Impact: Prevents future bloat
Difficulty: Low
Risk: Low

Even after cleanup, consider Git LFS for any large files that must be tracked.

Implementation Steps

# 1. Install Git LFS
brew install git-lfs
git lfs install

# 2. Track file types that should use LFS
git lfs track "*.pdf"
git lfs track "*.png"
git lfs track "_build/**"

# 3. Commit the LFS configuration
git add .gitattributes
git commit -m "Configure Git LFS for large files"
git push

# 4. Migrate existing files (if any) to LFS
git lfs migrate import --include="*.pdf,*.png" --everything
git push --force

Note: Git LFS requires additional configuration on GitHub and has storage limits.


Recommended Action Plan

Phase 1: Immediate Actions (Low Risk)

  • ✅ Verify .gitignore is working correctly (already done)
  • Clean up merged branches using Option 3
  • Document the plan and notify contributors

Phase 2: Major Cleanup (Requires Coordination)

  • Choose Option 1 (BFG) for best results with least complexity
  • Set a maintenance window for the cleanup
  • Notify all contributors 1 week in advance
  • Perform the cleanup during low-activity period
  • Force push the cleaned repository
  • Provide updated clone instructions to all contributors

Phase 3: Preventive Measures

  • Set up branch protection rules to auto-delete merged branches
  • Consider Git LFS for any necessary large files (Option 5)
  • Document build artifact policy in CONTRIBUTING.md
  • Add pre-commit hooks to prevent accidental commits of build outputs

Pre-Cleanup Checklist

Before performing any history-rewriting operation:

  • Backup: Create complete backup of repository
  • Notify: Email/message all contributors about upcoming changes
  • Document: List any critical commit hashes referenced elsewhere
  • Schedule: Choose maintenance window (weekend/low-activity period)
  • Test: Run cleanup on test clone first
  • Verify: Check new size and ensure critical files remain
  • Prepare: Write contributor instructions for re-cloning
  • Archive: Consider archiving old branches before deletion

Post-Cleanup Verification

After cleanup, verify:

# Check repository size
du -sh .git
git count-objects -vH

# Verify critical files are present
ls lectures/*.md
cat lectures/_config.yml

# Test build process
jb build lectures --path-output ./ -W --keep-going

# Check branch count
git branch -a | wc -l

Expected results:

  • .git directory: ~500MB-1GB (down from 3.9GB)
  • All source .md files intact
  • Build process works correctly
  • Reduced branch count

Communication Template for Contributors

📢 Repository Cleanup Notice

We're performing maintenance on the lecture-python.myst repository to reduce its size from 4.2GB to ~500MB-1GB.

**What's happening:**
- Removing build artifacts from git history (PDFs, images, HTML)
- This requires rewriting git history (force push)

**When:** [INSERT DATE/TIME]

**What you need to do:**
After the cleanup is complete, you MUST re-clone the repository:

# Backup any local changes
cd lecture-python.myst
git stash
git diff > my-changes.patch  # if you have uncommitted changes

# Remove old clone
cd ..
rm -rf lecture-python.myst

# Fresh clone
git clone https://github.com/QuantEcon/lecture-python.myst.git
cd lecture-python.myst

# Reapply changes if needed
git apply my-changes.patch

**Benefits:**
- Faster clones (4.2GB → ~600MB)
- Faster git operations
- Less disk space usage

Questions? Please contact [MAINTAINER]

Estimated Time Requirements

Task Estimated Time
BFG cleanup execution 10-30 minutes
Testing cleaned repository 30-60 minutes
Force push and verification 10 minutes
Contributor re-clone time 5-10 minutes per person
Total maintenance window 2-3 hours (recommended)

Technical Details

Why Build Artifacts Are So Large

  1. PDF files: Each build generates a 30MB+ PDF that was committed 137 times
  2. Images: Matplotlib/plotting outputs were committed with each documentation build
  3. Jupyter notebooks: .ipynb files store outputs inline, making them large
  4. HTML files: Full website builds were committed before using external hosting

Why .gitignore Isn't Enough

Adding files to .gitignore only prevents future commits. Git stores the complete history of all files ever committed, so historical versions remain in the repository forever unless explicitly removed.

Safety Considerations

  • Force pushing rewrites git history, breaking existing clones
  • All contributors must re-clone or reset their repositories
  • Any references to old commit hashes (in issues, PRs, documentation) will break
  • CI/CD systems with cached clones will need to re-clone
  • GitHub PR comments linking to specific commits will show 404 errors

Alternative: Using GitHub Artifacts for Builds

Instead of committing build outputs, consider:

# .github/workflows/build.yml
- name: Upload PDF artifact
  uses: actions/upload-artifact@v3
  with:
    name: quantecon-python-pdf
    path: _pdf/quantecon-python.pdf
    retention-days: 90

This keeps build outputs accessible without bloating the repository.


Questions & Answers

Q: Will this affect the published website?
A: No, the published website is deployed separately and won't be affected.

Q: Will we lose commit history?
A: No, commit messages and code changes remain. Only the large binary files are removed from history.

Q: What if someone has unpushed commits?
A: They should push before the cleanup or save patches and reapply after re-cloning.

Q: Can we selectively keep some PDF versions?
A: Yes, BFG can be configured to keep recent files while removing old versions.

Q: How often should we do this?
A: Once, if we prevent build artifacts from being committed going forward.


Next Steps

  1. Decide which option to use (recommend Option 1: BFG)
  2. Schedule maintenance window
  3. Notify all contributors
  4. Execute cleanup during maintenance window
  5. Verify results
  6. Update documentation and CONTRIBUTING.md
  7. Implement preventive measures

References


Document Version: 1.0
Date: October 16, 2025
Analysis performed on: macOS with git 2.x

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions