Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Gitea git cat-file Subprocess Management Optimization #33952

Open
lunny opened this issue Mar 20, 2025 · 9 comments
Open

Proposal: Gitea git cat-file Subprocess Management Optimization #33952

lunny opened this issue Mar 20, 2025 · 9 comments
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@lunny
Copy link
Member

lunny commented Mar 20, 2025

Feature Description

Background

Currently, Gitea creates a new git cat-file --batch subprocess for each request when handling git operations. While this approach is straightforward, it leads to the following issues in high-concurrency scenarios:

  • High system overhead due to frequent subprocess creation/destruction
  • Increased response latency as each request requires subprocess initialization
  • Potential overconsumption of system resources (such as file descriptors)
  • Non-gogit version of Gitea in Windows is slow for git related operations

Proposal Overview

Design and implement a lightweight git cat-file --batch subprocess manager to improve performance and resource utilization through subprocess reuse. Key features include:

  • Maintaining subprocess pools organized by repository path
  • Dynamically allocating idle subprocesses to handle requests
  • Automatically recycling long-idle subprocesses
  • Gracefully handling high-load situations

Detailed Design

Subprocess Manager Structure

type GitCatFileManager struct {
    // Subprocess pools indexed by repository path
    procPools     map[string]*ProcPool
    mutex         sync.RWMutex
    maxProcsPerRepo int     // Maximum number of subprocesses per repository
    idleTimeout   time.Duration // Idle timeout period
}

type ProcPool struct {
    repoPath      string
    processes     []*GitCatFileProcess
    mutex         sync.Mutex
}

type GitCatFileProcess struct {
    cmd           *exec.Cmd
    stdin         io.WriteCloser
    stdout        io.ReadCloser
    lastUsed      time.Time
    inUse         bool
    mutex         sync.Mutex
}

Core Functionality Implementation

Acquiring a Subprocess

func (m *GitCatFileManager) Get(repoPath string) (*GitCatFileProcess, error) {
    m.mutex.RLock()
    pool, exists := m.procPools[repoPath]
    m.mutex.RUnlock()
    
    if !exists {
        m.mutex.Lock()
        // Double-check to avoid race conditions
        pool, exists = m.procPools[repoPath]
        if !exists {
            pool = &ProcPool{repoPath: repoPath}
            m.procPools[repoPath] = pool
        }
        m.mutex.Unlock()
    }
    
    return pool.getProcess()
}

func (p *ProcPool) getProcess() (*GitCatFileProcess, error) {
    p.mutex.Lock()
    defer p.mutex.Unlock()
    
    // Look for an idle process
    for _, proc := range p.processes {
        if !proc.inUse {
            proc.inUse = true
            proc.lastUsed = time.Now()
            return proc, nil
        }
    }
    
    // Check if maximum limit has been reached
    if len(p.processes) >= maxProcsPerRepo {
        return nil, errors.New("reached max processes limit for repository")
    }
    
    // Create a new process
    proc, err := newGitCatFileProcess(p.repoPath)
    if err != nil {
        return nil, err
    }
    
    p.processes = append(p.processes, proc)
    return proc, nil
}

Creating a New Subprocess

func newGitCatFileProcess(repoPath string) (*GitCatFileProcess, error) {
    cmd := exec.Command("git", "-C", repoPath, "cat-file", "--batch")
    
    stdin, err := cmd.StdinPipe()
    if err != nil {
        return nil, err
    }
    
    stdout, err := cmd.StdoutPipe()
    if err != nil {
        stdin.Close()
        return nil, err
    }
    
    if err := cmd.Start(); err != nil {
        stdin.Close()
        stdout.Close()
        return nil, err
    }
    
    return &GitCatFileProcess{
        cmd:      cmd,
        stdin:    stdin,
        stdout:   stdout,
        lastUsed: time.Now(),
        inUse:    true,
    }, nil
}

Releasing a Subprocess

func (m *GitCatFileManager) Release(proc *GitCatFileProcess) {
    proc.mutex.Lock()
    proc.inUse = false
    proc.lastUsed = time.Now()
    proc.mutex.Unlock()
}

Periodic Cleanup

func (m *GitCatFileManager) StartCleaner(interval time.Duration) {
    ticker := time.NewTicker(interval)
    
    go func() {
        for range ticker.C {
            m.cleanIdleProcesses()
        }
    }()
}

func (m *GitCatFileManager) cleanIdleProcesses() {
    now := time.Now()
    m.mutex.Lock()
    defer m.mutex.Unlock()
    
    for repoPath, pool := range m.procPools {
        pool.mutex.Lock()
        
        activeProcs := make([]*GitCatFileProcess, 0, len(pool.processes))
        for _, proc := range pool.processes {
            proc.mutex.Lock()
            if !proc.inUse && now.Sub(proc.lastUsed) > m.idleTimeout {
                // Close long-idle processes
                proc.stdin.Close()
                proc.cmd.Process.Kill()
                proc.cmd.Wait() // Avoid zombie processes
                proc.mutex.Unlock()
            } else {
                proc.mutex.Unlock()
                activeProcs = append(activeProcs, proc)
            }
        }
        
        pool.processes = activeProcs
        pool.mutex.Unlock()
        
        // Remove empty process pools
        if len(pool.processes) == 0 {
            delete(m.procPools, repoPath)
        }
    }
}

start the manager

// Global instance
var gitCatFileManager = NewGitCatFileManager(
    10,               // Maximum subprocesses per repository
    5*time.Minute,    // Idle timeout period
)

func init() {
    // Start the cleanup goroutine, checking once per minute
    gitCatFileManager.StartCleaner(1 * time.Minute)
}

Implementation Considerations

  • Error Handling: Detect and handle subprocess abnormal exit situations
  • Thread Safety: Use appropriate mutex locks to ensure concurrency safety
  • Resource Limits: Add a global maximum process limit to prevent resource exhaustion
  • Monitoring Metrics: Add monitoring for subprocess pool usage to facilitate troubleshooting

Performance Expectations

  • Reduced Latency: Most requests use already-initialized subprocesses, avoiding startup overhead
  • Increased Throughput: Reduced system-level call overhead in high-concurrency scenarios
  • Lowered Resource Consumption: Control of total subprocess count prevents excessive resource usage

Drawbacks

  • Increased Complexity: The solution adds complexity to Gitea's codebase with new data structures, synchronization mechanisms, and lifecycle management that will need to be maintained.
  • Memory Footprint: Long-running subprocesses will consume more memory over time compared to short-lived ones. Each cached subprocess maintains open file handles and memory buffers.
  • UnReleased sub process maybe stuck forever, there should be timeout for a subprocess session.
  • When a git cat-file --batch run for a long time and repository updated, what will happen.

TODO

  • Implement a basic version and conduct performance benchmark tests
  • Consider adding subprocess health check mechanisms
  • Integrate with Gitea's monitoring and trace system
@lunny lunny added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Mar 20, 2025
@wxiaoguang
Copy link
Contributor

If we do not like creating sub-processes, why not just use "gogit"?

@lunny
Copy link
Member Author

lunny commented Mar 21, 2025

I think gogit has its own problems.

I don't whether it's fast enough to just replace cat-file --batch. Maybe an evaluation is necessary.

@wxiaoguang
Copy link
Contributor

But I do not think you made it right: #33954 (comment)

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 21, 2025

I think gogit has its own problems.

We only need the " cat-file / attr-check " (reading some entry information) in this case, other problems won't really happen. And maybe go-git v6 will have sha256 support.

@lunny
Copy link
Member Author

lunny commented Mar 21, 2025

I think gogit has its own problems.

We only need the " cat-file / attr-check " (reading some entry information) in this case, other problems won't really happen. And maybe go-git v6 will have sha256 support.

Yes. That's a better alternative than this one if the performance is enough and sha256 will be supported.

@TheFox0x7
Copy link
Contributor

I've looked briefly at gitlab's gitaly and it looks like they are just using git under the hood there (no idea if they have some patches on top of it) and that got me curious - what benefit does gogit bring over git binary, besides not having to carry the git binary with gitea? Is it that some operations are faster there?

@eeyrjmr
Copy link
Contributor

eeyrjmr commented Mar 21, 2025

I've looked briefly at gitlab's gitaly and it looks like they are just using git under the hood there (no idea if they have some patches on top of it) and that got me curious - what benefit does gogit bring over git binary, besides not having to carry the git binary with gitea? Is it that some operations are faster there?

you still need local git installed even for gogit.
gogit use to be faster for windows but I think improvements in go and/or git in general have removed whatever the slowdown is. I am now using the non-gogit on windows because of this (and sha256)

@lunny
Copy link
Member Author

lunny commented Mar 22, 2025

Gitaly used to rely on libgit2 about two years ago but later switched to using native Git commands. It now utilizes git cat-file --batch and git cat-file --batch-command, which is similar to Gitea’s approach of spawning a new subprocess for each request. I think they have implemented a similiar subprocess cache https://gitlab.com/gitlab-org/gitaly/-/tree/master/internal/git/catfile?ref_type=heads

@eeyrjmr Have you run any benchmark tests comparing the two Gitea binaries?

Unlike native Git, go-git lacks features like git cat-file --batch, which makes it significantly slower when working with large repositories such as the Linux kernel. One potential solution is to implement a mechanism similar to git cat-file --batch based on go-git to improve performance.

@TheFox0x7
Copy link
Contributor

You're right, they have a cache layer for cat-file so it might be a good idea to look through it for reference. I think gitea will end up supporting gitaly for distributed scenarios so it's not a bad idea to mirror some solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

No branches or pull requests

4 participants