ggshield and pygitguardian disagree on the way to check the maximum document size #561

agateau-gg · 2023-06-09T13:43:57Z

The problem

Sometimes a file is considered as small enough to be scanned by ggshield, but then it is rejected by py-gitguardian. This is because ggshield checks the character length of the decoded string whereas py-gitguardian checks the byte length of the utf-8 encoded string.

ggshield can't just use the byte length of the file it reads though: it may not be utf-8, and GitGuardian API expects utf-8 encoded content.

This diagram summarizes the current situation:

     content in bytes (any encoding)
             ↓
---------[ggshield]--------------------
             ↓
      detect encoding
      decode content
      decoded size too large? → exit
             ↓
       decoded content
             ↓ 
------[py-gitguardian]-----------------
             ↓
      encode to utf-8
      encoded size too large? → exit
             ↓
          { HTTP }
             ↓
------[GitGuardian API]----------------
    expects utf-8 content
             ↓
    check encoded size

This file: 561.txt, consists of one é character followed by enough e characters so that the encoded size is too long by one byte. It can be used to reproduce the bug:

$ ggshield secret scan path 561.txt
Scanning Path... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1 files scanned out of 1 0:00:00

Error: Scanning failed: file exceeds the maximum allowed size of 1048576B

Discarded solutions

Do not check the document size on ggshield side

This would make ggshield load and decode all files in memory, including huge ones, only to have them rejected by py-gitguardian. That is wasteful.

Change ggshield check to use the encoded size

This would mean each file would be encoded twice: once by ggshield, then again by py-gitguardian. ggshield cannot pass the encoded content to py-gitguardian, because py-gitguardian only accepts decoded content.

Proposed solution

The proposed solution is to make ggshield do the utf-8 encoding (and check the size) and make py-gitguardian accept encoded content.

Details:

Add a static encode_content(str) -> bytes method to pygitguardian.Document. This would perform what the post_load methods of DocumentSchema do. Replace said post_load methods with a call to Document.encode_content().
Make GGClient.content_scan() and GGClient.multi_content_scan() methods accept bytes for document in addition to str. When document is bytes, the scan methods can assume the content is utf-8 and just check the byte length, without encoding.
Make ggshield encode the content it scans into utf-8 using Document.encode_content(), then make it pass the encoded content to GGClient.multi_content_scan(). Nice side-effect: if the content is already utf-8, no need to decode and encode it (need to check how this can be made to work with the 0x0 to 0xFFFD replacement, maybe use a single-byte utf-8 character as a replacement instead).

The text was updated successfully, but these errors were encountered:

agateau-gg added type:bug Something isn't working status:confirmed This issue has been reviewed and confirmed labels Jun 9, 2023

agateau-gg changed the title ~~ggshield and pygitguardian disagree on the way to check for the maximum document size~~ ggshield and pygitguardian disagree on the way to check the maximum document size Jun 9, 2023

This was referenced Jul 7, 2023

SSLEOFError when scanning a windows docker image #555

Open

Fix ggshield secret scan trying to scan too large documents #631

Merged

Do not make the encoded document larger than expected GitGuardian/py-gitguardian#67

Merged

agateau-gg linked a pull request Jul 12, 2023 that will close this issue

Fix ggshield secret scan trying to scan too large documents #631

Merged

Walz closed this as completed in #631 Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggshield and pygitguardian disagree on the way to check the maximum document size #561

ggshield and pygitguardian disagree on the way to check the maximum document size #561

agateau-gg commented Jun 9, 2023 •

edited

Loading

ggshield and pygitguardian disagree on the way to check the maximum document size #561

ggshield and pygitguardian disagree on the way to check the maximum document size #561

Comments

agateau-gg commented Jun 9, 2023 • edited Loading

The problem

Discarded solutions

Do not check the document size on ggshield side

Change ggshield check to use the encoded size

Proposed solution

agateau-gg commented Jun 9, 2023 •

edited

Loading