Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggshield and pygitguardian disagree on the way to check the maximum document size #561

Closed
agateau-gg opened this issue Jun 9, 2023 · 0 comments · Fixed by #631
Closed
Labels
status:confirmed This issue has been reviewed and confirmed type:bug Something isn't working

Comments

@agateau-gg
Copy link
Collaborator

agateau-gg commented Jun 9, 2023

The problem

Sometimes a file is considered as small enough to be scanned by ggshield, but then it is rejected by py-gitguardian. This is because ggshield checks the character length of the decoded string whereas py-gitguardian checks the byte length of the utf-8 encoded string.

ggshield can't just use the byte length of the file it reads though: it may not be utf-8, and GitGuardian API expects utf-8 encoded content.

This diagram summarizes the current situation:

     content in bytes (any encoding)
             ↓
---------[ggshield]--------------------
             ↓
      detect encoding
      decode content
      decoded size too large? → exit
             ↓
       decoded content
             ↓ 
------[py-gitguardian]-----------------
             ↓
      encode to utf-8
      encoded size too large? → exit
             ↓
          { HTTP }
             ↓
------[GitGuardian API]----------------
    expects utf-8 content
             ↓
    check encoded size

This file: 561.txt, consists of one é character followed by enough e characters so that the encoded size is too long by one byte. It can be used to reproduce the bug:

$ ggshield secret scan path 561.txt
Scanning Path... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1 files scanned out of 1 0:00:00

Error: Scanning failed: file exceeds the maximum allowed size of 1048576B

Discarded solutions

Do not check the document size on ggshield side

This would make ggshield load and decode all files in memory, including huge ones, only to have them rejected by py-gitguardian. That is wasteful.

Change ggshield check to use the encoded size

This would mean each file would be encoded twice: once by ggshield, then again by py-gitguardian. ggshield cannot pass the encoded content to py-gitguardian, because py-gitguardian only accepts decoded content.

Proposed solution

The proposed solution is to make ggshield do the utf-8 encoding (and check the size) and make py-gitguardian accept encoded content.

Details:

  1. Add a static encode_content(str) -> bytes method to pygitguardian.Document. This would perform what the post_load methods of DocumentSchema do. Replace said post_load methods with a call to Document.encode_content().

  2. Make GGClient.content_scan() and GGClient.multi_content_scan() methods accept bytes for document in addition to str. When document is bytes, the scan methods can assume the content is utf-8 and just check the byte length, without encoding.

  3. Make ggshield encode the content it scans into utf-8 using Document.encode_content(), then make it pass the encoded content to GGClient.multi_content_scan(). Nice side-effect: if the content is already utf-8, no need to decode and encode it (need to check how this can be made to work with the 0x0 to 0xFFFD replacement, maybe use a single-byte utf-8 character as a replacement instead).

@agateau-gg agateau-gg added type:bug Something isn't working status:confirmed This issue has been reviewed and confirmed labels Jun 9, 2023
@agateau-gg agateau-gg changed the title ggshield and pygitguardian disagree on the way to check for the maximum document size ggshield and pygitguardian disagree on the way to check the maximum document size Jun 9, 2023
@Walz Walz closed this as completed in #631 Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:confirmed This issue has been reviewed and confirmed type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant