You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes a file is considered as small enough to be scanned by ggshield, but then it is rejected by py-gitguardian. This is because ggshield checks the character length of the decoded string whereas py-gitguardian checks the byte length of the utf-8 encoded string.
ggshield can't just use the byte length of the file it reads though: it may not be utf-8, and GitGuardian API expects utf-8 encoded content.
This file: 561.txt, consists of one é character followed by enough e characters so that the encoded size is too long by one byte. It can be used to reproduce the bug:
$ ggshield secret scan path 561.txt
Scanning Path... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1 files scanned out of 1 0:00:00
Error: Scanning failed: file exceeds the maximum allowed size of 1048576B
Discarded solutions
Do not check the document size on ggshield side
This would make ggshield load and decode all files in memory, including huge ones, only to have them rejected by py-gitguardian. That is wasteful.
Change ggshield check to use the encoded size
This would mean each file would be encoded twice: once by ggshield, then again by py-gitguardian. ggshield cannot pass the encoded content to py-gitguardian, because py-gitguardian only accepts decoded content.
Proposed solution
The proposed solution is to make ggshield do the utf-8 encoding (and check the size) and make py-gitguardian accept encoded content.
Details:
Add a static encode_content(str) -> bytes method to pygitguardian.Document. This would perform what the post_load methods of DocumentSchema do. Replace said post_load methods with a call to Document.encode_content().
Make GGClient.content_scan() and GGClient.multi_content_scan() methods accept bytes for document in addition to str. When document is bytes, the scan methods can assume the content is utf-8 and just check the byte length, without encoding.
Make ggshield encode the content it scans into utf-8 using Document.encode_content(), then make it pass the encoded content to GGClient.multi_content_scan(). Nice side-effect: if the content is already utf-8, no need to decode and encode it (need to check how this can be made to work with the 0x0 to 0xFFFD replacement, maybe use a single-byte utf-8 character as a replacement instead).
The text was updated successfully, but these errors were encountered:
agateau-gg
changed the title
ggshield and pygitguardian disagree on the way to check for the maximum document size
ggshield and pygitguardian disagree on the way to check the maximum document size
Jun 9, 2023
The problem
Sometimes a file is considered as small enough to be scanned by ggshield, but then it is rejected by py-gitguardian. This is because ggshield checks the character length of the decoded string whereas py-gitguardian checks the byte length of the utf-8 encoded string.
ggshield can't just use the byte length of the file it reads though: it may not be utf-8, and GitGuardian API expects utf-8 encoded content.
This diagram summarizes the current situation:
This file: 561.txt, consists of one
é
character followed by enoughe
characters so that the encoded size is too long by one byte. It can be used to reproduce the bug:Discarded solutions
Do not check the document size on ggshield side
This would make ggshield load and decode all files in memory, including huge ones, only to have them rejected by py-gitguardian. That is wasteful.
Change ggshield check to use the encoded size
This would mean each file would be encoded twice: once by ggshield, then again by py-gitguardian. ggshield cannot pass the encoded content to py-gitguardian, because py-gitguardian only accepts decoded content.
Proposed solution
The proposed solution is to make ggshield do the utf-8 encoding (and check the size) and make py-gitguardian accept encoded content.
Details:
Add a static
encode_content(str) -> bytes
method topygitguardian.Document
. This would perform what thepost_load
methods ofDocumentSchema
do. Replace saidpost_load
methods with a call toDocument.encode_content()
.Make
GGClient.content_scan()
andGGClient.multi_content_scan()
methods acceptbytes
fordocument
in addition tostr
. Whendocument
isbytes
, the scan methods can assume the content is utf-8 and just check the byte length, without encoding.Make ggshield encode the content it scans into utf-8 using
Document.encode_content()
, then make it pass the encoded content toGGClient.multi_content_scan()
. Nice side-effect: if the content is already utf-8, no need to decode and encode it (need to check how this can be made to work with the0x0
to0xFFFD
replacement, maybe use a single-byte utf-8 character as a replacement instead).The text was updated successfully, but these errors were encountered: