fix: input file encoding #596

JCHacking · 2023-10-12T19:07:46Z

input files in lock-format are expected in a certain encoding,
other input file encodings are detected.

fixes #448

Now the characters are read in bytes, what encoding they have is evaluated and converted to a string with it. Signed-off-by: JCHacking <juancruzmencia@gmail.com> Refs: #448

jkowalleck · 2023-10-19T08:41:03Z

why did you use a niche library https://pypi.org/project/faust-cchardet
why not use a library that is widely used? like https://pypi.org/project/chardet ?

if it was for the py3.6 compatibility, chardet provides older compatible versions.

JCHacking · 2023-10-19T09:18:02Z

why did you use a niche library https://pypi.org/project/faust-cchardet
why not use a library that is widely used? like https://pypi.org/project/chardet ?

if it was for the py3.6 compatibility, chardet provides older compatible versions.

That was one of the reasons, the other one is because some encodings like cp1252 are returned as Windows-1252 so they are not exactly the same to pass it to the decode method.

Also faust-cchardet is a maintained fork of cchardet which is written in C so it has better performance.
Traducido con DeepL https://www.deepl.com/app/?utm_source=android&utm_medium=app&utm_campaign=share-translation

jkowalleck · 2023-10-19T09:56:43Z

re #596 (comment)

[...] it has better performance

Performance should not be a concern at this point.
Maintainability and support of used libraries is more important here. So better use a well-maintained library like chardet that is widely used.
If this choice is insufficient at any point, we can still report bugs to the maintainers team of that library.

see #448 (comment)

could you test the following on your system?

for poetry.lock and Pipfile.lock use: open(..., encoding="utf8") -- no mode, must path and encoding
for requirements.txt the char-detection is okay in general.

JCHacking · 2023-10-19T09:58:55Z

re #596 (comentario)

[...] tiene mejor rendimiento

El rendimiento no debería ser una preocupación en este momento. La mantenibilidad y el soporte de las bibliotecas usadas son más importantes aquí. Así que es mejor utilizar una biblioteca bien mantenida como chardetla que se utiliza ampliamente. Si esta elección es insuficiente en algún momento, aún podemos informar errores al equipo de mantenimiento de esa biblioteca.

ver #448 (comentario)

¿Podrías probar lo siguiente en tu sistema?

para poetry.locky Pipfile.lockuso: open(..., encoding="utf8")-- sin modo, debe ruta y codificación

para requirements.txtla detección de caracteres está bien en general.

Perfect, then I change the library to chardet with a version that supports python 3.6 and then I try the other option that you mention of only validating the coding in requirements.txt

Changed to use the chardet library and now only the encoding in requirements.txt is inspected. Signed-off-by: JCHacking <juancruzmencia@gmail.com> Refs: #448

JCHacking · 2023-10-19T10:55:04Z

The change is already done

Basically, if the open is done in byte mode the encoding will be inspected, otherwise it will be assumed that everything is OK.

And regarding chardet I have made a replace to make it work with windows, since it returns Windows-1252 but python only understands cp1252.

pyproject.toml

Changed to use the chardet library and now only the encoding in requirements.txt is inspected. Signed-off-by: JCHacking <juancruzmencia@gmail.com> Refs: #448

Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com>

jkowalleck · 2023-10-19T13:11:07Z

I had to do minor version range adjustments and other chores.

I will add a regression test, and then, this fix is ready to go.
Thank you for your effort, @JCHacking

Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com>

JCHacking · 2023-10-19T13:41:56Z

I had to do minor version range adjustments and other chores.

I will add a regression test, and then, this fix is ready to go.
Thank you for your effort, @JCHacking

Thanks to you for letting me collaborate in its solution.

jkowalleck · 2023-10-19T13:55:05Z

fix is available as of v3.11.3

JCHacking requested a review from a team as a code owner October 12, 2023 19:07

fix: Fix windows poetry charset error decode

9565cc1

Now the characters are read in bytes, what encoding they have is evaluated and converted to a string with it. Signed-off-by: JCHacking <juancruzmencia@gmail.com> Refs: #448

JCHacking mentioned this pull request Oct 12, 2023

Error getting BOM file from poetry #448

Closed

JCHacking added 2 commits October 19, 2023 12:34

Merge branch 'CycloneDX:main' into char_encode

2b75b40

chore: User chardet and only inspect requirements.txt encoding

2982d7c

Changed to use the chardet library and now only the encoding in requirements.txt is inspected. Signed-off-by: JCHacking <juancruzmencia@gmail.com> Refs: #448

jkowalleck reviewed Oct 19, 2023

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

JCHacking and others added 2 commits October 19, 2023 13:51

chore: Use chardet and only inspect requirements.txt encoding

1547635

Changed to use the chardet library and now only the encoding in requirements.txt is inspected. Signed-off-by: JCHacking <juancruzmencia@gmail.com> Refs: #448

stype

419cfd5

Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com>

jkowalleck changed the title ~~fix: Fix windows poetry charset error decode~~ fix: input file encoding Oct 19, 2023

jkowalleck added 2 commits October 19, 2023 14:50

adjust dep boundaries of chardet

6714ae3

Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com>

silence mypy

cf02aec

Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com>

add potential regression test

071976a

Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com>

jkowalleck approved these changes Oct 19, 2023

View reviewed changes

jkowalleck merged commit a9dda4b into CycloneDX:main Oct 19, 2023
22 checks passed

jkowalleck mentioned this pull request Oct 19, 2023

v4 finalization #535

Closed

11 tasks

JCHacking deleted the char_encode branch November 18, 2023 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: input file encoding #596

fix: input file encoding #596

JCHacking commented Oct 12, 2023 •

edited by jkowalleck

jkowalleck commented Oct 19, 2023 •

edited

JCHacking commented Oct 19, 2023

jkowalleck commented Oct 19, 2023

JCHacking commented Oct 19, 2023

JCHacking commented Oct 19, 2023

jkowalleck commented Oct 19, 2023

JCHacking commented Oct 19, 2023

jkowalleck commented Oct 19, 2023

fix: input file encoding #596

fix: input file encoding #596

Conversation

JCHacking commented Oct 12, 2023 • edited by jkowalleck

jkowalleck commented Oct 19, 2023 • edited

JCHacking commented Oct 19, 2023

jkowalleck commented Oct 19, 2023

JCHacking commented Oct 19, 2023

JCHacking commented Oct 19, 2023

jkowalleck commented Oct 19, 2023

JCHacking commented Oct 19, 2023

jkowalleck commented Oct 19, 2023

JCHacking commented Oct 12, 2023 •

edited by jkowalleck

jkowalleck commented Oct 19, 2023 •

edited