Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Problem] DXF files produced by older AutoCAD versions where text contains non-ASCII characters cannot be imported/opened by either the legacy or non-legacy DXF importer. #8704

Closed
2 tasks done
kpemartin opened this issue Feb 28, 2023 · 17 comments
Labels
File format: DXF WB Draft Related to the Draft Workbench

Comments

@kpemartin
Copy link
Contributor

kpemartin commented Feb 28, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Version

0.20 (Release)

Full version info

OS: Windows 7 Version 6.1 (Build 7601: SP 1)
Word size of FreeCAD: 64-bit
Version: 0.20.2.29177 +426 (Git)
Build type: Release
Branch: (HEAD detached from 0.20.2)
Hash: 930dd9a76203a3260b1e6256c70c1c3cad8c5cb8
Python 3.8.10, Qt 5.15.2, Coin 4.0.1, Vtk 8.2.0, OCC 7.6.3
Locale: English/United States (en_US)

Also: AutoCAD version 13 (yes, it is that old)

Subproject(s) affected?

None

Problem description

Per the DXF format documentation at https://help.autodesk.com/view/OARX/2018/ENU/?guid=GUID-2553CF98-44F6-4828-82DD-FE3BC7448113 ("Storage of String Values" section), text in the DXF file is encoded as UTF-8 for versions 2007 and beyond ($ACADVER >= "AC1021"), but as "Plain ASCII" [sic] for versions 2004 and previous ($ACADVER <= "AC1018").

In the latter case, examination of the contents of the DXF file reveal that the encoding is not really "ASCII" because it contains bytes larger then 0x7F. Such characters can be entered using ALT+0ddd keyboarding on the numeric keypad or by pasting from another program such as Character Map, or dimension text for angular dimensions can contain a degree sign. These strings can also contain in-band encoding syntax \U+xxxx to select Unicode code points above 0xFF, a feature only useful if the text font was one that supported wide character sets.

Autocad Screen.png

The way AutoCAD renders characters above 0x7F on the screen indicates that the values are actually ISO-6659-1 (or perhaps Windows-1252) with no encoding (i.e. raw bytes). Thus the byte 0xB0 in the text in the sample DXF file renders as a degree sign, and the two bytes 0xC2 0xB0, which are the bytes of the UTF-8 encoding for the degree sign (0xB0), render as an accented uppercase A and (somewhat coincidentally) a degree sign.

The DXF file contains a variable called "DWGCODEPAGE" whose value (in this case) is "ansi-1252" which is "set to the system code page when a new drawing is created, but not otherwise maintained by AutoCAD", so perhaps this is the encoding that should be used for versions 2004 and previous.

Both importers attempt to interpret the text as UTF-8 and fail because the single byte B0 is a continuation byte in that encoding, and it is not preceded by a start byte in the range C0-FF. The actual failure point is different for the two importers.

The legacy importer is handling the character encoding a block (2k bytes?) at a time within the text reader so it gets the error as it tries to read the first character of the block that contains the string text. It would appear that the new importer initially creates
the strings with the raw bytes from the file, and later calls Draft.convertDraftTexts which attempts to convert the text in all text objects from raw bytes to proper text assuming UTF-8 encoding.

Characters causing this problem can appear in TEXT, MTEXT, DIMENSION text, block definition ATTDEF, block insertion ATTRIB and perhaps other places, but apparently cannot occur in things such as layer and text style names. This means that it is safe to read the HEADER section with no encoding, allowing determination of ACADVER and DWGCODEPAGE.


To reproduce the problem using the legacy importer:
In Edit->Preferences..., select Import-Export in the left panel, then the DXF tab in the right panel. Ensure "Use legacy python importer" is checked. Ensure "Import ... texts and dimensions" is checked. Select "Create ... simple Part shapes" (note that I'm not
sure if the latter matters).
Open or import the attached DXF file.
The open/import fails:

07:07:25  opening C:/Users/KPMartin/Documents/FreeCAD Issues/Encoding Error/Encoding-error.dxf...
07:07:25  Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Program Files\FreeCAD 0.20\Mod\Draft\importDXF.py", line 2781, in open
    processdxf(doc, filename)
  File "C:\Program Files\FreeCAD 0.20\Mod\Draft\importDXF.py", line 2191, in processdxf
    drawing = dxfReader.readDXF(filename)
  File "C:\Users\KPMartin\AppData\Roaming\FreeCAD\Macro\dxfReader.py", line 372, in readDXF
    drawing = sm.run(infile)
  File "C:\Users\KPMartin\AppData\Roaming\FreeCAD\Macro\dxfReader.py", line 95, in run
    (newState, cargo) = handler(cargo)
  File "C:\Users\KPMartin\AppData\Roaming\FreeCAD\Macro\dxfReader.py", line 288, in start_section
    obj = handleObject(infile)
  File "C:\Users\KPMartin\AppData\Roaming\FreeCAD\Macro\dxfReader.py", line 196, in handleObject
    line = infile.readline()
  File "C:\Program Files\FreeCAD 0.20\bin\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
<class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xb0 in position 418: invalid start byte

To reproduce the problem using the new importer:
In Edit->Preferences..., select Import-Export in the left panel, then the DXF tab in the right panel. Ensure "Use legacy python importer" is unchecked. Ensure "Import ... texts and dimensions" is checked. Select "Create ... simple Part shapes" (note that I'm not sure if the latter matters).
Open or import the attached DXF file.
The open/import fails:

07:04:01  This function will be deprecated. Please use 'convert_draft_texts'.
07:04:01  ----------------
07:04:01  Convert Draft texts
07:04:01  Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Program Files\FreeCAD 0.20\Mod\Draft\importDXF.py", line 2797, in open
    Draft.convertDraftTexts() # convert annotations to Draft texts
  File "C:\Program Files\FreeCAD 0.20\Mod\Draft\draftmake\make_text.py", line 210, in convertDraftTexts
    return convert_draft_texts(textslist)
  File "C:\Program Files\FreeCAD 0.20\Mod\Draft\draftmake\make_text.py", line 191, in convert_draft_texts
    new_obj = make_text(obj.LabelText, placement=obj.Position)
<class 'UnicodeError'>: UTF8 conversion failure at PropertyStringList::getPyObject()

Note: I have a fix for the legacy importer (not included here) but it is not optimal as it reads the entire first SECTION (the one containing the variables) twice, and only works with python 3.7 and up which allows an existing file object to be reconfigured with a new encoding.

Attachments (all in the single zip file Encoding Error.zip):
Encoding-error.dxf: The DXF file to be imported/opened
Autocad Screen.png: A screenshot of the text in the AutoCAD window
LegacyError.txt: The text from the FreeCAD Report View after failing to import the file using the legacy importer
NewError.png: The error message box after failing to import the file using the new importer
NewError.txt: The text from the FreeCAD Report View after failing to import the file using the new importer

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@luzpaz luzpaz added File format: DXF WB Draft Related to the Draft Workbench labels Mar 1, 2023
@luzpaz
Copy link
Contributor

luzpaz commented Mar 1, 2023

Updated original ticket to make it a little bit easier for devs to grok the issue. Thanks for being thorough!

@kpemartin
Copy link
Contributor Author

Thanks for the edit. I'm not very familiar with this web interface yet.

I've been looking at the new importer code to get a better idea of the origin of the error there. The 'obj' in the traceback is an App::Annotation object. The importer initially converts DXF text into App::Annotation objects, and convertDraftTexts converts these to App:FeaturePython/Text objects. It appears that, when creating the App::Annotation, the LabelText is set by calling setValues (in C++) which expects UTF-8-encoded std::string(s). The fault appears to be in ImpExpDxfRead::OnReadText or ImpExpDxfRead::Deformat, which should behave differently depending on the AutoCad version from the DXF file; for newer versions the text can be used as read from the file, but for older versions the text must be UTF-8 encoded before setting it into the LabelText property.

@luzpaz
Copy link
Contributor

luzpaz commented Mar 3, 2023

Kind of similar to #7987

@kpemartin
Copy link
Contributor Author

There is some similarity to #7987 but only insofar as being related to text encoding and the assumption of a particular encoding (UTF-8) rather than relying on information in the file to specify the encoding (I am assuming here that a STEP file is a text file that would start with a BOM if it were UTF-8).

I think the similarity ends there though, different importer (DXF vs STEP), different string context (drawing text vs layer/object name)

@luzpaz
Copy link
Contributor

luzpaz commented Mar 8, 2023

@Roy-043 any chance you can take a look at this ?

@yorikvanhavre
Copy link
Member

Yes, I believe the way non-UTF-8 strings are encoded in DXF is fundamentally different than how it's done in STEP so there is little chance a common solution could work.
Can you share your Python solution @kpemartin ? Even if it does only solve for the Python importer, and even if it's not optimized, that would at least offer a solution, and possibly give hints to develop the C++ counterpart

@kpemartin
Copy link
Contributor Author

Ok, I'm new at this forum. What would be the best way to provide the changes?

@yorikvanhavre
Copy link
Member

The ideal way (long version at https://wiki.freecad.org/Source_code_management ):

  • Fork the FreeCAD repo
  • Do your changes on your own copy (this is ideally done on your computer, but can be done online on github too)
  • Commit your changes to your own copy
  • Open a Pull Request to have your changes merged into the main FreeCAD code.

This might be a bit daunting at first but if you plan to work more with FreeCAD (or any other FOSS project) and write code, it's totally worth learning it ;)

Alternatively, just attach the modified file here.

@kpemartin
Copy link
Contributor Author

Yorik, the file I have changes to is DXFReader.py which is not under the main FreeCAD repository, but under your own git repository so I've also done the fork/clone on that repository.

I've just put the changes into my copy and created a pull request

@kpemartin
Copy link
Contributor Author

The C++ version doesn't seem to interpret the HEADER section at all, so to get a fix there this would have to be added, at least enough to find $ACADVER and $DWGCODEPAGE. That's probably the hard part.
The actual encoding change would be in either ImpExpDxfRead::OnReadText or ImpExpDxfRead::Deformat, at some point checking the DXF encoding to ensure utf-8 bytes are passed to setValues on the App::Annotation's LabelText member.

To avoid reading the entire HEADER section twice in the legacy importer, another approach to this change would be to read the entire file with encoding=None, and instead have the convert function do the appropriate conversion from the bytes read to a Python string. A global variable would tell it what encoding to use. Reading the HEADER would have to look for the two variables on the way by, instead of the current blind collecting of section contents.

I'm not sure the size of the HEADER section is big enough to worry about re-reading it, we're talking about perhaps 4k bytes here.

@yorikvanhavre
Copy link
Member

yorikvanhavre commented Mar 9, 2023

IIRC the c++ importer already reads some of the header variables. Maybe it's not that hard. Another option would be to split the problem into several. For ex, to begin with, let the user specify the encoding manually via python, ex: dxfimport.open(filename, encoding="xxx") then later on we figure out the rest

@kpemartin
Copy link
Contributor Author

I've implemented the code in the C++ importer and it appears to work, though I had some difficulty testing this:

I'm running into problems due to this change which causes the C++ importer to get its settings from a new location. However there appears to be no corresponding change to have the Preferences dialog place the settings in the new location, and as a result I'm getting the default of 'false' for the option to import annotations and text. Needless to say, this making testing text import rather difficult.

Although I see merit in the fact that DXF import settings don't really belong under Mod/Draft, this change by itself breaks several things: The Preferences still places the settings in the old location, so the settings used by the C++ importer cannot be changed from the defaults of (no text/annotation import, no layer import, and scaling=1.0), and the legacy importer has not been changed to look in the new location.

This change is about 8 months old so it does not appear to be part of some other work in progress, and there is nothing in the change log to justify the change. I don't know how (if it is possible) to refer back to some Issue that might have prompted the change. From my (perhaps naïve) viewpoint this looks like a bad change that should be rolled back.

Because the ctor for ImpExpDxfRead loads the options from the old place, then Import::Module::readDXF in AppImportPy.cpp makes it load from the new place, I've added a temporary workaround to make ImpExpDxfRead use its existing settings as defaults if the specific property is not found in the given options source.

Should this just be addressed in a new Issue?

@kpemartin
Copy link
Contributor Author

I've just made a PR on the FreeCAD repository for the fix to the C++ importer, to go along with my previous PR for the legacy importer. Once both these are honoured, this issue can be closed as fixed.

@yorikvanhavre
Copy link
Member

The python importer PR looks good, I just suggested a couple of changes, have a look if it makes sense to you. The C++ is harder to assess, I'm looking at it now :)

@kpemartin
Copy link
Contributor Author

Well, I messed up when trying to remove some lint I had introduced and also trying to rebase my changes, and somehow closed #8862, so I'm putting in the changes fresh as #9060.

There are still lint warnings in the files I have changed, but not in the part of the code I've changed. Should I be trying to clean up these warnings while I'm changing the file, even if I did not introduce them? Should I just make a separate PR to clean up the lint? Or should I just ignore it?

@luzpaz
Copy link
Contributor

luzpaz commented Mar 26, 2023

Should I just make a separate PR to clean up the lint?

If you don't mind, please do make a separate PR for the linting. It's just easier to rollback commits if they are more targeted. TIA!

@kpemartin
Copy link
Contributor Author

#9060 has been merged into the main branch so this issue is resolved. I'll do the linting as an unrelated PR later.

Syres916 added a commit to Syres916/FreeCAD that referenced this issue Jun 2, 2023
See discussion https://forum.freecad.org/viewtopic.php?t=78719

I don't believe FreeCAD#8704 is fixed even after this change, the example file Encoding-error.dxf which can be opened by Varicad Viewer still opens error free but has nothing in the 3D view.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
File format: DXF WB Draft Related to the Draft Workbench
Projects
None yet
Development

No branches or pull requests

3 participants