New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Problem] DXF files produced by older AutoCAD versions where text contains non-ASCII characters cannot be imported/opened by either the legacy or non-legacy DXF importer. #8704
Comments
Updated original ticket to make it a little bit easier for devs to grok the issue. Thanks for being thorough! |
Thanks for the edit. I'm not very familiar with this web interface yet. I've been looking at the new importer code to get a better idea of the origin of the error there. The 'obj' in the traceback is an App::Annotation object. The importer initially converts DXF text into App::Annotation objects, and convertDraftTexts converts these to App:FeaturePython/Text objects. It appears that, when creating the App::Annotation, the LabelText is set by calling setValues (in C++) which expects UTF-8-encoded std::string(s). The fault appears to be in ImpExpDxfRead::OnReadText or ImpExpDxfRead::Deformat, which should behave differently depending on the AutoCad version from the DXF file; for newer versions the text can be used as read from the file, but for older versions the text must be UTF-8 encoded before setting it into the LabelText property. |
Kind of similar to #7987 |
There is some similarity to #7987 but only insofar as being related to text encoding and the assumption of a particular encoding (UTF-8) rather than relying on information in the file to specify the encoding (I am assuming here that a STEP file is a text file that would start with a BOM if it were UTF-8). I think the similarity ends there though, different importer (DXF vs STEP), different string context (drawing text vs layer/object name) |
@Roy-043 any chance you can take a look at this ? |
Yes, I believe the way non-UTF-8 strings are encoded in DXF is fundamentally different than how it's done in STEP so there is little chance a common solution could work. |
Ok, I'm new at this forum. What would be the best way to provide the changes? |
The ideal way (long version at https://wiki.freecad.org/Source_code_management ):
This might be a bit daunting at first but if you plan to work more with FreeCAD (or any other FOSS project) and write code, it's totally worth learning it ;) Alternatively, just attach the modified file here. |
Yorik, the file I have changes to is DXFReader.py which is not under the main FreeCAD repository, but under your own git repository so I've also done the fork/clone on that repository. I've just put the changes into my copy and created a pull request |
The C++ version doesn't seem to interpret the HEADER section at all, so to get a fix there this would have to be added, at least enough to find $ACADVER and $DWGCODEPAGE. That's probably the hard part. To avoid reading the entire HEADER section twice in the legacy importer, another approach to this change would be to read the entire file with encoding=None, and instead have the convert function do the appropriate conversion from the bytes read to a Python string. A global variable would tell it what encoding to use. Reading the HEADER would have to look for the two variables on the way by, instead of the current blind collecting of section contents. I'm not sure the size of the HEADER section is big enough to worry about re-reading it, we're talking about perhaps 4k bytes here. |
IIRC the c++ importer already reads some of the header variables. Maybe it's not that hard. Another option would be to split the problem into several. For ex, to begin with, let the user specify the encoding manually via python, ex: |
I've implemented the code in the C++ importer and it appears to work, though I had some difficulty testing this: I'm running into problems due to this change which causes the C++ importer to get its settings from a new location. However there appears to be no corresponding change to have the Preferences dialog place the settings in the new location, and as a result I'm getting the default of 'false' for the option to import annotations and text. Needless to say, this making testing text import rather difficult. Although I see merit in the fact that DXF import settings don't really belong under Mod/Draft, this change by itself breaks several things: The Preferences still places the settings in the old location, so the settings used by the C++ importer cannot be changed from the defaults of (no text/annotation import, no layer import, and scaling=1.0), and the legacy importer has not been changed to look in the new location. This change is about 8 months old so it does not appear to be part of some other work in progress, and there is nothing in the change log to justify the change. I don't know how (if it is possible) to refer back to some Issue that might have prompted the change. From my (perhaps naïve) viewpoint this looks like a bad change that should be rolled back. Because the ctor for ImpExpDxfRead loads the options from the old place, then Import::Module::readDXF in AppImportPy.cpp makes it load from the new place, I've added a temporary workaround to make ImpExpDxfRead use its existing settings as defaults if the specific property is not found in the given options source. Should this just be addressed in a new Issue? |
I've just made a PR on the FreeCAD repository for the fix to the C++ importer, to go along with my previous PR for the legacy importer. Once both these are honoured, this issue can be closed as fixed. |
The python importer PR looks good, I just suggested a couple of changes, have a look if it makes sense to you. The C++ is harder to assess, I'm looking at it now :) |
Well, I messed up when trying to remove some lint I had introduced and also trying to rebase my changes, and somehow closed #8862, so I'm putting in the changes fresh as #9060. There are still lint warnings in the files I have changed, but not in the part of the code I've changed. Should I be trying to clean up these warnings while I'm changing the file, even if I did not introduce them? Should I just make a separate PR to clean up the lint? Or should I just ignore it? |
If you don't mind, please do make a separate PR for the linting. It's just easier to rollback commits if they are more targeted. TIA! |
#9060 has been merged into the main branch so this issue is resolved. I'll do the linting as an unrelated PR later. |
See discussion https://forum.freecad.org/viewtopic.php?t=78719 I don't believe FreeCAD#8704 is fixed even after this change, the example file Encoding-error.dxf which can be opened by Varicad Viewer still opens error free but has nothing in the 3D view.
Is there an existing issue for this?
Version
0.20 (Release)
Full version info
Subproject(s) affected?
None
Problem description
Per the DXF format documentation at https://help.autodesk.com/view/OARX/2018/ENU/?guid=GUID-2553CF98-44F6-4828-82DD-FE3BC7448113 ("Storage of String Values" section), text in the DXF file is encoded as UTF-8 for versions 2007 and beyond ($ACADVER >= "AC1021"), but as "Plain ASCII" [sic] for versions 2004 and previous ($ACADVER <= "AC1018").
In the latter case, examination of the contents of the DXF file reveal that the encoding is not really "ASCII" because it contains bytes larger then 0x7F. Such characters can be entered using ALT+0ddd keyboarding on the numeric keypad or by pasting from another program such as Character Map, or dimension text for angular dimensions can contain a degree sign. These strings can also contain in-band encoding syntax \U+xxxx to select Unicode code points above 0xFF, a feature only useful if the text font was one that supported wide character sets.
The way AutoCAD renders characters above 0x7F on the screen indicates that the values are actually ISO-6659-1 (or perhaps Windows-1252) with no encoding (i.e. raw bytes). Thus the byte 0xB0 in the text in the sample DXF file renders as a degree sign, and the two bytes 0xC2 0xB0, which are the bytes of the UTF-8 encoding for the degree sign (0xB0), render as an accented uppercase A and (somewhat coincidentally) a degree sign.
The DXF file contains a variable called "DWGCODEPAGE" whose value (in this case) is "ansi-1252" which is "set to the system code page when a new drawing is created, but not otherwise maintained by AutoCAD", so perhaps this is the encoding that should be used for versions 2004 and previous.
Both importers attempt to interpret the text as UTF-8 and fail because the single byte B0 is a continuation byte in that encoding, and it is not preceded by a start byte in the range C0-FF. The actual failure point is different for the two importers.
The legacy importer is handling the character encoding a block (2k bytes?) at a time within the text reader so it gets the error as it tries to read the first character of the block that contains the string text. It would appear that the new importer initially creates
the strings with the raw bytes from the file, and later calls Draft.convertDraftTexts which attempts to convert the text in all text objects from raw bytes to proper text assuming UTF-8 encoding.
Characters causing this problem can appear in TEXT, MTEXT, DIMENSION text, block definition ATTDEF, block insertion ATTRIB and perhaps other places, but apparently cannot occur in things such as layer and text style names. This means that it is safe to read the HEADER section with no encoding, allowing determination of ACADVER and DWGCODEPAGE.
To reproduce the problem using the legacy importer:
In Edit->Preferences..., select Import-Export in the left panel, then the DXF tab in the right panel. Ensure "Use legacy python importer" is checked. Ensure "Import ... texts and dimensions" is checked. Select "Create ... simple Part shapes" (note that I'm not
sure if the latter matters).
Open or import the attached DXF file.
The open/import fails:
To reproduce the problem using the new importer:
In Edit->Preferences..., select Import-Export in the left panel, then the DXF tab in the right panel. Ensure "Use legacy python importer" is unchecked. Ensure "Import ... texts and dimensions" is checked. Select "Create ... simple Part shapes" (note that I'm not sure if the latter matters).
Open or import the attached DXF file.
The open/import fails:
Note: I have a fix for the legacy importer (not included here) but it is not optimal as it reads the entire first SECTION (the one containing the variables) twice, and only works with python 3.7 and up which allows an existing file object to be reconfigured with a new encoding.
Attachments (all in the single zip file Encoding Error.zip):
Encoding-error.dxf
: The DXF file to be imported/openedAutocad Screen.png
: A screenshot of the text in the AutoCAD windowLegacyError.txt
: The text from the FreeCAD Report View after failing to import the file using the legacy importerNewError.png
: The error message box after failing to import the file using the new importerNewError.txt
: The text from the FreeCAD Report View after failing to import the file using the new importerAnything else?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: