Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document.append does not handle utf-8 encoding #17

Closed
tcmitchell opened this issue Nov 12, 2019 · 3 comments · Fixed by #20
Closed

Document.append does not handle utf-8 encoding #17

tcmitchell opened this issue Nov 12, 2019 · 3 comments · Fixed by #20
Assignees
Milestone

Comments

@tcmitchell
Copy link
Collaborator

Saw this error via Document.append():

======================================================================
ERROR: test_AAA (test.test_roundtrip.TestRoundTripSBOL2) [pICSL50014.xml] (filename='pICSL50014.xml')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/SBOL/sbol/test/test_roundtrip.py", line 82, in test_AAA
    self.run_round_trip(f)
  File "/SBOL/sbol/test/test_roundtrip.py", line 45, in run_round_trip
    split_path[0] + split_path[1]))
  File "/SBOL/sbol/document.py", line 330, in read
    self.append(filename)
  File "/SBOL/sbol/document.py", line 381, in append
    self.graph.parse(f, format="application/rdf+xml")
  File "/usr/local/lib/python3.6/dist-packages/rdflib/graph.py", line 1043, in parse
    parser.parse(source, self, **args)
  File "/usr/local/lib/python3.6/dist-packages/rdflib/plugins/parsers/rdfxml.py", line 578, in parse
    self._parser.parse(source)
  File "/usr/lib/python3.6/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python3.6/xml/sax/xmlreader.py", line 123, in parse
    buffer = file.read(self._bufsize)
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 8027: ordinal not in range(128)

The file sbol/test/SBOLTestSuite/SBOL2/pICSL50014.xml has non-ascii characters (see lines 136 and 137). This file passes SBOL validation.

Two alternatives seem to work:

  1. First, we can open the file in binary mode (with open(filename, 'rb')) -- this is what RDFLib does (see parser.py)
  2. We can stop opening the file altogether and just pass the filename on to RDFLib. This approach seems the safest.

Both approaches pass the tests in test_roundtrip.py.

@tcmitchell
Copy link
Collaborator Author

Here's the change I think we can make for approach two above:

diff --git a/sbol/document.py b/sbol/document.py
index 4eb9bf5..6191954 100644
--- a/sbol/document.py
+++ b/sbol/document.py
@@ -372,18 +372,17 @@ class Document(Identified):
         :return: None
         """
         self.logger.debug("Appending data from file: " + filename)
-        with open(filename, 'r') as f:
-            if not self.graph:
-                self.graph = rdflib.Graph()
-            # Save any changes we've made to the graph.
-            self.update_graph()
-            # Use rdflib to automatically merge the graphs together
-            self.graph.parse(f, format="application/rdf+xml")
-            # Clean up our internal data structures.
-            # (There's probably a more efficient way to merge.)
-            self.clear(clear_graph=False)
-            # Base our internal representation on the new graph.
-            self.parse_all()
+        if not self.graph:
+            self.graph = rdflib.Graph()
+        # Save any changes we've made to the graph.
+        self.update_graph()
+        # Use rdflib to automatically merge the graphs together
+        self.graph.parse(filename, format="application/rdf+xml")
+        # Clean up our internal data structures.
+        # (There's probably a more efficient way to merge.)
+        self.clear(clear_graph=False)
+        # Base our internal representation on the new graph.
+        self.parse_all()
 
     def parse_all(self):
         # Parse namespaces

@tcmitchell
Copy link
Collaborator Author

This issue only happens when the environment variable LANG is unset, as it is in a docker environment. When LANG=en_US.utf8, the UTF8 document is read properly.

Since the recommended fix works in both environments, it is probably desirable. It wouldn't be surprising to see the SBOL module used in a docker environment.

@jakebeal
Copy link
Collaborator

jakebeal commented Nov 14, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants