Skip to content

Commit

Permalink
COMPAT: Properly encode filenames in read_csv (pandas-dev#24758)
Browse files Browse the repository at this point in the history
Python 3.6+ changes the default encoding to
UTF8 (PEP 529), which conflicts with the
encoding of Windows (MBCS).

This fix checks if we're using Python 3.6+
and on Windows, after which we force the
encoding to "mbcs".

Closes pandas-devgh-15086.
  • Loading branch information
gfyoung authored and Pingviinituutti committed Feb 28, 2019
1 parent a362e92 commit ee1be2c
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Expand Up @@ -1790,6 +1790,7 @@ I/O
- Bug in :meth:`DataFrame.to_dict` when the resulting dict contains non-Python scalars in the case of numeric data (:issue:`23753`)
- :func:`DataFrame.to_string()`, :func:`DataFrame.to_html()`, :func:`DataFrame.to_latex()` will correctly format output when a string is passed as the ``float_format`` argument (:issue:`21625`, :issue:`22270`)
- Bug in :func:`read_csv` that caused it to raise ``OverflowError`` when trying to use 'inf' as ``na_value`` with integer index column (:issue:`17128`)
- Bug in :func:`read_csv` that caused the C engine on Python 3.6+ on Windows to improperly read CSV filenames with accented or special characters (:issue:`15086`)
- Bug in :func:`read_fwf` in which the compression type of a file was not being properly inferred (:issue:`22199`)
- Bug in :func:`pandas.io.json.json_normalize` that caused it to raise ``TypeError`` when two consecutive elements of ``record_path`` are dicts (:issue:`22706`)
- Bug in :meth:`DataFrame.to_stata`, :class:`pandas.io.stata.StataWriter` and :class:`pandas.io.stata.StataWriter117` where a exception would leave a partially written and invalid dta file (:issue:`23573`)
Expand Down
8 changes: 7 additions & 1 deletion pandas/_libs/parsers.pyx
Expand Up @@ -677,7 +677,13 @@ cdef class TextReader:

if isinstance(source, basestring):
if not isinstance(source, bytes):
source = source.encode(sys.getfilesystemencoding() or 'utf-8')
if compat.PY36 and compat.is_platform_windows():
# see gh-15086.
encoding = "mbcs"
else:
encoding = sys.getfilesystemencoding() or "utf-8"

source = source.encode(encoding)

if self.memory_map:
ptr = new_mmap(source)
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/io/parser/test_common.py
Expand Up @@ -1904,6 +1904,18 @@ def test_suppress_error_output(all_parsers, capsys):
assert captured.err == ""


def test_filename_with_special_chars(all_parsers):
# see gh-15086.
parser = all_parsers
df = DataFrame({"a": [1, 2, 3]})

with tm.ensure_clean("sé-es-vé.csv") as path:
df.to_csv(path, index=False)

result = parser.read_csv(path)
tm.assert_frame_equal(result, df)


def test_read_table_deprecated(all_parsers):
# see gh-21948
parser = all_parsers
Expand Down

0 comments on commit ee1be2c

Please sign in to comment.