Skip to content

Commit

Permalink
BUG: Fixing memory leaks in read_csv (pandas-dev#23072)
Browse files Browse the repository at this point in the history
* Move allocation of na_hashset down to avoid a leak on continue
* Delete na_hashset if there is an exception
* Clean up table before raising an exception

Closes pandas-devgh-21353.
  • Loading branch information
zhezherun authored and Pingviinituutti committed Feb 28, 2019
1 parent 8b15fa2 commit 6219fb8
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 18 deletions.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Expand Up @@ -1382,6 +1382,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
- Bug in :func:`DataFrame.to_string()` that caused representations of :class:`DataFrame` to not take up the whole window (:issue:`22984`)
- Bug in :func:`DataFrame.to_csv` where a single level MultiIndex incorrectly wrote a tuple. Now just the value of the index is written (:issue:`19589`).
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
- Bug in :func:`read_csv()` in which memory leaks occurred in the C engine when parsing ``NaN`` values due to insufficient cleanup on completion or error (:issue:`21353`)
- Bug in :func:`read_csv()` in which incorrect error messages were being raised when ``skipfooter`` was passed in along with ``nrows``, ``iterator``, or ``chunksize`` (:issue:`23711`)
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
Expand Down
46 changes: 28 additions & 18 deletions pandas/_libs/parsers.pyx
Expand Up @@ -1070,18 +1070,6 @@ cdef class TextReader:

conv = self._get_converter(i, name)

# XXX
na_flist = set()
if self.na_filter:
na_list, na_flist = self._get_na_list(i, name)
if na_list is None:
na_filter = 0
else:
na_filter = 1
na_hashset = kset_from_list(na_list)
else:
na_filter = 0

col_dtype = None
if self.dtype is not None:
if isinstance(self.dtype, dict):
Expand All @@ -1106,13 +1094,34 @@ cdef class TextReader:
self.c_encoding)
continue

# Should return as the desired dtype (inferred or specified)
col_res, na_count = self._convert_tokens(
i, start, end, name, na_filter, na_hashset,
na_flist, col_dtype)
# Collect the list of NaN values associated with the column.
# If we aren't supposed to do that, or none are collected,
# we set `na_filter` to `0` (`1` otherwise).
na_flist = set()

if self.na_filter:
na_list, na_flist = self._get_na_list(i, name)
if na_list is None:
na_filter = 0
else:
na_filter = 1
na_hashset = kset_from_list(na_list)
else:
na_filter = 0

if na_filter:
self._free_na_set(na_hashset)
# Attempt to parse tokens and infer dtype of the column.
# Should return as the desired dtype (inferred or specified).
try:
col_res, na_count = self._convert_tokens(
i, start, end, name, na_filter, na_hashset,
na_flist, col_dtype)
finally:
# gh-21353
#
# Cleanup the NaN hash that we generated
# to avoid memory leaks.
if na_filter:
self._free_na_set(na_hashset)

if upcast_na and na_count > 0:
col_res = _maybe_upcast(col_res)
Expand Down Expand Up @@ -2059,6 +2068,7 @@ cdef kh_str_t* kset_from_list(list values) except NULL:

# None creeps in sometimes, which isn't possible here
if not isinstance(val, bytes):
kh_destroy_str(table)
raise ValueError('Must be all encoded bytes')

k = kh_put_str(table, PyBytes_AsString(val), &ret)
Expand Down

0 comments on commit 6219fb8

Please sign in to comment.