Plotting datetime values from Pandas dataframe #5550

arc-jim · 2015-11-23T19:52:20Z

This appears to be a new issue in 1.5.0.

The script below attempts to plot two 2-D graphs whose X and Y values are Pandas series. The issue seems to occur when pyplot is passed a datetime column which doesn't contain an index of value 0 - note how the second dataframe contains only odd indices (1, 3, 5, etc.)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# create sample dataframe with preset dates and values columns
dates = np.arange('2005-02', '2005-03', dtype='datetime64[D]')
values = np.sin(np.array(range(len(dates))))
df = pd.DataFrame({'dates': dates, 'values': values})

# matplotlib figure + two subplots for comparison
fig, axes = plt.subplots(1, 2)

# create two dataframes for comparison - one with all indices, including 0, and one with only odd indices
with_zero_index = df.copy()
without_zero_index = df[np.array(df.index) % 2 == 1].copy()

# plot both - note how second plot fails without a 0 index
axes[0].plot(with_zero_index['dates'], with_zero_index['values'])
axes[1].plot(without_zero_index['dates'], without_zero_index['values'])

Stack trace:

KeyError                                  Traceback (most recent call last)
<ipython-input-29-28e878247c17> in <module>()
     17 # plot both - note how second plot fails without a 0 index
     18 axes[0].plot(with_zero_index['dates'], with_zero_index['values'])
---> 19 axes[1].plot(without_zero_index['dates'], without_zero_index['values'])

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/__init__.pyc in inner(ax, *args, **kwargs)
   1809                     warnings.warn(msg % (label_namer, func.__name__),
   1810                                   RuntimeWarning, stacklevel=2)
-> 1811             return func(ax, *args, **kwargs)
   1812         pre_doc = inner.__doc__
   1813         if pre_doc is None:

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in plot(self, *args, **kwargs)
   1425             kwargs['color'] = c
   1426 
-> 1427         for line in self._get_lines(*args, **kwargs):
   1428             self.add_line(line)
   1429             lines.append(line)

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _grab_next_args(self, *args, **kwargs)
    384                 return
    385             if len(remaining) <= 3:
--> 386                 for seg in self._plot_args(remaining, kwargs):
    387                     yield seg
    388                 return

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _plot_args(self, tup, kwargs)
    362             x, y = index_of(tup[-1])
    363 
--> 364         x, y = self._xy_from_xy(x, y)
    365 
    366         if self.command == 'plot':

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _xy_from_xy(self, x, y)
    195     def _xy_from_xy(self, x, y):
    196         if self.axes.xaxis is not None and self.axes.yaxis is not None:
--> 197             bx = self.axes.xaxis.update_units(x)
    198             by = self.axes.yaxis.update_units(y)
    199 

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axis.pyc in update_units(self, data)
   1387         neednew = self.converter != converter
   1388         self.converter = converter
-> 1389         default = self.converter.default_units(data, self)
   1390         if default is not None and self.units is None:
   1391             self.set_units(default)

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/dates.pyc in default_units(x, axis)
   1562 
   1563         try:
-> 1564             x = x[0]
   1565         except (TypeError, IndexError):
   1566             pass

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
   1593 
   1594         try:
-> 1595             return self._engine.get_value(s, k)
   1596         except KeyError as e1:
   1597             if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3113)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2844)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7224)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7162)()

KeyError: 0

Looks like matplotlib's dates.py module is attempting to access the first value in the datetime series - but when passed as a pandas series, x[0] represents the value at index 0, and a) might not exist, and b) isn't necessarily the first value in the series! Possible fix might be to catch IndexErrors and attempt an x.iloc[0] call instead.

Script above worked fine in 1.4.3, but fails specifically in 1.5.0.

The text was updated successfully, but these errors were encountered:

8one6 · 2015-11-23T20:19:34Z

I'd consider trying to do a PR on this (it'd be my first!). As a general question, what's the preferred "style" for dealing with Pandas support within matplotlib? In particular, is it better to catch the KeyError thrown by Pandas in a try/except clause or better to do an explicit check to see if x is a Series or DataFrame and, if so, to call x.iloc[0] instead of x[0]?

tacaswell · 2015-11-23T21:01:50Z

We have a strict do not import pandas rule (same goes for scipy) so using isinstance is out.

attn @TomAugspurger any advice on how to reliably get the first element?

I suspect that is is fall-out from @dopplershift 's changes to make mpl work with pint which is less aggressive about casting inputs to numpy arrays.

@arc-jim The quick workaround is to use .values on your series on the way in.

8one6 · 2015-11-23T21:04:10Z

How about:

try:
    x = x[0]
except (KeyError):
    x = x.iloc[0]
except (TypeError, IndexError):
    pass

TomAugspurger · 2015-11-23T21:08:20Z

Haven't looked closely at what's going on, but @8one6 is correct that .iloc is the way to get the first element from a Series (by position).

8one6 · 2015-11-23T21:08:58Z

s.iloc[0] should be the same thing as s.values[0].

8one6 · 2015-11-23T21:17:05Z

Actually, .values[0] is probably a better idea. If nothing else it's way faster...so:

How about:

try:
    x = x[0]
except (KeyError):
    x = x.values[0]
except (TypeError, IndexError):
    pass

efiring · 2015-11-23T21:39:38Z

Maybe this gets it past the units handling, but won't it then fail later? Instead, would it be better to check whether an input has a values method and use it if it does? Or just leave that as the responsibility of the pandas user? Given that a DataFrame is quite a different beast than an ndarray, it seems like we either need to support it fully, or require that the user handle the transformation to an ndarray.

8one6 · 2015-11-24T14:35:04Z

Is there a statement somewhere about how the matplotlib community has decided to handle Pandas datastructures, in general? Particularly with the newest release, the docs make reference to supporting Pandas structures (or at least Pandas-like structures): for example, here. However, there are still suggestions that users use only np.ndarray's in other parts of the docs, like here. Is it possible to clarify where the community stands on this issue?

I don't really mind either approach. But I think consistency would help people write code that avoids corner cases. Is the example here just scratching the surface? Specifically, it would be great to know what the "right answer" is here: is the commitment to supporting "labelled data" real, in which case matplotlib will need to be patched here (and likely elsewhere), or should users know that labelled data handling is a bit of a "buyer beware" situation (not fully supported) and thus to ensure maximum stability, they should always "strip" their data before plotting.

efiring · 2015-11-24T17:05:49Z

No, there is no such statement that I am aware of. This is an area that needs attention. I think the prevailing opinion is that we want to make mpl "just work" when possible, but at the same time we don't want to add dependencies on packages like pandas, and we don't want to add more than small amounts of code to handle their inputs. Unfortunately, this leads to the ambiguous situation you highlight.

8one6 · 2015-11-24T17:59:35Z

For reference, here is a non-date-related case where a very similar thing seems to be causing errors for a SO user.

tacaswell · 2015-11-24T18:29:55Z

Regarding the labeled data, please be careful to keep that seperated from pandas. The promise on labeled data is that the data kwarg can be anything that supports __getitem__ with a string and returns things that 'smell like' numpy arrays. DataFrames are just one of many things that are supported (the simplest being a dict of ndarrays).

The problem here is that pandas made the choice to give the convenient api to index-based slicing not to positional based slicing on Series objects. (I am a bit curious if this was always the case or this is a side effect of Series no longer being a sub-class of ndarray).

Probably the safest way to get the first element is to do

el = next(iter(obj))

which so long as the object is more ndarray like than dict/mapping like should do the right thing.

It is marginally slower (40ns vs 170ns for lists and 86ns vs 225ns for arrays), but it does the right thing with array-like data structures that have non-standard __getitem__ behavior.

PR coming.

8one6 · 2015-11-24T18:37:28Z

Yeah, no question, the fact that s[0] and s.values[0] do different things (in some cases) for pandas series is very frustrating at times.

I'm not sure exactly all the situations that matplotlib needs this idea of a "first element". Maybe what you have above works well for Series but I think it might do something odd in the case of a DataFrame. I think when you iterate over a DataFrame you are iterating over the columns, not the rows.

TomAugspurger · 2015-11-24T18:39:58Z

iter(DataFrame) is an iterator over the column keys, so no values at all (like a dictionary).

On Nov 24, 2015, at 12:37 PM, 8one6 notifications@github.com wrote:

Yeah, no question, the fact that s[0] and s.values[0] do different things (in some cases) for pandas series is very frustrating at times.

I'm not sure exactly all the situations that matplotlib needs this idea of a "first element". Maybe what you have above works well for Series but I think it might do something odd in the case of a DataFrame. I think when you iterate over a DataFrame you are iterating over the columns, not the rows.

—
Reply to this email directly or view it on GitHub #5550 (comment).

tacaswell · 2015-11-24T19:25:51Z

@TomAugspurger By the time we got to these parts of the code we should always have a Series not a DataFrame. All of the data-frame unpacking should be happening in the data kwarg or higher.

pd.Series prefer indexing via searching the index to positional indexing. This method will get the first element of any iterable. It will advance a generater, but they did not work previously anyway. Closes matplotlib#5550

tacaswell · 2015-11-24T20:18:33Z

@8one6 Can you make a new issue for the hist based issue? It is an entirely different code path (and a bit more work to clean up).

8one6 · 2015-11-24T21:18:45Z

xref #5557

pd.Series prefer indexing via searching the index to positional indexing. This method will get the first element of any iterable. It will advance a generater, but they did not work previously anyway. Closes matplotlib#5550

see e.g. matplotlib/matplotlib#5550 matplotlib/matplotlib#5557

tacaswell added this to the Critical bugfix release (1.5.1) milestone Nov 23, 2015

tacaswell mentioned this issue Nov 24, 2015

FIX: pandas indexing error #5556

Merged

tacaswell self-assigned this Nov 24, 2015

tacaswell added the status: needs review label Nov 24, 2015

8one6 mentioned this issue Dec 7, 2015

plt.hist throws KeyError when passed a pandas.Series without 0 in index #5557

Closed

TomAugspurger mentioned this issue Dec 17, 2015

Index without 0 in xerr/yerr causes KeyError pandas-dev/pandas#11858

Closed

tacaswell closed this as completed in 2f73778 Jan 1, 2016

tacaswell removed the status: needs review label Jan 1, 2016

ilia-kats added a commit to ilia-kats/CombinatorialProfiler that referenced this issue Nov 26, 2018

require matplotlib >= 1.5.1

0d7684b

see e.g. matplotlib/matplotlib#5550 matplotlib/matplotlib#5557

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plotting datetime values from Pandas dataframe #5550

Plotting datetime values from Pandas dataframe #5550

arc-jim commented Nov 23, 2015

8one6 commented Nov 23, 2015

tacaswell commented Nov 23, 2015

8one6 commented Nov 23, 2015

TomAugspurger commented Nov 23, 2015

8one6 commented Nov 23, 2015

8one6 commented Nov 23, 2015

efiring commented Nov 23, 2015

8one6 commented Nov 24, 2015

efiring commented Nov 24, 2015

8one6 commented Nov 24, 2015

tacaswell commented Nov 24, 2015

8one6 commented Nov 24, 2015

TomAugspurger commented Nov 24, 2015

tacaswell commented Nov 24, 2015

tacaswell commented Nov 24, 2015

8one6 commented Nov 24, 2015

Plotting datetime values from Pandas dataframe #5550

Plotting datetime values from Pandas dataframe #5550

Comments

arc-jim commented Nov 23, 2015

8one6 commented Nov 23, 2015

tacaswell commented Nov 23, 2015

8one6 commented Nov 23, 2015

TomAugspurger commented Nov 23, 2015

8one6 commented Nov 23, 2015

8one6 commented Nov 23, 2015

efiring commented Nov 23, 2015

8one6 commented Nov 24, 2015

efiring commented Nov 24, 2015

8one6 commented Nov 24, 2015

tacaswell commented Nov 24, 2015

8one6 commented Nov 24, 2015

TomAugspurger commented Nov 24, 2015

tacaswell commented Nov 24, 2015

tacaswell commented Nov 24, 2015

8one6 commented Nov 24, 2015