Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting datetime values from Pandas dataframe #5550

Closed
arc-jim opened this issue Nov 23, 2015 · 16 comments
Closed

Plotting datetime values from Pandas dataframe #5550

arc-jim opened this issue Nov 23, 2015 · 16 comments
Assignees
Milestone

Comments

@arc-jim
Copy link

arc-jim commented Nov 23, 2015

This appears to be a new issue in 1.5.0.

The script below attempts to plot two 2-D graphs whose X and Y values are Pandas series. The issue seems to occur when pyplot is passed a datetime column which doesn't contain an index of value 0 - note how the second dataframe contains only odd indices (1, 3, 5, etc.)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# create sample dataframe with preset dates and values columns
dates = np.arange('2005-02', '2005-03', dtype='datetime64[D]')
values = np.sin(np.array(range(len(dates))))
df = pd.DataFrame({'dates': dates, 'values': values})

# matplotlib figure + two subplots for comparison
fig, axes = plt.subplots(1, 2)

# create two dataframes for comparison - one with all indices, including 0, and one with only odd indices
with_zero_index = df.copy()
without_zero_index = df[np.array(df.index) % 2 == 1].copy()

# plot both - note how second plot fails without a 0 index
axes[0].plot(with_zero_index['dates'], with_zero_index['values'])
axes[1].plot(without_zero_index['dates'], without_zero_index['values'])

Stack trace:

KeyError                                  Traceback (most recent call last)
<ipython-input-29-28e878247c17> in <module>()
     17 # plot both - note how second plot fails without a 0 index
     18 axes[0].plot(with_zero_index['dates'], with_zero_index['values'])
---> 19 axes[1].plot(without_zero_index['dates'], without_zero_index['values'])

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/__init__.pyc in inner(ax, *args, **kwargs)
   1809                     warnings.warn(msg % (label_namer, func.__name__),
   1810                                   RuntimeWarning, stacklevel=2)
-> 1811             return func(ax, *args, **kwargs)
   1812         pre_doc = inner.__doc__
   1813         if pre_doc is None:

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in plot(self, *args, **kwargs)
   1425             kwargs['color'] = c
   1426 
-> 1427         for line in self._get_lines(*args, **kwargs):
   1428             self.add_line(line)
   1429             lines.append(line)

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _grab_next_args(self, *args, **kwargs)
    384                 return
    385             if len(remaining) <= 3:
--> 386                 for seg in self._plot_args(remaining, kwargs):
    387                     yield seg
    388                 return

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _plot_args(self, tup, kwargs)
    362             x, y = index_of(tup[-1])
    363 
--> 364         x, y = self._xy_from_xy(x, y)
    365 
    366         if self.command == 'plot':

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _xy_from_xy(self, x, y)
    195     def _xy_from_xy(self, x, y):
    196         if self.axes.xaxis is not None and self.axes.yaxis is not None:
--> 197             bx = self.axes.xaxis.update_units(x)
    198             by = self.axes.yaxis.update_units(y)
    199 

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/axis.pyc in update_units(self, data)
   1387         neednew = self.converter != converter
   1388         self.converter = converter
-> 1389         default = self.converter.default_units(data, self)
   1390         if default is not None and self.units is None:
   1391             self.set_units(default)

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/matplotlib/dates.pyc in default_units(x, axis)
   1562 
   1563         try:
-> 1564             x = x[0]
   1565         except (TypeError, IndexError):
   1566             pass

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

/home/jim/arcemweb/venv/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
   1593 
   1594         try:
-> 1595             return self._engine.get_value(s, k)
   1596         except KeyError as e1:
   1597             if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3113)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2844)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7224)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7162)()

KeyError: 0

Looks like matplotlib's dates.py module is attempting to access the first value in the datetime series - but when passed as a pandas series, x[0] represents the value at index 0, and a) might not exist, and b) isn't necessarily the first value in the series! Possible fix might be to catch IndexErrors and attempt an x.iloc[0] call instead.

Script above worked fine in 1.4.3, but fails specifically in 1.5.0.

@tacaswell tacaswell added this to the Critical bugfix release (1.5.1) milestone Nov 23, 2015
@8one6
Copy link

8one6 commented Nov 23, 2015

I'd consider trying to do a PR on this (it'd be my first!). As a general question, what's the preferred "style" for dealing with Pandas support within matplotlib? In particular, is it better to catch the KeyError thrown by Pandas in a try/except clause or better to do an explicit check to see if x is a Series or DataFrame and, if so, to call x.iloc[0] instead of x[0]?

@tacaswell
Copy link
Member

We have a strict do not import pandas rule (same goes for scipy) so using isinstance is out.

attn @TomAugspurger any advice on how to reliably get the first element?

I suspect that is is fall-out from @dopplershift 's changes to make mpl work with pint which is less aggressive about casting inputs to numpy arrays.

@arc-jim The quick workaround is to use .values on your series on the way in.

@8one6
Copy link

8one6 commented Nov 23, 2015

How about:

try:
    x = x[0]
except (KeyError):
    x = x.iloc[0]
except (TypeError, IndexError):
    pass

@TomAugspurger
Copy link
Contributor

Haven't looked closely at what's going on, but @8one6 is correct that .iloc is the way to get the first element from a Series (by position).

@8one6
Copy link

8one6 commented Nov 23, 2015

s.iloc[0] should be the same thing as s.values[0].

@8one6
Copy link

8one6 commented Nov 23, 2015

Actually, .values[0] is probably a better idea. If nothing else it's way faster...so:

How about:

try:
    x = x[0]
except (KeyError):
    x = x.values[0]
except (TypeError, IndexError):
    pass

@efiring
Copy link
Member

efiring commented Nov 23, 2015

Maybe this gets it past the units handling, but won't it then fail later? Instead, would it be better to check whether an input has a values method and use it if it does? Or just leave that as the responsibility of the pandas user? Given that a DataFrame is quite a different beast than an ndarray, it seems like we either need to support it fully, or require that the user handle the transformation to an ndarray.

@8one6
Copy link

8one6 commented Nov 24, 2015

Is there a statement somewhere about how the matplotlib community has decided to handle Pandas datastructures, in general? Particularly with the newest release, the docs make reference to supporting Pandas structures (or at least Pandas-like structures): for example, here. However, there are still suggestions that users use only np.ndarray's in other parts of the docs, like here. Is it possible to clarify where the community stands on this issue?

I don't really mind either approach. But I think consistency would help people write code that avoids corner cases. Is the example here just scratching the surface? Specifically, it would be great to know what the "right answer" is here: is the commitment to supporting "labelled data" real, in which case matplotlib will need to be patched here (and likely elsewhere), or should users know that labelled data handling is a bit of a "buyer beware" situation (not fully supported) and thus to ensure maximum stability, they should always "strip" their data before plotting.

@efiring
Copy link
Member

efiring commented Nov 24, 2015

No, there is no such statement that I am aware of. This is an area that needs attention. I think the prevailing opinion is that we want to make mpl "just work" when possible, but at the same time we don't want to add dependencies on packages like pandas, and we don't want to add more than small amounts of code to handle their inputs. Unfortunately, this leads to the ambiguous situation you highlight.

@8one6
Copy link

8one6 commented Nov 24, 2015

For reference, here is a non-date-related case where a very similar thing seems to be causing errors for a SO user.

@tacaswell
Copy link
Member

Regarding the labeled data, please be careful to keep that seperated from pandas. The promise on labeled data is that the data kwarg can be anything that supports __getitem__ with a string and returns things that 'smell like' numpy arrays. DataFrames are just one of many things that are supported (the simplest being a dict of ndarrays).

The problem here is that pandas made the choice to give the convenient api to index-based slicing not to positional based slicing on Series objects. (I am a bit curious if this was always the case or this is a side effect of Series no longer being a sub-class of ndarray).

Probably the safest way to get the first element is to do

el = next(iter(obj))

which so long as the object is more ndarray like than dict/mapping like should do the right thing.

It is marginally slower (40ns vs 170ns for lists and 86ns vs 225ns for arrays), but it does the right thing with array-like data structures that have non-standard __getitem__ behavior.

PR coming.

@8one6
Copy link

8one6 commented Nov 24, 2015

Yeah, no question, the fact that s[0] and s.values[0] do different things (in some cases) for pandas series is very frustrating at times.

I'm not sure exactly all the situations that matplotlib needs this idea of a "first element". Maybe what you have above works well for Series but I think it might do something odd in the case of a DataFrame. I think when you iterate over a DataFrame you are iterating over the columns, not the rows.

@TomAugspurger
Copy link
Contributor

iter(DataFrame) is an iterator over the column keys, so no values at all (like a dictionary).

On Nov 24, 2015, at 12:37 PM, 8one6 notifications@github.com wrote:

Yeah, no question, the fact that s[0] and s.values[0] do different things (in some cases) for pandas series is very frustrating at times.

I'm not sure exactly all the situations that matplotlib needs this idea of a "first element". Maybe what you have above works well for Series but I think it might do something odd in the case of a DataFrame. I think when you iterate over a DataFrame you are iterating over the columns, not the rows.


Reply to this email directly or view it on GitHub #5550 (comment).

@tacaswell
Copy link
Member

@TomAugspurger By the time we got to these parts of the code we should always have a Series not a DataFrame. All of the data-frame unpacking should be happening in the data kwarg or higher.

tacaswell added a commit to tacaswell/matplotlib that referenced this issue Nov 24, 2015
pd.Series prefer indexing via searching the index to positional
indexing.  This method will get the first element of any iterable.

It will advance a generater, but they did not work previously anyway.

Closes matplotlib#5550
@tacaswell tacaswell self-assigned this Nov 24, 2015
@tacaswell
Copy link
Member

@8one6 Can you make a new issue for the hist based issue? It is an entirely different code path (and a bit more work to clean up).

@8one6
Copy link

8one6 commented Nov 24, 2015

xref #5557

tacaswell added a commit to tacaswell/matplotlib that referenced this issue Dec 21, 2015
pd.Series prefer indexing via searching the index to positional
indexing.  This method will get the first element of any iterable.

It will advance a generater, but they did not work previously anyway.

Closes matplotlib#5550
tacaswell added a commit to tacaswell/matplotlib that referenced this issue Dec 28, 2015
pd.Series prefer indexing via searching the index to positional
indexing.  This method will get the first element of any iterable.

It will advance a generater, but they did not work previously anyway.

Closes matplotlib#5550
ilia-kats added a commit to ilia-kats/CombinatorialProfiler that referenced this issue Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants