Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: Fix names, levels and labels handling in MultiIndex #4039

Merged
merged 4 commits into from Aug 11, 2013

Conversation

jtratner
Copy link
Contributor

This PR covers:

Fixes: #4202, #3714, #3742 (there are might be some others, but I've blanked on them...)

Bug fixes:

  • MultiIndex preserves names as much as possible and it's now harder to overwrite index metadata by making changes down the line.
  • set_values no longer messes up names.

External API Changes:

  • Names, levels and labels are now validated each time and 'mostly' immutable.
  • names, levels and labels produce containers that are immutable (using new containers FrozenList and FrozenNDArray)
  • MultiIndex now shallow copies levels and labels before storing them.
  • Adds astype method to MultiIndex to resolve issue with set_values in NDFrame
  • Direct setting of levels and labels is "deprecated" with a setter that raises a DeprecationWarning (but still functions)
  • New set_names, set_labels, and set_levels methods allow setting of these attributes and take an inplace=True keyword argument to mutate in place.
  • Index has a rename method that works similarly to the set_* methods.
  • Improved exceptions on Index methods to be more descriptive / more specific (e.g., replacing Exception with ValueError, etc.)
  • Index.copy() now accepts keyword arguments (name=,names=, levels=, labels=,) which return a new copy with those attributes set. It also accepts deep, which is there for compatibility with other copy() methods, but doesn't actually change what copy does (though, for MultiIndex, it makes the copy operation slower)

Internal changes:

  • MultiIndex now uses _set_levels, _get_levels, _set_labels, _get_labels internally to handle labels and levels (and uses that directly in __array_finalize__ and __setstate__, etc.)
  • MultiIndex.copy(deep=True) will deepcopy levels, labels, and names.
  • Index objects handle names with _set_names and _get_names.
  • Index now inherits from FrozenNDArray which (mostly) blocks mutable methods (except for view() and reshape())
  • Index now actually copies ndarrays when copy=True is passed to constructor and dtype=None

values = list(values)
if len(values) != self.nlevels:
raise ValueError(('Length of names (%d) must be same as level '
'(%d)') % (len(values),self.nlevels))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete bikeshedding, but no need for double parens here, as long as there's one set of parens python knows what 2 do :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, I just moved it straight from __new__, certainly worth it to change. :)

@jtratner
Copy link
Contributor Author

as an aside, this leads to weird things like

idx.names = list("abcdef")
idx.droplevel("d") # Gives an index out of range-esque error

@cpcloud
Copy link
Member

cpcloud commented Jun 26, 2013

yep i used to get that and i hacked around by recreating frames/series and some other trickery that will hopefully never see the light of day :)

@jtratner
Copy link
Contributor Author

@cpcloud okay, I think I'm understanding more of the problem here: when you slice an index, the levels remain the same...e.g.:

chunklet = idx[-3:]
assert chunklet.levels[0] is idx.levels[0] # True

So, when you assign names, it mutates the underlying levels of both. This seems to follow the convention in __new__ to assign names to the levels directly...but if you do this after the fact, you end up mutating earlier copies.

Moreover, if you pass levels to the MultiIndex constructor, it doesn't copy them, e.g.:

new_idx = MultiIndex(idx.levels, idx.labels)
assert new_idx.levels[0] is idx.levels[0] # True

so what ought to be happening here? Should names be assigned to underlying levels or just left alone?

index.names = ["a", "b"]
ind_names = list(index.names)
level_names = [level.name for level in index.levels]
self.assertListEqual(ind_names, level_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi you can't use this because it was introduced in py27

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the heads up ... didn't realize (and definitely not worth
porting). Second change to make.

On Wed, Jun 26, 2013 at 10:01 PM, Phillip Cloud notifications@github.comwrote:

In pandas/tests/test_index.py:

  •    # initializing with bad names (should always be equivalent)
    
  •    major_axis, minor_axis = self.index.levels
    
  •    major_labels, minor_labels = self.index.labels
    
  •    assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],
    
  •                       labels=[major_labels, minor_labels],
    
  •                       names=['first'])
    
  •    assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],
    
  •                       labels=[major_labels, minor_labels],
    
  •                       names=['first', 'second', 'third'])
    
  •    # names are assigned
    
  •    index.names = ["a", "b"]
    
  •    ind_names = list(index.names)
    
  •    level_names = [level.name for level in index.levels]
    
  •    self.assertListEqual(ind_names, level_names)
    

fyi you can't use this because it was introduced in py27http://docs.python.org/2/library/unittest.html#unittest.TestCase.assertListEqual


Reply to this email directly or view it on GitHubhttps://github.com//pull/4039/files#r4907657
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python >= 2.7 assertEqual will dispatch to e.g., assertListEqual if two lists are passed, so that's nice.

@jtratner
Copy link
Contributor Author

@cpcloud made those two changes and rebased.

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

@wesm any reason why names should be settable to a sequence larger than nlevels after the fact?

@wesm
Copy link
Member

wesm commented Jun 27, 2013

nope. i just never added any validation and left names as a simple attribute.

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

+1 for a validator here...@jtratner what breaks?

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

this is basically same issue as #3742

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

indeed...close this one, that one?

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

def should be some kind of validator on setting of names attrib in index; weird things can happen if e.g. levels are changed

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

can close that one (maybe move example to here though) as another test case

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2013

Example from #3742 cc @thriveth

I have raised the issue in this question on Stack Overflow, but I'm not sure it ever made it to the Pandas issue tracker.

I have a MultiIndex'ed DataFrame which I want to expand by using set_value(), but doing this destroys the names attribute of the index. This does not happen when setting the value of an already existing entry in the DataFrame. An easily reproducible example is to create the dataframe by:

lev1 = ['hans', 'hans', 'hans', 'grethe', 'grethe', 'grethe']
lev2 = ['1', '2', '3'] * 2
idx = pd.MultiIndex.from_arrays(
    [lev1, lev2], 
    names=['Name', 'Number'])
df = pd.DataFrame(
    np.random.randn(6, 4),
    columns=['one', 'two', 'three', 'four'],
    index=idx)
df = df.sortlevel()
df 

This shows a neat and nice object, just as I expected, with proper naming of the index columns. If I now run:

df.set_value(('grethe', '3'), 'one', 99.34)

the result is also as expected. But if I run:

df.set_value(('grethe', '4'), 'one', 99.34)

The column names of the index are gone, and the names attribute has been set to [None, None].

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

also #3714 same issue too, except assigning to levels, needs validation as well

@jtratner
Copy link
Contributor Author

This is what I was trying to get across earlier :) If you pass levels through the MultiIndex constructor, they have their names set to the names keyword argument or to None (and are not copied).

        if names is None:
            # !!!This is why names get reset to None
            subarr.names = [None] * subarr.nlevels
        else:
            if len(names) != subarr.nlevels:
                raise AssertionError(('Length of names (%d) must be same as level '
                                      '(%d)') % (len(names),subarr.nlevels))

            subarr.names = list(names)

        # THIS IS WHERE NAMES GET OVERWRITTEN WITHOUT BEING COPIED
        # set the name
        for i, name in enumerate(subarr.names):
            subarr.levels[i].name = name

An easy solution would be for the MultiIndex to copy the levels it receives first and then rename them. Then, any time you set the names attribute, it would set the name on every level and the names attribute on the copied levels, e.g. force _ensure_index to make a copy if it's already an index, so that at this point you'd be all good:

levels = [_ensure_index(lev) for lev in levels]

Because later on it just assigns it to the object:

subarr.levels = levels

I think levels should be a cached_readonly property, so that you don't end up creating indices twice (once in the levels = part and another on the setattr levels). If not,you could turn levels into a property that makes a copy before setting on the object.

If this all makes sense to you, I can write it up into a PR soon.

@jtratner
Copy link
Contributor Author

@jreback @cpcloud Are Index and MultiIndex supposed to be immutable? If so, then neither should be able to set names directly (like MultiIndex does in its __new__ method). I think it makes more sense to allow them to mutate on names...much easier internally.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2013

well they are immutable on the values
but not names / levels (though they really should be)
because state sharing is a problem then

@jtratner
Copy link
Contributor Author

@cpcloud I learned something today: Python cheats and compares lists first by object equality, then checks individual items...I'm wondering if this error is lurking other places in the code:

>>> arr1, arr2 = np.array(range(10)), np.array(range(10))
>>> assert [arr1, arr2] == [arr1, arr2] # succeeds
>>> assert [arr1, arr2] == [arr1, arr2.copy()]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> assert_almost_equal([arr1, arr2], [arr1.copy(), arr2])
True

I found this in test_copy, where a test case was not actually testing what it claimed to be testing :P .

    def test_copy(self):
        i_copy = self.index.copy()

        # Equal...but not the same object
        self.assert_(i_copy.levels == self.index.levels)
        self.assert_(i_copy.levels is not self.index.levels)

@jtratner
Copy link
Contributor Author

Actually that whole test is not great, because it's actually (was) testing that two different lists were created.

@cpcloud
Copy link
Member

cpcloud commented Jun 28, 2013

@jtratner that is good to know.

@cpcloud
Copy link
Member

cpcloud commented Jun 28, 2013

i always assumed that sequence equality was done recursively. docs seem to imply that that is the case..strange

@jtratner
Copy link
Contributor Author

@cpcloud yep, that's exactly right. (I'm sure that's implementation dependent, but it is important to know if using numpy arrays.

@cpcloud
Copy link
Member

cpcloud commented Jun 28, 2013

maybe some sort of lightweight Levels class is in order? overloading __eq__ to compare levels or something more general for use with labels and levels?

@cpcloud
Copy link
Member

cpcloud commented Jun 28, 2013

Maybe you could use Categorical?

@jtratner
Copy link
Contributor Author

@cpcloud maybe. Right now I just changed levels, names and labels to return tuples and used assert_almost_equals. Very simple.

On that note - is it okay to make that change? Much easier to prevent erroneous assignment (like index.levels[i] = some_new_index) by producing an immutable object than to try to use something else. Have to change a ton of test cases that assumed names produces a list, but hopefully that's not a big deal...

@cpcloud
Copy link
Member

cpcloud commented Jun 28, 2013

Categorical might be jumping the gun a bit. I would rather have it immutable, but that prevents sharing. a perf comparison might be useful here

@jtratner
Copy link
Contributor Author

@cpcloud well, I'm using shallow copies/views, which I think means that only metadata is copied. This is necessary anyways, because you want to be able to set names on the underlying levels without worrying about messing up other indexes.

@jtratner
Copy link
Contributor Author

(I just pushed what I have so far - it's failing because a ton of tests assume that index names, levels and labels will be lists...)

* `FrozenNDArray` - thin wrapper around ndarray that disallows setting methods
  (will be used for levels on `MultiIndex`)
* `FrozenList` - thin wrapper around list that disallows setting methods
  (needed because of type checks elsewhere)

Index inherits from FrozenNDArray now and also actually copies for deepcopy.
Assumption is that underlying array is still immutable-ish
@jtratner
Copy link
Contributor Author

@jreback tell me if you want docs on set_names, set_levels, or the new copy constructor.

@jreback
Copy link
Contributor

jreback commented Aug 10, 2013

I think an example for v0.13 is good
and maybe add in indexing.rst? not sure we actually have a section on index names so put where u think

@jtratner
Copy link
Contributor Author

okay, there's an example in v0.13.0.txt and changed indexing.rst slightly to add the index names section (moved around the Index objects part so it could address MultiIndex too)

@jtratner
Copy link
Contributor Author

@jreback this is all working now + has the docs, etc.

@@ -82,6 +88,20 @@ pandas 0.13
- removed the ``warn`` argument from ``open``. Instead a ``PossibleDataLossError`` exception will
be raised if you try to use ``mode='w'`` with an OPEN file handle (:issue:`4367`)
- allow a passed locations array or mask as a ``where`` condition (:issue:`4467`)
- ``Index`` and ``MultiIndex`` changes (:issue:`4039`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this need to be indented? (eg outer level should be same as existing and inner level indented?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep needs to be indented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That indentation level is because the previous entry is a sub-bullet of the HDFStore changes. (this matches up with the outer indentation level, which is 2 spaces.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can you change the ones below it cuz this one sticks out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What follows are sub-bullets of "Index and MultiIndex changes", just like how the HDFStore changes are grouped above it - I can change it if you want, I was matching the look of other elements that have multiple changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you're right! sorry about that. only thing is that to prevent sphinx from complaining you should put a newline between an outdented bullet point and the previously indented one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud added the space.

* `names` is now a property *and* is set up as an immutable tuple.
* `levels` are always (shallow) copied now and it is deprecated to set directly
* `labels` are set up as a property now, moving all the processing of
  labels out of `__new__` + shallow-copied.
* `levels` and `labels` are immutable.
* Add names tests, motivating example from pandas-dev#3742, reflect tuple-ish
  output from names, and level names check to reindex test.
* Add set_levels, set_labels, set_names and rename to index
* Deprecate setting labels and levels directly

Similar to other set_* methods...allows mutation if necessary but
otherwise returns same object.

Labels are now converted to `FrozenNDArray` and wrapped in a
`FrozenList`. Should mostly resolve pandas-dev#3714 because you have to work to
actually make assignments to an `Index`.

BUG: Give MultiIndex its own astype method

Fixes issue with set_value forgetting names.
* Index derivatives can set `name` or `names` as well as
  `dtype` on copy. MultiIndex can set `levels`, `labels`, and `names`.
* Also, `__deepcopy__` just calls `copy(deep=True)`
* Now, BlockManager.copy() takes an additional argument `copy_axes` which
  copies axes as well. Defaults to False.
* `Series.copy()` takes an optional deep argument, which causes it to
  copy its index.
* `DataFrame.copy()` passes `copy_axes=True` when deepcopying.
* Add copy kwarg to MultiIndex `__new__`
@jreback
Copy link
Contributor

jreback commented Aug 11, 2013

ok bombs away

jreback added a commit that referenced this pull request Aug 11, 2013
ENH/BUG: Fix names, levels and labels handling in MultiIndex
@jreback jreback merged commit 429e9f3 into pandas-dev:master Aug 11, 2013
@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

@jtratner

just rebased #3482 on top of this in master, no problem

noticed that you had a separate copy for Series (which copies index and such properly). Is this not also needed in core/internals/copy? specificy the axes are ALWAYS shallow copied here...?

@jtratner
Copy link
Contributor Author

@jreback actually, I was flip-flopping on this for a while. I tried copying axes in a few places, and I kept getting issues with the check that self.ref_items is self.items failing. It's fine for series, because it passes it through the constructor again, but DataFrame passes its data to the new object, so it doesn't reassign ref_items. I didn't want to mess with it too much because I wasn't sure where to change it. Maybe I needed to change it in block instead?

@jtratner jtratner deleted the fix-multiindex-naming branch August 12, 2013 02:04
@jtratner
Copy link
Contributor Author

@jreback After you brought this up - I'm thinking that you could copy index first, then pass it as a parameter to the copy on blocks to overwrite items and ref_items. Maybe that would work? What's the difference between ref_items and items?

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

@jtratner items are references to ref_items that are identical or a take (and not an indexing operation).

So the only thing you can do is to copy it BEFORE you start and then use that one (and yes, could be done in copy), you could have to pass the new ref_items, then change the Block.copy to use the new items, e.g. items = ref_items.take(self.ref_locs); very much like a renaming operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Deep copy not copying index/columns
6 participants