ENH/BUG: Fix names, levels and labels handling in MultiIndex #4039

jtratner · 2013-06-26T01:22:53Z

This PR covers:

Fixes: #4202, #3714, #3742 (there are might be some others, but I've blanked on them...)

Bug fixes:

MultiIndex preserves names as much as possible and it's now harder to overwrite index metadata by making changes down the line.
set_values no longer messes up names.

External API Changes:

Names, levels and labels are now validated each time and 'mostly' immutable.
names, levels and labels produce containers that are immutable (using new containers FrozenList and FrozenNDArray)
MultiIndex now shallow copies levels and labels before storing them.
Adds astype method to MultiIndex to resolve issue with set_values in NDFrame
Direct setting of levels and labels is "deprecated" with a setter that raises a DeprecationWarning (but still functions)
New set_names, set_labels, and set_levels methods allow setting of these attributes and take an inplace=True keyword argument to mutate in place.
Index has a rename method that works similarly to the set_* methods.
Improved exceptions on Index methods to be more descriptive / more specific (e.g., replacing Exception with ValueError, etc.)
Index.copy() now accepts keyword arguments (name=,names=, levels=, labels=,) which return a new copy with those attributes set. It also accepts deep, which is there for compatibility with other copy() methods, but doesn't actually change what copy does (though, for MultiIndex, it makes the copy operation slower)

Internal changes:

MultiIndex now uses _set_levels, _get_levels, _set_labels, _get_labels internally to handle labels and levels (and uses that directly in __array_finalize__ and __setstate__, etc.)
MultiIndex.copy(deep=True) will deepcopy levels, labels, and names.
Index objects handle names with _set_names and _get_names.
Index now inherits from FrozenNDArray which (mostly) blocks mutable methods (except for view() and reshape())
Index now actually copies ndarrays when copy=True is passed to constructor and dtype=None

cpcloud · 2013-06-26T01:30:52Z

pandas/core/index.py

+        values = list(values)
+        if len(values) != self.nlevels:
+            raise ValueError(('Length of names (%d) must be same as level '
+                              '(%d)') % (len(values),self.nlevels))


complete bikeshedding, but no need for double parens here, as long as there's one set of parens python knows what 2 do :)

heh, I just moved it straight from __new__, certainly worth it to change. :)

jtratner · 2013-06-26T01:39:04Z

as an aside, this leads to weird things like

idx.names = list("abcdef")
idx.droplevel("d") # Gives an index out of range-esque error

cpcloud · 2013-06-26T01:46:32Z

yep i used to get that and i hacked around by recreating frames/series and some other trickery that will hopefully never see the light of day :)

jtratner · 2013-06-26T02:00:06Z

@cpcloud okay, I think I'm understanding more of the problem here: when you slice an index, the levels remain the same...e.g.:

chunklet = idx[-3:]
assert chunklet.levels[0] is idx.levels[0] # True

So, when you assign names, it mutates the underlying levels of both. This seems to follow the convention in __new__ to assign names to the levels directly...but if you do this after the fact, you end up mutating earlier copies.

Moreover, if you pass levels to the MultiIndex constructor, it doesn't copy them, e.g.:

new_idx = MultiIndex(idx.levels, idx.labels)
assert new_idx.levels[0] is idx.levels[0] # True

so what ought to be happening here? Should names be assigned to underlying levels or just left alone?

cpcloud · 2013-06-27T02:01:20Z

pandas/tests/test_index.py

+        index.names = ["a", "b"]
+        ind_names = list(index.names)
+        level_names = [level.name for level in index.levels]
+        self.assertListEqual(ind_names, level_names)


fyi you can't use this because it was introduced in py27

thanks for the heads up ... didn't realize (and definitely not worth
porting). Second change to make.

On Wed, Jun 26, 2013 at 10:01 PM, Phillip Cloud notifications@github.comwrote:

In pandas/tests/test_index.py:

# initializing with bad names (should always be equivalent)

major_axis, minor_axis = self.index.levels

major_labels, minor_labels = self.index.labels

assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],

labels=[major_labels, minor_labels],

names=['first'])

assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],

labels=[major_labels, minor_labels],

names=['first', 'second', 'third'])

# names are assigned

index.names = ["a", "b"]

ind_names = list(index.names)

level_names = [level.name for level in index.levels]

self.assertListEqual(ind_names, level_names)

fyi you can't use this because it was introduced in py27http://docs.python.org/2/library/unittest.html#unittest.TestCase.assertListEqual

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4039/files#r4907657
.

In python >= 2.7 assertEqual will dispatch to e.g., assertListEqual if two lists are passed, so that's nice.

jtratner · 2013-06-27T02:51:17Z

@cpcloud made those two changes and rebased.

cpcloud · 2013-06-27T03:39:19Z

@wesm any reason why names should be settable to a sequence larger than nlevels after the fact?

wesm · 2013-06-27T19:00:00Z

nope. i just never added any validation and left names as a simple attribute.

cpcloud · 2013-06-27T19:04:23Z

+1 for a validator here...@jtratner what breaks?

jreback · 2013-06-27T19:08:26Z

this is basically same issue as #3742

cpcloud · 2013-06-27T19:09:05Z

indeed...close this one, that one?

jreback · 2013-06-27T19:09:56Z

def should be some kind of validator on setting of names attrib in index; weird things can happen if e.g. levels are changed

jreback · 2013-06-27T19:10:23Z

can close that one (maybe move example to here though) as another test case

cpcloud · 2013-06-27T19:12:29Z

Example from #3742 cc @thriveth

I have raised the issue in this question on Stack Overflow, but I'm not sure it ever made it to the Pandas issue tracker.

I have a MultiIndex'ed DataFrame which I want to expand by using set_value(), but doing this destroys the names attribute of the index. This does not happen when setting the value of an already existing entry in the DataFrame. An easily reproducible example is to create the dataframe by:

lev1 = ['hans', 'hans', 'hans', 'grethe', 'grethe', 'grethe']
lev2 = ['1', '2', '3'] * 2
idx = pd.MultiIndex.from_arrays(
    [lev1, lev2], 
    names=['Name', 'Number'])
df = pd.DataFrame(
    np.random.randn(6, 4),
    columns=['one', 'two', 'three', 'four'],
    index=idx)
df = df.sortlevel()
df

This shows a neat and nice object, just as I expected, with proper naming of the index columns. If I now run:

df.set_value(('grethe', '3'), 'one', 99.34)

the result is also as expected. But if I run:

df.set_value(('grethe', '4'), 'one', 99.34)

The column names of the index are gone, and the names attribute has been set to [None, None].

jreback · 2013-06-27T19:15:03Z

also #3714 same issue too, except assigning to levels, needs validation as well

jtratner · 2013-06-27T21:15:34Z

This is what I was trying to get across earlier :) If you pass levels through the MultiIndex constructor, they have their names set to the names keyword argument or to None (and are not copied).

        if names is None:
            # !!!This is why names get reset to None
            subarr.names = [None] * subarr.nlevels
        else:
            if len(names) != subarr.nlevels:
                raise AssertionError(('Length of names (%d) must be same as level '
                                      '(%d)') % (len(names),subarr.nlevels))

            subarr.names = list(names)

        # THIS IS WHERE NAMES GET OVERWRITTEN WITHOUT BEING COPIED
        # set the name
        for i, name in enumerate(subarr.names):
            subarr.levels[i].name = name

An easy solution would be for the MultiIndex to copy the levels it receives first and then rename them. Then, any time you set the names attribute, it would set the name on every level and the names attribute on the copied levels, e.g. force _ensure_index to make a copy if it's already an index, so that at this point you'd be all good:

levels = [_ensure_index(lev) for lev in levels]

Because later on it just assigns it to the object:

subarr.levels = levels

I think levels should be a cached_readonly property, so that you don't end up creating indices twice (once in the levels = part and another on the setattr levels). If not,you could turn levels into a property that makes a copy before setting on the object.

If this all makes sense to you, I can write it up into a PR soon.

jtratner · 2013-06-27T22:34:56Z

@jreback @cpcloud Are Index and MultiIndex supposed to be immutable? If so, then neither should be able to set names directly (like MultiIndex does in its __new__ method). I think it makes more sense to allow them to mutate on names...much easier internally.

jreback · 2013-06-27T23:16:01Z

well they are immutable on the values
but not names / levels (though they really should be)
because state sharing is a problem then

jtratner · 2013-06-28T21:29:56Z

@cpcloud I learned something today: Python cheats and compares lists first by object equality, then checks individual items...I'm wondering if this error is lurking other places in the code:

>>> arr1, arr2 = np.array(range(10)), np.array(range(10))
>>> assert [arr1, arr2] == [arr1, arr2] # succeeds
>>> assert [arr1, arr2] == [arr1, arr2.copy()]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> assert_almost_equal([arr1, arr2], [arr1.copy(), arr2])
True

I found this in test_copy, where a test case was not actually testing what it claimed to be testing :P .

    def test_copy(self):
        i_copy = self.index.copy()

        # Equal...but not the same object
        self.assert_(i_copy.levels == self.index.levels)
        self.assert_(i_copy.levels is not self.index.levels)

jtratner · 2013-06-28T21:31:18Z

Actually that whole test is not great, because it's actually (was) testing that two different lists were created.

cpcloud · 2013-06-28T22:01:38Z

@jtratner that is good to know.

cpcloud · 2013-06-28T22:05:49Z

i always assumed that sequence equality was done recursively. docs seem to imply that that is the case..strange

jtratner · 2013-06-28T22:13:48Z

@cpcloud yep, that's exactly right. (I'm sure that's implementation dependent, but it is important to know if using numpy arrays.

cpcloud · 2013-06-28T22:17:11Z

maybe some sort of lightweight Levels class is in order? overloading __eq__ to compare levels or something more general for use with labels and levels?

cpcloud · 2013-06-28T22:19:33Z

Maybe you could use Categorical?

jtratner · 2013-06-28T23:04:27Z

@cpcloud maybe. Right now I just changed levels, names and labels to return tuples and used assert_almost_equals. Very simple.

On that note - is it okay to make that change? Much easier to prevent erroneous assignment (like index.levels[i] = some_new_index) by producing an immutable object than to try to use something else. Have to change a ton of test cases that assumed names produces a list, but hopefully that's not a big deal...

cpcloud · 2013-06-28T23:18:41Z

Categorical might be jumping the gun a bit. I would rather have it immutable, but that prevents sharing. a perf comparison might be useful here

jtratner · 2013-06-28T23:37:55Z

@cpcloud well, I'm using shallow copies/views, which I think means that only metadata is copied. This is necessary anyways, because you want to be able to set names on the underlying levels without worrying about messing up other indexes.

jtratner · 2013-06-28T23:38:44Z

(I just pushed what I have so far - it's failing because a ton of tests assume that index names, levels and labels will be lists...)

* `FrozenNDArray` - thin wrapper around ndarray that disallows setting methods (will be used for levels on `MultiIndex`) * `FrozenList` - thin wrapper around list that disallows setting methods (needed because of type checks elsewhere) Index inherits from FrozenNDArray now and also actually copies for deepcopy. Assumption is that underlying array is still immutable-ish

jtratner · 2013-08-10T16:45:46Z

@jreback tell me if you want docs on set_names, set_levels, or the new copy constructor.

jreback · 2013-08-10T16:49:44Z

I think an example for v0.13 is good
and maybe add in indexing.rst? not sure we actually have a section on index names so put where u think

jtratner · 2013-08-10T17:09:38Z

okay, there's an example in v0.13.0.txt and changed indexing.rst slightly to add the index names section (moved around the Index objects part so it could address MultiIndex too)

jtratner · 2013-08-10T19:41:34Z

@jreback this is all working now + has the docs, etc.

jreback · 2013-08-11T16:43:57Z

doc/source/release.rst

@@ -82,6 +88,20 @@ pandas 0.13
    - removed the ``warn`` argument from ``open``. Instead a ``PossibleDataLossError`` exception will
      be raised if you try to use ``mode='w'`` with an OPEN file handle (:issue:`4367`)
    - allow a passed locations array or mask as a ``where`` condition (:issue:`4467`)
+  - ``Index`` and ``MultiIndex`` changes (:issue:`4039`):


doesn't this need to be indented? (eg outer level should be same as existing and inner level indented?)

yep needs to be indented

That indentation level is because the previous entry is a sub-bullet of the HDFStore changes. (this matches up with the outer indentation level, which is 2 spaces.).

then can you change the ones below it cuz this one sticks out

What follows are sub-bullets of "Index and MultiIndex changes", just like how the HDFStore changes are grouped above it - I can change it if you want, I was matching the look of other elements that have multiple changes.

oh you're right! sorry about that. only thing is that to prevent sphinx from complaining you should put a newline between an outdented bullet point and the previously indented one

@cpcloud added the space.

* `names` is now a property *and* is set up as an immutable tuple. * `levels` are always (shallow) copied now and it is deprecated to set directly * `labels` are set up as a property now, moving all the processing of labels out of `__new__` + shallow-copied. * `levels` and `labels` are immutable. * Add names tests, motivating example from pandas-dev#3742, reflect tuple-ish output from names, and level names check to reindex test. * Add set_levels, set_labels, set_names and rename to index * Deprecate setting labels and levels directly Similar to other set_* methods...allows mutation if necessary but otherwise returns same object. Labels are now converted to `FrozenNDArray` and wrapped in a `FrozenList`. Should mostly resolve pandas-dev#3714 because you have to work to actually make assignments to an `Index`. BUG: Give MultiIndex its own astype method Fixes issue with set_value forgetting names.

* Index derivatives can set `name` or `names` as well as `dtype` on copy. MultiIndex can set `levels`, `labels`, and `names`. * Also, `__deepcopy__` just calls `copy(deep=True)` * Now, BlockManager.copy() takes an additional argument `copy_axes` which copies axes as well. Defaults to False. * `Series.copy()` takes an optional deep argument, which causes it to copy its index. * `DataFrame.copy()` passes `copy_axes=True` when deepcopying. * Add copy kwarg to MultiIndex `__new__`

jreback · 2013-08-11T20:00:01Z

ok bombs away

ENH/BUG: Fix names, levels and labels handling in MultiIndex

jreback · 2013-08-12T01:49:16Z

@jtratner

just rebased #3482 on top of this in master, no problem

noticed that you had a separate copy for Series (which copies index and such properly). Is this not also needed in core/internals/copy? specificy the axes are ALWAYS shallow copied here...?

jtratner · 2013-08-12T02:03:07Z

@jreback actually, I was flip-flopping on this for a while. I tried copying axes in a few places, and I kept getting issues with the check that self.ref_items is self.items failing. It's fine for series, because it passes it through the constructor again, but DataFrame passes its data to the new object, so it doesn't reassign ref_items. I didn't want to mess with it too much because I wasn't sure where to change it. Maybe I needed to change it in block instead?

jtratner · 2013-08-12T02:07:16Z

@jreback After you brought this up - I'm thinking that you could copy index first, then pass it as a parameter to the copy on blocks to overwrite items and ref_items. Maybe that would work? What's the difference between ref_items and items?

jreback · 2013-08-12T02:12:23Z

@jtratner items are references to ref_items that are identical or a take (and not an indexing operation).

So the only thing you can do is to copy it BEFORE you start and then use that one (and yes, could be done in copy), you could have to pass the new ref_items, then change the Block.copy to use the new items, e.g. items = ref_items.take(self.ref_locs); very much like a renaming operation.

cpcloud reviewed Jun 26, 2013
View reviewed changes

cpcloud reviewed Jun 27, 2013
View reviewed changes

cpcloud mentioned this pull request Jun 27, 2013

MultiIndex names attribute broken when using set_value() to expand DataFrame() #3742

Closed

jtratner mentioned this pull request Aug 10, 2013

Remove deprecated setters for MultiIndex levels and labels #4535

Closed

jreback reviewed Aug 11, 2013
View reviewed changes

jtratner added 2 commits August 11, 2013 13:35

jreback added a commit that referenced this pull request Aug 11, 2013

Merge pull request #4039 from jtratner/fix-multiindex-naming

429e9f3

ENH/BUG: Fix names, levels and labels handling in MultiIndex

jreback merged commit 429e9f3 into pandas-dev:master Aug 11, 2013

jtratner deleted the fix-multiindex-naming branch August 12, 2013 02:04

jreback mentioned this pull request Aug 14, 2013

CLN: Post Series subclass from NDFrame #4324

Closed

18 tasks

jreback mentioned this pull request Aug 21, 2013

API/CLN: make rename and copy consistent across NDFrame #4627

Merged

jtratner mentioned this pull request Sep 17, 2013

ENH: Allow fast comparisons of Index views, similar to 'is' checks. #4859

Closed

cancan101 mentioned this pull request Oct 20, 2013

New in version 0.13 for "Setting index metadata" #5279

Closed

tlmaloney mentioned this pull request Jan 12, 2015

sort_index behavior differs for the same DataFrame? #9212

Closed

TomAugspurger modified the milestones: Next Major Release, 0.13 Jul 9, 2017

TomAugspurger added the Deprecate Functionality to remove in pandas label Jul 9, 2017

TomAugspurger modified the milestones: 0.13, Next Major Release Jul 9, 2017

jorisvandenbossche mentioned this pull request Nov 13, 2017

CLN/DEPR: remove setter for MultiIndex levels/labels properties #18256

Merged

rhshadrach mentioned this pull request Jan 26, 2024

API: Change MultiIndex.levels/codes to use tuple instead of FrozenList? #53531

Closed

Uh oh!

ENH/BUG: Fix names, levels and labels handling in MultiIndex #4039

ENH/BUG: Fix names, levels and labels handling in MultiIndex #4039

Uh oh!

Conversation

jtratner commented Jun 26, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtratner commented Jun 26, 2013

Uh oh!

cpcloud commented Jun 26, 2013

Uh oh!

jtratner commented Jun 26, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtratner commented Jun 27, 2013

Uh oh!

cpcloud commented Jun 27, 2013

Uh oh!

wesm commented Jun 27, 2013

Uh oh!

cpcloud commented Jun 27, 2013

Uh oh!

jreback commented Jun 27, 2013

Uh oh!

cpcloud commented Jun 27, 2013

Uh oh!

jreback commented Jun 27, 2013

Uh oh!

jreback commented Jun 27, 2013

Uh oh!

cpcloud commented Jun 27, 2013

Uh oh!

jreback commented Jun 27, 2013

Uh oh!

jtratner commented Jun 27, 2013

Uh oh!

jtratner commented Jun 27, 2013

Uh oh!

jreback commented Jun 27, 2013

Uh oh!

jtratner commented Jun 28, 2013

Uh oh!

jtratner commented Jun 28, 2013

Uh oh!

cpcloud commented Jun 28, 2013

Uh oh!

cpcloud commented Jun 28, 2013

Uh oh!

jtratner commented Jun 28, 2013

Uh oh!

cpcloud commented Jun 28, 2013

Uh oh!

cpcloud commented Jun 28, 2013

Uh oh!

jtratner commented Jun 28, 2013

Uh oh!

cpcloud commented Jun 28, 2013

Uh oh!

jtratner commented Jun 28, 2013

Uh oh!

jtratner commented Jun 28, 2013

Uh oh!

jtratner commented Aug 10, 2013

Uh oh!

jreback commented Aug 10, 2013

Uh oh!

jtratner commented Aug 10, 2013

Uh oh!

jtratner commented Aug 10, 2013

Uh oh!

Choose a reason for hiding this comment