BUG: Fix copy s.t. it always copies index/columns. #4830

jtratner · 2013-09-13T00:56:27Z

Fixes #4202 (and maybe some others too).

Only copies index/columns with deep=True on BlockManager. Plus some
tests...yay!

jtratner · 2013-09-13T00:58:37Z

@hayd copying will work now :)

jreback · 2013-09-13T00:59:39Z

pandas/core/internals.py

+        if deep:
+            new_axes = [ax.copy(deep=deep) for ax in self.axes]
+        else:
+            new_axes = list(self.axes)
        return self.apply('copy', axes=new_axes, deep=deep, do_integrity_check=False)


just pass ref_items=new_axes[0] if deep else None (and it will get passed to the block), the you dont' need the mod in apply

duh - good catch. 😄

jtratner · 2013-09-14T15:56:10Z

this works for everything but SparseSeries and SparsePanel at this point...still trying to figure them out.

jtratner · 2013-09-15T03:28:31Z

pandas/core/groupby.py

@@ -1988,7 +1988,8 @@ def transform(self, func, *args, **kwargs):

            # broadcasting
            if isinstance(res, Series):
-                if res.index is obj.index:
+                # TODO: Possibly could just check for length here...not clear
+                if res.index.identical(obj.index):


@jreback, @cpcloud or others - can't do this check any more here, because of the change in how copying works (and because more ops end up returning views rather than the same object), I think identical is okay, because it seems relatively fast. I didn't see much of a change on perf tests, but they are relatively unreliable on my computer. What do you think? I wasn't sure whether what mattered was whether it was the same object or if it was just the length...

Hm is is much faster than identical, even for tiny indices:

In [9]: i = Index([1, 2, 3]) In [10]: timeit i.identical(i) 1000000 loops, best of 3: 1.43 µs per loop In [11]: timeit i is i 10000000 loops, best of 3: 60.2 ns per loop

Is there any way to try the is check and if that fails then use identical.

@cpcloud with these changes, in the entire test suite, the is check never succeeds. I was thinking of checking is on the base first, since it's a view. Would that work? Here's what I get timing wise for this:

In [3]: %timeit i is i %timeit 10000000 loops, best of 3: 95.6 ns per loop In [4]: %timeit i.base is i.base 1000000 loops, best of 3: 369 ns per loop In [5]: %timeit i.identical(i) 100000 loops, best of 3: 2.1 µs per loop

Works with MultiIndex too:

In [6]: mi = MultiIndex.from_tuples(zip(range(10), range(10))) In [7]: mi.base Out[7]: array([], dtype=object) In [8]: mi2 = MultiIndex.from_tuples(zip(range(10), range(10)) ...: ) In [9]: mi2 Out[9]: MultiIndex [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)] In [10]: mi2.base Out[10]: array([], dtype=object) In [11]: mi.base is mi2.base Out[11]: False In [12]: %timeit mi is mi %timeit mi.base i10000000 loops, best of 3: 95.8 ns per loop In [13]: %timeit mi.base is mi.base 1000000 loops, best of 3: 351 ns per loop In [14]: %timeit mi.identical(mi) 100000 loops, best of 3: 12.6 µs per loop In [15]: %timeit mi.base is mi2.base 1000000 loops, best of 3: 355 ns per loop

Length check is not MI friendly:

In [18]: %timeit len(mi) == len(mi2) 100000 loops, best of 3: 4.96 µs per loop

btw - maybe we should, at some point, optimize the length check on multiindex...

nope. base doesn't work in every case either :-(

jreback · 2013-09-15T04:32:21Z

re identical

u might try something like

if a is b or a.base is b.base: ( maybe do a share memory check too)
return True

return a.identizal(b)

so the actual identity case is a lot faster

jtratner · 2013-09-15T05:31:52Z

Problem is that bases gets changed with view. I.e.:

In [2]: i = Index(range(10)
   ...: )

In [3]: i.base
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]: i.view().base
Out[4]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [5]: i.view().view().base is i
Out[5]: True

so slight tweak on base.

jtratner · 2013-09-15T05:54:29Z

@jreback @cpcloud here's what I'm thinking for index, adding an is_ method:

    def is_(self, other):
        """returns True if and only if the indices actually are the same underlying array.
         Generally survives through views. Doesn't care about metadata."""
        if self is other:
            return True
        s_base, o_base = self.base, other.base
        return self is o_base or other is s_base or s_base is o_base

(base calls aren't cached, which is why it needs to be setup this way)

cpcloud · 2013-09-15T12:28:12Z

pandas/sparse/panel.py

+        d = self._construct_axes_dict()
+        if deep:
+            new_data = dict((k, v.copy()) for k, v in compat.iteritems(self._frames))
+            d = dict((k, v.copy()) for k, v in compat.iteritems(d))


should these be v.copy(deep=True)?

yes they should

I'll change.

jtratner · 2013-09-15T18:13:18Z

blah, this doesn't quite work either :-/ multiple levels of view screw up .base.

jtratner · 2013-09-16T00:22:46Z

It's going to take some thinking to decide on how best to efficiently handle flexible metadata for Index. I was thinking of setting an _id attribute that gets copied on views but changed when slicing, copying or setting attributes. then the is_ check would just be comparing the equality of two ids (probably hashes of id + type name).

jtratner · 2013-09-21T00:41:12Z

Got this working (and faster) with the is_ method from #4909. @cpcloud you want to take a look at either this or #4909 to see whether you think this is fast enough for groupby?

cpcloud · 2013-09-21T00:42:15Z

i'll vbench it and report back :)

jtratner · 2013-09-21T00:44:34Z

if it's not good enough, we can shave off about 600-800ns per iteration by skipping the function call and using self.index._id is other.index._id

jtratner · 2013-09-22T22:33:46Z

Is this the right behavior?

ind = Index(range(10))
df = DataFrame(range(10), ind)
assert df.index is ind
assert (df + 1).index is ind

jreback · 2013-09-22T22:50:19Z

@jtratner yep

jtratner · 2013-09-22T22:54:58Z

pandas/sparse/series.py

@@ -116,7 +116,7 @@ def __init__(self, data, index=None, sparse_index=None, kind='block',

            if is_sparse_array:
                if isinstance(data, SparseSeries) and index is None:
-                    index = data.index
+                    index = data.index.view()


so then, should all of these places actually make views on the index? Less sure about sparse...

in this case what's the different? whether I assign an index or its view?

just whether you want to be creating a view every time. Minimal difference (if at all).

Only copies index/columns with `deep=True` on `BlockManager`. Plus some tests...yay! Change copy to make views of indices. Requires changing groupby to check whether indices are identical, rather than comparing with `is`. Plus add tests for name checks through everything. Also, fixes tests to use `is_()` rather than `is` CLN: Change groupby to use 'is_' instead

jtratner · 2013-09-24T03:27:38Z

@jreback @cpcloud does this look okay to merge?

jreback · 2013-09-24T03:32:46Z

ok

cpcloud · 2013-09-24T03:40:45Z

yep

BUG: Fix copy s.t. it always copies index/columns.

jreback reviewed Sep 13, 2013
View reviewed changes

jtratner reviewed Sep 15, 2013
View reviewed changes

cpcloud reviewed Sep 15, 2013
View reviewed changes

jtratner mentioned this pull request Sep 17, 2013

ENH: Allow fast comparisons of Index views, similar to 'is' checks. #4859

Closed

jtratner reviewed Sep 22, 2013
View reviewed changes

jtratner and others added 3 commits September 23, 2013 23:27

ENH: enable deep copy for sparse Series/Panel

8de0778

CLN: PEP8 tests/test_index.py and core/index

42d1d74

jtratner added a commit that referenced this pull request Sep 24, 2013

Merge pull request #4830 from jtratner/copy-index-and-columns

d7d9a6c

BUG: Fix copy s.t. it always copies index/columns.

jtratner merged commit d7d9a6c into pandas-dev:master Sep 24, 2013

jtratner deleted the copy-index-and-columns branch September 24, 2013 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix copy s.t. it always copies index/columns. #4830

BUG: Fix copy s.t. it always copies index/columns. #4830

jtratner commented Sep 13, 2013

jtratner commented Sep 13, 2013

jreback Sep 13, 2013

jtratner Sep 13, 2013

jtratner commented Sep 14, 2013

jtratner Sep 15, 2013

cpcloud Sep 15, 2013

jtratner Sep 15, 2013

jtratner Sep 15, 2013

jreback commented Sep 15, 2013

jtratner commented Sep 15, 2013

jtratner commented Sep 15, 2013

cpcloud Sep 15, 2013

jreback Sep 15, 2013

jtratner Sep 15, 2013

jtratner commented Sep 15, 2013

jtratner commented Sep 16, 2013

jtratner commented Sep 21, 2013

cpcloud commented Sep 21, 2013

jtratner commented Sep 21, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

jtratner Sep 22, 2013

jreback Sep 22, 2013

jtratner Sep 22, 2013

jtratner commented Sep 24, 2013

jreback commented Sep 24, 2013

cpcloud commented Sep 24, 2013

BUG: Fix copy s.t. it always copies index/columns. #4830

BUG: Fix copy s.t. it always copies index/columns. #4830

Conversation

jtratner commented Sep 13, 2013

jtratner commented Sep 13, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtratner commented Sep 14, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 15, 2013

jtratner commented Sep 15, 2013

jtratner commented Sep 15, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtratner commented Sep 15, 2013

jtratner commented Sep 16, 2013

jtratner commented Sep 21, 2013

cpcloud commented Sep 21, 2013

jtratner commented Sep 21, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtratner commented Sep 24, 2013

jreback commented Sep 24, 2013

cpcloud commented Sep 24, 2013