Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix copy s.t. it always copies index/columns. #4830

Merged
merged 3 commits into from Sep 24, 2013

Conversation

jtratner
Copy link
Contributor

Fixes #4202 (and maybe some others too).

Only copies index/columns with deep=True on BlockManager. Plus some
tests...yay!

@jtratner
Copy link
Contributor Author

@hayd copying will work now :)

if deep:
new_axes = [ax.copy(deep=deep) for ax in self.axes]
else:
new_axes = list(self.axes)
return self.apply('copy', axes=new_axes, deep=deep, do_integrity_check=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pass ref_items=new_axes[0] if deep else None (and it will get passed to the block), the you dont' need the mod in apply

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duh - good catch. 😄

@jtratner
Copy link
Contributor Author

this works for everything but SparseSeries and SparsePanel at this point...still trying to figure them out.

@@ -1988,7 +1988,8 @@ def transform(self, func, *args, **kwargs):

# broadcasting
if isinstance(res, Series):
if res.index is obj.index:
# TODO: Possibly could just check for length here...not clear
if res.index.identical(obj.index):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, @cpcloud or others - can't do this check any more here, because of the change in how copying works (and because more ops end up returning views rather than the same object), I think identical is okay, because it seems relatively fast. I didn't see much of a change on perf tests, but they are relatively unreliable on my computer. What do you think? I wasn't sure whether what mattered was whether it was the same object or if it was just the length...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm is is much faster than identical, even for tiny indices:

In [9]: i = Index([1, 2, 3])

In [10]: timeit i.identical(i)
1000000 loops, best of 3: 1.43 µs per loop

In [11]: timeit i is i
10000000 loops, best of 3: 60.2 ns per loop

Is there any way to try the is check and if that fails then use identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud with these changes, in the entire test suite, the is check never succeeds. I was thinking of checking is on the base first, since it's a view. Would that work? Here's what I get timing wise for this:

In [3]: %timeit i is i
%timeit 10000000 loops, best of 3: 95.6 ns per loop

In [4]: %timeit i.base is i.base
1000000 loops, best of 3: 369 ns per loop

In [5]: %timeit i.identical(i)
100000 loops, best of 3: 2.1 µs per loop

Works with MultiIndex too:

In [6]: mi = MultiIndex.from_tuples(zip(range(10), range(10)))

In [7]: mi.base
Out[7]: array([], dtype=object)
In [8]: mi2 = MultiIndex.from_tuples(zip(range(10), range(10))
   ...: )

In [9]: mi2
Out[9]:
MultiIndex
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]

In [10]: mi2.base
Out[10]: array([], dtype=object)

In [11]: mi.base is mi2.base
Out[11]: False

In [12]: %timeit mi is mi
%timeit mi.base i10000000 loops, best of 3: 95.8 ns per loop

In [13]: %timeit mi.base is mi.base
1000000 loops, best of 3: 351 ns per loop

In [14]: %timeit mi.identical(mi)
100000 loops, best of 3: 12.6 µs per loop

In [15]: %timeit mi.base is mi2.base
1000000 loops, best of 3: 355 ns per loop

Length check is not MI friendly:

In [18]: %timeit len(mi) == len(mi2)
100000 loops, best of 3: 4.96 µs per loop

btw - maybe we should, at some point, optimize the length check on multiindex...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. base doesn't work in every case either :-(

@jreback
Copy link
Contributor

jreback commented Sep 15, 2013

re identical

u might try something like

if a is b or a.base is b.base: ( maybe do a share memory check too)
return True

return a.identizal(b)

so the actual identity case is a lot faster

@jtratner
Copy link
Contributor Author

Problem is that bases gets changed with view. I.e.:

In [2]: i = Index(range(10)
   ...: )

In [3]: i.base
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]: i.view().base
Out[4]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [5]: i.view().view().base is i
Out[5]: True

so slight tweak on base.

@jtratner
Copy link
Contributor Author

@jreback @cpcloud here's what I'm thinking for index, adding an is_ method:

    def is_(self, other):
        """returns True if and only if the indices actually are the same underlying array.
         Generally survives through views. Doesn't care about metadata."""
        if self is other:
            return True
        s_base, o_base = self.base, other.base
        return self is o_base or other is s_base or s_base is o_base

(base calls aren't cached, which is why it needs to be setup this way)

d = self._construct_axes_dict()
if deep:
new_data = dict((k, v.copy()) for k, v in compat.iteritems(self._frames))
d = dict((k, v.copy()) for k, v in compat.iteritems(d))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be v.copy(deep=True)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they should

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change.

@jtratner
Copy link
Contributor Author

blah, this doesn't quite work either :-/ multiple levels of view screw up .base.

@jtratner
Copy link
Contributor Author

It's going to take some thinking to decide on how best to efficiently handle flexible metadata for Index. I was thinking of setting an _id attribute that gets copied on views but changed when slicing, copying or setting attributes. then the is_ check would just be comparing the equality of two ids (probably hashes of id + type name).

@jtratner
Copy link
Contributor Author

Got this working (and faster) with the is_ method from #4909. @cpcloud you want to take a look at either this or #4909 to see whether you think this is fast enough for groupby?

@cpcloud
Copy link
Member

cpcloud commented Sep 21, 2013

i'll vbench it and report back :)

@jtratner
Copy link
Contributor Author

if it's not good enough, we can shave off about 600-800ns per iteration by skipping the function call and using self.index._id is other.index._id

@jtratner
Copy link
Contributor Author

Is this the right behavior?

ind = Index(range(10))
df = DataFrame(range(10), ind)
assert df.index is ind
assert (df + 1).index is ind

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@jtratner yep

@@ -116,7 +116,7 @@ def __init__(self, data, index=None, sparse_index=None, kind='block',

if is_sparse_array:
if isinstance(data, SparseSeries) and index is None:
index = data.index
index = data.index.view()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so then, should all of these places actually make views on the index? Less sure about sparse...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case what's the different? whether I assign an index or its view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just whether you want to be creating a view every time. Minimal difference (if at all).

jtratner and others added 3 commits September 23, 2013 23:27
Only copies index/columns with `deep=True` on `BlockManager`. Plus some
tests...yay! Change copy to make views of indices.

Requires changing groupby to check whether indices are identical, rather
than comparing with `is`. Plus add tests for name checks through
everything. Also, fixes tests to use `is_()` rather than `is`

CLN: Change groupby to use 'is_' instead
@jtratner
Copy link
Contributor Author

@jreback @cpcloud does this look okay to merge?

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

ok

@cpcloud
Copy link
Member

cpcloud commented Sep 24, 2013

yep

jtratner added a commit that referenced this pull request Sep 24, 2013
BUG: Fix copy s.t. it always copies index/columns.
@jtratner jtratner merged commit d7d9a6c into pandas-dev:master Sep 24, 2013
@jtratner jtratner deleted the copy-index-and-columns branch September 24, 2013 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Deep copy not copying index/columns
4 participants