Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor indexing #54

Merged
merged 3 commits into from
Feb 28, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 26 additions & 3 deletions lib/daru/index.rb
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,13 @@ def [](*key)

slice first, last
when key.size > 1
Daru::Index.new key.map { |k| self[k] }
if include? key[0]
Daru::Index.new key.map { |k| k }
else
# Assume the user is specifing values for index not keys
# Return index object having keys corresponding to values provided
Daru::Index.new key.map { |k| key k }
end
else
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v0dro I am on my way to improve the indexing infrastructure. I have made a major change in Index behavior. I think an example will illustrate it best:

[3] pry(main)> i = Daru::Index.new([:a, :b, :c])
=> #<Daru::Index:0x00000003f5ca50
 @keys=[:a, :b, :c],
 @relation_hash={:a=>0, :b=>1, :c=>2},
 @size=3>
[4] pry(main)> i[0, 2]
=> #<Daru::Index:0x00000003ed14f0
 @keys=[:a, :c],
 @relation_hash={:a=>0, :c=>1},
 @size=2>

Earlier it used to be:

[4] pry(main)> i = Daru::Index.new([:a, :b, :c])
=> #<Daru::Index:0x000000030be4f8
 @keys=[:a, :b, :c],
 @relation_hash={:a=>0, :b=>1, :c=>2},
 @size=3>
[5] pry(main)> i[0, 2]
=> #<Daru::Index:0x00000003006560
 @keys=[0, 2],
 @relation_hash={0=>0, 2=>1},
 @size=2>

The motive behind this is to remove the frequent checking of type of index and moving the functionality of guessing that the user is perhaps giving the index values (not keys) from Vector class to Index class because it seems more inherent to index.

Are you good with this?

If you are good with this then I plan to the same with MultiIndex and finally remove the conditionals.

v = @relation_hash[loc]
return loc if v.nil?
Expand Down Expand Up @@ -143,6 +149,14 @@ def self._load data

Daru::Index.new(h[:relation_hash].keys)
end

# Provide an Index for sub vector produced
#
# @param input_indexes [Array] the input by user to index the vector
# @return [Object] the Index object for sub vector produced
def conform input_indexes
self
end
end # class Index

class MultiIndex < Index
Expand Down Expand Up @@ -214,7 +228,7 @@ def [] *key
case
when key[0].is_a?(Range) then retrieve_from_range(key[0])
when (key[0].is_a?(Integer) and key.size == 1) then try_retrieve_from_integer(key[0])
else retrieve_from_tuples(key)
else retrieve_from_tuples key
end
end

Expand All @@ -236,7 +250,7 @@ def retrieve_from_tuples key
chosen = find_all_indexes label, level_index, chosen
end

return chosen[0] if chosen.size == 1
return chosen[0] if chosen.size == 1 and key.size == @levels.size
return multi_index_from_multiple_selections(chosen)
end

Expand Down Expand Up @@ -330,5 +344,14 @@ def values
def inspect
"Daru::MultiIndex:#{self.object_id} (levels: #{levels}\nlabels: #{labels})"
end

# Provide a MultiIndex for sub vector produced
#
# @param input_indexes [Array] the input by user to index the vector
# @return [Object] the MultiIndex object for sub vector produced
def conform input_indexes
return self if input_indexes[0].is_a? Range
drop_left_level input_indexes.size
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that you've used this method to remove the left most level on this line and have kept it here for a uniform interface between indexes.

But according to the previous functionality, the number of levels that were removed were equal to the number of levels that were specified by the user.

For example:

d =        Daru::MultiIndex.from_tuples([
  [:c,:one,:bar],          
  [:c,:one,:baz],          
  [:c,:two,:foo],          
  [:c,:two,:bar]          
])            
#=> Daru::MultiIndex:96634260 (levels: [[:c], [:one, :two], [:bar, :baz, :foo]]
#labels: [[0, 0, 0, 0], [0, 0, 1, 1], [0, 1, 2, 0]])
v = Daru::Vector.new([1,2,3,4], index: d)
#<Daru::Vector:101434370 @name = nil @size = 4 >
#                              nil
#[:c, :one, :bar]                1
#[:c, :one, :baz]                2
#[:c, :two, :foo]                3
#[:c, :two, :bar]                4

v[:c, :one]
#<Daru::Vector:101291480 @name = nil @size = 2 >
#          nil
#[:bar]      1
#[:baz]      2

v[:c]
#=> 
#<Daru::Vector:101176290 @name = nil @size = 4 >
#                      nil
#[:one, :bar]            1
#[:one, :baz]            2
#[:two, :foo]            3
#[:two, :bar]            4

I don't quite understand how the correct number of levels are being dropped in spite of you nowhere specifying the number of levels the user has asked for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also factor_out is not a very verbose method name. Can you make it more specific? Also add documentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand how the correct number of levels are being dropped in spite of you nowhere specifying the number of levels the user has asked for.

factor_out method drops all levels from left until a level with size greater than 1 is encountered.
For example

[17] pry(main)> d = Daru::MultiIndex.from_tuples([[:a, :one], [:a, :two]])
=> Daru::MultiIndex:20963520 (levels: [[:a], [:one, :two]]
labels: [[0, 0], [0, 1]])
[5] pry(main)> d.factor_out
=> Daru::MultiIndex:20450720 (levels: [[:one, :two]]
labels: [[0, 1]])

Here [:a] level got dropped because it's size was 1. But there's a little problem with this. It's a little inconsistent. For example:

[21] pry(main)> d = Daru::MultiIndex.from_tuples([[:a, :one], [:b, :two]])
=> Daru::MultiIndex:22630440 (levels: [[:a, :b], [:one, :two]]
labels: [[0, 1], [0, 1]])
[22] pry(main)> v = Daru::Vector.new([1,2], index: d)
=> 
#<Daru::Vector:22281380 @name = nil @size = 2 >
                  nil
[:a, :one]          1
[:b, :two]          2

[23] pry(main)> v[:a]
=> 1

Notice, it never showed the [:one].

I checked out the previous functionality and it too didn't showed [:one].

If I want to correct this inconsistency I think I would need to pass arguments to the factor_out how much levels to remove. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you will absolutely have to pass the number of levels to be dropped because when a user specifies a given number of indexes for a MultiIndex vector/dataframe, only those many levels should be dropped.

Your previous approach worked mainly because the specs were so written. What if the levels were like this:

levels: [[:a], [:one], [:one, :two, :three, :four]]

And I have specified only v[:a] when calling the vector?
Don't rely only on the tests, they're there just as a reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And about showing the :one; if I remember correctly, I had ported that functionality from Python's pandas. I think they follow the same convention if there's only one element with top level index :a.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v0dro No, they don't follow the same convention. For example:

>>> index = pd.MultiIndex.from_tuples([['a', 'one'], ['b', 'two']], names=['first', 'second'])
>>> s = pd.Series(np.random.randn(8), index=index)
>>> s
first  second
a      one       1.387915
b      two      -1.169805
dtype: float64
>>> s['a']
second
one    1.387915
dtype: float64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Port the pandas functionality then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just it fixed it. Now it's showing :one. Should I add test for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely.

end
end
69 changes: 11 additions & 58 deletions lib/daru/vector.rb
Original file line number Diff line number Diff line change
Expand Up @@ -199,56 +199,19 @@ def self.[](*args)
#
# # For vectors employing hierarchial multi index
#
def [](*indexes)
location = indexes[0]
if @index.is_a?(MultiIndex)
sub_index = @index[indexes]
result =
if sub_index.is_a?(Integer)
@data[sub_index]
else
elements = sub_index.map do |tuple|
@data[@index[tuple]]
end

if !indexes[0].is_a?(Range) and indexes.size < @index.width
sub_index = sub_index.drop_left_level indexes.size
end
Daru::Vector.new(
elements, index: sub_index, name: @name, dtype: @dtype)
end

return result
else
raise TypeError, "Invalid index type #{location.inspect}.\
\nUsage: v[:a, :b] gives elements with keys :a and :b for vector v." if location.is_a? Array

unless indexes[1]
case location
when Range
first = location.first
last = location.last
indexes = @index.slice first, last
else
pos = @index[location]
if pos.is_a?(Numeric)
return @data[pos]
else
indexes = pos
end
end
else
indexes = indexes.map { |e| named_index_for(e) }
end
def [](*input_indexes)
# Get a proper index object
indexes = @index[*input_indexes]

begin
Daru::Vector.new(
indexes.map { |loc| @data[@index[loc]] },
name: @name, index: indexes, dtype: @dtype)
rescue NoMethodError
raise IndexError, "Specified index #{pos.inspect} does not exist."
end
# If one object is asked return it
if indexes.is_a? Numeric
return @data[indexes]
end

# Form a new Vector using indexes and return it
Daru::Vector.new(
indexes.map { |loc| @data[@index[loc]] },
name: @name, index: indexes.conform(input_indexes), dtype: @dtype)
end

# Just like in Hashes, you can specify the index label of the Daru::Vector
Expand Down Expand Up @@ -1311,16 +1274,6 @@ def cast_vector_to dtype, source=nil, nm_dtype=nil
new_vector
end

def named_index_for index
if @index.include? index
index
elsif @index.key index
@index.key index
else
raise IndexError, "Specified index #{index} does not exist."
end
end

def index_for index
if @index.include?(index)
@index[index]
Expand Down
2 changes: 1 addition & 1 deletion spec/dataframe_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@
end

it "returns a Vector if the last level of MultiIndex is tracked" do
expect(@df_mi[:a, :one]).to eq(
expect(@df_mi[:a, :one, :bar]).to eq(
Daru::Vector.new(@vector_arry1, index: @multi_index))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't tracking the last level.

end
end
Expand Down
10 changes: 5 additions & 5 deletions spec/index_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
context "#[]" do
before do
@id = Daru::Index.new [:one, :two, :three, :four, :five, :six, :seven]
@mixed_id = Daru::Index.new ['a','b','c',:d,:a,0,3,5]
@mixed_id = Daru::Index.new ['a','b','c',:d,:a,8,3,5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason why you replaced the 0 with 8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0 was a key as well as index value, and since the first input that is 0 was a key the new code treated the whole input as key values rather than index values during this test, that's why I had to change it to 8 so that the code treats input as index values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. But keep in mind that if 0 exists in the named Index, the Index must return the corresponding element position.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example 1. Notice that since 0 is present in the index, daru should not return the 0th element but rather the element at index 0. This applies for ranges too.

[18] pry(main)> a = Daru::Vector.new([1,2,3,4], index: [2,3,0,1])
=> 
#<Daru::Vector:88293040 @name = nil @size = 4 >
    nil
  2   1
  3   2
  0   3
  1   4

[19] pry(main)> a[0]
=> 3

Example 2. Here 0 isn't present in the index so it returns the 0th element:

[24] pry(main)> b = Daru::Vector.new([4,3,5,9], index: [:a, :b, :c, :d])
=> 
#<Daru::Vector:87873110 @name = nil @size = 4 >
    nil
  a   4
  b   3
  c   5
  d   9

[25] pry(main)> b[0]
=> 4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My code is good with these tests!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem then :)

end

it "works with ranges" do
Expand All @@ -88,12 +88,12 @@
expect(@mixed_id[0..2]).to eq(Daru::Index.new(['a','b','c']))

# If atleast one is a number then refer to actual indexing
expect(@mixed_id.slice('b',0)).to eq(Daru::Index.new(['b','c',:d,:a,0]))
expect(@mixed_id.slice('b',8)).to eq(Daru::Index.new(['b','c',:d,:a,8]))
end

it "returns multiple keys if specified multiple indices" do
expect(@id[0,1,3,4]).to eq(Daru::Index.new([0,1,3,4]))
expect(@mixed_id[0,5,3,2]).to eq(Daru::Index.new([5, 7, 6, 2]))
expect(@id[0,1,3,4]).to eq(Daru::Index.new([:one, :two, :four, :five]))
expect(@mixed_id[0,5,3,2]).to eq(Daru::Index.new(['a', 8, :d, 'c']))
end

it "returns correct index position for non-numeric index" do
Expand All @@ -102,7 +102,7 @@
end

it "returns correct index position for mixed index" do
expect(@mixed_id[0]).to eq(5)
expect(@mixed_id[8]).to eq(5)
expect(@mixed_id['c']).to eq(2)
end
end
Expand Down
11 changes: 9 additions & 2 deletions spec/vector_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -177,11 +177,12 @@
[:c,:one,:bar],
[:c,:one,:baz],
[:c,:two,:foo],
[:c,:two,:bar]
[:c,:two,:bar],
[:d,:one,:foo]
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add [:d, :one, :foo] to test a special case i.e. single index exist under second layer.

@multi_index = Daru::MultiIndex.from_tuples(@tuples)
@vector = Daru::Vector.new(
Array.new(12) { |i| i }, index: @multi_index,
Array.new(13) { |i| i }, index: @multi_index,
dtype: dtype, name: :mi_vector)
end

Expand Down Expand Up @@ -211,6 +212,12 @@
dtype: dtype, name: :sub_sub_vector))
end

it "returns sub vector not a single element when passed the partial tuple" do
mi = Daru::MultiIndex.from_tuples([[:foo]])
expect(@vector[:d, :one]).to eq(Daru::Vector.new([12], index: mi,
dtype: dtype, name: :sub_sub_vector))
end

it "returns a vector with corresponding MultiIndex when specified numeric Range" do
mi = Daru::MultiIndex.from_tuples([
[:a,:two,:baz],
Expand Down