Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous behaviours of MultiIndex #342

Open
zverok opened this issue May 9, 2017 · 14 comments
Open

Erroneous behaviours of MultiIndex #342

zverok opened this issue May 9, 2017 · 14 comments
Labels
Milestone

Comments

@zverok
Copy link
Collaborator

zverok commented May 9, 2017

Shown at #340

df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [101,102,103,104,105],
        c: [11,22,33,44,55]},
      order: [:a, :b, :c],
      index: [[:k], [:k], [:k], [:l], [:l]])
# => #<Daru::DataFrame(5x3)>
#       a   b   c
#   k 101  11  11
#     102  12  22
#     103  13  33
#   l 104  14  44
#     105  15  55 

Problems:

  1. one-level MultiIndex should not be a thing
  2. MultiIndex with repeated tuples should not be a thing
@zverok zverok added the bug label May 9, 2017
@zverok zverok mentioned this issue May 9, 2017
3 tasks
@Shekharrajak
Copy link
Member

@zverok , can you please show some examples and expected output ?

@zverok
Copy link
Collaborator Author

zverok commented Jul 22, 2017

@Shekharrajak, I believe that:

  1. Creating 1-el MultiIndex should be either impossible (error), or silently converted into just Index
  2. Attempt to create MultiIndex with repeating tuples should be prohibited.

@zverok
Copy link
Collaborator Author

zverok commented Jul 22, 2017

E.g.:

df = Daru::DataFrame.new({b: [11,12], a: [101,102], c: [11,22]},
      order: [:a, :b, :c],
      index: [[:k], [:l]])
# v1:
# ArgumentError: MultiIndex can't consist of single-element tuples!
# or v2:
df.index
# => #<Daru::Index(2): {k, l}> -- not MultiIndex!

And

df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [101,102,103,104,105],
        c: [11,22,33,44,55]},
      order: [:a, :b, :c],
      index: [[:k], [:k], [:k], [:l], [:l]])
# ArgumentError: repeating values in index!

@Shekharrajak
Copy link
Member

Thanks! I think for 1st example v2 will be good.

For 2nd example: I think, it should allow repeating index values. Means in 2nd example df must be :

=> #<Daru::DataFrame(5x3)>
       a   b   c
   k 101  11  11
   k 102  12  22
   k 103  13  33
   l 104  14  44
   l 105  15  55

So when user want values in indexk :

df[:a][:k] 

       a  
   k 101 
   k 102 
   k 103  

That means

irb(main):025:0> df = Daru::DataFrame.new({b: [11,12], a: [101,102], c: [11,22]},
irb(main):026:1*       order: [:a, :b, :c],
irb(main):027:1*       index: [[:k, :m], [:k, :m]])
=> #<Daru::DataFrame(2x3)>
           a   b   c
   k   m 101  11  11
        m 102  12  22

not this :

=> #<Daru::DataFrame(2x3)>
           a   b   c
   k   m 101  11  11
           102  12  22

So that we can access the rows using df[:a][:k] , means :

        a   
   m  101  
   m  102  

Is it good idea ? @zverok

@zverok
Copy link
Collaborator Author

zverok commented Jul 22, 2017

I think, it should allow repeating index values.

I believe, index by definition should be unique (it becames complicated with "category indexes" and I do not feel clearly understanding matters, but generic rule is simple: "index is unique names for rows"). But it is just my opinion.

@v0dro @lokeshh WDYT?

@v0dro
Copy link
Member

v0dro commented Aug 3, 2017

Pandas allows repeating values in index. However, since we haven't come across a concrete use case where this functionality is useful, I think there is no need to spend effort on making it happen. We will most likely need to change the underlying data structure for storing the index (its currently a Hash) and making it as fast as a Hash (in pure Ruby) will be a challenge.

@lokeshh
Copy link
Member

lokeshh commented Aug 4, 2017

@zverok CategoricalIndex is there to deal with duplicate indexes, so I think its fine if we restrict Index and MultiIndex to be unique but I can't agree with the definition that it should be unique because the widely accepted view of index is

"A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure."(see https://en.wikipedia.org/wiki/Database_index)

which doesn't presume that index should be unique.

@v0dro
Copy link
Member

v0dro commented Aug 5, 2017

Lokesh has a point. However lets put off the uniqueness issue until someone comes up with a concrete use case.

@zverok
Copy link
Collaborator Author

zverok commented Aug 5, 2017

"A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure."

I don't believe "database index" is a good metaphor here: in this case it should be auxiliary structure, added to dataframe for easier access (and we could have 10 different indexes for different types of access).

Dataframe index is rather an unique names for the rows as far as I can understand, and therefore https://en.wikipedia.org/wiki/Index_(publishing) is better comparison.

@gnilrets
Copy link
Contributor

gnilrets commented Aug 6, 2017

I'm still struggling with understanding why indexes more complex than sequential integers are really necessary for dataframes. Ideally, #where on a single vector should be as performance as any index lookup, especially since we're restricted to only one index per dataframe.

@lokeshh
Copy link
Member

lokeshh commented Aug 6, 2017

Ideally, #where on a single vector should be as performance as any index lookup, especially since we're restricted to only one index per dataframe.

@gnilrets We cannot increase the lookup performance of #where because it costs us additional updates and writes which are expensive. This is the whole point of having an index. Index gives us faster lookup but with an additional cost.

@gnilrets Do you agree?

@zverok
Copy link
Collaborator Author

zverok commented Aug 11, 2017

I'm still struggling with understanding why indexes more complex than sequential integers are really necessary for dataframes.

At least, because of "special" indexes (MultiIndex, which is easy to slice by part of tuple, and DateTimeIndex, where you can query the entire year). I believe that notion of Index in the meaning we use it in Daru cames from spreadsheets/accounting, and typical tables looking like

Observation1 Observation2 Observation 3
Subject1 value11 value12 value13
Subject2 value21 value22 value33
Subject3 value31 value32 value33

This is typical way how scientists think of data, I believe.

@Shekharrajak
Copy link
Member

I think if if indexes are not unique then Daru::Index must automatically go to the Daru::CategoricalIndex like how Daru::Index returns Daru::MultiIndex when tuples are passed.

Means

irb(main):012:0> Daru::Index.new([1,2,3])
=> #<Daru::Index(3): {1, 2, 3}>
irb(main):013:0> Daru::Index.new([[1,2,3], [2,3,4]])
=> #<Daru::MultiIndex(2x3)>
   1   2   3
   2   3   4
irb(main):014:0> Daru::Index.new([1,1,2,2,3,3])
=> #<Daru::Index(3): {1, 2, 3}>  # this must be => #<Daru::CategoricalIndex(6): {1, 1, 2, 2, 3, 3}>
 

Isn't good?

@Shekharrajak
Copy link
Member

Shekharrajak commented Sep 3, 2017

I am using Categorical Index when there is only one level and labels left (and duplicate index present), see : https://github.com/SciRuby/daru/pull/340/files#diff-df0c816a5a6b82ab4d961bf9d1a0acbfR248

@zverok zverok added this to the Version 1.0 milestone Oct 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants