Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of DF multi-index when joining a DF without multi-index #429

Open
info-rchitect opened this issue Oct 10, 2017 · 12 comments
Open

Loss of DF multi-index when joining a DF without multi-index #429

info-rchitect opened this issue Oct 10, 2017 · 12 comments
Labels
Milestone

Comments

@info-rchitect
Copy link

Hi,

I am seeing a multi-index DF lose its multi-index (goes to 0,1,2,3...) when doing a left join with a DF without a multi-index. Is that the expected behavior?

[6] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.index
=> #<Daru::MultiIndex(26x3)>
 2017W27 MYPROD1       A
 2017W28 MYPROD1       A
 2017W29 MYPROD1       A
 2017W30 MYPROD1       A
 2017W31 MYPROD1       A
 2017W32 MYPROD1       A
 2017W33 MYPROD1       A
 2017W34 MYPROD1       A
 2017W35 MYPROD1       A
 2017W36 MYPROD1       A
 2017W37 MYPROD1       A
 2017W38 MYPROD1       A
 2017W39 MYPROD1       A
 2017W40 MYPROD1       A
 2017W41 MYPROD1       A
 2017W42 MYPROD1       A
 2017W43 MYPROD1       A
 2017W44 MYPROD1       A
 2017W45 MYPROD1       A
 2017W46 MYPROD1       A
     ...     ...     ...
[7] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index
=> #<Daru::Index(30): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... 29}>
[8] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
[9] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index
=> #<Daru::Index(26): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... 25}>

thx

@zverok
Copy link
Collaborator

zverok commented Oct 11, 2017

I believe it is a bug, but can you please provide a self-containing example for easier reproduction?

@info-rchitect
Copy link
Author

@zverok sorry for the delay, will get to this today.

@info-rchitect
Copy link
Author

OK here we go:

[1] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data
=> #<Daru::DataFrame(4x7)>
                unit testA_ft1 testA_ws1 testB_ft1 testB_ws1 testC_ft1 testC_ws1
         1         1       nil       0.0       nil       0.0       nil       1.0
         2         2       0.0       nil       1.0       nil       0.0       nil
         3         3       nil       0.0       nil       0.0       nil       1.0
         4         4       0.0       nil       1.0       nil       0.0       nil
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj
=> #<Daru::DataFrame(4x4)>
                unit testA_ws1 testD_ws1 testE_ws1
         1         1       1.0       0.0       1.0
         2         2       0.0       1.0       0.0
         3         3       1.0       1.0       0.0
         4         4       0.0       1.0       0.0
[3] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data.index
=> #<Daru::MultiIndex(4x1)>
   1
   2
   3
   4
[4] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index
=> #<Daru::MultiIndex(4x1)>
   1
   2
   3
   4
[5] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
=> #<Daru::DataFrame(4x10)>
             testA_ft1 testA_ws1_  testB_ft1  testB_ws1  testC_ft1  testC_ws1       unit testA_ws1_  testD_ws1  testE_ws1
          0        nil        0.0        nil        0.0        nil        1.0          1        1.0        0.0        1.0
          1        0.0        nil        1.0        nil        0.0        nil          2        0.0        1.0        0.0
          2        nil        0.0        nil        0.0        nil        1.0          3        1.0        1.0        0.0
          3        0.0        nil        1.0        nil        0.0        nil          4        0.0        1.0        0.0
[6] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index
=> #<Daru::Index(4): {0, 1, 2, 3}>

In reality, it seems like any join between a multiindex DF and any type of indexed DF results in a Daru::Index DF. I believe this is related to the merge issue.

@zverok
Copy link
Collaborator

zverok commented Oct 16, 2017

Please check if it works on current master should be solved by #431

@info-rchitect
Copy link
Author

@zverok I am still seeing the join method turn 2 DFs with multiindex into a single index but my example perhaps is not the best due to the fact that it looks like daru is actually 'optimizing' away the multi-index.

[1] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data.index.to_a
=> [[1], [2], [3], [4]]
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index.to_a
=> [[1], [2], [3], [4]]
[3] pry(#<Nemawashi::Analytics::Transforms::Update>)>
    60:           binding.pry
    61:           updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
 => 62:           binding.pry
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index.to_a
=> [0, 1, 2, 3]

Is this intended behavior? I will try to find a better example where the pre-join index cannot be flattened to be the same as the post-join index.

@zverok
Copy link
Collaborator

zverok commented Oct 19, 2017

Well, in fact, currently:

  • join does NOT preserve index (because in general case, two joined dataframes can have different indexes, how'd you preserve them?)
  • ...and just provides "default" index.

I have a suspicion that the most reasonable behavior would be "make join columns an index", but not sure it should/would/can work (see "duplicate index discussion" here ;)).

Anyways, I am pretty sure your initial request (preserving index of both dataframes on join) can't be performed. Why do you believe otherwise?

@info-rchitect
Copy link
Author

@zverok if two dataframes have the same index shouldn't the index be preserved?

@zverok
Copy link
Collaborator

zverok commented Oct 19, 2017

But why it should?
Logically, "joining" is merging of different data sources: people (indexed by name) join departments (indexed by id), books (indexed by isbn) join authors (indexed by name)...
What is the real-life use case for two dataframes that should be joined, but have the same index?

@info-rchitect
Copy link
Author

@zverok so let's say my multi-index is [:work_week, :factory] and i pull yield data from one database and I pull volume forecast data from another database that uses the same multi-index of [:work_week, :factory] and want them to be joined. why should I lose the matching multi-index on the subsequently created dataframe?

@zverok
Copy link
Collaborator

zverok commented Oct 19, 2017

By which column you join them, then?

@info-rchitect
Copy link
Author

those 2 columns.

@zverok
Copy link
Collaborator

zverok commented Oct 19, 2017

So, you join them on index?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants