Loss of DF multi-index when joining a DF without multi-index #429

info-rchitect · 2017-10-10T17:01:40Z

Hi,

I am seeing a multi-index DF lose its multi-index (goes to 0,1,2,3...) when doing a left join with a DF without a multi-index. Is that the expected behavior?

[6] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.index
=> #<Daru::MultiIndex(26x3)>
 2017W27 MYPROD1       A
 2017W28 MYPROD1       A
 2017W29 MYPROD1       A
 2017W30 MYPROD1       A
 2017W31 MYPROD1       A
 2017W32 MYPROD1       A
 2017W33 MYPROD1       A
 2017W34 MYPROD1       A
 2017W35 MYPROD1       A
 2017W36 MYPROD1       A
 2017W37 MYPROD1       A
 2017W38 MYPROD1       A
 2017W39 MYPROD1       A
 2017W40 MYPROD1       A
 2017W41 MYPROD1       A
 2017W42 MYPROD1       A
 2017W43 MYPROD1       A
 2017W44 MYPROD1       A
 2017W45 MYPROD1       A
 2017W46 MYPROD1       A
     ...     ...     ...
[7] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index
=> #<Daru::Index(30): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... 29}>
[8] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
[9] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index
=> #<Daru::Index(26): {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... 25}>

thx

zverok · 2017-10-11T18:52:28Z

I believe it is a bug, but can you please provide a self-containing example for easier reproduction?

info-rchitect · 2017-10-13T11:31:15Z

@zverok sorry for the delay, will get to this today.

info-rchitect · 2017-10-14T13:15:07Z

OK here we go:

[1] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data
=> #<Daru::DataFrame(4x7)>
                unit testA_ft1 testA_ws1 testB_ft1 testB_ws1 testC_ft1 testC_ws1
         1         1       nil       0.0       nil       0.0       nil       1.0
         2         2       0.0       nil       1.0       nil       0.0       nil
         3         3       nil       0.0       nil       0.0       nil       1.0
         4         4       0.0       nil       1.0       nil       0.0       nil
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj
=> #<Daru::DataFrame(4x4)>
                unit testA_ws1 testD_ws1 testE_ws1
         1         1       1.0       0.0       1.0
         2         2       0.0       1.0       0.0
         3         3       1.0       1.0       0.0
         4         4       0.0       1.0       0.0
[3] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data.index
=> #<Daru::MultiIndex(4x1)>
   1
   2
   3
   4
[4] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index
=> #<Daru::MultiIndex(4x1)>
   1
   2
   3
   4
[5] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
=> #<Daru::DataFrame(4x10)>
             testA_ft1 testA_ws1_  testB_ft1  testB_ws1  testC_ft1  testC_ws1       unit testA_ws1_  testD_ws1  testE_ws1
          0        nil        0.0        nil        0.0        nil        1.0          1        1.0        0.0        1.0
          1        0.0        nil        1.0        nil        0.0        nil          2        0.0        1.0        0.0
          2        nil        0.0        nil        0.0        nil        1.0          3        1.0        1.0        0.0
          3        0.0        nil        1.0        nil        0.0        nil          4        0.0        1.0        0.0
[6] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index
=> #<Daru::Index(4): {0, 1, 2, 3}>

In reality, it seems like any join between a multiindex DF and any type of indexed DF results in a Daru::Index DF. I believe this is related to the merge issue.

zverok · 2017-10-16T18:44:10Z

Please check if it works on current master should be solved by #431

info-rchitect · 2017-10-16T20:54:00Z

@zverok I am still seeing the join method turn 2 DFs with multiindex into a single index but my example perhaps is not the best due to the fact that it looks like daru is actually 'optimizing' away the multi-index.

[1] pry(#<Nemawashi::Analytics::Transforms::Update>)> @source.data.index.to_a
=> [[1], [2], [3], [4]]
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> dataobj.index.to_a
=> [[1], [2], [3], [4]]
[3] pry(#<Nemawashi::Analytics::Transforms::Update>)>
    60:           binding.pry
    61:           updated_dataframe = @source.data.join(dataobj, how: :left, on: match_columns)
 => 62:           binding.pry
[2] pry(#<Nemawashi::Analytics::Transforms::Update>)> updated_dataframe.index.to_a
=> [0, 1, 2, 3]

Is this intended behavior? I will try to find a better example where the pre-join index cannot be flattened to be the same as the post-join index.

zverok · 2017-10-19T19:22:32Z

Well, in fact, currently:

join does NOT preserve index (because in general case, two joined dataframes can have different indexes, how'd you preserve them?)
...and just provides "default" index.

I have a suspicion that the most reasonable behavior would be "make join columns an index", but not sure it should/would/can work (see "duplicate index discussion" here ;)).

Anyways, I am pretty sure your initial request (preserving index of both dataframes on join) can't be performed. Why do you believe otherwise?

info-rchitect · 2017-10-19T19:25:23Z

@zverok if two dataframes have the same index shouldn't the index be preserved?

zverok · 2017-10-19T19:31:43Z

But why it should?
Logically, "joining" is merging of different data sources: people (indexed by name) join departments (indexed by id), books (indexed by isbn) join authors (indexed by name)...
What is the real-life use case for two dataframes that should be joined, but have the same index?

info-rchitect · 2017-10-19T19:36:23Z

@zverok so let's say my multi-index is [:work_week, :factory] and i pull yield data from one database and I pull volume forecast data from another database that uses the same multi-index of [:work_week, :factory] and want them to be joined. why should I lose the matching multi-index on the subsequently created dataframe?

zverok · 2017-10-19T19:45:04Z

By which column you join them, then?

info-rchitect · 2017-10-19T20:00:22Z

those 2 columns.

zverok · 2017-10-19T20:01:12Z

So, you join them on index?

info-rchitect mentioned this issue Oct 11, 2017

Merge on non default index drops index values #422

Closed

zverok added the bug label Oct 11, 2017

zverok added this to the Version 1.0 milestone Oct 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss of DF multi-index when joining a DF without multi-index #429

Loss of DF multi-index when joining a DF without multi-index #429

info-rchitect commented Oct 10, 2017

zverok commented Oct 11, 2017

info-rchitect commented Oct 13, 2017

info-rchitect commented Oct 14, 2017

zverok commented Oct 16, 2017

info-rchitect commented Oct 16, 2017

zverok commented Oct 19, 2017

info-rchitect commented Oct 19, 2017

zverok commented Oct 19, 2017

info-rchitect commented Oct 19, 2017

zverok commented Oct 19, 2017

info-rchitect commented Oct 19, 2017

zverok commented Oct 19, 2017

Loss of DF multi-index when joining a DF without multi-index #429

Loss of DF multi-index when joining a DF without multi-index #429

Comments

info-rchitect commented Oct 10, 2017

zverok commented Oct 11, 2017

info-rchitect commented Oct 13, 2017

info-rchitect commented Oct 14, 2017

zverok commented Oct 16, 2017

info-rchitect commented Oct 16, 2017

zverok commented Oct 19, 2017

info-rchitect commented Oct 19, 2017

zverok commented Oct 19, 2017

info-rchitect commented Oct 19, 2017

zverok commented Oct 19, 2017

info-rchitect commented Oct 19, 2017

zverok commented Oct 19, 2017