Skip to content

Inconsistent behavior in keyed/unkeyed joins against duplicate columns #4891

@MichaelChirico

Description

@MichaelChirico

Observed while answering this SO Question: https://stackoverflow.com/a/66041678/3576984

# NB: recall CJ output is keyed by default
DT1 = CJ(a = 1:3, b = 4:5, c = 6)
# the same table, but column two is re-named a
DT2 = CJ(a = 1:3, a = 4:5, d = 6)

Observe the difference of when DT2 is keyed vs not:

# KEYED
DT1[DT2]
#    a b c i.a
# 1: 1 1 6   4
# 2: 1 1 6   5
# 3: 2 2 6   4
# 4: 2 2 6   5
# 5: 3 3 6   4
# 6: 3 3 6   5

# UNKEYED
setkey(DT2, NULL)
DT1[DT2]
#    a b c
# 1: 1 4 6
# 2: 1 5 6
# 3: 2 4 6
# 4: 2 5 6
# 5: 3 4 6
# 6: 3 5 6

Is there some reason the first case should be intended behavior?

The verbose output suggests it starts doing the right thing, then gets tripped up later on:

DT1[DT2, verbose=TRUE]
# i.a has same type (integer) as x.a. No coercion needed.
# i.a has same type (integer) as x.b. No coercion needed.
# i.d has same type (double) as x.c. No coercion needed.
# on= matches existing key, using key
# Starting bmerge ...
# bmerge done in 0.000s elapsed (0.000s cpu) 
# Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions