Skip to content

Improve Pig file size to increase parallelism according to shard size #294

@costin

Description

@costin

By default, for reducers Pig uses only 1 task. Thus if the user applies a DISTINCT followed by a GROUP BY, the Pig stream parallelism gets funnelled into one task.
The workaround is to use the PARALLEL work (though that requires user interaction) or potentially create a dynamic file size to trigger this automatically (following Pig's naive InputSizeReducerEstimator).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions