Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataX---channel数与实际task数不同(未完待续) #90

Open
AronChung opened this issue Nov 26, 2020 · 0 comments
Open

DataX---channel数与实际task数不同(未完待续) #90

AronChung opened this issue Nov 26, 2020 · 0 comments
Labels
DataX datax深入研究
Projects

Comments

@AronChung
Copy link
Owner

AronChung commented Nov 26, 2020

提升job内Channel并发有三种配置方式:

  • 配置全局Byte限速以及单Channel Byte限速,Channel个数 = 全局Byte限速 / 单Channel Byte限速
  • 配置全局Record限速以及单Channel Record限速,Channel个数 = 全局Record限速 / 单Channel Record限速
  • 直接配置Channel个数.(只有在上面两种未设置才生效,上面两个是取最小的一个channel作为最终的channel)

做了一个简单测试,输出单张表数据,共47条记录
在使用过程中,我只做了如下配置,希望是跑3个task:

"setting": {
            "speed": {
                 "channel": 3
            },
        },

而执行结果却为:

2020-11-26 10:51:18.160 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [16] tasks.
2020-11-26 10:51:18.160 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [16] tasks.

预期是3个task,为何最终是16个task,于是继续往下深究:

// adviceNumber为channel数,假设为3
// tableNumber假设为1
// 计算后eachTableShouldSplittedNumber为3
private static int calculateEachTableShouldSplittedNumber(int adviceNumber,
                                                              int tableNumber) {
        double tempNum = 1.0 * adviceNumber / tableNumber;
        return (int) Math.ceil(tempNum);
    }

为什么最终channel数与实际task数不同?

 //最终切分份数不一定等于 eachTableShouldSplittedNumber
boolean needSplitTable = eachTableShouldSplittedNumber > 1
        && StringUtils.isNotBlank(splitPk);
if (needSplitTable) {
    if (tables.size() == 1) {
        //原来:如果是单表的,主键切分num=num*2+1
        // splitPk is null这类的情况的数据量本身就比真实数据量少很多, 和channel大小比率关系时,不建议考虑
        //eachTableShouldSplittedNumber = eachTableShouldSplittedNumber * 2 + 1;// 不应该加1导致长尾(长尾:倾斜)
        
        //考虑其他比率数字?(splitPk is null, 忽略此长尾)
        eachTableShouldSplittedNumber = eachTableShouldSplittedNumber * 5;
    }
    // 尝试对每个表,切分为eachTableShouldSplittedNumber 份
    for (String table : tables) {
        tempSlice = sliceConfig.clone();
        tempSlice.set(Key.TABLE, table);

        List<Configuration> splittedSlices = SingleTableSplitUtil
                .splitSingleTable(tempSlice, eachTableShouldSplittedNumber);

        splittedConfigs.addAll(splittedSlices);
    }
}
@AronChung AronChung added the DataX datax深入研究 label Nov 26, 2020
@AronChung AronChung added this to DataX in My Blog Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataX datax深入研究
Projects
My Blog
  
DataX
Development

No branches or pull requests

1 participant