Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hive---ddl在200s左右结束 #106

Open
AronChung opened this issue Sep 29, 2021 · 0 comments
Open

hive---ddl在200s左右结束 #106

AronChung opened this issue Sep 29, 2021 · 0 comments
Labels
Hive hive
Projects

Comments

@AronChung
Copy link
Owner

AronChung commented Sep 29, 2021

现象

create\drop等dll执行时长都为200秒

原因分析

ddl的流程涉及的组件为:hs2 -> hms -> sentry -> hdfs acl
排查过程:

  1. 慢查询
  2. show locks
  3. hs2日志
  4. hms日志 Failed to sync requested HMS notifications up to the event ID xxxx
  5. namenode日志
  6. sentry日志 timed out wait request for id xxxx
    image
timed out wait request for id  xxxx事件解析
use hive;
select * from NOTIFICATION_LOG where event_id=xxxx;
查看sentry处理的eventId是否跟hms的一致
sentry: 
select * from  sentry.SENTRY_HMS_NOTIFICATION_ID order by NOTIFICATION_ID desc limit 10;

hms:
select * from hive.NOTIFICATION_SEQUENCE Order by NEXT_EVENT_ID desc limit 10;

经过排查,定位到hms和sentry的两句log,确定是HMS notifications出了问题,下载sentry源码master分支,找到异常所在的代码
image
从而进一步找到200秒超时的参数:

// Should match the value for RPC timeout in HMS client config
    public static final String SENTRY_NOTIFICATION_SYNC_TIMEOUT_MS = "sentry.notification.sync.timeout.ms";
    public static final int SENTRY_NOTIFICATION_SYNC_TIMEOUT_DEFAULT = 200000;

分析这块源码逻辑,这块主要是开启了hdfs-sentry acl同步后,hdfs, sentry, hive metastore server三者间权限同步的消息处理。当突然大批量的目录权限消息需要处理,后台线程处理不过来,消息积压滞后就会出现这个异常。这个异常不影响集群使用,只是会导致create,drop table 慢需要等200s,这样等待也是为了追上最新的id,可以通过设置sentry sentry.notification.sync.timeout.ms(默认200s)参数调小超时时间,减小等待时间,积压不多的话可以让它自行消费处理掉。我们这次同时出现了hive metastore server 参与同步消息处理的线程被异常退出,导致sentry的sentry_hms_notification_id 表数据一直没更新,需要重启hive metastore server。如果积压了太多消息,让它慢慢消费处理需要的时间太长,可能一直追不上,这时可以选择丢掉这些消息。具体操作在sentry sentry_hms_notification_id 表中插入一条最大值(等于当前消息的id,从notification_sequence 表中获取) ,重启sentry 服务。notification_log 表存储了消息日志信息。

总结:

  • 于昨天上午10:35:08起,用户操作ddl时开始变慢
  • 那个时间有大量的DDL 分区删除操作(20分钟有2w多)
  • 导致sentry到hdfs acl的链路速度跟不上ddl的请求速度导致,有大量的ddl命令导致sentry到hdfs acl的链路处理不及时
  • 超时时间默认200s,这就是为什么大家drop/create表时都刚好200s的原因
  • 持续至今日凌晨0点50分,ddl数量恢复正常,问题得以缓解
@AronChung AronChung added the Hive hive label Sep 29, 2021
@AronChung AronChung changed the title hive---ddl在200s左右结束(未完待续) hive---ddl在200s左右结束 Oct 20, 2021
@AronChung AronChung added this to Hive in My Blog Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hive hive
Projects
My Blog
  
Hive
Development

No branches or pull requests

1 participant