You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ddl的流程涉及的组件为:hs2 -> hms -> sentry -> hdfs acl
排查过程:
慢查询
show locks
hs2日志
hms日志 Failed to sync requested HMS notifications up to the event ID xxxx
namenode日志
sentry日志 timed out wait request for id xxxx
timed out wait request for id xxxx事件解析
use hive;
select * from NOTIFICATION_LOG where event_id=xxxx;
查看sentry处理的eventId是否跟hms的一致
sentry:
select * from sentry.SENTRY_HMS_NOTIFICATION_ID order by NOTIFICATION_ID desc limit 10;
hms:
select * from hive.NOTIFICATION_SEQUENCE Order by NEXT_EVENT_ID desc limit 10;
// Should match the value for RPC timeout in HMS client config
public static final String SENTRY_NOTIFICATION_SYNC_TIMEOUT_MS = "sentry.notification.sync.timeout.ms";
public static final int SENTRY_NOTIFICATION_SYNC_TIMEOUT_DEFAULT = 200000;
现象
create\drop等dll执行时长都为200秒
原因分析
ddl的流程涉及的组件为:hs2 -> hms -> sentry -> hdfs acl
排查过程:
Failed to sync requested HMS notifications up to the event ID xxxx
timed out wait request for id xxxx
经过排查,定位到hms和sentry的两句log,确定是HMS notifications出了问题,下载sentry源码master分支,找到异常所在的代码
从而进一步找到200秒超时的参数:
分析这块源码逻辑,这块主要是开启了hdfs-sentry acl同步后,hdfs, sentry, hive metastore server三者间权限同步的消息处理。当突然大批量的目录权限消息需要处理,后台线程处理不过来,消息积压滞后就会出现这个异常。这个异常不影响集群使用,只是会导致create,drop table 慢需要等200s,这样等待也是为了追上最新的id,可以通过设置sentry sentry.notification.sync.timeout.ms(默认200s)参数调小超时时间,减小等待时间,积压不多的话可以让它自行消费处理掉。我们这次同时出现了hive metastore server 参与同步消息处理的线程被异常退出,导致sentry的sentry_hms_notification_id 表数据一直没更新,需要重启hive metastore server。如果积压了太多消息,让它慢慢消费处理需要的时间太长,可能一直追不上,这时可以选择丢掉这些消息。具体操作在sentry sentry_hms_notification_id 表中插入一条最大值(等于当前消息的id,从notification_sequence 表中获取) ,重启sentry 服务。notification_log 表存储了消息日志信息。
总结:
The text was updated successfully, but these errors were encountered: