codis-fe界面 sentinel经常出现stats ERROR状态，实际状态是正常的 #1345

zhaomingzhu · 2017-09-06T11:09:07Z

codis-fe界面 sentinel经常出现stats ERROR状态，实际使用redis-cli查看sentinel状态是正常的。
点击sync也无法同步，无任何报错信息，重新移除添加之后，状态恢复正常。

spinlock · 2017-09-06T16:42:58Z

log 里面会写出错的原因吧。你看一下写的什么。

On Wed, Sep 6, 2017 at 04:09 zhaomingzhu ***@***.***> wrote: codis-fe界面 sentinel经常出现stats ERROR状态，实际使用redis-cli查看sentinel状态是正常的。重新移除添加之后，状态恢复正常。 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1345>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAsHpazcaHiBrNU44nbNtXMYkucbjfneks5sfn1ZgaJpZM4PONWB> .

zhaomingzhu · 2017-09-13T02:53:38Z

@spinlock 出现下面日志信息
2017/09/13 10:50:48 sentinel.go:67: [WARN] sentinel subscribe canceled (context canceled)
2017/09/13 10:50:48 topom_cache.go:224: [WARN] update sentinel:

zhaomingzhu · 2017-10-16T03:34:40Z

@spinlock 我又观察了一段时间 3.2没有加sentinel_client_timeout这个参数版本之前，没有出现这种现象。升级到加sentinel_client_timeout配置参数集群，容易出现这个报错（]sentinel subscribe canceled (context canceled)）。重启dashboard或者移除重新添加恢复正常。直接点sync无法同步。请帮忙看下有没有好的方法可以解决？

spinlock · 2017-10-16T06:35:22Z

@zhaomingzhu 可以私信我联系方式？我们换一种方式沟通，wnzheng AT gmail.com

vipwangtian · 2017-10-19T09:13:20Z

我也出现了同样的问题，不知如何解决

spinlock · 2017-10-19T09:54:17Z

版本？是 branch 还是 release ？

…

On Thu, Oct 19, 2017 at 17:13 vipwangtian ***@***.***> wrote: 我也出现了同样的问题，不知如何解决 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1345 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAsHpcqhqK35noWdG8lqXl_wMSnSDLtjks5stxKzgaJpZM4PONWB> .

vipwangtian · 2017-10-19T14:02:48Z

release3.2，借鉴楼上说的，我把sentinel_client_timeout参数设置成了1000，重启dashboard再观察下

vipwangtian · 2017-10-20T00:59:37Z

问题依旧，删除重新添加sentinel会解决，这是dashboard topom中的错误信息，dashboard日志并无异常，sentinel日志也正常，并且可以用redis-cli连接执行命令

"192.168.112.155:26379": {
                    "error": {
                        "Cause": "redigo: unexpected type for String, got type []interface {}",
                        "Stack": [
                            {
                                "Name": "github.com/CodisLabs/codis/pkg/utils/redis.(*Client).Info",
                                "File": "/root/go/src/github.com/CodisLabs/codis/pkg/utils/redis/client.go",
                                "Line": 105
                            },
                            {
                                "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom).RefreshRedisStats.func3",
                                "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go",
                                "Line": 83
                            },
                            {
                                "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom).newRedisStats.func1",
                                "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go",
                                "Line": 33
                            }
                        ]
                    },
                    "unixtime": 1508458576
                },

spinlock · 2017-10-20T02:32:26Z

抱歉，这个代码我确认了一下，应该没有问题才对。特别是这个错误是 RESP 指令解析的错误，INFO 指令返回的应该是 String 类型，而不是 []interface{}，很奇怪啊。

…

On Fri, Oct 20, 2017 at 8:59 AM, vipwangtian ***@***.***> wrote: 问题依旧，删除重新添加sentinel会解决，这是dashboard topom中的错误信息，dashboard日志并无异常， sentinel日志也正常，并且可以用redis-cli连接执行命令 "192.168.112.155:26379": { "error": { "Cause": "redigo: unexpected type for String, got type []interface {}", "Stack": [ { "Name": "github.com/CodisLabs/codis/pkg/utils/redis.(*Client).Info", "File": "/root/go/src/github.com/CodisLabs/codis/pkg/utils/redis/client.go ", "Line": 105 }, { "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom). RefreshRedisStats.func3", "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go ", "Line": 83 }, { "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom).newRedisStats.func1 ", "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go ", "Line": 33 } ] }, "unixtime": 1508458576 }, — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1345 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAsHpRIl7_mMivW-6PV77pZAl_Xr83S3ks5st_B8gaJpZM4PONWB> .

vipwangtian · 2017-10-23T09:44:11Z

@spinlock 很奇怪，因为我的sentinel没有专用的机器单独做集群，出问题的sentinel一直是和dashboard在一台机器上部署的那个。

fancy-rabbit · 2017-11-05T09:13:24Z

same problem occured in my production environment. it seems that the sentinel pipelining misbehaves.

2002wmj · 2017-12-21T03:04:13Z

我的环境也有这种问题，

spinlock · 2017-12-21T08:10:21Z

嗯，我怀疑是我在处理 sentinel pipeline 的时候，错误处理过程没有及时关闭出错的连接。

我来解释一下我的猜测：因为 1. 集群下面 groups 数量比较多；2. sentinel 处理指令比较慢，这两个因素导致 sync 过程超时，但是超时出错的 client 没有及时关闭 (close)，导致 reuse client 的时候，出现 mismatch 的情况出现。

所以我整理了一下 sentinel 的 pipeline 的处理逻辑，你可以替换一下 dashboard 试试看。

期待反馈，谢谢。

vipwangtian · 2018-01-24T06:57:42Z

1月12号升级dashboard，版本号2017-12-28 13:21:33 +0000 @9fde2809cca131e3da1a7e0920ea151029301fb4 @3.2.1-10-g9fde280
，至今问题依旧

spinlock · 2018-02-09T05:42:58Z

@vipwangtian 抱歉，我才看到。我现在很难出现这个 bug，能提供更多的信息么？

@fancy-rabbit 如果可能的话，你能帮我 debug 一下这个情况么？谢谢！

spinlock · 2018-02-09T05:45:09Z

@zhaomingzhu 在以前，和 sentinel 是没有 pipeline 的，好处是写起来简单，缺点就是如果集群比较大，单次 sentinel 操作可能在几十秒，甚至几分钟，这是不能接受的。所以才把他改成 pipeline 的，但是不幸的是，我自己在维护过程中没出现过这个错误，我仅有的条件很难进行 debug。

ecvjacky · 2018-02-21T04:56:29Z

使用最新版本, 問題仍然存在. 煩請繼續跟進, 謝謝

version = 2017-12-28 13:21:33 +0000 @9fde2809cca131e3da1a7e0920ea151029301fb4 @3.2.2
compile = 2018-02-08 15:31:11 +0800 by go version go1.9.4 linux/amd64

spinlock · 2018-02-23T02:36:54Z

新年好！

看起来这个问题还是挺严重的。我下周找时间 debug 一下，因为我现在没有环境，所以不一定能找到真正的原因。

spinlock · 2018-02-23T02:39:01Z

@vipwangtian 你的 stack 很有帮助，谢谢！

…#1419,#1345)

spinlock · 2018-02-23T03:53:58Z

@fancy-rabbit Hello，我刚刚做了一些修改，你可以 review 一下。

主要修改是，在使用 Pipeline 的地方，对 Client.Pipeline.Send 和 Client.Pipeline.Recv 进行比较，如果不匹配，则立即关闭。

fancy-rabbit · 2018-02-25T15:03:29Z

@spinlock 比较这个立刻关闭是没问题的做法，不过还是没看出来之前的写法哪里会出问题。
挠头。新版已上生产环境验证~~

vipwangtian · 2018-03-12T07:09:43Z

抱歉，刚刚看到，我们在生产环境已经把sentinel移除了，我可以在下次维护的时候升级dashboard版本再观察一下 @spinlock

zhaomingzhu · 2018-03-28T06:43:12Z

经过长期的观察没有出现过三个节点同时error的情况，即使集群group很少的情况下sentinel也会出现error的情况。

pengdafu · 2019-01-18T07:45:39Z

现在依然有这种情况，最多的时候两个哨兵显示error(但是实际运行正常),在sentinel.go的284行，masterCommand

刚加入哨兵没有出现，是过几天才有

应该是
values, err := redigo.Values(client.Do("SENTINEL", "masters"))
产生的报错

wsgzao · 2020-01-16T03:17:03Z

问题描述

版本: Codis Latest release 3.2.2
Commit: 9fde280

Dashboard Sentinel每隔一段时间依次出现Status Error，所有Sentinels最终都会如此，影响Redis HA切换，解决办法就是删除后再重新添加，日志描述和上面反馈的朋友类似。

另外我们线上核心环境为了提高安全性和快速切换时间，采用的是Codis Proxy + Redis主从，Server IP填写的是Keepalived VIP，会奢华的占用多一些资源，但非常稳定，如果大家规模不大可以试试这套组合。

问题分析

通过分析日志我们最初推测可能与Sentinel或Dashboard有关
阅读源码后发现Master分支中有记录修改Dashboard代码解决Status Error问题，但是在Codis Latest release 3.2.2 并没更新
https://github.com/CodisLabs/codis/commits/release3.2

我的疑问

是否建议用户下载Master分支源码手动编译生成Dashboard二进制文件，替换Latest release 3.2.2的的源文件即可修复该问题，因为涉及线上生产，如果有朋友被相同问题困扰也可以反馈下是否得以修复

未来展望

很感谢作者长期开发和维护Codis，至少在让我们可以拥有一个方便scale-out和相对稳定对客户端友好的Redis集群解决方案。

Redis 6.0 新增 redis cluster proxy，相信技术解决方案上也会有新的突破

我之前整理了关于Codis的文章希望对大家有所帮助
Redis(Codis) 分布式集群部署实践
https://wsgzao.github.io/post/codis/

spinlock mentioned this issue Dec 21, 2017

Release3.2 fix 1345 #1419

Merged

spinlock-pony mentioned this issue Feb 7, 2018

请问 codis 3.2.2 版本有什么改动？ #1442

Open

spinlock added help wanted bug labels Feb 23, 2018

spinlock added a commit that referenced this issue Feb 23, 2018

utils: revert b9272a9 (#1419,#1345)

d525223

spinlock added a commit that referenced this issue Feb 23, 2018

utils: close redis connection immediately if client is not recyclable (…

67c7cf5

…#1419,#1345)

fancy-rabbit mentioned this issue Feb 23, 2018

codis3.2 sentinel status时不时的显示error，但是在linux确实是运行着，移除，在加入，就好了，过一段时间还会出现！ #1439

Closed

yz1509 mentioned this issue Feb 9, 2021

Fix the bug that healthy sentinel displays ERROR on the codis-fe #1730

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codis-fe界面 sentinel经常出现stats ERROR状态，实际状态是正常的 #1345

codis-fe界面 sentinel经常出现stats ERROR状态，实际状态是正常的 #1345

zhaomingzhu commented Sep 6, 2017 •

edited

spinlock commented Sep 6, 2017 via email

zhaomingzhu commented Sep 13, 2017

zhaomingzhu commented Oct 16, 2017

spinlock commented Oct 16, 2017

vipwangtian commented Oct 19, 2017

spinlock commented Oct 19, 2017 via email

vipwangtian commented Oct 19, 2017

vipwangtian commented Oct 20, 2017 •

edited by spinlock

spinlock commented Oct 20, 2017 via email

vipwangtian commented Oct 23, 2017

fancy-rabbit commented Nov 5, 2017

2002wmj commented Dec 21, 2017

spinlock commented Dec 21, 2017 •

edited

vipwangtian commented Jan 24, 2018

spinlock commented Feb 9, 2018

spinlock commented Feb 9, 2018

ecvjacky commented Feb 21, 2018 •

edited

spinlock commented Feb 23, 2018

spinlock commented Feb 23, 2018

spinlock commented Feb 23, 2018

fancy-rabbit commented Feb 25, 2018

vipwangtian commented Mar 12, 2018 •

edited

zhaomingzhu commented Mar 28, 2018

pengdafu commented Jan 18, 2019

wsgzao commented Jan 16, 2020

codis-fe界面 sentinel经常出现stats ERROR状态，实际状态是正常的 #1345

codis-fe界面 sentinel经常出现stats ERROR状态，实际状态是正常的 #1345

Comments

zhaomingzhu commented Sep 6, 2017 • edited

spinlock commented Sep 6, 2017 via email

zhaomingzhu commented Sep 13, 2017

zhaomingzhu commented Oct 16, 2017

spinlock commented Oct 16, 2017

vipwangtian commented Oct 19, 2017

spinlock commented Oct 19, 2017 via email

vipwangtian commented Oct 19, 2017

vipwangtian commented Oct 20, 2017 • edited by spinlock

spinlock commented Oct 20, 2017 via email

vipwangtian commented Oct 23, 2017

fancy-rabbit commented Nov 5, 2017

2002wmj commented Dec 21, 2017

spinlock commented Dec 21, 2017 • edited

vipwangtian commented Jan 24, 2018

spinlock commented Feb 9, 2018

spinlock commented Feb 9, 2018

ecvjacky commented Feb 21, 2018 • edited

spinlock commented Feb 23, 2018

spinlock commented Feb 23, 2018

spinlock commented Feb 23, 2018

fancy-rabbit commented Feb 25, 2018

vipwangtian commented Mar 12, 2018 • edited

zhaomingzhu commented Mar 28, 2018

pengdafu commented Jan 18, 2019

wsgzao commented Jan 16, 2020

问题描述

问题分析

我的疑问

未来展望

zhaomingzhu commented Sep 6, 2017 •

edited

vipwangtian commented Oct 20, 2017 •

edited by spinlock

spinlock commented Dec 21, 2017 •

edited

ecvjacky commented Feb 21, 2018 •

edited

vipwangtian commented Mar 12, 2018 •

edited