Skip to content

Commit

Permalink
add light_monitor.md (#1578)
Browse files Browse the repository at this point in the history
  • Loading branch information
ywy2090 committed Nov 23, 2022
1 parent f48dd38 commit 42e0774
Show file tree
Hide file tree
Showing 2 changed files with 142 additions and 0 deletions.
141 changes: 141 additions & 0 deletions 3.x/zh_CN/docs/develop/light_monitor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# light_monitor.sh监控工具

标签:``监控`` ``monitor``

----

## 介绍

### 简介

`FISCO-BCOS 3.0`区块链轻量级监控工具,可以监控区块链是否正常工作,也提供简单的接入告警系统的方式.

### 功能

- 监控共识是否正常.
- 监控区块同步是否正常.
- 监控磁盘空间.
- 对接告警系统,发送告警信息.

## 使用

帮助:
`bash light_monitor.sh -h`

```shell
$ bash light_monitor.sh -h
Usage:
Optional:
-g [Require] group id
-i [Require] rpc server ip
-p [Require] rpc server port
-t [Optional] block number far behind warning threshold, default: 30
-d [Optional] disk directory to be monitor
-T [Optional] disk capacity alarm threshold, default: 5%
-h Help.
Example:
bash light_monitor.sh -i 127.0.0.1 -p 20200 -g group0
bash light_monitor.sh -i 127.0.0.1 -p 20200 -g group0 -d /data -T 10
```

**参数:**

- `-g`: 监控的群组id,`rpc`接入多个群组时,可以部署多个light_monitor.sh,分别监控不同群组
- `-i`: rpc ip
- `-p`: rpc port
- `-t`: 区块同步告警的阈值,共识节点之间的区块高度差值超过该阈值,说明共识或者区块同步异常,默认值`30`
- `-d`: 磁盘容量需要监控的目录
- `-T`: 磁盘告警阈值,监控的磁盘剩余空间百分比小于该值时,触发磁盘告警告警,默认值`5%`
- `-h`: 帮助信息

### 状态描述

**参数:**

- $config_ip: rpc ip
- $config_port:rpc port
- $group: 群组id
- $height: 区块高度

#### ```OK! $config_ip:$config_port $node:$group is working properly: height $height```

群组`${group}`正常工作, 共识模块/区块同步正常工作。

#### ```ERROR! Cannot connect to $config_ip:$config_port ${group}, method: xxxx```

调用`rpc`接口`xxxx`失败,`rpc`服务宕机, 严重错误,此时需要重启`rpc`服务。

#### ```ERROR! Consensus timeout $config_ip:$config_port ${group}:${node}```

**群组共识超时,连续出现时为严重错误**
排查网络连接是否正常。

#### ```ERROR! insufficient disk capacity, monitor disk directory: ${dir}, left disk space percent: ${disk_space_left_percent}%```

磁盘空间不足,剩余`${disk_space_left_percent}%`的空间。

### 配置crontab任务

为了能够持续监控区块链节点的状态, 需要将`light_monitor.sh`配置到`crontab`定期执行.

```shell
# 每分钟执行一次,查看节点是否正常启动, 正常共识, 有无关键错误打印
*/1 * * * * /data/app/127.0.0.1/light_monitor.sh >> /data/app/127.0.0.1/light_monitor.log 2>&1
```

`light_monitor.log`保存`light_monitor.sh`的输出

**用户需要根据实际部署修改示例中的路径.**

## 对接告警系统

- 接口
`light_monitor.sh`对接告警系统的接口`alarm`, 默认实现如下:

```shell
alarm() {
echo "$1"
alert_msg="$1"
alert_ip=$(/sbin/ifconfig eth0 | grep inet | grep -v inet6 | awk '{print $2}')
alert_time=$(date "+%Y-%m-%d %H:%M:%S")

# TODO: alarm the message, mail or phone

# echo "[${alert_time}]:[${alert_ip}]:${alert_msg}"| mail -s "fisco-bcos alarm message" xxxxxx@qq.com
}
```

`light_monitor.sh`执行触发的关键错误处都会调用该函数, 并将错误信息作为入参, 用户可以调用监控平台的API将错误信息发送至告警平台.

- 示例

假设用户的告警系统

- API:
`http://127.0.0.1:1111/alarm/request`
POST参数:
```{'title':'告警主题','alert_ip':'告警服务器IP', 'alert_info':'告警内容'}```

修改`alarm`函数:

```shell
alarm() {
local message="$1"
# 获取本机IP
alert_ip=$(/sbin/ifconfig eth0 2>/dev/null | grep inet | awk '{print $2}')

# 发送告警信息
curl -H "Content-Type: application/json" -X POST --data "{'title':'alarm','alert_ip':'${alert_ip}','alert_info':'${message}'}" http://127.0.0.1:1111/alarm/request
}

alarm() {
echo "$1"
alert_msg="$1"
alert_ip=$(/sbin/ifconfig eth0 | grep inet | grep -v inet6 | awk '{print $2}')
alert_time=$(date "+%Y-%m-%d %H:%M:%S")

# TODO: alarm the message, mail or phone

curl -H "Content-Type: application/json" -X POST --data "{'title':'alarm','alert_ip':'${alert_ip}','alert_info':'${alert_msg}'}" http://127.0.0.1:1111/alarm/request
}
```
1 change: 1 addition & 0 deletions 3.x/zh_CN/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ FISCO BCOS 是一个稳定、高效、安全的区块链底层平台,经过多
docs/develop/account.md
docs/develop/stress_testing.md
docs/develop/committee_usage.md
docs/develop/light_monitor.md


.. toctree::
Expand Down

0 comments on commit 42e0774

Please sign in to comment.