Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: translate chinese doc of faq.md and monitoring.md into english (#1796) #1815

Merged
merged 8 commits into from
May 24, 2022
69 changes: 69 additions & 0 deletions docs/en/maintain/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Operation and Maintenance FAQ

## Deploy and Startup FAQ

### 1. How to confirm that the cluster is running normally?
Although there is one-click to start the script, due to the numerous configurations, problems such as "the port is occupied" and "the directory does not have read and write permissions" may occur. These problems can only be identified when the server process is running, and there is no timely feedback after exiting. (If monitoring is configured, it can be checked directly by monitoring.)
Therefore, please make sure that all server processes in the cluster are running normally.

It can be queried by `ps axu | grep openmldb`. (Note that `mon` is used as the daemon process in the official run script, but the running of the `mon` process does not mean that the OpenMLDB server process is running.)

If the processes are all running and the cluster still behaves abnormally, you need to query the server log. You can give priority to 'WARN' and 'ERROR' level logs, which are most likely the root cause.

## Server FAQ

### 1. Why is there a warning of "Fail to write into Socket" in the log?
```
http_rpc_protocol.cpp:911] Fail to write into Socket{id=xx fd=xx addr=xxx} (0x7a7ca00): Unknown error 1014 [1014]
```
This is the log that the server side will print. Generally, the client side uses the connection pool or short connection mode. After the RPC times out, the connection will be closed. When the server writes back the response, it finds that the connection has been closed and reports this error. Got EOF means that EOF has been received before (the peer has closed the connection normally). The client side uses the single connection mode and the server side generally does not report this.

### 2. The initial ttl setting of table data is not suitable, how to adjust it?
This needs to be modified using nsclient, which cannot be done by ordinary clients. For nsclient startup method and command, see [ns client](../reference/cli.md#ns-client)。

Use the command `setttl` in nsclient to change the ttl of a table, similar to
```
setttl table_name ttl_type ttl [ttl] [index_name]
```
As you can see, if you configure the name of the index at the end of the command, you can only modify the ttl of a single index.
```{caution}
Changes to `setttl` will not take effect in time and will be affected by the `gc_interval` configuration of the tablet server. (The configuration of each tablet server is independent and does not affect each other.)

For example, if the `gc_interval` of a tablet server is 1h, then the ttl configuration reload will be performed at the last moment of the next gc (in the worst case, it will be reloaded after 1h). This time the gc that reloads the ttl will not eliminate the data according to the latest ttl. The latest ttl will be used for data elimination during the next gc.

Therefore, after **ttl is changed, it takes two gc intervals to take effect**. please wait patiently.

Of course, you can adjust the `gc_interval` of the tablet server, but this configuration cannot be changed dynamically, it can only take effect after restarting. Therefore, if the memory pressure is high, you can try to expand the capacity and migrate the data shards to reduce the memory pressure. Adjusting `gc_interval` lightly is not recommended.
```

### 3. If a warning log appears: Last Join right table is empty, what does it mean?
Generally speaking, this is a normal phenomenon and does not represent an anomaly in the cluster. It's just that the right table of the join in the runner is empty, while is a possible phenomenon, and is instead likely to be a data problem.

## Client FAQ

### 1. Why am I getting a warning log for Reached timeout?
```
rpc_client.h:xxx] request error. [E1008] Reached timeout=xxxms
```
This is because the timeout setting of the rpc request sent by the client itself is small, and the client itself disconnects itself. Note that this is a timeout for rpc.

It is divided into the following situations:
#### Synchronized offline job
This happens easily when using synchronized offline commands. you can use
```sql
> SET @@job_timeout = "600000";
```
To adjust the timeout time of rpc, use 'ms' units.
#### normal request
If it is a simple query or insert, there will be a timeout, and the general `request_timeout` configuration needs to be changed.
1. CLI: cannot be changed at this time
2. JAVA: SDK direct connection, adjust `SdkOption.requestTimeout`; JDBC, adjust the parameter `requestTimeout` in url
3. Python: cannot be changed at this time

### 2. Why am I getting the warning log of Got EOF of Socket?
```
rpc_client.h:xxx] request error. [E1014]Got EOF of Socket{id=x fd=x addr=xxx} (xx)
```
This is because the `addr` side actively disconnected, and the address of `addr` is most likely taskmanager. This does not mean that the taskmanager is abnormal, but that the taskmanager side thinks that the connection is inactive and has exceeded the keepAliveTime, and actively disconnects the communication channel.
In version 0.5.0 and later, the taskmanager's `server.channel_keep_alive_time` can be increased to increase the tolerance of inactive channels. The default value is 1800s (0.5h), especially when using synchronous offline commands, this value may need to be adjusted appropriately.
In versions before 0.5.0, this configuration cannot be changed, please upgrade the taskmanager version.
2 changes: 2 additions & 0 deletions docs/en/maintain/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,6 @@ Maintenance

upgrade
cli
faq
monitoring

195 changes: 195 additions & 0 deletions docs/en/maintain/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Monitoring

## Overview


The monitoring scheme of OpenMLDB is outlined as follows:

- Use [prometheus](https://prometheus.io) to collect monitoring metrics, [grafana](https://grafana.com/oss/grafana/) to visualize metrics
- OpenMLDB exporter exposes database-level and component-level monitoring metrics
- Node uses [node_exporter](https://github.com/prometheus/node_exporter) to expose machine and operating system related metrics

## Install and Run OpenMLDB Exporter

### Introduction

[![PyPI](https://img.shields.io/pypi/v/openmldb-exporter?label=openmldb-exporter)](https://pypi.org/project/openmldb-exporter/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/openmldb-exporter?style=flat-square)

The OpenMLDB exporter is a prometheus exporter implemented in Python. The core connects the OpenMLDB instance through the database SDK and will query the exposed monitoring indicators through SQL statements. Exporter will follow the OpenMLDB version update and release to PyPI. For production use, you can install the latest `openmldb-exporter` directly through pip. For development and usage instructions, please refer to the code directory [README](https://github.com/4paradigm/OpenMLDB/tree/main/monitoring).


### Environmental Requirements

- Python >= 3.8
- OpenMLDB >= 0.5.0

### Preparation

1. Get OpenMLDB

You can download precompiled packages from the [OpenMLDB release](https://github.com/4paradigm/OpenMLDB/releases) page, or build from source.

Note that when compiling, make sure to enable the compile option: `-DTCMALLOC_ENABLE=ON`, the default is `ON`:
```sh
git clone https://github.com/4paradigm/OpenMLDB
cd OpenMLDB
# OpenMLDB exporter depends on compiled Python SDK
make SQL_PYSDK_ENABLE=ON
make install
```
See [compile.md](../deploy/compile.md).

2. Start OpenMLDB

See [install_deploy](../deploy/install_deploy.md) How to start OpenMLDB components

OpenMLDB exporter requires OpenMLDB to start the server status function, to do so, add the startup parameter `--enable_status_service=true` at startup, please make sure that `--enable_status_service=true' in `conf/(tablet|nameserver).flags` in the installation directory `.

The default startup script `bin/start.sh` enables server status, no additional configuration is required.

3. Note: Make sure to select the binding IP addresses of OpenMLDB components OpenMLDB exporter as well as prometheus and grafana to ensure that grafana can access prometheus, and that prometheus, OpenMLDB exporter, and OpenMLDB components can access each other.


### Deploy the OpenMLDB exporter

1. Install openmldb-exporter from PyPi

```bash
pip install openmldb-exporter==0.5.0
```

2. Run

An executable `openmldb-exporter` will be installed by default, make sure pip install path is in your $PATH environment variable.

```bash
openmldb-exporter
```

Note that the appropriate parameters are passed in, `openmldb-exporter -h` to see help:

```bash
usage: openmldb-exporter [-h] [--log.level LOG.LEVEL] [--web.listen-address WEB.LISTEN_ADDRESS]
[--web.telemetry-path WEB.TELEMETRY_PATH] [--config.zk_root CONFIG.ZK_ROOT]
[--config.zk_path CONFIG.ZK_PATH] [--config.interval CONFIG.INTERVAL]

OpenMLDB exporter

optional arguments:
-h, --help show this help message and exit
--log.level LOG.LEVEL
config log level, default WARN
--web.listen-address WEB.LISTEN_ADDRESS
process listen port, default 8000
--web.telemetry-path WEB.TELEMETRY_PATH
Path under which to expose metrics, default metrics
--config.zk_root CONFIG.ZK_ROOT
endpoint to zookeeper, default 127.0.0.1:6181
--config.zk_path CONFIG.ZK_PATH
root path in zookeeper for OpenMLDB, default /
--config.interval CONFIG.INTERVAL
interval in seconds to pull metrics periodically, default 30.0

```

3. View the list of metrics

```bash
$ curl http://127.0.0.1:8000/metrics
# HELP openmldb_connected_seconds_total duration for a component conncted time in seconds
# TYPE openmldb_connected_seconds_total counter
openmldb_connected_seconds_total{endpoint="172.17.0.15:9520",role="tablet"} 208834.70900011063
openmldb_connected_seconds_total{endpoint="172.17.0.15:9521",role="tablet"} 208834.70700001717
openmldb_connected_seconds_total{endpoint="172.17.0.15:9522",role="tablet"} 208834.71399998665
openmldb_connected_seconds_total{endpoint="172.17.0.15:9622",role="nameserver"} 208833.70000004768
openmldb_connected_seconds_total{endpoint="172.17.0.15:9623",role="nameserver"} 208831.70900011063
openmldb_connected_seconds_total{endpoint="172.17.0.15:9624",role="nameserver"} 208829.7230000496
# HELP openmldb_connected_seconds_created duration for a component conncted time in seconds
# TYPE openmldb_connected_seconds_created gauge
openmldb_connected_seconds_created{endpoint="172.17.0.15:9520",role="tablet"} 1.6501813860467942e+09
openmldb_connected_seconds_created{endpoint="172.17.0.15:9521",role="tablet"} 1.6501813860495396e+09
openmldb_connected_seconds_created{endpoint="172.17.0.15:9522",role="tablet"} 1.650181386050323e+09
openmldb_connected_seconds_created{endpoint="172.17.0.15:9622",role="nameserver"} 1.6501813860512116e+09
openmldb_connected_seconds_created{endpoint="172.17.0.15:9623",role="nameserver"} 1.650181386051238e+09
openmldb_connected_seconds_created{endpoint="172.17.0.15:9624",role="nameserver"} 1.6501813860512598e+09

```

## Deploy Node Exporter
For how to install and deploy prometheus, grafana, please refer to the official documents [promtheus get started](https://prometheus.io/docs/prometheus/latest/getting_started/) and [grafana get started](https://grafana.com/docs/ grafana/latest/getting-started/getting-started-prometheus/) .

[node_exporter](https://github.com/prometheus/node_exporter) is an official implementation of prometheus that exposes system metrics.

Go to the [release](https://github.com/prometheus/node_exporter/releases) page, download and unzip the compressed package of the corresponding platform. For example, under linux amd64 platform:
```sh
curl -SLO https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.darwin-amd64.tar.gz
tar xzf node_exporter-1.3.1-*.tar.gz
cd node_exporter-1.3.1-*/

# Start node_exporter
./node_exporter
```

## Deploy Prometheus and Grafana

For installation and deployment of prometheus and grafana, please refer to the official documents [promtheus get started](https://prometheus.io/docs/prometheus/latest/getting_started/) and [grafana get started](https://grafana.com/docs/ grafana/latest/getting-started/getting-started-prometheus/).
OpenMLDB provides prometheus and grafana configuration files for reference, see [OpenMLDB mixin](https://github.com/4paradigm/OpenMLDB/tree/main/monitoring/openmldb_mixin/README.md)

- prometheus_example.yml: Prometheus configuration example, intended to modify the target address in `node`, `openmldb_components` and `openmldb_exporter` jobs
- openmldb_dashboard.json: grafana dashboard configuration for OpenMLDB metrics, divided into two steps:
1. Under the grafana data source page, add the started prometheus server address as the data source
2. Under the dashboard browsing page, click New to import a dashboard, and upload the json configuration file

## Understand Existing Monitoring Metrics

Taking the OpenMLDB cluster system as an example, the monitoring indicators are divided into two categories according to different prometheus pull jobs:

1. DB-Level metrics, exposed through the OpenMLDB exporter, correspond to the `job_name=openmldb_exporter` entry in the `prometheus_example.yml` configuration:

```yaml
- job_name: openmldb_exporter
# pull OpenMLDB DB-Level specific metric
# change the 'targets' value to your deployed OpenMLDB exporter endpoint
static_configs:
- targets:
- 172.17.0.15:8000
```

The categories of indicators exposed are mainly:

- component status: cluster component status

- table status: database table related information, such as `rows_cout`, `memory_bytes`

- deploy query response time: the runtime of the deployment query inside the tablet

The full DB-Level metrics can be listed through the following command:

```bash
curl http://172.17.0.15:8000/metrics
```

2. Component-Level metrics. The related components of OpenMLDB (nameserver, tablet, etc), themselves as BRPC server, and expose [prometheus related metrics](https://github.com/apache/incubator-brpc/blob/master/docs/en/bvar .md#export-to-prometheus), you only need to configure the prometheus server to pull metrics from the corresponding address. It corresponds to the `job_name=openmldb_components` item in `prometheus_example.yml`:

```yaml
- job_name: openmldb_components
# job to pull component metrics from OpenMLDB like tablet/nameserver
# tweak the 'targets' list in 'static_configs' on your need
# every nameserver/tablet component endpoint should be added into targets
metrics_path: /brpc_metrics
static_configs:
- targets:
- 172.17.0.15:9622
```

The metrics of exposure are mainly

- BRPC server process related information
- Corresponding to the RPC method related metrics defined by the BRPC server, such as the RPC request `count`, `error_count`, `qps` and `response_time`

Metrics and help information can be shown through the following command (Note that the metrics exposed by different components will vary):

```bash
curl http://${COMPONENT_IP}:${COMPONENT_PORT}/brpc_metrics
```