Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TT produces huge index files #64

Open
jens-ylja opened this issue Dec 27, 2023 · 17 comments
Open

TT produces huge index files #64

jens-ylja opened this issue Dec 27, 2023 · 17 comments

Comments

@jens-ylja
Copy link
Contributor

jens-ylja commented Dec 27, 2023

Hi,

I'm just back with installing TickTockDB.
After months of successful execution I'm setting up another system - using the latest version 0.12.1-beta build from main branch just a few days ago. The TT instance was set up at Dec 20 afternoon.

This worked well till yesterday, but today the data from yesterday cannot be queried any longer.
Within the logs I find error messages in series:

2023-12-27 08:56:11.227 [ERROR] [http_20_task_0] Failed to mmap file /var/lib/ticktock/data/2023/12/1703548800.1703635200/index, errno = 12

This remembers me to #48 which finally was a problem on 32 bit machines only.
Thus I ran the same steps to drill down.

  1. top -> gives me a virtual size of 2121952 and resistant size of 3792
  2. ps -eLf -> gives me 62 threads
  3. pmap -X -> gives me a size (sum) of 2121956 and resistant of 5536 with one file (index) having a size of 1593392

To cross-check, I stopped TT and created a backup (tar cvfz data.tgz data/) and switched back to TT version 0.11.8-beta (as used on my working first setup). But the behaviour doesn't changed. Thus I inspected the file system and found the following:

ticktock/data$ ls -lR | grep index
-rw-r--r-- 1 jba jba  24576 Dec 22 02:00 index
-rw-r--r-- 1 jba jba  24576 Dec 21 19:31 index
-rw-r--r-- 1 jba jba  24576 Dec 23 02:00 index
-rw-r--r-- 1 jba jba  24576 Dec 21 01:51 index
-rw-r--r-- 1 jba jba  24576 Dec 24 02:00 index
-rw-r--r-- 1 jba jba  24576 Dec 22 01:51 index
-rw-r--r-- 1 jba jba  24576 Dec 25 02:00 index
-rw-r--r-- 1 jba jba  24576 Dec 23 01:51 index
-rw-r--r-- 1 jba jba 1631632048 Dec 26 15:15 index
-rw-r--r-- 1 jba jba 1631632078 Dec 27 01:51 index

I verified this with the backup tar:

ticktock$ tar tvfz data.tgz | grep index
-rw-r--r-- jba/jba       24576 2023-12-21 19:31 data/2023/12/1703030400.1703116800.back/index
-rw-r--r-- jba/jba       24576 2023-12-21 01:51 data/2023/12/1703116800.1703203200.back/index
-rw-r--r-- jba/jba       24576 2023-12-24 02:00 data/2023/12/1703203200.1703289600/index
-rw-r--r-- jba/jba       24576 2023-12-22 02:00 data/2023/12/1703030400.1703116800/index
-rw-r--r-- jba/jba       24576 2023-12-25 02:00 data/2023/12/1703289600.1703376000/index
-rw-r--r-- jba/jba       24576 2023-12-22 01:51 data/2023/12/1703203200.1703289600.back/index
-rw-r--r-- jba/jba  1631632078 2023-12-27 01:51 data/2023/12/1703635200.1703721600/index
-rw-r--r-- jba/jba       24576 2023-12-23 01:51 data/2023/12/1703289600.1703376000.back/index
-rw-r--r-- jba/jba       24576 2023-12-23 02:00 data/2023/12/1703116800.1703203200/index
-rw-r--r-- jba/jba  1631632048 2023-12-26 15:15 data/2023/12/1703548800.1703635200/index

The data.tgz file itself has a size of 5MB only:

ticktock$ ls -lh data.tgz 
-rw-rw-r-- 1 jba jba 5.0M Dec 27 10:02 data.tgz

This means the both index files have a serious amount of compressible content (maybe gaps or zeros).

I double checked the index files with my other installation (running since April and having a total size for data of about 350MB) - all index files there have exactly the same size of 24576 bytes.

The failure happened this night, seven days after the DB was set up. Do we have an operation which runs with a delay of one week?
I'll remove the database and re-run for the next (seven) days with 0.12.1-beta - will see if it happens again.

@ylin30
Copy link
Collaborator

ylin30 commented Dec 27, 2023

12 means out of resources, usually oom or out of virtual mem. Two index files are too big, 1.6g, to fit both into vm. I assume you are using 32bit os?

Index files are mainly determined by number of data pages and page size. What is your config? Pls run admin/config.sh and see the difference from default.

@ylin30
Copy link
Collaborator

ylin30 commented Dec 27, 2023

Can you please share the backup data.tgz with us?

@ylin30
Copy link
Collaborator

ylin30 commented Dec 27, 2023

I noticed that the last two indexes don't have corresponding *.back, which is the original folder after compact. Since this happened after one week, I suspect likely there is a bug in archive/compact, kicked off every week.

If you want to rerun it before we figure out the bug, pls disable compact by setting TSDB.archive.threshold = 0h

@jens-ylja
Copy link
Contributor Author

12 means out of resources, usually oom or out of virtual mem. Two index files are too big, 1.6g, to fit both into vm. I assume you are using 32bit os?

Index files are mainly determined by number of data pages and page size. What is your config? Pls run admin/config.sh and see the difference from default.

Yes, it's a 32bit OS.
I'm running TT within Docker, thus admin/config.sh is in scripts. Here is the output:

ticktock@4f447483e0f8:/opt/ticktock/scripts$ ./config.sh 
{
  "log.file": "/var/lib/ticktock/log/ticktock.log",
  "tsdb.data.dir": "/var/lib/ticktock/data",
}

So config is quite the default.

I try to find a way to provide the data.tgz. GitHub doesn't like it.

@jens-ylja
Copy link
Contributor Author

data.tar.gz

got it uploaded as data.tar.gz - so now is tar cf and than compressed by gzip.

@jens-ylja
Copy link
Contributor Author

I noticed that the last two indexes don't have corresponding *.back, which is the original folder after compact. Since this happened after one week, I suspect likely there is a bug in archive/compact, kicked off every week.

If you want to rerun it before we figure out the bug, pls disable compact by setting TSDB.archive.threshold = 0h

Will think about TSDB.archive.threshold = 0h tomorrow. For now my "working" day is over.

@ylin30
Copy link
Collaborator

ylin30 commented Dec 27, 2023

The problem is not archive/compact. Rather, ticktock.meta format was completely messed up. Its last modify time is Dec 26 05:31, which happens to be before the two big index files.

[Yi-MBP data-1]$ ls -l
total 8
drwxr--r--@ 3 ylin30  staff   96 Dec 20 06:33 2023
-rw-r--r--@ 1 ylin30  staff  459 Dec 26 05:31 ticktock.meta
[Yi-MBP data-1]$ cat ticktock.meta
sensor.status device=s10e,type=s10e val=0
test.cpu host=rpi,id=1 usr=1 sys=2 idle=3

 ; =4
0 type=s10e 1703590340
=5
0
 ; =6
s10e.power ; battery=1703590342 grid=1703590343 home=1703590344 pv=1703590345
s10e.status ; autarky=1703590346 ep=1703590347 self=1703590348 soc=1703590349 swVersion=1703590350
s10e.dcVoltage ; garage=1703590351 house=1703590352
s10e.dcCurrent ; garage=1703590353 house=1703590354
s10e.dcPower ; garage=1703590355 house=1703590356
[Yi-MBP data-1]$

Note that only first 2 lines are in correct format. The subsequence lines are completely messed up. It is obvious that line 3-8 are wrong. Lines after 8 are also in wrong format. The last number in house=1703590356 indicates there are 1703590356 time series which is wrong (I guess).

If you want to manually fix the data folders, you can simply remove lines after line 3, and the two TSDBs with big index files. But we don't know if the problem would come back.

Basically, index file is a map from time series id to the position of its first data page in data file. TT knows all time series and their ids by looking at ticktock.meta. Since ticktock.meta is in wrong format, TT won't be able to generate correct index files. It likely kept trying to generate 1703590356 time series in index file, causing huge index files.

We don't know what causes ticktock.meta ending up in the weird format yet. It is hard for us to repro this. If you could shed some light in reproing, that would be great! For example, did you recognize s10e.power, battery, garage etc? What commands did u use to add the data points of those metrics?

@ylin30
Copy link
Collaborator

ylin30 commented Dec 27, 2023

You use influxdb line protocol. The first line sensor.status device=s10e,type=s10e val=0 is correct and The time series is interpreted as metric name is sensor.status, two tags: device=s10e,type=s10e, and a field val, assigned a time series ID 0.

s10e.power ; battery=1703590342 grid=1703590343 home=1703590344 pv=1703590345

The line above (line 9) is strange, with a ; at the position of tag=value pairs. Do you happen to know where ; comes from?

 ; =4

Line 4 is even weird, indicating a time series with no metric name, no tag=value, no field, and a ;. And TT assigned a ID 4 to this time series.

Do you use some metric collectors, or you own scripts to collect data points and send to TT?

@ylin30
Copy link
Collaborator

ylin30 commented Dec 27, 2023

We are getting closer to the cause. I can partially repro your problems.

s10e.power ; battery=1703590342 grid=1703590343 home=1703590344 pv=1703590345

The line above (line 9) is strange, with a ; at the position of tag=value pairs. Do you happen to know where ; comes from?

This line is actually valid. If you don't specify tags in Influxdb line protocol (according to influxdb, tags are optional), then ticktock.meta will have a line like this. You can try this to repro:

curl -XPOST 'http://localhost:6182/api/write' -d $'test.cpu usr=10,sys=20,idle=70'
 ; =4

Line 4 is even weird, indicating a time series with no metric name, no tag=value, no field, and a ;. And TT assigned a ID 4 to this time series.

This weird line is invalid. You might have sent an empty line to /api/write. TT would crash. But ticktock.meta would have a line like above. You can try this to repro it:

curl -XPOST 'http://localhost:6182/api/write' -d $' '

While TT absolutely needs to add more protection, so far please double check your inputs and avoid sending empty line to /api/write.

image

The huge ID was actually caused by line 5 of ticktock.meta above. It is supposed to be a timestamp but you might have missing fields in line protocol. Then ticktock.meta would end up with such a line. For example,

curl -XPOST 'http://localhost:6182/api/write' -d $'0,type=s10e 1703590340'

The POST above will result in ticktock.meta having a line as:

0 type=s10e 1703590340=27

Note that line 5 & 6 in your ticktock.meta are actually supposed to be in one line. You probably ran your client in Windows which appends a ^M at the end of 1703590340 in your input. I haven't been able to verify yet.

Line 5 thus caused subsequential time series IDs starting from 1703590340, since TT thought the last ID was 1703590340 (instead of 5). So index files thought there are more than 1703590340 time series and thus grows so big. Line 5 is the real cause of the problem.

Sorry we didn't add much protection to TT to prevent it from typos or wrong format inputs. TT definitely needs to be improved. We will release a new subversion which at least addresses the problems you report, soon. For the time being, please make sure your inputs are in right formats to avoid weird problems in TT.

Please do let us know if our reproed cases above make sense?

@jens-ylja
Copy link
Contributor Author

Thanks very much for the detailed analysis.
According to the completely corrupted ticktock.meta file:

  • I did some experiments with the input data at Dec 26 around 12:00-13:00 using telnet to send Influx line protocol data - the test.cpu series
  • During these tests maybe I sent some weird lines and empty lines too.
  • Maybe - but I'm not sure anymore TT crashed in the middle of these experiments.
  • Afterwards I continued writing data the ordinary way (using a Java program which does this Job well in my other setup since months).
  • Finally I completely renovate my service setup (multiple docker to docker compose) which caused TT to be restarted multiple times in short intervals.

So maybe I corrupted the ticktock.meta file anyhow and wasn't aware about this because of one huge index file was OK for my small machines memory, the second at Dec 27 not.

I believe the things happened are not reproducible.

@jens-ylja
Copy link
Contributor Author

@ylin30
Regarding your last post - you're so right.

When sending with telnet:

  • I definitely sent blank lines - wasn't intentional, but happened
  • I wanted to stop telnet and sure hit ^D and ^C
  • I'm working with Linux, so I'm unsure about the ^M

Nevertheless I'm sure, I corrupted the ticktock.meta file this way.
I'll try to reproduce and come back.

@jens-ylja
Copy link
Contributor Author

jens-ylja commented Dec 28, 2023

Reproduction starting with a fresh TickTockDB

ordinary work:

$ telnet hcwatch 6180
Trying 172.16.1.110...
Connected to hcwatch.fritz.box.
Escape character is '^]'.
sensor.status,device=s10e,type=s10e val=0 1703697200
s10e.power battery=-417,grid=8,home=425,pv=0 1703697200
^]
telnet> quit
Connection closed.

results in

$ cat ticktock.meta 
sensor.status device=s10e,type=s10e val=0
s10e.power ; battery=1 grid=2 home=3 pv=4

weird chars:

$ telnet hcwatch 6000
Trying 172.16.1.110...
Connected to hcwatch.fritz.box.
Escape character is '^]'.
s10e.dcVoltage garage=48.0,house=89.0 1703697200
^C
Connection closed by foreign host.

not printed are a ^D in front of the ^C and a newline after the ^C.
the ticktock.meta is corrupted now:

$ cat ticktock.meta 
sensor.status device=s10e,type=s10e val=0
s10e.power ; battery=1 grid=2 home=3 pv=4
s10e.dcVoltage ; garage=5 house=6
����
 ; =7
0
User-Agent: ; curl/7.81.0
Accept: */*



=8

and TT crashed (restart-loop due to docker --restart unless-stopped).

I wasn't able to reproduce the complete behaviour but produced a serious amount of garbage anyway.
Fortunately I didn't the origin tests on my "PROD" instance :)

As you mentioned, some protection regarding invalid input data would be great.
In the meanwhile I've learned: "Do strictly adhere to the line protocols, otherwise you may corrupt your database."

Thanks a lot for your fast response.
I'll start over with a fresh database and correct input data :)

@ylin30
Copy link
Collaborator

ylin30 commented Dec 28, 2023

@jens-ylja Oh, you used telnet. That's why the weird chars come from. I didn't think of it. Thanks for the info. Now I can repro Line 5 which caused the big ID.

Two conditions: 1. with tags but missing fields, and with a timestamp in line protocol; 2. Directly append ^] at the end of line.

[yi-IdeaPad ~]$ telnet 192.168.1.16 6180
Trying 192.168.1.16...
Connected to 192.168.1.16.
Escape character is '^]'.
test1,type=s10e 1703590340^]

telnet>

Connection closed by foreign host.
[yi-IdeaPad ~]$

Here is my ticktock.meta. The last two lines are caused by the above test case.

ylin30@raspberrypi:~/ticktock $ ./bin/tt[0x17898]
/lib/arm-linux-gnueabihf/libc.so.6(__default_sa_restorer+0x0)[0xb6b36900]
./bin/tt[0x41d38]
./bin/tt[0x311e8]
./bin/tt[0x30438]
/lib/arm-linux-gnueabihf/libstdc++.so.6(+0xa9150)[0xb6dcc150]
Interrupted (11), shutting down...
Start shutdown process...
Shutdown process complete

cat ./data/ticktock.meta
temperature location=office 0
......
test1 type=s10e 1703590340
=28
[1]+  Done                    ./bin/tt -c conf/tt.conf
ylin30@raspberrypi:~/ticktock $

@ytyou and I will figure out a solution to fix this kind of errors. Let me explain some of TT design principles in our mind first.

Technically we can add many protections (e.g., validating correct format etc) in TT. But it would add lots of overhead and slow down TT. TT runs extremely fast (millions operation per second, less than 1ms per operation, less than 1/10 or even 1/100 of ms per data point). Any overheads will slow down TT dramatically. We believe very likely clients will send requests in correct formats (if not, QA would have caught the typos in testing before PROD already, as what we are doing right now). So we try to minimize any format validation in TT.

Now it comes to a discussion of how to balance perf with correctness. A miss-typo (maybe a malicious attach) would easily break down TT, forever. This is definitely not acceptable. We will think thoughtfully.

Again, grateful that you report such an important issue!

@ylin30
Copy link
Collaborator

ylin30 commented Dec 28, 2023

@jens-ylja BTW, we observe that someone kept pulling TT docker images every 5 min for about 25 days since the end of Nov. Since you mentioned you tested TT in dockers for months, was it you who did the pulling?

@jens-ylja
Copy link
Contributor Author

@ylin30 - I totally agree - in fact each validation takes time and slows down processing. The things I did should never ever be the ordinary way of inserting data to TT. Nevertheless they had some dramatic consequences.

So please at least add a big notice about this fact to the documentation of the supported input formats.

@jens-ylja
Copy link
Contributor Author

jens-ylja commented Dec 29, 2023

@jens-ylja BTW, we observe that someone kept pulling TT docker images every 5 min for about 25 days since the end of Nov. Since you mentioned you tested TT in dockers for months, was it you who did the pulling?

Yes, I'm using TT in docker. But I build it on my own from the sources because I run it on a ARMv7.
I'm always using the tagged versions (0.11.8-beta since about 7 months, 0.12.1-beta since Dec 20).

@ylin30
Copy link
Collaborator

ylin30 commented Dec 31, 2023

Add warnings in usr guide and FAQ.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants