New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TT produces huge index
files
#64
Comments
12 means out of resources, usually oom or out of virtual mem. Two index files are too big, 1.6g, to fit both into vm. I assume you are using 32bit os? Index files are mainly determined by number of data pages and page size. What is your config? Pls run admin/config.sh and see the difference from default. |
Can you please share the backup data.tgz with us? |
I noticed that the last two indexes don't have corresponding *.back, which is the original folder after compact. Since this happened after one week, I suspect likely there is a bug in archive/compact, kicked off every week. If you want to rerun it before we figure out the bug, pls disable compact by setting TSDB.archive.threshold = 0h |
Yes, it's a 32bit OS.
So config is quite the default. I try to find a way to provide the |
got it uploaded as |
Will think about |
The problem is not archive/compact. Rather, ticktock.meta format was completely messed up. Its last modify time is Dec 26 05:31, which happens to be before the two big index files.
Note that only first 2 lines are in correct format. The subsequence lines are completely messed up. It is obvious that line 3-8 are wrong. Lines after 8 are also in wrong format. The last number in If you want to manually fix the data folders, you can simply remove lines after line 3, and the two TSDBs with big index files. But we don't know if the problem would come back. Basically, index file is a map from time series id to the position of its first data page in data file. TT knows all time series and their ids by looking at ticktock.meta. Since ticktock.meta is in wrong format, TT won't be able to generate correct index files. It likely kept trying to generate 1703590356 time series in index file, causing huge index files. We don't know what causes ticktock.meta ending up in the weird format yet. It is hard for us to repro this. If you could shed some light in reproing, that would be great! For example, did you recognize |
You use influxdb line protocol. The first line
The line above (line 9) is strange, with a
Line 4 is even weird, indicating a time series with no metric name, no tag=value, no field, and a Do you use some metric collectors, or you own scripts to collect data points and send to TT? |
We are getting closer to the cause. I can partially repro your problems.
This line is actually valid. If you don't specify tags in Influxdb line protocol (according to influxdb, tags are optional), then ticktock.meta will have a line like this. You can try this to repro:
This weird line is invalid. You might have sent an empty line to /api/write. TT would crash. But ticktock.meta would have a line like above. You can try this to repro it:
While TT absolutely needs to add more protection, so far please double check your inputs and avoid sending empty line to /api/write. The huge ID was actually caused by line 5 of ticktock.meta above. It is supposed to be a timestamp but you might have missing fields in line protocol. Then ticktock.meta would end up with such a line. For example,
The POST above will result in ticktock.meta having a line as:
Note that line 5 & 6 in your ticktock.meta are actually supposed to be in one line. You probably ran your client in Windows which appends a Line 5 thus caused subsequential time series IDs starting from 1703590340, since TT thought the last ID was 1703590340 (instead of 5). So index files thought there are more than 1703590340 time series and thus grows so big. Line 5 is the real cause of the problem. Sorry we didn't add much protection to TT to prevent it from typos or wrong format inputs. TT definitely needs to be improved. We will release a new subversion which at least addresses the problems you report, soon. For the time being, please make sure your inputs are in right formats to avoid weird problems in TT. Please do let us know if our reproed cases above make sense? |
Thanks very much for the detailed analysis.
So maybe I corrupted the I believe the things happened are not reproducible. |
@ylin30 When sending with
Nevertheless I'm sure, I corrupted the |
Reproduction starting with a fresh TickTockDB ordinary work:
results in
weird chars:
not printed are a
and TT crashed (restart-loop due to I wasn't able to reproduce the complete behaviour but produced a serious amount of garbage anyway. As you mentioned, some protection regarding invalid input data would be great. Thanks a lot for your fast response. |
@jens-ylja Oh, you used telnet. That's why the weird chars come from. I didn't think of it. Thanks for the info. Now I can repro Line 5 which caused the big ID. Two conditions: 1. with tags but missing fields, and with a timestamp in line protocol; 2. Directly append
Here is my ticktock.meta. The last two lines are caused by the above test case.
@ytyou and I will figure out a solution to fix this kind of errors. Let me explain some of TT design principles in our mind first. Technically we can add many protections (e.g., validating correct format etc) in TT. But it would add lots of overhead and slow down TT. TT runs extremely fast (millions operation per second, less than 1ms per operation, less than 1/10 or even 1/100 of ms per data point). Any overheads will slow down TT dramatically. We believe very likely clients will send requests in correct formats (if not, QA would have caught the typos in testing before PROD already, as what we are doing right now). So we try to minimize any format validation in TT. Now it comes to a discussion of how to balance perf with correctness. A miss-typo (maybe a malicious attach) would easily break down TT, forever. This is definitely not acceptable. We will think thoughtfully. Again, grateful that you report such an important issue! |
@jens-ylja BTW, we observe that someone kept pulling TT docker images every 5 min for about 25 days since the end of Nov. Since you mentioned you tested TT in dockers for months, was it you who did the pulling? |
@ylin30 - I totally agree - in fact each validation takes time and slows down processing. The things I did should never ever be the ordinary way of inserting data to TT. Nevertheless they had some dramatic consequences. So please at least add a big notice about this fact to the documentation of the supported input formats. |
Yes, I'm using TT in docker. But I build it on my own from the sources because I run it on a ARMv7. |
Add warnings in usr guide and FAQ. |
Hi,
I'm just back with installing TickTockDB.
After months of successful execution I'm setting up another system - using the latest version 0.12.1-beta build from main branch just a few days ago. The TT instance was set up at Dec 20 afternoon.
This worked well till yesterday, but today the data from yesterday cannot be queried any longer.
Within the logs I find error messages in series:
This remembers me to #48 which finally was a problem on 32 bit machines only.
Thus I ran the same steps to drill down.
top
-> gives me a virtual size of 2121952 and resistant size of 3792ps -eLf
-> gives me 62 threadspmap -X
-> gives me a size (sum) of 2121956 and resistant of 5536 with one file (index
) having a size of 1593392To cross-check, I stopped TT and created a backup (
tar cvfz data.tgz data/
) and switched back to TT version 0.11.8-beta (as used on my working first setup). But the behaviour doesn't changed. Thus I inspected the file system and found the following:I verified this with the backup tar:
The
data.tgz
file itself has a size of 5MB only:This means the both
index
files have a serious amount of compressible content (maybe gaps or zeros).I double checked the
index
files with my other installation (running since April and having a total size fordata
of about 350MB) - allindex
files there have exactly the same size of 24576 bytes.The failure happened this night, seven days after the DB was set up. Do we have an operation which runs with a delay of one week?
I'll remove the database and re-run for the next (seven) days with 0.12.1-beta - will see if it happens again.
The text was updated successfully, but these errors were encountered: