Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk space consumption of data folder is pretty high #62

Open
Soren-klars opened this issue Nov 11, 2023 · 9 comments
Open

disk space consumption of data folder is pretty high #62

Soren-klars opened this issue Nov 11, 2023 · 9 comments

Comments

@Soren-klars
Copy link

Hi there,
I'm using TickTockDB version: 0.12.0 for collecting iot data for the whole year now. I re-imported old data again into the new version of TT to have a fresh start. The issue I'm facing is that my actual data is very small, while the header files in each day's data folder consume a lot more space.
For this year it accumulates to more than 500 MB of disk space, which compresses down to less than 2 MB in the zip files, indicating lots of unused allocated space in the header files.
They (the header.0 files) are consistently 852kB big, while the actual data is only a few kBs.
Also, there are also always 2 folders for each day, one has the ending .back. They both contain the same header file.
I attached all the data for this year, and the latest log file for the last 2 month. It would be great if you could optimize this, since space on little 32bit computers is limited.
Otherwise I'm very happy with the stability and performance of your DB, it's really great. Thank you so much for all your amazing work.
Kind regards
Soren

ticktock.log.tar.gz
data.tar.gz

@ylin30
Copy link
Collaborator

ylin30 commented Nov 11, 2023

It is likely due to new format in v0.12. Do you remember what version did you use before import? The .back files are original files. We will look at your files today.

@Soren-klars
Copy link
Author

I had used 0.10.2 , but it shouldn't matter because I wiped the data folder and imported the older data via API calls, using a SQL DB as a secondary data store. I really tried to start clean, I was aware of different data formats. I also used the default config file for version 0.12, without any old settings...

@ylin30
Copy link
Collaborator

ylin30 commented Nov 11, 2023

We looked at your data and logs. The unmatched size of header and data files are due to the fact that your data cardinality is too small, i.e., only 4 time series. TT's default setting (tsdb.page.count) is 32768 so a header file is around 900kB, no matter how many time series there are.

But data file size is determined by your real data size, i.e., how many data points in total. If in a TSDB (default 1 day), there are too many data points (i.e., number of data pages to fill these data are larger than 32768), we will create another data file. In your case, you will never use 32768 pages per day so your data file size is super small (also due to compression).

What we suggest is to reduce tsdb.page.count to a small number (e.g., 128) in your conf (default conf/tt.conf). Don't worry about if you may have more than 128 time series in the future. TT will handle that correctly.

A bad news is that your data may be inconsistent/corrupted since 10/06. We found in your log that a TSDB is compacted twice every day since 10/09. Could you please help us verify that you aren't running two TT processes in your orangepi?

We are trying to restore data using *.back files in your data.tar.gz. Let's see if we can recover your data correctly.

For now, please disable compaction by adding a setting (tsdb.compact.frequency=0s) to you conf file (default: conf/tt.conf). And don't forget to restart TT.

@ytyou
Copy link
Owner

ytyou commented Nov 12, 2023

I've re-imported your data using smaller tsdb.page.count (128). The size is much smaller now.

data-new.tar.gz

@Soren-klars
Copy link
Author

Hi guys,
thank you very much for your support.
I will re-import the data again with the setting you propose. It would be great if you could also add this to the documentation because I didn't understand this stuff before.

But I would like to know how you do the re-imports and the data recovery you mentioned? Any tricks are very welcome. There is nothing in your docs, and I think that's useful to many of us... Also, when exporting the data first, do I need to do something special? I'm trying to understand what you said about some data being in the back files, and some in the normal data files. I have no knowledge how TT is behaving in this case. Could you shed some light please?

Regarding your assumption that there were 2 TT processes running, I really don't think that this could have happened. Can it be also caused by me copying the data folder from one machine to another? I think I did the re-import on my laptop, and copied the whole data folder over to the OrangePi, and started TT there.
Soren

@ylin30
Copy link
Collaborator

ylin30 commented Nov 13, 2023

Hi guys, thank you very much for your support. I will re-import the data again with the setting you propose. It would be great if you could also add this to the documentation because I didn't understand this stuff before.

Sure, will add more to the doc. Thanks for bringing the issue up.

But I would like to know how you do the re-imports and the data recovery you mentioned? Any tricks are very welcome. There is nothing in your docs, and I think that's useful to many of us...

Your case is special since some data are somehow corrupted (caused by compaction bugs) due to the old version of TT you used. @ytyou has to write specific codes to correct your binary data.

Also, when exporting the data first, do I need to do something special?

You don't need to. Unlike HBase or other DBs in which data files are hard to export. TT is designed in the way such that data files are not machine/folder bound. The data files can be simply moved/copied to another machine/folder and restored by another TT process (no matter 32 or 64 bits, arm or x86, as long as you specified the correct path in config). As you may already note, data files are organized in time format (year/month/day). You can simply delete/backup old data by removing unwanted dirs.

We wrote a script to migrate data from old version to new version TT (in case that data binary format is different in the new version) but we haven't released it yet.

I'm trying to understand what you said about some data being in the back files, and some in the normal data files. I have no knowledge how TT is behaving in this case. Could you shed some light please?

The back files are original files before compaction. After compaction is done (default once a day during off hours) on a TSDB, the original files will be backed up as .back. Technically you don't need to keep the back files at all. We were careful since TT is still pre-released so we keep raw data files just in case.

Regarding your assumption that there were 2 TT processes running, I really don't think that this could have happened. Can it be also caused by me copying the data folder from one machine to another? I think I did the re-import on my laptop, and copied the whole data folder over to the OrangePi, and started TT there. Soren

It doesn't sound like the cause. Never mind!

Thanks for your diligence to make TT more reliable. If you have any questions/problems, don't hesitate to contact us. We will get back to you ASAP.

@Soren-klars
Copy link
Author

I have one more question about the files on disk. The setting 'tsdb.rotation.frequency' seems to allow only a number of days, right? I tried it with '1m' for 1 month, and it showed a weird behavior and even crashed. This setting also didn't show up in the log file btw. Could you please extend the documentation here and there a bit with possible values? It would be also great to know what config changes are allowed on existing data, and which ones cause trouble.

Right now I'm using tsdb.rotation.frequency=30d, which works fine, just that the start and end dates are pretty random and not related to a month. But I'm happy with the reduction in disk space I achieved by this. For the little amount of data I currently have (~20k records a month) 1 data file (~250kB) for this period is enough. I also set tsdb.page.count = 3000, with a header size of 78kB. Or what do you think about this setup?

@ylin30
Copy link
Collaborator

ylin30 commented Jan 17, 2024 via email

@ylin30
Copy link
Collaborator

ylin30 commented Jan 18, 2024

@Soren-klars I verified that 1n refers to 1 month. However, data folders are not set up properly as in calendar. tsdb.rotation.frequency=1n is more or less the same as 30d. I create a bug for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants