A typical log monitoring tool has to monitor log files to be able to determine if there were any changes
to be able to copy/stream those changes to a central server for housekeeping/analytics. One way of doing this
is to periodically perform os.stat
on each log file and determine if there were any changes from the last visit.
If so then copy/stream those changes. If a customer has 100's/1000's of logs files being monitored then performing those
os.stat
calls serially are quite expensive especially when only few logs files are actively being updated.
One way of solving the above mentioned problem is to write a cache layer which will listen to file system events
using a library called inotify
present in the Linux kernel and building a cache only when a event is received
about a file change. This will abstract away the os.stat
call and the cache layer will serve as a point of contact.
Further more LRU(Least Recently Used) kind of cache implementation can be used to build a cache layer of certain number
of files. Where the least recently used files will be replaced with a new file who details were requested. This way
we can manage resources efficiently on low powered devices as well.
- Pre build the cache by supplying a list of files to monitor for changes
- When one of the file being watched is modified we update the stats(last modification time and size) of that file
in our cache by invoking
os.stat
- If a file is accessed for first time which is not present in the original watch list then the stats are
fetched using
os.stat
and the file will be added to the watch list ofinotify
and subsequent calls to get stats will be faster as they will be returned from the cache - We only listen for
Modify
andDelete
events frominotify
and update the cache accordingly - calling
invalidate
will un-watch all the files in the cache frominotify
and clear the cache - verified with
inotify-simple v1.2.1
andPython 3.7.0
from fstat_cache import FStatCache
import time
if __name__ == '__main__':
cache = FStatCache()
cache.build(["/tmp/test_file1"])
print(cache.get_file_stats("/tmp/test_file1"))
time.sleep(10)
# from another terminal update /tmp/test_file1 and you should see the update info when next line is run
# after 10 seconds
print(cache.get_file_stats("/tmp/test_file1"))
cache.invalidate()
Another Example of consuming this library in a flask app
Following are the steps to run the flask app.
pip install -r requirements.txt
pip install flask
python fstat_cache/example_flask_app.py
# in another terminal or a browser
# will fetch the file size and last modification timestamp from cache
curl <ip:port>/cache/<path-to-a-file-in-/tmp-dir>
# eg: curl http://127.0.0.1:5000/cache/test_file_1
# will fetch the file size and last modification timestamp directly using os.stat
curl <ip:port>/stat/<path-to-a-file-in-/tmp-dir>
# eg: curl http://127.0.0.1:5000/stat/test_file_1
cd fstat-cache/fstat_cache
python fstat_cache.py
you should see an output similar to
extreme@a11973d3ad9c:/codefresh/volume/waas/fstat-cache/fstat_cache$ python fstat_cache.py
stats for /tmp/test_file1 = {'ts': 1579282851.43988, 'size': 47}
stats for /tmp/test_file2 = {'ts': 1579282851.43988, 'size': 67}
list of files in the cache = ['/tmp/test_file1', '/tmp/test_file2']
stats for /tmp/test_file1 = {'ts': 1579282851.43988, 'size': 61}
stats for /tmp/test_file2 = {'ts': 1579282851.44988, 'size': 82}
extreme@a11973d3ad9c:/codefresh/volume/waas/fstat-cache/fstat_cache$
cd fstat-cache/tests
nosetests -v
nosetests -v --with-coverage --cover-package=fstat_cache
sample output
$ nosetests -v --with-coverage --cover-package=fstat_cache
test_add_file_to_watch_and_remove (tests.test_fstat_cache.FStatCacheTestCase) ... ok
test_get_file_size_from_cache (tests.test_fstat_cache.FStatCacheTestCase) ... ok
test_get_file_size_using_stat (tests.test_fstat_cache.FStatCacheTestCase) ... ok
test_list_files_in_cache (tests.test_fstat_cache.FStatCacheTestCase) ... ok
Name Stmts Miss Cover
------------------------------------------------
fstat_cache/__init__.py 2 0 100%
fstat_cache/fstat_cache.py 108 20 81%
------------------------------------------------
TOTAL 110 20 82%
----------------------------------------------------------------------
Ran 4 tests in 0.025s
OK
Following are the benchmarks on getting the stats of 10,000 files 100 times using this cache library vs using os.stat
sheshagiri@ubuntu-vm-1:~/workspace/sheshagiri/fstat-cache/fstat_cache$ python3 benchmarks.py
creating 10000 temp files
created temp files in 0.393253 seconds
building cache
built cache in 0.756536 seconds
starting benchmarks now
using fstat-cache library: 0.295163 seconds
using os.stat: 3.294907 seconds
cleaning up tmp files
sheshagiri@ubuntu-vm-1:~/workspace/sheshagiri/fstat-cache/fstat_cache$
On a Ubuntu-18 VM with 4GiB Memory and 1 vCPU running on VMware Fusion. PS: I had to increase the ulimit size.
TBD
- Only works for files that are already existing before creating the cache as we are only listening for MODIFY and DELETE events from inotify python wrapper
- As of now only absolute paths are supported, there is not support for watching the whole directory
- Works only on Linux. Doesn't work on Windows and MacOS