New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cached KnownHosts implementation - continued... #744
base: master
Are you sure you want to change the base?
Conversation
This adds a new class Net::SSH::KnownHosts::Cached which loads the known_hosts files once, builds an index, and uses that when searching for keys for a given host. An instance of this object can be passed using the `known_hosts` option in order to avoid loading known_hosts multiple times when making multiple connections to either the same host or several hosts. This behavior is implemented in a new class because it differs from the existing implementation in the case where the content of the known_hosts file changes outside the current process. With the standard implementation, those changes will always be picked up the next time a connection is made, whereas the Cached implementation won't automatically incorporated changes made out-of-band.
@@ -17,12 +24,10 @@ def search_for(host, options={}) | |||
@hmac_lookups.each do |(hmac, salt), entry| | |||
digest = OpenSSL::Digest.new('sha1') | |||
host_hmac = OpenSSL::HMAC.digest(digest, salt, hostname) | |||
entries << entry if Base64.encode64(host_hmac.chomp) == hmac | |||
entries.concat(entry) if Base64.encode64(host_hmac).chomp == hmac |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug whereby using << would insert an array rather than merging. Also, the newline the chomp
method is meant to be used to ignore is in the base64 encoded content so it needs to be called once the content is decoded.
lib/net/ssh/known_hosts/cached.rb
Outdated
file.each_line do |line| | ||
hosts, type, key_content = line.split(' ') | ||
next unless hosts || hosts.start_with?('#') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulls in work from #742
lib/net/ssh/known_hosts/cached.rb
Outdated
@@ -54,7 +61,7 @@ def parse_known_hosts(source) | |||
hostlist.each do |host| | |||
entry = Entry.new(hostlist, key) | |||
if host.include?('*') || host.include?('?') | |||
regex = regexify_pattern(host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was just a typo.
@@ -73,14 +80,17 @@ def parse_known_hosts(source) | |||
end | |||
end | |||
|
|||
def add(host, key) | |||
def add(host, key, _options = nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to keep support for a third positional parameter for Hostskeys#add_host_key
net-ssh/lib/net/ssh/known_hosts.rb
Line 24 in 54476fa
@known_hosts.add(@host, key, @options) |
test/test_known_hosts_cached.rb.rb
Outdated
assert_equal(0, keys.count) | ||
f.write(File.read(path("known_hosts/github"))) | ||
f.close | ||
# Ensure mtime is different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I found issues with mtimes being so close that the file would not be re-parsed.
Codecov Report
@@ Coverage Diff @@
## master #744 +/- ##
==========================================
+ Coverage 95.58% 95.95% +0.37%
==========================================
Files 153 155 +2
Lines 9963 10241 +278
==========================================
+ Hits 9523 9827 +304
+ Misses 440 414 -26
Continue to review full report at Codecov.
|
lib/net/ssh/known_hosts/cached.rb
Outdated
return unless File.readable?(source) | ||
|
||
File.open(source) do |file| | ||
@md5_sums[source] = Digest::MD5.hexdigest(File.read(file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use SHA256 instead of MD5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an easy change to make. For now i'd like to see what folks think about using a hash instead of mtime. If we want to move forward with hash I think using sha256 should be just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to used sha256
When the cache is built, store the file modification times. Only re-build the cache when the mtime does not exist or has changed since last read. Similarly with the add method, only rebuild the cache when the write is the only change since the last read.
7cf78c8
to
deaaee0
Compare
lib/net/ssh/known_hosts/cached.rb
Outdated
|
||
def cache_invalid? | ||
(@user_files + @global_files).each do |file| | ||
return true if File.readable?(file) && @md5_sums[file] != Digest::MD5.hexdigest(File.read(file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it also needs to check if the file is removed, that should invalidate cache as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to invalidate on delete.
Yes i think using some hash is a great idea, much better than |
Instead of tracking mtime to determine if known_hosts need to be re-parsed, track file sha256 checksums. This avoids cases where mtimes are close enough together that a change goes un-noticed (discovered writing tests). Also based on feedback also invalidate cache if known_hosts file is deleted OOB.
deaaee0
to
8ae4097
Compare
@nicklewis Did you have any scripts to test performance with this change? |
Looked a bit at what gains there would be from using cached known_hosts. I built up a very large known hosts file with the following strategy: Take an existing known_hosts entry and save everything but the host name. Build up 100K host entries with a CachedTest script for using cached: #!/usr/bin/env ruby
require 'net/ssh'
require 'net/ssh/known_hosts/cached'
options = {
verify_host_key: Net::SSH::Verifiers::Always.new,
keys: '/home/cas/.ssh/id_rsa-acceptance',
known_hosts: Net::SSH::KnownHosts::Cached.new({})
}
(0..100).each do
Net::SSH.start('tasty-calibre.delivery.puppetlabs.net', 'root', options)
end Example time:
Profile: https://gist.github.com/donoghuc/0e13a59b544935f1b87316cbbc2b4f03#file-cached_hosts-svg Not cachedScript for non-cached: #!/usr/bin/env ruby
require 'net/ssh'
require 'net/ssh/known_hosts/cached'
options = {
verify_host_key: Net::SSH::Verifiers::Always.new,
keys: '/home/cas/.ssh/id_rsa-acceptance'
}
(0..100).each do
Net::SSH.start('tasty-calibre.delivery.puppetlabs.net', 'root', options)
end Time example:
Profile: https://gist.github.com/donoghuc/0e13a59b544935f1b87316cbbc2b4f03#file-non_cached_hosts-svg I also noticed that sha256 is a bit slower than md5: Given collisions seem like they would be extremely rare could we just use md5 (or some even faster algorithm) as there are not really any security concerns, just detecting file changes? sha256#!/usr/bin/env ruby
require 'digest/sha2'
path = File.expand_path('~/.ssh/known_hosts')
(0..1000).each do
Digest::SHA256.hexdigest(File.read(path))
end
md5#!/usr/bin/env ruby
require 'digest/md5'
path = File.expand_path('~/.ssh/known_hosts')
(0..1000).each do
Digest::MD5.hexdigest(File.read(path))
end
|
This commit ensures file paths are expanded given the defaults contain `~`. It also handles the case where *no* known host files are readable by intializsing caches to empty hashes in the init method.
How would this work with centralized based knownhosts database implementations? |
This PR continues the work that @nicklewis did in #686 based on the suggestion from @mfazekas to rebuild cache when file changes occur out of band. There are two approaches for this in the PR. First use file mtime (originally suggested by @mfazekas) and the second to use an md5 hash to determine if a known_hosts file has changed out of band. While writing tests for the original mtime implementation I came across a case where a file write happened so close to the time the file was originally read that comparing mtimes did not detect a change.