Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cached KnownHosts implementation - continued... #744

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

donoghuc
Copy link
Contributor

@donoghuc donoghuc commented Mar 1, 2020

This PR continues the work that @nicklewis did in #686 based on the suggestion from @mfazekas to rebuild cache when file changes occur out of band. There are two approaches for this in the PR. First use file mtime (originally suggested by @mfazekas) and the second to use an md5 hash to determine if a known_hosts file has changed out of band. While writing tests for the original mtime implementation I came across a case where a file write happened so close to the time the file was originally read that comparing mtimes did not detect a change.

This adds a new class Net::SSH::KnownHosts::Cached which loads the
known_hosts files once, builds an index, and uses that when searching
for keys for a given host. An instance of this object can be passed
using the `known_hosts` option in order to avoid loading known_hosts
multiple times when making multiple connections to either the same host
or several hosts.

This behavior is implemented in a new class because it differs from the
existing implementation in the case where the content of the known_hosts
file changes outside the current process. With the standard
implementation, those changes will always be picked up the next time a
connection is made, whereas the Cached implementation won't
automatically incorporated changes made out-of-band.
@@ -17,12 +24,10 @@ def search_for(host, options={})
@hmac_lookups.each do |(hmac, salt), entry|
digest = OpenSSL::Digest.new('sha1')
host_hmac = OpenSSL::HMAC.digest(digest, salt, hostname)
entries << entry if Base64.encode64(host_hmac.chomp) == hmac
entries.concat(entry) if Base64.encode64(host_hmac).chomp == hmac
Copy link
Contributor Author

@donoghuc donoghuc Mar 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug whereby using << would insert an array rather than merging. Also, the newline the chomp method is meant to be used to ignore is in the base64 encoded content so it needs to be called once the content is decoded.

file.each_line do |line|
hosts, type, key_content = line.split(' ')
next unless hosts || hosts.start_with?('#')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulls in work from #742

@@ -54,7 +61,7 @@ def parse_known_hosts(source)
hostlist.each do |host|
entry = Entry.new(hostlist, key)
if host.include?('*') || host.include?('?')
regex = regexify_pattern(host)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just a typo.

@@ -73,14 +80,17 @@ def parse_known_hosts(source)
end
end

def add(host, key)
def add(host, key, _options = nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to keep support for a third positional parameter for Hostskeys#add_host_key

@known_hosts.add(@host, key, @options)

assert_equal(0, keys.count)
f.write(File.read(path("known_hosts/github")))
f.close
# Ensure mtime is different
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I found issues with mtimes being so close that the file would not be re-parsed.

@codecov-io
Copy link

codecov-io commented Mar 1, 2020

Codecov Report

Merging #744 into master will increase coverage by 0.37%.
The diff coverage is 99.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #744      +/-   ##
==========================================
+ Coverage   95.58%   95.95%   +0.37%     
==========================================
  Files         153      155       +2     
  Lines        9963    10241     +278     
==========================================
+ Hits         9523     9827     +304     
+ Misses        440      414      -26
Impacted Files Coverage Δ
test/test_known_hosts_cached.rb.rb 100% <100%> (ø)
lib/net/ssh/known_hosts/cached.rb 98.68% <98.68%> (ø)
lib/net/ssh/transport/algorithms.rb 98% <0%> (-0.5%) ⬇️
test/authentication/test_agent.rb 100% <0%> (ø) ⬆️
lib/net/ssh/test/socket.rb 96.42% <0%> (ø) ⬆️
...t/transport/kex/test_diffie_hellman_group1_sha1.rb 100% <0%> (ø) ⬆️
lib/net/ssh/test/channel.rb 81.57% <0%> (ø) ⬆️
lib/net/ssh/test/remote_packet.rb 93.75% <0%> (ø) ⬆️
test/transport/kex/test_ecdh_sha2_nistp256.rb 100% <0%> (ø) ⬆️
lib/net/ssh/test/kex.rb 100% <0%> (ø) ⬆️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54476fa...8ae4097. Read the comment docs.

return unless File.readable?(source)

File.open(source) do |file|
@md5_sums[source] = Digest::MD5.hexdigest(File.read(file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use SHA256 instead of MD5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an easy change to make. For now i'd like to see what folks think about using a hash instead of mtime. If we want to move forward with hash I think using sha256 should be just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to used sha256

lib/net/ssh/known_hosts/cached.rb Outdated Show resolved Hide resolved
test/test_known_hosts_cached.rb.rb Outdated Show resolved Hide resolved
When the cache is built, store the file modification times. Only re-build the cache when the mtime does not exist or has changed since last read. Similarly with the add method, only rebuild the cache when the write is the only change since the last read.

def cache_invalid?
(@user_files + @global_files).each do |file|
return true if File.readable?(file) && @md5_sums[file] != Digest::MD5.hexdigest(File.read(file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it also needs to check if the file is removed, that should invalidate cache as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to invalidate on delete.

@mfazekas
Copy link
Collaborator

mfazekas commented Mar 1, 2020

and the second to use an md5 hash to determine if a known_hosts file has changed out of band.

Yes i think using some hash is a great idea, much better than mtime!

Instead of tracking mtime to determine if known_hosts need to be re-parsed, track file sha256 checksums. This avoids cases where mtimes are close enough together that a change goes un-noticed (discovered writing tests). Also based on feedback also invalidate cache if known_hosts file is deleted OOB.
@donoghuc
Copy link
Contributor Author

donoghuc commented Mar 2, 2020

@nicklewis Did you have any scripts to test performance with this change?

@donoghuc
Copy link
Contributor Author

donoghuc commented Mar 8, 2020

Looked a bit at what gains there would be from using cached known_hosts. I built up a very large known hosts file with the following strategy: Take an existing known_hosts entry and save everything but the host name. Build up 100K host entries with a www.foo_[1..100k],1.2.3.4 ssh-ed25519 .... then in the last line add the actual "good" known_hosts entry.

Cached

Test script for using cached:

#!/usr/bin/env ruby

require 'net/ssh'
require 'net/ssh/known_hosts/cached'
options = {
  verify_host_key: Net::SSH::Verifiers::Always.new,
  keys: '/home/cas/.ssh/id_rsa-acceptance',
  known_hosts: Net::SSH::KnownHosts::Cached.new({})
}

(0..100).each do 
  Net::SSH.start('tasty-calibre.delivery.puppetlabs.net', 'root', options)
end

Example time:

real	0m42.614s
user	0m8.878s
sys	0m1.049s

Profile: https://gist.github.com/donoghuc/0e13a59b544935f1b87316cbbc2b4f03#file-cached_hosts-svg

Not cached

Script for non-cached:

#!/usr/bin/env ruby

require 'net/ssh'
require 'net/ssh/known_hosts/cached'
options = {
  verify_host_key: Net::SSH::Verifiers::Always.new,
  keys: '/home/cas/.ssh/id_rsa-acceptance'
}

(0..100).each do 
  Net::SSH.start('tasty-calibre.delivery.puppetlabs.net', 'root', options)
end

Time example:

real	0m55.168s
user	0m21.868s
sys	0m0.446s

Profile: https://gist.github.com/donoghuc/0e13a59b544935f1b87316cbbc2b4f03#file-non_cached_hosts-svg

I also noticed that sha256 is a bit slower than md5: Given collisions seem like they would be extremely rare could we just use md5 (or some even faster algorithm) as there are not really any security concerns, just detecting file changes?

sha256

#!/usr/bin/env ruby
require 'digest/sha2'
path = File.expand_path('~/.ssh/known_hosts')
(0..1000).each do 
  Digest::SHA256.hexdigest(File.read(path))
end
real	0m33.535s
user	0m29.791s
sys	0m3.723s

md5

#!/usr/bin/env ruby
require 'digest/md5'
path = File.expand_path('~/.ssh/known_hosts')
(0..1000).each do 
  Digest::MD5.hexdigest(File.read(path))
end
real	0m22.207s
user	0m18.901s
sys	0m3.307s

This commit ensures file paths are expanded given the defaults contain `~`. It also handles the case where *no* known host files are readable by intializsing caches to empty hashes in the init method.
@logicminds
Copy link

How would this work with centralized based knownhosts database implementations?

ie. https://gitlab.com/nwops/net-ssh-exthosts/-/tree/master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants