Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core) upstreams #1735

Merged
merged 81 commits into from
Dec 28, 2016
Merged

feat(core) upstreams #1735

merged 81 commits into from
Dec 28, 2016

Conversation

Tieske
Copy link
Member

@Tieske Tieske commented Oct 11, 2016

Summary

Implements the upstreams feature as discussed in #157.

Full changelog

  • adds upstream entity (a loadbalanced upstream, identified by a unique virtual hostname)
  • adds target entity (an upstream target, identified by a name/ip + port combination)

Issues resolved

Implements #157

Todos

  • implement cluster events for upstream and target entities
  • implement consistent hashing algorithm
  • implement dns refresh mechanism (for failed queries, currently at 0 slots)
  • implement ttl=0 logic
  • release the feature/balancer branch of the dns.lua module
  • add balancer integration tests
  • implement millisecond precision for timestamps

Usage

to create an upstream use the following request;

http POST localhost:8001/upstreams \
   name=service-xyz-v1

the upstream name can now be used in an api where the hostname is the newly created upstream name;

upstream_url=http://service-xyz-v1/my/api

Targets can now be added to the upstreams;

http POST localhost:8001/upstreams/service-xyz-v1/targets \
   target=service.v1.host1:123 \
   weight=10

The target can be an IP address or name. The name will be resolved and all entries will be added with the same weight. So if in the example above service.v1.host1 resolves to an A record containing ip addresses 1.2.3.4 and 5.6.7.8, then they will both be added with weight 10. When a record expires (ttl) it will be refreshed upon the first access after expiring.

When the name resolves to an SRV record, then both the specified port and weight will be ignored. The information from the SRV record will be used instead. For SRV records, only the primary servers will be used (the entries with the lowest priority settings).

A dns record with a ttl setting of 0 will be added as a single target (even if it resolves to multiple ip addresses) and will be resolved upon each request.

To remove a target, do another post request, but set weight=0

Tieske and others added 28 commits September 1, 2016 14:37
patch the global tcp.connect function to use the internal dns resolver
…rs an empty table if nothing is specified

instead of the previous `nil`. So if table length is 0, drop it and revert to defaults.
using postgres it won't start, complaining no options table was provided, yet the debug line shows it does.
using cassandra the tcp override debug lines don't show --> cassandra bypasses those overrrides. How??
@Tieske Tieske added the pr/wip A work in progress PR opened to receive feedback label Oct 11, 2016
@Tieske Tieske added this to the 0.10 RC milestone Oct 11, 2016
Copy link
Member

@thibaultcha thibaultcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of reviewing! Left a few "nitpicks" comments that you are free to ignore, but on which we should debate furthermore in the future.

However, a couple of changes are quite important to me, including the log levels mentioned a few times, the format of the log messages, some variable names and the hot code paths good practices :)

PS: we really need to agree on a readable and strictly applied code style.

end,

POST = function(self, dao_factory, helpers)
local cleanup_factor = 10 -- when to cleanup; invalid-entries > (valid-ones * cleanup_factor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: in your comments, it is very difficult to understand what you mean at the first glance at it because you use semi-colons ; instead of colons :.

Semi-colons are meant to separate two independent clauses in a sentence. Colons, on the other side, are used to present an explanation after what could stand as an independent clause. That is, the later should be used when you make a statement, and present the explanation for that statement. Not the former.

I know this can sound like a nitpick but believe me: many times did I found myself re-reading your comments because I did not understand the second clause was the explanation for the first one.

delete[#delete+1] = entry
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style/nitpick: feel free to use more space to make the code more readable. Line jumps, for example, could be inserted at many places in this snippet. A good practice I have recently found myself to be kind of is inserting a blank lines before elseif and else statements as well. It makes the whole code greatly more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, I believe we should formalize the code-style used in this project.

-- either nothing left, or when 10x more outdated than active entries
if (#cleaned == 0 and #delete > 0) or
(#delete >= (math.max(#cleaned,1)*cleanup_factor)) then
ngx.log(ngx.WARN, "Starting cleanup of target table for upstream "..tostring(self.params.upstream_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • should be logged at the INFO log level.
  • better to respect good practices and use the variadic arguments form of ngx.log: Use , instead of .. to avoid Lua-land string concatenations.
  • avoid +80 chars columns
  • prefix with the component logging this message: [admin API]

-- do we need to cleanup?
-- either nothing left, or when 10x more outdated than active entries
if (#cleaned == 0 and #delete > 0) or
(#delete >= (math.max(#cleaned,1)*cleanup_factor)) then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: could use some spacing as well...

-- in case another kong-node does the same cleanup simultaneously
cnt = cnt + 1
end
ngx.log(ngx.WARN, "Finished cleanup of target table for upstream "..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto:

  • prefix message with component
  • INFO log level
  • avoid Lua-land string concatenations

config.orderlist = t
end
return true
end,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll take your word for this part 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments to clarify

return responses.send_HTTP_INTERNAL_SERVER_ERROR()
return responses.send_HTTP_INTERNAL_SERVER_ERROR("failed to retry the "..
"dns/balancer resolver for '"..addr.upstream.host..
"' with; "..tostring(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use colon instead of semi-colon.

valid, errors, check = validate_entity({ name = "valid.host.name" }, upstreams_schema)
assert.True(valid)
assert.Nil(errors)
assert.Nil(check)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously mentioned multiple times before: we have always been using the is_ form of luassert:

  • assert.is_table
  • assert.is_string
  • assert.is_true
  • assert.is_nil

Capital letters are confusing and carry less meaning than the is_ prefix.

valid, errors, check = validate_entity(data, upstreams_schema)
assert.False(valid)
assert.Nil(errors)
assert.are.equal("invalid orderlist",check.message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of those use cases should be a different test but that's ok.

valid, errors, check = validate_entity(data, targets_schema)
assert.True(valid)
assert.is_nil(errors)
assert.is_nil(check)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using the is_nil form as we should. Better stay consistent, another reason to switch to the is_ form altogether.

@thibaultcha
Copy link
Member

A dns record with a ttl setting of 0 will be added as a single target (even if it resolves to multiple ip addresses) and will be resolved upon each request.

👍

Question: based on our previous discussion, I think this was the correct decision. Do you have any side effects in mind about this behavior?


describe("Ring-balancer #"..kong_config.database, function()

local config_db
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint: It appears this variable is unused and is causing the linting to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was fixed

@Tieske
Copy link
Member Author

Tieske commented Dec 22, 2016

A dns record with a ttl setting of 0 will be added as a single target (even if it resolves to multiple ip addresses) and will be resolved upon each request.

👍
Question: based on our previous discussion, I think this was the correct decision. Do you have any side effects in mind about this behavior?

This more relates to the underlying dns resolving than specifically to this upstream/target implementation.
But the effects as discussed before: resource usage under load (sockets), and additional latency.
So imo resolving upon each request is theoretically correct, but doesn't bring anything to the table in a practical sense.

@subnetmarco
Copy link
Member

I can add multiple identical targets, which should not be allowed, ie the target with host 127.0.0.1:3000:

{
	"data": [{
		"weight": 100,
		"id": "ef33c439-0fb7-409f-a7c1-f3e4fb78edf5",
		"target": "127.0.0.1:3001",
		"created_at": 1482445955029,
		"upstream_id": "11920b28-24cb-4fc7-bad7-8124d886fbd6"
	}, {
		"weight": 100,
		"id": "9041af17-7d83-42d5-a04b-3fec76b39512",
		"target": "127.0.0.1:3000",
		"created_at": 1482445937573,
		"upstream_id": "11920b28-24cb-4fc7-bad7-8124d886fbd6"
	}, {
		"weight": 100,
		"id": "59a7f050-f9cb-4cd6-a9f4-399387b05bdc",
		"target": "127.0.0.1:3001",
		"created_at": 1482445939323,
		"upstream_id": "11920b28-24cb-4fc7-bad7-8124d886fbd6"
	}],
	"total": 3
}

@subnetmarco
Copy link
Member

subnetmarco commented Dec 22, 2016

Also I have an error when properly configuring an upstreams with targets and then making a request to my API:

2016/12/22 14:34:08 [error] 45731#0: *3840 lua entry thread aborted: runtime error: ./kong/core/balancer.lua:186: attempt to index local 'balancer' (a nil value)
stack traceback:
coroutine 0:
	./kong/core/balancer.lua: in function 'get_balancer'
	./kong/core/balancer.lua:261: in function 'balancer_execute'
	./kong/core/handler.lua:60: in function 'before'
	./kong/kong.lua:232: in function 'access'
	access_by_lua(nginx-kong.conf:70):2: in function <access_by_lua(nginx-kong.conf:70):1>, client: 127.0.0.1, server: kong, request: "GET /test HTTP/1.1", host: "127.0.0.1:8000"

@subnetmarco
Copy link
Member

subnetmarco commented Dec 22, 2016

When the DNS cannot find the upstream, the error returned should be 502 but not 500:

2016/12/22 15:01:16 [error] 45726#0: *17684 [lua] responses.lua:101: before(): failed the initial dns/balancer resolve for 'mashape' with: dns server error; 3 name error, client: 127.0.0.1, server: kong, request: "GET /test2 HTTP/1.1", host: "localhost:8000"

Should be 502 Bad Gateway (or 503?).

@subnetmarco
Copy link
Member

Also this happens when no targets have been created:

2016/12/22 15:05:08 [error] 45725#0: *19659 lua entry thread aborted: runtime error: ./kong/core/balancer.lua:190: attempt to index a nil value
stack traceback:
coroutine 0:
	./kong/core/balancer.lua: in function 'get_balancer'
	./kong/core/balancer.lua:261: in function 'balancer_execute'
	./kong/core/handler.lua:60: in function 'before'
	./kong/kong.lua:232: in function 'access'
	access_by_lua(nginx-kong.conf:70):2: in function <access_by_lua(nginx-kong.conf:70):1>, client: 127.0.0.1, server: kong, request: "GET /test2 HTTP/1.1", host: "localhost:8000"

@Tieske
Copy link
Member Author

Tieske commented Dec 23, 2016

I can add multiple identical targets, which should not be allowed, ie the target with host 127.0.0.1:3000

This should be allowed. The targets are not a list of active targets, but a history of changes. So we're not updating an entry, but we're adding a new entry to the history of the upstream. Hence you cannot delete an entry, but only set its weight=0 with a new entry.

When the DNS cannot find the upstream, the error returned should be 502 but not 500:

2016/12/22 15:01:16 [error] 45726#0: *17684 [lua] responses.lua:101: before(): failed the initial dns/balancer resolve for 'mashape' with: dns server error; 3 name error, client: 127.0.0.1, server: kong, request: "GET /test2 HTTP/1.1", host: "localhost:8000"

Should be 502 Bad Gateway.

This is not related to the upstream, but to the dns resolver. Currently the dns resolver throws an error if it fails to resolve a name. I'll create a separate issue for this.

Also this happens when no targets have been created:

2016/12/22 15:05:08 [error] 45725#0: *19659 lua entry thread aborted: runtime error: ./kong/core/balancer.lua:190: attempt to index a nil value
stack traceback:
coroutine 0:
./kong/core/balancer.lua: in function 'get_balancer'
./kong/core/balancer.lua:261: in function 'balancer_execute'
./kong/core/handler.lua:60: in function 'before'
./kong/kong.lua:232: in function 'access'
access_by_lua(nginx-kong.conf:70):2: in function <access_by_lua(nginx-kong.conf:70):1>, client: 127.0.0.1, server: kong, request: "GET /test2 HTTP/1.1", host: "localhost:8000"

This was an issue with the in-memory cache updating only a single worker, instead of all workers. Running tests on the fix right now.

Tieske and others added 2 commits December 23, 2016 16:07
reloading an upstream, read the upstream and invalidated the balancer. But this code only runs in a single worker.
Invalidating the balancer (and recreating it) should be done in very worker.
@subnetmarco
Copy link
Member

It now seems to be working better but this error still appears in my logs:

2016/12/23 10:57:14 [error] 69798#0: *733 lua entry thread aborted: runtime error: ./kong/core/balancer.lua:197: attempt to index a nil value
stack traceback:
coroutine 0:
	./kong/core/balancer.lua: in function 'get_balancer'
	./kong/core/balancer.lua:268: in function 'balancer_execute'
	./kong/core/handler.lua:60: in function 'before'
	./kong/kong.lua:232: in function 'access'
	access_by_lua(nginx-kong.conf:80):2: in function <access_by_lua(nginx-kong.conf:80):1>, client: 127.0.0.1, server: kong, request: "GET /test HTTP/1.1", host: "127.0.0.1:8000"

@Tieske
Copy link
Member Author

Tieske commented Dec 23, 2016

fixed.

@Tieske Tieske merged commit 0a18ac2 into next Dec 28, 2016
@Tieske Tieske deleted the feature/upstreams2 branch December 28, 2016 16:45
thibaultcha pushed a commit that referenced this pull request Jan 12, 2017
* adds loadbalancing on specified targets
* adds service registry
* implements #157
* adds entities: upstreams and targets
* modifies timestamps to millisecond precision (except for the non-related tables when using postgres)
* adds collecting health-data on a per-request basis (unused for now)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants