Skip to content

fix(CubeProxy): initialize random seed in worker phase#138

Merged
kinwin-ustc merged 1 commit into
TencentCloud:masterfrom
novahe:fix/random-seed
May 9, 2026
Merged

fix(CubeProxy): initialize random seed in worker phase#138
kinwin-ustc merged 1 commit into
TencentCloud:masterfrom
novahe:fix/random-seed

Conversation

@novahe
Copy link
Copy Markdown
Contributor

@novahe novahe commented May 4, 2026

Summary

Seed the random number generator for each worker to ensure that cache TTL jitter (math.random) works correctly.

For example, in our cubeProxy logic, we use math.random to calculate the cache TTL jitter:

local function get_cache_timeout()
    return math.random(tonumber(ngx.var.timeout_min), tonumber(ngx.var.timeout_max))
end

(Reference: rewrite_phase.lua#L20-L22)

Without a unique seed per worker, all OpenResty workers initialize with the exact same default internal state. If multiple workers receive concurrent requests and calculate this timeout, they will all call math.random() and receive the exact same value. This causes all workers to set identical expiration times in the shared cache, leading to a synchronized stampede to the backend when that specific time is reached, completely defeating the purpose of adding jitter.

Here is a simple demo illustrating how the lack of unique seeds synchronizes the behavior across workers:

Demo:

-- Simulate behavior in init_worker_phase.lua
-- Scenario: Workers starting simultaneously to handle requests

-- [Case A]: No fix (All Workers share the same seed, default is 1)
local function worker_without_fix(id)
    math.randomseed(1) -- OpenResty default behavior, all workers start with the same state
    local jitter = math.random(1, 100)
    print(string.format("Worker %d (No Fix) - Calculated Jitter: %d", id, jitter))
end

-- [Case B]: With fix (Each Worker has a unique seed)
local function worker_with_fix(id)
    -- Fix: Use timestamp + Worker ID to ensure unique seeds
    math.randomseed(os.time() + id) 
    local jitter = math.random(1, 100)
    print(string.format("Worker %d (Fixed)  - Calculated Jitter: %d", id, jitter))
end

print("--- Testing Synchronized Behavior ---")
for i = 1, 3 do worker_without_fix(i) end

print("\n--- Testing Distributed Behavior ---")
for i = 1, 3 do worker_with_fix(i) end

Output:

--- Testing Synchronized Behavior ---
Worker 1 (No Fix) - Calculated Jitter: 82
Worker 2 (No Fix) - Calculated Jitter: 82
Worker 3 (No Fix) - Calculated Jitter: 82  <-- All random values are identical!

--- Testing Distributed Behavior ---
Worker 1 (Fixed)  - Calculated Jitter: 45
Worker 2 (Fixed)  - Calculated Jitter: 12
Worker 3 (Fixed)  - Calculated Jitter: 98  <-- Truly distributed randomness

By uniquely seeding each worker based on its ngx.worker.id() and the current time, we ensure the random distribution works as intended across the entire proxy layer, keeping backend refreshes safely distributed.

Verification with Real-world Data (from dev environment)

To further validate this, I performed an end-to-end test in a live environment with multiple workers handling concurrent requests.

Case 1: Without Fix (Seeding commented out)

As expected, different workers generated identical timeout sequences, which would trigger a synchronized stampede:

2026/05/07 01:21:19 [notice] worker=6  timeout=618
2026/05/07 01:21:19 [notice] worker=9  timeout=618  <-- COLLISION!
2026/05/07 01:21:19 [notice] worker=6  timeout=516
2026/05/07 01:21:08 [notice] worker=0  timeout=516  <-- COLLISION!

Case 2: With Fix (Unique seeding applied)

Each worker now generates its own unique, distributed timeout value even when requests are handled simultaneously:

2026/05/07 01:19:31 [notice] worker=0  timeout=629
2026/05/07 01:19:31 [notice] worker=3  timeout=661
2026/05/07 01:19:31 [notice] worker=9  timeout=657
2026/05/07 01:19:31 [notice] worker=10 timeout=506

This confirms that the fix successfully decouples the workers' random state, ensuring that the cache jitter works as intended across the entire proxy layer.

ref: #135 (comment)

Comment thread CubeProxy/lua/init_worker_phase.lua Outdated
Comment on lines +1 to +3
-- Seed the random number generator for each worker to ensure
-- that cache TTL jitter (math.random) works correctly.
math.randomseed(ngx.now() * 1000 + ngx.worker.id())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate more on this change in commit message. I still don't get the point of this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essential for features like cache TTL jitter to work as intended and avoid synchronized cache expiration stampedes.

The cache is shared by all workers. How does this change help with the issue describe above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the above.

In OpenResty, all worker processes inherit the same state from the
master process. Without explicitly seeding the random number generator
in the init_worker phase, each worker starts with the same default
seed. This results in math.random() producing the exact same sequence
of numbers across all workers.

Seeding with (ngx.now() * 1000 + ngx.worker.id()) ensures that each
worker has a unique, time-varying seed. This is essential for features
like cache TTL jitter to work as intended and avoid synchronized
cache expiration stampedes.

Signed-off-by: novahe <heqianfly@gmail.com>
@novahe novahe force-pushed the fix/random-seed branch from ba71c20 to 0b5f058 Compare May 6, 2026 05:04
@novahe novahe requested a review from chenhengqi May 6, 2026 06:39
@chenhengqi
Copy link
Copy Markdown
Collaborator

The change itself looks good, but I still doubt the rationale.

What's the real issue in the following scenario?

Worker 1 (No Fix) - Calculated Jitter: 82 -> for sandbox A on node 1
Worker 2 (No Fix) - Calculated Jitter: 82 -> for sandbox B on node 2
Worker 3 (No Fix) - Calculated Jitter: 82 -> for sandbox C on node 3

@kinwin-ustc kinwin-ustc merged commit f746f8e into TencentCloud:master May 9, 2026
2 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants