Skip to content

CI: Add behavioral validation tests for topology, problems, and operations#91

Merged
renecannao merged 9 commits intomasterfrom
issue88-90-behavioral-tests
Apr 9, 2026
Merged

CI: Add behavioral validation tests for topology, problems, and operations#91
renecannao merged 9 commits intomasterfrom
issue88-90-behavioral-tests

Conversation

@renecannao
Copy link
Copy Markdown

@renecannao renecannao commented Apr 9, 2026

Summary

The existing functional tests are mostly HTTP status code checks ("does the endpoint return 200?"). This PR adds real behavioral tests that verify orchestrator actually works correctly.

New test scripts

test-topology-discovery.sh (#88) — Cross-references orchestrator's API data against direct MySQL queries:

  • Correct instance count (3)
  • Master identified correctly (read_only=false, no upstream)
  • Replicas identified correctly (read_only=true, correct master, replication running)
  • GTID tracking active
  • Instance freshness (LastChecked recent)
  • Uptime detection

test-problem-detection.sh (#89) — Induces problems and verifies detection:

  • Stops replication → verifies orchestrator detects it → restarts → verifies it clears
  • Sets replica writable → verifies detection → restores read_only → verifies cleared
  • Injects errant GTID → verifies GtidErrant field populated

test-topology-operations.sh (#90) — Tests actual refactoring operations:

  • relocate: move mysql3 under mysql2, verify via API and direct MySQL
  • move-up: move mysql3 back under mysql1
  • repoint: GTID-based repoint to different master
  • set-read-only / set-writeable: toggle read_only via API

Depends on #85
Closes #88, closes #89, closes #90

Test plan

  • All 3 test scripts pass on MySQL 8.4
  • Tests run in the MySQL version matrix
  • Topology operations correctly modify replication and orchestrator detects the changes

Summary by CodeRabbit

  • Tests

    • Added functional tests for replication topology discovery, instance state validation, problem detection/clearing, topology refactoring (relocate/move-up/repoint) and read-only management.
    • Improved test reliability with polling, validation, and cleanup steps.
  • Chores

    • Extended CI workflow to run the new functional test steps as part of MySQL job.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

Warning

Rate limit exceeded

@renecannao has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 20 minutes and 11 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 20 minutes and 11 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 51052a23-1f43-41b7-88ee-be2de97b33fe

📥 Commits

Reviewing files that changed from the base of the PR and between 47988d9 and cd02379.

📒 Files selected for processing (1)
  • tests/functional/test-topology-operations.sh
📝 Walkthrough

Walkthrough

Three new functional test scripts were added to validate topology discovery, problem detection, and topology operations. The CI workflow was updated to run these tests between smoke and regression stages. The shared test helper mysql_source_host() parsing was adjusted, and an existing test (test-named-channels.sh) gained extra validation and cleanup.

Changes

Cohort / File(s) Summary
CI Workflow
.github/workflows/functional.yml
Inserted three new functional test steps — tests/functional/test-topology-discovery.sh, tests/functional/test-problem-detection.sh, tests/functional/test-topology-operations.sh — into the MySQL workflow job between smoke and regression tests.
Test Helpers
tests/functional/lib.sh
Modified mysql_source_host() to use SHOW ... STATUS without \G and parse tab-separated output via awk -F'\t' '{print $2; exit}' for replication source extraction.
Existing Test
tests/functional/test-named-channels.sh
Added pre-check on mysql2 for expected data before configuring named channel, replaced fixed sleep with a polling loop (up to 15s) and improved failure messaging; added cleanup step to drop extra_db.
New Tests
tests/functional/test-topology-discovery.sh, tests/functional/test-problem-detection.sh, tests/functional/test-topology-operations.sh
Added three executable Bash functional tests: discovery (validates instance counts, master/replica identification, GTID, freshness, uptime), problem detection (stops/starts replication, detects/clears problems, read_only mismatch, errant GTIDs), and topology operations (relocate, move-up, repoint, read-only toggling) with polling-driven assertions and recovery steps.

Sequence Diagram(s)

sequenceDiagram
    participant TestRunner as Test Runner
    participant Orchestrator as Orchestrator API
    participant MySQL1 as mysql1 (master)
    participant MySQL2 as mysql2 (replica)
    participant MySQL3 as mysql3 (replica)

    TestRunner->>Orchestrator: wait for /api/cluster and /api/instance readiness
    TestRunner->>Orchestrator: POST /api/discover/mysql1/3306 (discover topology)
    Orchestrator->>MySQL1: query instance status, GTID, read_only, uptime
    Orchestrator->>MySQL2: query instance status, replication status
    Orchestrator->>MySQL3: query instance status, replication status
    Orchestrator-->>TestRunner: cluster JSON (instances, master/replica info)

    alt Problem detection test
        TestRunner->>MySQL2: execute SQL to stop replication
        TestRunner->>Orchestrator: POST /api/discover/mysql2/3306
        Orchestrator->>MySQL2: detect replication stopped
        Orchestrator-->>TestRunner: /api/problems includes mysql2 problem
        TestRunner->>MySQL2: restart replication
        TestRunner->>Orchestrator: POST /api/discover/mysql2/3306
    end

    alt Topology operations test
        TestRunner->>Orchestrator: POST /api/relocate/mysql3
        Orchestrator->>MySQL3: orchestrate move (exec SQL or GTID operations)
        Orchestrator-->>TestRunner: operation result
        TestRunner->>MySQL3: query @@read_only and replication source to verify
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I hopped through logs and chased the thread—
Discovery, problems, operations fed.
With polls and checks I nibbled each clue,
Topology tidy, tests passing too! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: adding three behavioral validation tests for topology, problems, and operations, which is directly supported by all file summaries.
Linked Issues check ✅ Passed All three linked issues (#88, #89, #90) are met: topology discovery validation (test-topology-discovery.sh), replication problem detection (test-problem-detection.sh), and topology refactoring operations (test-topology-operations.sh) are implemented with required test cases and cross-validation.
Out of Scope Changes check ✅ Passed All changes are directly related to the PR objectives: workflow configuration changes to run the three tests, helper function updates for parsing reliability, and the three new test scripts with supporting modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue88-90-behavioral-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces three new functional test scripts to validate orchestrator's problem detection, topology discovery, and topology refactoring operations. My review identified a bug in the errant GTID test cleanup where the database drop operation would fail due to super_read_only being enabled, and I have provided a correction. Additionally, I recommended replacing fixed sleep intervals with polling loops for better test reliability, suggested consolidating redundant Python parsing logic, and noted a discrepancy between a comment and the actual implementation regarding timeout thresholds.

Comment on lines +191 to +195
$COMPOSE exec -T mysql3 mysql -uroot -ptestpass -e "
SET GLOBAL read_only=0;
DROP DATABASE IF EXISTS errant_detect_test;
SET GLOBAL read_only=1;
" 2>/dev/null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cleanup for the errant GTID test will fail silently. After the injection at lines 158-164, super_read_only is set to 1. The cleanup attempts to DROP DATABASE, which is a write operation that will be blocked by super_read_only=1. The error is suppressed by 2>/dev/null.

The cleanup needs to disable super_read_only to be able to drop the database and then restore it.

Suggested change
$COMPOSE exec -T mysql3 mysql -uroot -ptestpass -e "
SET GLOBAL read_only=0;
DROP DATABASE IF EXISTS errant_detect_test;
SET GLOBAL read_only=1;
" 2>/dev/null
$COMPOSE exec -T mysql3 mysql -uroot -ptestpass -e "
SET GLOBAL read_only=0;
SET GLOBAL super_read_only=0;
DROP DATABASE IF EXISTS errant_detect_test;
SET GLOBAL read_only=1;
SET GLOBAL super_read_only=1;
" 2>/dev/null

echo "Setting mysql2 writeable via API..."
RESULT=$(curl -s --max-time 10 "$ORC_URL/api/set-writeable/mysql2/3306" 2>/dev/null)

sleep 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a fixed sleep (here and on line 158) can make tests flaky if the operation takes longer than expected. A more robust approach is to poll for the expected read_only status in a loop with a timeout.

For example:

SUCCESS=false
for i in $(seq 1 10); do
    # check condition
    if [ ... ]; then
        SUCCESS=true
        break
    fi
    sleep 1
done
# assert SUCCESS is true

Comment on lines +80 to +88
HAS_MYSQL2=$(echo "$PROBLEMS" | python3 -c "
import json, sys
problems = json.load(sys.stdin)
for p in problems:
h = p.get('Key', {}).get('Hostname', '')
if 'mysql2' in h:
sys.exit(0)
sys.exit(1)
" 2>/dev/null && echo "yes" || echo "no")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The python script used here to check for problems on mysql2 is identical to the one used in the first test (lines 30-38). This duplication should be avoided to improve maintainability. Consider creating a reusable shell function that takes the instance name and problem list as arguments.

Comment on lines +42 to +43
ORC_MASTER_HOST=$(echo "$MASTER_INFO" | python3 -c "import json,sys; print(json.load(sys.stdin)['Key']['Hostname'])" 2>/dev/null || echo "")
ORC_MASTER_RO=$(echo "$MASTER_INFO" | python3 -c "import json,sys; print(json.load(sys.stdin)['ReadOnly'])" 2>/dev/null || echo "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These two commands parse the same JSON string ($MASTER_INFO) in two separate python3 processes. This is inefficient. Consider parsing the JSON once and extracting both values in a single python3 call. You could output them separated by a delimiter and use cut or read to assign them to variables.

echo ""
echo "--- Test 5: Instance freshness ---"

# Verify LastChecked is recent (within last 60 seconds)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment states the check is for the "last 60 seconds", but the implementation at line 193 uses a threshold of 120 seconds. The comment should be updated to match the code to avoid confusion.

Suggested change
# Verify LastChecked is recent (within last 60 seconds)
# Verify LastChecked is recent (within last 120 seconds)

# Use move-up to move mysql3 from under mysql2 back to mysql1
echo "Moving mysql3 up via API..."
RESULT=$(curl -s --max-time 30 "$ORC_URL/api/move-up/mysql3/3306" 2>/dev/null)
CODE=$(echo "$RESULT" | python3 -c "import json,sys; print(json.load(sys.stdin).get('Code',''))" 2>/dev/null || echo "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is identical to line 59 and line 118. To improve maintainability and reduce duplication, consider creating a small helper function to extract the Code field from the API JSON response.

For example:

get_api_code() {
    echo "$1" | python3 -c "import json,sys; print(json.load(sys.stdin).get('Code',''))" 2>/dev/null || echo ""
}

# Usage
CODE=$(get_api_code "$RESULT")

@renecannao renecannao force-pushed the issue80-81-mysql-matrix-channels branch 2 times, most recently from e81056f to fa42679 Compare April 9, 2026 07:33
…89, #90)

Add three new test scripts that verify orchestrator actually works
correctly, not just that API endpoints return 200:

- test-topology-discovery.sh: validates orchestrator's view of the
  topology matches actual MySQL state (master/replica roles, read_only,
  GTID tracking, replication threads, instance freshness)

- test-problem-detection.sh: verifies orchestrator detects and clears
  replication problems (stopped replication, writable replica, errant
  GTID) by inducing problems and checking the API response

- test-topology-operations.sh: tests actual topology refactoring
  operations (relocate, move-up, repoint, set-read-only/writeable)
  and cross-references results with direct MySQL queries
The test created extra_db on mysql2 but only cleaned up the channel
on mysql3. Add DROP DATABASE to prevent residual state.
@renecannao renecannao force-pushed the issue88-90-behavioral-tests branch from 34f3c7c to cbda41b Compare April 9, 2026 09:44
@renecannao renecannao changed the base branch from issue80-81-mysql-matrix-channels to master April 9, 2026 10:01
The \G (vertical format) flag is unreliable when passed through
docker exec pipes — it returns empty output on MySQL 8.4 and 9.x.
Use tabular SHOW REPLICA STATUS instead; column 2 is always the
source host.
awk splits on whitespace by default, picking up "for" from
"Waiting for source..." (column 1 = Replica_IO_State).
Use -F'\t' to split on tabs so column 2 = Source_Host.
Orchestrator caches instance state and only refreshes on its poll
interval. After STOP REPLICA / SET GLOBAL read_only / injecting errant
GTIDs, the tests need to force a re-discover call so orchestrator
updates its cached data before the test assertions check the API.
- Verify data exists on mysql2 before setting up the extra channel
- Replace single sleep+check with a 15s retry loop for data replication
- Add better error messages for debugging if setup or replication fails
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/functional/test-topology-discovery.sh (1)

5-5: Add error handling for cd command.

If the cd fails (e.g., directory doesn't exist), the script will continue with incorrect working directory, potentially causing confusing failures.

Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || { echo "Failed to cd to repository root"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-topology-discovery.sh` at line 5, The script currently
runs cd "$(dirname "$0")/../.." without checking for failure; update the script
so the cd command's failure is detected and handled (for example, test the exit
status of cd or enable errexit) and exit with a clear error message;
specifically modify the invocation around cd "$(dirname "$0")/../.." (or the
script's top-of-file bootstrapping) to abort the script and log why (unable to
change to the expected working directory) when the cd fails.
tests/functional/test-problem-detection.sh (1)

5-5: Add error handling for cd command.

Consistent with the other test scripts, this should fail fast if the directory change fails.

Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || { echo "Failed to cd to repository root"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-problem-detection.sh` at line 5, The cd command in
tests/functional/test-problem-detection.sh should fail fast like the other
scripts: update the cd "$(dirname "$0")/../.." invocation to check its exit
status and abort on failure (e.g., append an immediate failure handler such as
|| exit 1 or an equivalent error message + exit) so the script stops if the
directory change fails.
tests/functional/test-topology-operations.sh (1)

5-5: Add error handling for cd command.

Same issue as in other test scripts — the script should fail fast if the directory change fails.

Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || { echo "Failed to cd to repository root"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-topology-operations.sh` at line 5, The cd command cd
"$(dirname "$0")/../.." should fail fast if it cannot change directories; update
that line to check the exit status and abort on failure (e.g., append a
conditional OR that prints an error to stderr and exits non‑zero), or ensure the
script enables strict failure (set -e) before running the cd; target the exact
cd invocation (cd "$(dirname "$0")/../..") when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/functional/test-problem-detection.sh`:
- Line 5: The cd command in tests/functional/test-problem-detection.sh should
fail fast like the other scripts: update the cd "$(dirname "$0")/../.."
invocation to check its exit status and abort on failure (e.g., append an
immediate failure handler such as || exit 1 or an equivalent error message +
exit) so the script stops if the directory change fails.

In `@tests/functional/test-topology-discovery.sh`:
- Line 5: The script currently runs cd "$(dirname "$0")/../.." without checking
for failure; update the script so the cd command's failure is detected and
handled (for example, test the exit status of cd or enable errexit) and exit
with a clear error message; specifically modify the invocation around cd
"$(dirname "$0")/../.." (or the script's top-of-file bootstrapping) to abort the
script and log why (unable to change to the expected working directory) when the
cd fails.

In `@tests/functional/test-topology-operations.sh`:
- Line 5: The cd command cd "$(dirname "$0")/../.." should fail fast if it
cannot change directories; update that line to check the exit status and abort
on failure (e.g., append a conditional OR that prints an error to stderr and
exits non‑zero), or ensure the script enables strict failure (set -e) before
running the cd; target the exact cd invocation (cd "$(dirname "$0")/../..") when
applying the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f699f0c4-2392-4ad4-a3b1-fea6df43c768

📥 Commits

Reviewing files that changed from the base of the PR and between 5b13080 and 1d88f3f.

📒 Files selected for processing (6)
  • .github/workflows/functional.yml
  • tests/functional/lib.sh
  • tests/functional/test-named-channels.sh
  • tests/functional/test-problem-detection.sh
  • tests/functional/test-topology-discovery.sh
  • tests/functional/test-topology-operations.sh

After STOP+START REPLICA, mysql2 may still appear in /api/problems due
to other issues (replication lag, errant GTID from stop/start cycle).
Instead of checking /api/problems for any mysql2 entry, verify that the
SQL thread is running again via the instance API. Also force re-discovery
on each poll iteration to ensure fresh state.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
tests/functional/test-named-channels.sh (1)

65-65: Use _ for the intentionally unused loop variable.

Line 65 uses i but never reads it, which triggers SC2034 and adds noise.

#!/bin/bash
# Verify the loop variable is not referenced anywhere in this script.
rg -n --type=sh -C2 '\bfor\s+i\s+in\s+\$\(seq 1 15\)' tests/functional/test-named-channels.sh
rg -n --type=sh '\$i\b' tests/functional/test-named-channels.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-named-channels.sh` at line 65, The loop variable `i` in
the `for i in $(seq 1 15); do` line is intentionally unused and should be
renamed to `_` to avoid the SC2034 unused-variable warning; update the loop to
`for _ in $(seq 1 15); do` and verify there are no references to `$i` elsewhere
(if any exist, replace or remove them) so the change is safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-named-channels.sh`:
- Around line 129-131: The DROP DATABASE cleanup currently suppresses stderr and
ignores failures for the command `$COMPOSE exec -T mysql2 mysql -uroot
-ptestpass -e "DROP DATABASE IF EXISTS extra_db;"`; change this so failures are
detected and cause the test run to fail: remove the `2>/dev/null` suppression,
capture and check the command exit status (or enable `set -e`/`set -o errexit`
for the script), and on non-zero exit print a clear error message and exit
non-zero so stale state cannot leak into later tests.

In `@tests/functional/test-problem-detection.sh`:
- Line 23: The mysql state-changing exec lines (e.g. the call using $COMPOSE
exec -T mysql2 mysql -uroot -ptestpass -e "$STOP_SQL") suppress stderr and
ignore exit codes; update each such mutation call (including analogous
START/INSERT/UPDATE/DELETE commands) to capture stdout/stderr and check the
command's exit status immediately, and if non-zero print a descriptive error
including the captured stderr and exit non-zero to fail the test fast. Locate
the commands by the pattern "$COMPOSE exec -T mysql* mysql -uroot -ptestpass -e
\"...\"" and modify them to redirect/collect stderr (or assign output to a
variable), test $? (or the command result), and abort with an explanatory
message when the command fails.
- Around line 100-105: The current check marks CLEARED when only SQL_RUNNING is
"True", which can pass while IO_RUNNING is "False"; update the conditional that
sets CLEARED (the block using SQL_RUNNING, IO_RUNNING, CLEARED and REPL_STATE)
to require both SQL_RUNNING = "True" and IO_RUNNING = "True" before setting
CLEARED and breaking the loop, and adjust the echo message to reflect both
thread states (e.g., "SQL=True, IO=True") so the test only passes when both
threads are healthy.
- Line 15: discover_topology is being treated as optional but must be a hard
precondition; update the test to fail immediately when discover_topology
"mysql1" returns non-zero by checking its exit status and exiting with a
non-zero code and an explanatory stderr message (e.g., run discover_topology
"mysql1" and if it fails print "Topology discovery failed for mysql1" to stderr
and exit 1), or alternatively enable strict shell error handling at the top of
the script (set -euo pipefail) to enforce failure; refer to the
discover_topology invocation to locate where to add the check.
- Around line 5-6: The script should fail fast if bootstrap steps fail: add a
strict shell flag (e.g., set -euo pipefail or at minimum set -e) at the top of
the script (immediately after the shebang) so that failures in the cd "$(dirname
"$0")/../.." or source tests/functional/lib.sh commands abort the run; ensure
the flags are applied before those two lines so any failure in cd or source
stops the script.

---

Nitpick comments:
In `@tests/functional/test-named-channels.sh`:
- Line 65: The loop variable `i` in the `for i in $(seq 1 15); do` line is
intentionally unused and should be renamed to `_` to avoid the SC2034
unused-variable warning; update the loop to `for _ in $(seq 1 15); do` and
verify there are no references to `$i` elsewhere (if any exist, replace or
remove them) so the change is safe.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e9671a70-a5c5-46e7-96f5-90bb0c864147

📥 Commits

Reviewing files that changed from the base of the PR and between 1d88f3f and 47988d9.

📒 Files selected for processing (2)
  • tests/functional/test-named-channels.sh
  • tests/functional/test-problem-detection.sh

Comment on lines +129 to +131
# Clean up test database on mysql2
$COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "DROP DATABASE IF EXISTS extra_db;" 2>/dev/null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle cleanup failures explicitly to avoid cross-test contamination.

Line 130 suppresses stderr and does not check command success. If DROP DATABASE fails, stale state can leak into later functional tests in the same environment.

Suggested fix
 # Clean up test database on mysql2
-$COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "DROP DATABASE IF EXISTS extra_db;" 2>/dev/null
+if ! $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "DROP DATABASE IF EXISTS extra_db;" 2>/dev/null; then
+    fail "Cleanup: failed to drop extra_db on mysql2"
+else
+    pass "Cleanup: dropped extra_db on mysql2"
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Clean up test database on mysql2
$COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "DROP DATABASE IF EXISTS extra_db;" 2>/dev/null
# Clean up test database on mysql2
if ! $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "DROP DATABASE IF EXISTS extra_db;" 2>/dev/null; then
echo "Error: Failed to drop extra_db on mysql2"
exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-named-channels.sh` around lines 129 - 131, The DROP
DATABASE cleanup currently suppresses stderr and ignores failures for the
command `$COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "DROP DATABASE IF
EXISTS extra_db;"`; change this so failures are detected and cause the test run
to fail: remove the `2>/dev/null` suppression, capture and check the command
exit status (or enable `set -e`/`set -o errexit` for the script), and on
non-zero exit print a clear error message and exit non-zero so stale state
cannot leak into later tests.

Comment on lines +5 to +6
cd "$(dirname "$0")/../.."
source tests/functional/lib.sh
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when bootstrap steps fail.

If cd or source fails, the script still continues (no set -e), which can cascade into misleading failures later.

Suggested fix
-cd "$(dirname "$0")/../.."
-source tests/functional/lib.sh
+cd "$(dirname "$0")/../.." || { echo "FATAL: unable to cd to repository root"; exit 1; }
+source tests/functional/lib.sh || { echo "FATAL: unable to load tests/functional/lib.sh"; exit 1; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd "$(dirname "$0")/../.."
source tests/functional/lib.sh
cd "$(dirname "$0")/../.." || { echo "FATAL: unable to cd to repository root"; exit 1; }
source tests/functional/lib.sh || { echo "FATAL: unable to load tests/functional/lib.sh"; exit 1; }
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 5-5: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[info] 6-6: Not following: tests/functional/lib.sh was not specified as input (see shellcheck -x).

(SC1091)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-problem-detection.sh` around lines 5 - 6, The script
should fail fast if bootstrap steps fail: add a strict shell flag (e.g., set
-euo pipefail or at minimum set -e) at the top of the script (immediately after
the shebang) so that failures in the cd "$(dirname "$0")/../.." or source
tests/functional/lib.sh commands abort the run; ensure the flags are applied
before those two lines so any failure in cd or source stops the script.

START_SQL=$(mysql_start_replica_sql)

wait_for_orchestrator || { echo "FATAL: Orchestrator not reachable"; exit 1; }
discover_topology "mysql1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat topology discovery as a hard precondition.

Line 15 ignores discover_topology failure, so tests can run against partial/undiscovered state and produce noisy results.

Suggested fix
-discover_topology "mysql1"
+discover_topology "mysql1" || { echo "FATAL: initial topology discovery failed"; exit 1; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
discover_topology "mysql1"
discover_topology "mysql1" || { echo "FATAL: initial topology discovery failed"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-problem-detection.sh` at line 15, discover_topology is
being treated as optional but must be a hard precondition; update the test to
fail immediately when discover_topology "mysql1" returns non-zero by checking
its exit status and exiting with a non-zero code and an explanatory stderr
message (e.g., run discover_topology "mysql1" and if it fails print "Topology
discovery failed for mysql1" to stderr and exit 1), or alternatively enable
strict shell error handling at the top of the script (set -euo pipefail) to
enforce failure; refer to the discover_topology invocation to locate where to
add the check.


# Stop replication on mysql2
echo "Stopping replication on mysql2..."
$COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "$STOP_SQL" 2>/dev/null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Check exit codes for state-changing MySQL commands.

These commands mutate DB state but suppress stderr and do not verify success. If any command fails, subsequent assertions may incorrectly blame orchestrator behavior.

Suggested pattern
+# helper
+run_mysql() {
+    local host="$1"
+    local sql="$2"
+    if ! $COMPOSE exec -T "$host" mysql -uroot -ptestpass -e "$sql" >/dev/null; then
+        fail "MySQL command failed on ${host}" "$sql"
+        return 1
+    fi
+    return 0
+}
-$COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "$STOP_SQL" 2>/dev/null
+run_mysql "mysql2" "$STOP_SQL" || exit 1

Apply the same pattern to the other mutation calls.

Also applies to: 81-81, 122-122, 150-150, 179-185, 216-220

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-problem-detection.sh` at line 23, The mysql
state-changing exec lines (e.g. the call using $COMPOSE exec -T mysql2 mysql
-uroot -ptestpass -e "$STOP_SQL") suppress stderr and ignore exit codes; update
each such mutation call (including analogous START/INSERT/UPDATE/DELETE
commands) to capture stdout/stderr and check the command's exit status
immediately, and if non-zero print a descriptive error including the captured
stderr and exit non-zero to fail the test fast. Locate the commands by the
pattern "$COMPOSE exec -T mysql* mysql -uroot -ptestpass -e \"...\"" and modify
them to redirect/collect stderr (or assign output to a variable), test $? (or
the command result), and abort with an explanatory message when the command
fails.

Comment on lines +100 to +105
SQL_RUNNING=$(echo "$REPL_STATE" | cut -d: -f1)
IO_RUNNING=$(echo "$REPL_STATE" | cut -d: -f2)
if [ "$SQL_RUNNING" = "True" ]; then
CLEARED=true
echo "Replication recovered after ${i}s (SQL=True, IO=${IO_RUNNING})"
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require both SQL and IO threads for “cleared” state.

Current clear condition only checks SQL=True; replication can still be unhealthy with IO=False, causing a false pass.

Suggested fix
-    if [ "$SQL_RUNNING" = "True" ]; then
+    if [ "$SQL_RUNNING" = "True" ] && [ "$IO_RUNNING" = "True" ]; then
         CLEARED=true
-        echo "Replication recovered after ${i}s (SQL=True, IO=${IO_RUNNING})"
+        echo "Replication recovered after ${i}s (SQL=True, IO=True)"
         break
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SQL_RUNNING=$(echo "$REPL_STATE" | cut -d: -f1)
IO_RUNNING=$(echo "$REPL_STATE" | cut -d: -f2)
if [ "$SQL_RUNNING" = "True" ]; then
CLEARED=true
echo "Replication recovered after ${i}s (SQL=True, IO=${IO_RUNNING})"
break
SQL_RUNNING=$(echo "$REPL_STATE" | cut -d: -f1)
IO_RUNNING=$(echo "$REPL_STATE" | cut -d: -f2)
if [ "$SQL_RUNNING" = "True" ] && [ "$IO_RUNNING" = "True" ]; then
CLEARED=true
echo "Replication recovered after ${i}s (SQL=True, IO=True)"
break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-problem-detection.sh` around lines 100 - 105, The
current check marks CLEARED when only SQL_RUNNING is "True", which can pass
while IO_RUNNING is "False"; update the conditional that sets CLEARED (the block
using SQL_RUNNING, IO_RUNNING, CLEARED and REPL_STATE) to require both
SQL_RUNNING = "True" and IO_RUNNING = "True" before setting CLEARED and breaking
the loop, and adjust the echo message to reflect both thread states (e.g.,
"SQL=True, IO=True") so the test only passes when both threads are healthy.

@renecannao renecannao self-assigned this Apr 9, 2026
@renecannao renecannao merged commit c3f54cc into master Apr 9, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: Topology refactoring operation tests CI: Replication problem detection tests CI: Topology discovery validation tests

1 participant