"i/o timeout" err not retried when using Watch()

Issue tracker is used for reporting bugs and discussing new features. Please use
[stackoverflow](https://stackoverflow.com) for supporting issues.



## Expected Behavior

I would expect the client to retry commands related to executing the `ClusterClient.Watch()` method.



## Current Behavior

Currently if there's a broken connection the TCP call will timeout after the configured or default (3 seconds) timeout, causing the whole method call to fail. This can happen as early as when trying to execute the initial `WATCH` command against the server. The `Watch()` method call returns with an error like `redis: Conn is in a bad state: read tcp 123.123.123.123:37106->42.42.42.42:6379: i/o timeout`



## Possible Solution



Detect and unwrap a `BadConnErr` in `shouldRetry` before evaluating retry conditions.

## Steps to Reproduce

N/A, this is an intermittent issue that occurs in a cloud environment.

## Context (Environment)

Version: `github.com/redis/go-redis/v9 v9.5.1`






## Detailed Description



Reading through the code, this happens because the `Watch()` method uses a `Tx` object, which uses a `StickyConnPool`:

https://github.com/redis/go-redis/blob/0f0a28464c3db3ff13f80c44db43dbb4ad6ac2c7/tx.go#L26-L32

This StickyConnPool captures any errors that occur while executing commands and wraps them in the internal `BadConnErr` and stores internally:

https://github.com/redis/go-redis/blob/0f0a28464c3db3ff13f80c44db43dbb4ad6ac2c7/internal/pool/pool_sticky.go#L122

When the initial command fails, it may get retried (depending on options) by `baseClient._process`, but a repeated call to `StickyConnPool.Get` will now return this `BadConnErr`, which will bubble out.

When the `ClusterClient.Watch` method machinery [attempts a retry](https://github.com/redis/go-redis/blob/0f0a28464c3db3ff13f80c44db43dbb4ad6ac2c7/osscluster.go#L1648) it will evaluate the error using `shouldRetry` which will return false, even though the wrapped error is a `timeoutError`.

## Possible Implementation

Change `shouldRetry` to detect if the passed in err is `BadConnErr` and unwrap it before invoking the existing logic.

However note that there's a wrinkle in that if `baseClient._process` (which also uses `shouldRetry`) that is executing on a `Tx` object that is using the sticky pool now attempts a retry it will continue trying to get a `conn` that is in a bad state. There is the `StickyConnPool.Reset()` method that appears to be intended to restore the sticky pool to an uninitialized state so that calling `Get` on it can attempt to get a fresh conn from the underlying pool, however this method is not called anywhere in the code at the moment. It appears that there needs to be extra logic added to reset the sticky pool before retries are attempted.

	tx := Tx{
	baseClient: baseClient{
	opt: c.opt,
	connPool: pool.NewStickyConnPool(c.connPool),
	},
	hooksMixin: c.hooksMixin.clone(),
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"i/o timeout" err not retried when using Watch() #2999

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"i/o timeout" err not retried when using Watch() #2999

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions