Conflicts when updating JobParameters in ElasticJobParametersDB #6715

chrisburr · 2023-01-24T09:33:45Z

In LHCb we've seen the below callstack happening from time to time. I think it's a race between different RPC calls that asynchronously set the Status parameter. There is a retry_on_conflict option in elasticsearch but I'm not convinced this is the right thing to do.

2023-01-24 07:39:07 UTC WorkloadManagement/JobStateUpdate ERROR: Uncaught exception when serving RPC Function setJobStatusBulk
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 296, in __RPCCallFunction
    uReturnValue = oMethod(*args)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/JobStateUpdateHandler.py", line 137, in export_setJobStatusBulk
    return cls._setJobStatusBulk(jobID, statusDict, force=force)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/JobStateUpdateHandler.py", line 257, in _setJobStatusBulk
    result = cls.elasticJobParametersDB.setJobParameter(int(jobID), "Status", status)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/DB/ElasticJobParametersDB.py", line 177, in setJobParameter
    result = self.updateDoc(index=self._indexName(jobID), docID=str(jobID), body={"doc": data})
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Utilities/ElasticSearchDB.py", line 44, in wrapper_decorator
    return method(self, *args, **kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Utilities/ElasticSearchDB.py", line 269, in updateDoc
    return S_OK(self.client.update(index, docID, body))
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/client/utils.py", line 178, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/client/__init__.py", line 1767, in update
    return self.transport.perform_request(
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/transport.py", line 408, in perform_request
    raise e
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/transport.py", line 369, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/connection/http_urllib3.py", line 266, in perform_request
    self._raise_error(
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/connection/base.py", line 301, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
opensearchpy.exceptions.ConflictError: ConflictError(409, 'version_conflict_engine_exception', '[709792449]: version conflict, required seqNo [16420917], primary term [1]. current document has seqNo [16421015] and primary term [1]')
2023-01-24 07:39:07 UTC WorkloadManagement/JobStateUpdate NOTICE: Returning response ([::ffff:81.180.86.124]:54022)[lhcb_mc:fstagni] (0.30 secs) ERROR: Server error while serving setJobStatusBulk: ConflictError(409, 'version_conflict_engine_exce
ption', '[709792449]: version conflict, required seqNo [16420917], primary term [1]. current document has seqNo [16421015] and primary term [1]')
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    result = self._processProposal(trid, proposalTuple, handlerObj)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 538, in _processProposal
    result = self._executeAction(trid, proposalTuple, handlerObj)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 558, in _executeAction
    response = handlerObj._rh_executeAction(proposalTuple)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 124, in _rh_executeAction
    retVal = self.__doRPC(actionTuple[1])
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 255, in __doRPC
    return self.__RPCCallFunction(method, args)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 306, in __RPCCallFunction
    return S_ERROR(f"Server error while serving {method}: {str(e)}")

The text was updated successfully, but these errors were encountered:

fstagni · 2023-01-24T15:44:50Z

This is not specific to 8.0 release, as it was happening in LHCbDIRAC installation also before. Colleagues running older versions of DIRAC might see the same. The error also can be seen in slightly different forms, e.g.

2023-01-23 16:13:09 UTC WorkloadManagement/JobStateUpdate NOTICE: Returning response ([::ffff:202.122.32.249]:49810)[lhcb_mc:fstagni] (0.49 secs) ERROR: Server error while serving setJobParameters: ConflictError(409, 'version_conflict_engine_exception', '[709513455]: version conflict, required seqNo [8530694], primary term [1]. current document has seqNo [8531980] and primary term [1]')

But it is anyway the same.

The indices where we are doing these updates only have 1 replica, but in any case the update operation performed here is "heavy" from OpenSearch pov.

I am also not convinced that using retry_on_conflict is the right thing to do, but at the moment I don't see a better option.

chrisburr added this to the v8.0 milestone Jan 24, 2023

chrisburr assigned fstagni Jan 24, 2023

fstagni mentioned this issue Feb 14, 2023

[8.0] fix: wait and retry on ConflictError #6802

Merged

fstagni linked a pull request Feb 28, 2023 that will close this issue

[8.0] fix: wait and retry on ConflictError #6802

Merged

fstagni closed this as completed Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflicts when updating JobParameters in ElasticJobParametersDB #6715

Conflicts when updating JobParameters in ElasticJobParametersDB #6715

chrisburr commented Jan 24, 2023

fstagni commented Jan 24, 2023

Conflicts when updating JobParameters in ElasticJobParametersDB #6715

Conflicts when updating JobParameters in ElasticJobParametersDB #6715

Comments

chrisburr commented Jan 24, 2023

fstagni commented Jan 24, 2023