Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicts when updating JobParameters in ElasticJobParametersDB #6715

Closed
chrisburr opened this issue Jan 24, 2023 · 1 comment · Fixed by #6802
Closed

Conflicts when updating JobParameters in ElasticJobParametersDB #6715

chrisburr opened this issue Jan 24, 2023 · 1 comment · Fixed by #6802
Assignees
Milestone

Comments

@chrisburr
Copy link
Member

In LHCb we've seen the below callstack happening from time to time. I think it's a race between different RPC calls that asynchronously set the Status parameter. There is a retry_on_conflict option in elasticsearch but I'm not convinced this is the right thing to do.

2023-01-24 07:39:07 UTC WorkloadManagement/JobStateUpdate ERROR: Uncaught exception when serving RPC Function setJobStatusBulk
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 296, in __RPCCallFunction
    uReturnValue = oMethod(*args)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/JobStateUpdateHandler.py", line 137, in export_setJobStatusBulk
    return cls._setJobStatusBulk(jobID, statusDict, force=force)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/JobStateUpdateHandler.py", line 257, in _setJobStatusBulk
    result = cls.elasticJobParametersDB.setJobParameter(int(jobID), "Status", status)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/DB/ElasticJobParametersDB.py", line 177, in setJobParameter
    result = self.updateDoc(index=self._indexName(jobID), docID=str(jobID), body={"doc": data})
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Utilities/ElasticSearchDB.py", line 44, in wrapper_decorator
    return method(self, *args, **kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/Utilities/ElasticSearchDB.py", line 269, in updateDoc
    return S_OK(self.client.update(index, docID, body))
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/client/utils.py", line 178, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/client/__init__.py", line 1767, in update
    return self.transport.perform_request(
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/transport.py", line 408, in perform_request
    raise e
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/transport.py", line 369, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/connection/http_urllib3.py", line 266, in perform_request
    self._raise_error(
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/opensearchpy/connection/base.py", line 301, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
opensearchpy.exceptions.ConflictError: ConflictError(409, 'version_conflict_engine_exception', '[709792449]: version conflict, required seqNo [16420917], primary term [1]. current document has seqNo [16421015] and primary term [1]')
2023-01-24 07:39:07 UTC WorkloadManagement/JobStateUpdate NOTICE: Returning response ([::ffff:81.180.86.124]:54022)[lhcb_mc:fstagni] (0.30 secs) ERROR: Server error while serving setJobStatusBulk: ConflictError(409, 'version_conflict_engine_exce
ption', '[709792449]: version conflict, required seqNo [16420917], primary term [1]. current document has seqNo [16421015] and primary term [1]')
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    result = self._processProposal(trid, proposalTuple, handlerObj)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 538, in _processProposal
    result = self._executeAction(trid, proposalTuple, handlerObj)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 558, in _executeAction
    response = handlerObj._rh_executeAction(proposalTuple)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 124, in _rh_executeAction
    retVal = self.__doRPC(actionTuple[1])
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 255, in __doRPC
    return self.__RPCCallFunction(method, args)
  File "/opt/dirac/versions/v11.0.0-1674461755/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 306, in __RPCCallFunction
    return S_ERROR(f"Server error while serving {method}: {str(e)}")

Screenshot 2023-01-24 at 10 31 37

@chrisburr chrisburr added this to the v8.0 milestone Jan 24, 2023
@fstagni
Copy link
Contributor

fstagni commented Jan 24, 2023

This is not specific to 8.0 release, as it was happening in LHCbDIRAC installation also before. Colleagues running older versions of DIRAC might see the same. The error also can be seen in slightly different forms, e.g.

2023-01-23 16:13:09 UTC WorkloadManagement/JobStateUpdate NOTICE: Returning response ([::ffff:202.122.32.249]:49810)[lhcb_mc:fstagni] (0.49 secs) ERROR: Server error while serving setJobParameters: ConflictError(409, 'version_conflict_engine_exception', '[709513455]: version conflict, required seqNo [8530694], primary term [1]. current document has seqNo [8531980] and primary term [1]')

But it is anyway the same.

The indices where we are doing these updates only have 1 replica, but in any case the update operation performed here is "heavy" from OpenSearch pov.

I am also not convinced that using retry_on_conflict is the right thing to do, but at the moment I don't see a better option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants