Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle-ES consistency #240

Merged
merged 28 commits into from
May 28, 2019
Merged

Oracle-ES consistency #240

merged 28 commits into from
May 28, 2019

Conversation

Evildoor
Copy link
Contributor

@Evildoor Evildoor commented Apr 3, 2019

Add a mechanism for checking the consistency of data between Oracle and elasticsearch.

Currently the consistency is only checked for tasks (not datasets) by comparing the timestamps.

Consistency check is launched by executing Utils/Dataflow/run/data4es-consistency-check.

https://trello.com/c/QjYjIFi6

The first step in ensuring consistency between Oracle and ES is to obtain
a very basic set of task data - id and timestamp - from Oracle. Add a query
for doing so.
es.get() raises NotFoundError in both cases - when index does not exist and
when document does not exist. Also, it's more reasonable to check index once
since it's the same for all messages.
- Add/update functions and their parameters' descriptions.
- Update the script's description.
- Add consistency' description into README.
@Evildoor Evildoor self-assigned this Apr 3, 2019
Check that all fields supplied in input data are present in ES and their
values are matching the input data, instead of working only with tasks and
their timestamps. This will allow checking tasks' other fields as well as
different types of documents such as datasets.

Add stage 016 into consistency chain because it adds the fields required
for getting documents of given type from ES.
Prepare the script for further development, where incosistent tasks will be
automatically reloaded into ES.
@Evildoor Evildoor changed the title [WIP] Oracle-ES consistency Oracle-ES consistency Apr 5, 2019
@Evildoor Evildoor requested a review from mgolosova April 5, 2019 13:06
Copy link
Collaborator

@mgolosova mgolosova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should also be added data samples for this new pipeline: output of 009 with the new query, of 016 with that new input, etc.

Utils/Dataflow/009_oracleConnector/query/consistency.sql Outdated Show resolved Hide resolved
Utils/Dataflow/069_upload2es/README Outdated Show resolved Hide resolved
Utils/Dataflow/069_upload2es/README Outdated Show resolved Hide resolved
Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved
Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved
Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved
Utils/Dataflow/run/data4es-consistency-check Outdated Show resolved Hide resolved
Utils/Dataflow/run/data4es-consistency-check Outdated Show resolved Hide resolved
Utils/Dataflow/shell_lib/get_config Show resolved Hide resolved
Utils/Dataflow/shell_lib/get_config Show resolved Hide resolved
These functions are either used by several scripts or will be in the future.
Move them to library to uphold DRY principle.
DEBUG mode in data4es-start exists to check the workflow without uploading
anything to ES. Consistency check writes nothing, so DEBUG is unnecessary
here.
Do not redirect the stages' stderrs, leave them as-is.
While the script is the stage 069's counterpart in data4es-consistency-check,
they share no functionality.
- State what is retrieved by the query.
- Remove unnecessary information.
- Show an error message and exit if no host, port, or index is specified.
- Remove default values of the parameters.
Printing all discovered inconsistent records to stdout as a batch contradicts
with various things, such as pyDKB's file mode and the possibility of
controlling the workflow with Apache Kafka.

Create an output message with _id and _type for each inconsistent record.
Still exit with code 1 if at least one inconsistent record was found,
0 otherwise.
@Evildoor
Copy link
Contributor Author

Evildoor commented Apr 18, 2019

#199 is almost completed, will wait for it to merge master here before adding samples, to avoid conflicts in READMEs.

@mgolosova
Copy link
Collaborator

@@ -153,7 +153,11 @@ def main(args):
cfg = load_config(stage.ARGS.conf)
stage.process = process
es_connect(cfg)
stage.run()
if not es.indices.exists(INDEX):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for information: this check would cause AuthorizationException in case of remote connection via Nginx proxy:

>>> es = elasticsearch.Elasticsearch('http://login:password@aiatlas171.cern.ch:9200') 
>>> es.indices.exists('test_prodsys_rucio_ami')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/indices.py", line 213, in exists
    params=params)
  File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 318, in perform_request
    status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 185, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.AuthorizationException: TransportError(403, u'')

I updated proxy configuration to allow HEAD requests (to readable locations), as there`s no actual need to block them; so now there should be no problem with it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And (again, just for information), there was another way to hit the goal:

Check for index' existence before working.
es.get() raises NotFoundError in both cases - when index does not exist and
when document does not exist. Also, it's more reasonable to check index once
since it's the same for all messages.

Even with NotFoundError it is possible to say one situation from another:

>>> try: es.get(index='test_prodsys_rucio_ami', doc_type='task', id=1468216)
... except Exception, err: pass
... 
>>> err.info
{u'found': False, u'_type': u'task', u'_id': u'1468216', u'_index': u'test_prodsys_rucio_ami'}
>>>
>>> try: es.get(index='_no_such_index_', doc_type='task', id=14682166)                                                                                     
... except Exception, err: pass
>>> err.info
{u'status': 404, u'error': {u'index_uuid': u'_na_', u'index': u'tprodsys_rucio_ami', u'resource.type': u'index_expression', u'root_cause': [{u'index_uuid': u'_na_', u'index': u'tprodsys_rucio_ami', u'resource.type': u'index_expression', u'resource.id': u'tprodsys_rucio_ami', u'reason': u'no such index', u'type': u'index_not_found_exception'}], u'reason': u'no such index', u'type': u'index_not_found_exception', u'resource.id': u'tprodsys_rucio_ami'}}
>>> err.info['error']['reason']
u'no such index'

And in case of the error -- or, maybe, even in case of any error, when info['error'] is defined (but I am not sure if there can possibly be any other error) -- re-raise the exception to indicate that the process can not be continued. Maybe wrapping the exception into DataflowException.

The only situation when it makes any difference is when during the process execution the index was removed or access policy changed: the check was successfully passed on the start, so any NotFoundError after this will be taken as "record missed", no matter what. But I don`t think it is likely to happen, so there`s nothing wrong in the one-time check.

Copy link
Collaborator

@mgolosova mgolosova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general everything looks fine.

But I didn`t add any comments to the last commit (808280c): they will appear later or go to e-mail (need to think a little more about it).

The only thing that should be done in this PR before it can be merged is handling of "service" JSON fields. Specifically, two moments:

  • "service" fields should be filtered out from the "data" fields before checking the consistency (as "service" fields most likely are not written to ES as document content fields);
  • in case of parent/child relationship in ES, child document can not be taken by the ID with get(): it also requires parent ID specified.

The last point is where I feel like I might owe you an apology: if there was search() method used (as you meant to do initially), it would work same way for both tasks and output_datasets. Another point is that ids query is definitely more suitable for batch mode processing (which, I hope, will sooner or later be added to the processing mode).

Due to this I see here at least three ways to go:

  • leave things 'as-is' (adding to README note that for output_datasets it most likely won`t work properly). It is OK as right now we only check the tasks consistency -- but may be a bit frustrating;
  • add _parent handling in current version. It looks like the most stright and cheap way to make things work as they were meant to;
  • rework things once again to use ids query. It feels like the more time consuming way, and may be excessive (see below why).

If you take a look at the PR #244 -- I`m working on some kind of a "common" wrapper around ES with commonly used functions, including the "get record [fields] by id[+type][+parent]", etc. When I finish I most probably will try to adapt your code to use it -- to check that the library module covers all (known) needed use cases, as well as to make sure that things that one of us have already stumbled over will not appear again in different ES-using cenarios (and will be handled same way everywhere). And as I tought about it here -- the "ids" version of get() will be added there as well; so switching back to search() only in the view of the batch processing plans may be a premature optimization.

But it still may look better for you that _parent-aware version and/or leaving things as is (or maybe you see fourth and fifth way to go) -- so it is totaly up to you, what to choose.

Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved
Utils/Dataflow/069_upload2es/consistency.py Outdated Show resolved Hide resolved
Utils/Dataflow/shell_lib/eop_filter Outdated Show resolved Hide resolved
@mgolosova
Copy link
Collaborator

@Evildoor, another thing I have just noticed after regular consistency check: the scenario seems not to handle chain_data and chain_id fields properly. Error messages I`ve got look this way:
(WARN) 2019-05-20T08:05:01.301149 Document (task, 18024963) differs between Oracle and ES: Oracle:{u'chain_data': [18024963], u'chain_id': 18024963, u'taskid': 18024963, u'task_timestamp': u'14-05-2019 02:11:40'} ES:{u'task_timestamp': u'14-05-2019 02:11:40', u'chain_data': [17988291, 17988294, 18024963], u'chain_id': 17988291, u'taskid': 18024963}

@Evildoor
Copy link
Contributor Author

Evildoor commented May 20, 2019

the scenario seems not to handle chain_data and chain_id fields properly

Checked - this seems to happen as following:

  1. 009 produces data with taskid and timestamp.
  2. 016, among other things, checks data for chain_data, finds none, considers it to be missing and sets the task to be a root of its own chain.
  3. 071 sees incorrectly defined chain_data and chain_id, treats them as fields to check, and finds a discrepancy between them and correct values in ES.

Considering the possible situations:

  • Normal data loading, where it is possible to encounter records with missing chain_data.
  • Consistency check described above, where chain_data and chain_id shouldn't actually be checked.
  • Consistency check which includes chain_data and/or chain_id that should be checked.

I believe that this issue should be fixed by adding key --skip-empty-chains to stage 016 that will skip the processing of chain_data if it is missing. Will do so later if no objections will be present.

@Evildoor
Copy link
Contributor Author

Moved 808280c into a separate branch per email discussion, so that we can finish and merge the existing commits without it hindering us (as it's a somewhat different problem that can possibly require extensive work to be dealt with):

https://github.com/PanDAWMS/dkb/commits/oracle-es-consistency-limit

@mgolosova
Copy link
Collaborator

@Evildoor wrote:

@mgolosova wrote:

the scenario seems not to handle chain_data and chain_id fields properly

Checked - this seems to happen as following:

  1. 009 produces data with taskid and timestamp.
  2. 016, among other things, checks data for chain_data, finds none, considers it to be missing and sets the task to be a root of its own chain.
  3. 071 sees incorrectly defined chain_data and chain_id, treats them as fields to check, and finds a discrepancy between them and correct values in ES.

Right, thank you for checking. That`s what I thought is happening, too.

I believe that this issue should be fixed by adding key --skip-empty-chains to stage 016 that will skip the processing of chain_data if it is missing. Will do so later if no objections will be present.

In fact, there is a similar situation with phys_category:

# Crutch. Remove unwanted (for now) field added by Stage 016.
if 'phys_category' in data:
del data['phys_category']

I think these two cases (chain_* and phys_category fields) should be treated similarly.
So handle chain_* fields just the same way looks like the cheapest solution for me.

Yet there`s nothing wrong in asking Stage 016 not to produce extra fields, as you suggest. Except that it would better to take care of both chain_* and phys_category with this new option.


What is below is definitely out of this PR scope, yet worth thinking.

This situation itself says that something went wrong in general: stage 016 performs not an "atomic" data transformation, and it hinders us from reusing it 'as is'. It means that the "good solution" here would be to split Stage 016 in two smaller, more "atomic" stages: one that takes care of ES indexing information, another -- transforming/extending/checking task metadata. And maybe the first one should work with any type of input messages, not only those with task metadata -- to move this functionality out from 091 as well, making it more "atomic" too.

It is, again, shouildn`t be done in this PR; and the view of coming ES schema reworking (that will heavely affect the data4es process) maybe it doesn`t worth doing at all. Or maybe it can become the first step towards these changes, we just need to think a bit longer and try to foresee what`s coming.
Maybe we should discuss this question on the next meeting (on Fri)?

@Evildoor
Copy link
Contributor Author

Evildoor commented May 21, 2019

This situation itself says that something went wrong in general: stage 016 performs not an "atomic" data transformation, and it hinders us from reusing it 'as is'.
...
Maybe we should discuss this question on the next meeting (on Fri)?

Good idea, let's do that. Meanwhile, I've implemented this suggestion:

I think these two cases (chain_* and phys_category fields) should be treated similarly.
So handle chain_* fields just the same way looks like the cheapest solution for me.

for here-and-now checking of just taskid and timestamp.

@Evildoor
Copy link
Contributor Author

Evildoor commented May 22, 2019

_parent field is now passed to get(), other service fields (ones starting with underscore and not treated already) are removed from the data before ES request.

If you take a look at the PR #244 -- I`m working on some kind of a "common" wrapper around ES with commonly used functions, including the "get record [fields] by id[+type][+parent]", etc. When I finish I most probably will try to adapt your code to use it -- to check that the library module covers all (known) needed use cases, as well as to make sure that things that one of us have already stumbled over will not appear again in different ES-using cenarios (and will be handled same way everywhere). And as I tought about it here -- the "ids" version of get() will be added there as well; so switching back to search() only in the view of the batch processing plans may be a premature optimization.

Good to hear this, we definitely need common library for ES-related work.

mgolosova
mgolosova previously approved these changes May 27, 2019
Copy link
Collaborator

@mgolosova mgolosova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is approved for further merging.
Yet there`s a couple of insignificant comments, so before I merge things please let me know when you take a look at them and decide to change (or not to change) anything.

@Evildoor
Copy link
Contributor Author

@mgolosova, please, review again.

Type of _id is unknown - it can be str or int for task, and str for dataset.
The field is required to get child documents such as output datasets.
Service fields are different from data fields and shouldn't be checked.
These are unnecessary because library files are not supposed to be executed.
@mgolosova mgolosova merged commit 51accc9 into master May 28, 2019
@mgolosova mgolosova deleted the oracle-es-consistency branch May 28, 2019 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants