Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing purview test issues and improve performance #350

Merged
merged 12 commits into from
Jun 15, 2022

Conversation

xiaoyongzhu
Copy link
Member

@xiaoyongzhu xiaoyongzhu commented Jun 11, 2022

Currently there are two issues:

  1. Failed to connect to PurView caused CI tests failure #349 where connecting to Purview always yield this error:
E           requests.exceptions.ConnectionError: HTTPSConnectionPool(host='some-purview-name.catalog.purview.azure.com', port=443): Max retries exceeded with url: /api/atlas/v2/types/typedefs (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f90c3e7ea60>: Failed to establish a new connection: [Errno 111] Connection refused'))
  1. Listing entities in purview is extremely slow.

The current way we get features/list features for a project isn't very scalable. What we are currently doing is to first get all the entities, and filter out the ones that we need on the client side, which sometimes causes the service to throttle. We are also calling purview repeatedly for entities that we already fetched.

This PR solves those issue by issuing a server side filtering query, as well as optimize the get_features_from_registry logic to avoid duplicated purview calls.

Call time of get_features_from_registry() can be reduced from ~20s to around 3s now.

@xiaoyongzhu xiaoyongzhu linked an issue Jun 11, 2022 that may be closed by this pull request
@xiaoyongzhu xiaoyongzhu changed the title Fixing purview issues Fixing purview test issues and improve performance Jun 11, 2022
@xiaoyongzhu
Copy link
Member Author

@YihuiGuo this should also fix your earlier issue

@windoze
Copy link
Member

windoze commented Jun 13, 2022

Could you please add more tech details and investigations about the bug? I'm not sure what happened actually.
To confirm that the server side throttling is the root cause, you may need try to change server side parameters to relax the restriction and check if it fixes the problem, otherwise the root cause could be somewhere else.

blrchen
blrchen previously approved these changes Jun 13, 2022
@xiaoyongzhu
Copy link
Member Author

Could you please add more tech details and investigations about the bug? I'm not sure what happened actually. To confirm that the server side throttling is the root cause, you may need try to change server side parameters to relax the restriction and check if it fixes the problem, otherwise the root cause could be somewhere else.

As described in the PR, the implementation of the search_entities API in the underlying pyapacheatlas package is here (https://github.com/wjohnson/pyapacheatlas/blob/master/pyapacheatlas/core/discovery/purview.py#L226) where it's using a while True statement to keep querying the query API, and might got throttled.

More details are available in this issue: wjohnson/pyapacheatlas#206 where I talked with Will offline, and he agrees to add some backoff in the search_entities API.

@xiaoyongzhu
Copy link
Member Author

Also update the PR description to make it a bit more descriptive.

Copy link
Collaborator

@blrchen blrchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@windoze
Copy link
Member

windoze commented Jun 14, 2022

The canonical way to retrieve entity/entities from PurView is to fetch by GUID, also we already stored related GUIDs in our data model.
Querying/Searching is not the correct approach to load a project and may cause unwanted result if this PurView contains data other than what we've created by Feathr, it only fits for-eyes-only scenarios but not cases require accuracy and consistency such as Feathr client.
Please re-think and re-design the whole solution.

@blrchen
Copy link
Collaborator

blrchen commented Jun 14, 2022

The canonical way to retrieve entity/entities from PurView is to fetch by GUID, also we already stored related GUIDs in our data model.

I agree, when CLI save data to purview, it does not specify any hits to partition data by project, that means search via startsWith might still experience perf issues when data volume grows. Since CLI already writes lineage relationship, for example, project contains feature/anchor/derived, using guid list to fetch registered features sounds more efficient and scalable.

@xiaoyongzhu xiaoyongzhu merged commit cd6558a into main Jun 15, 2022
@xiaoyongzhu
Copy link
Member Author

The canonical way to retrieve entity/entities from PurView is to fetch by GUID, also we already stored related GUIDs in our data model. Querying/Searching is not the correct approach to load a project and may cause unwanted result if this PurView contains data other than what we've created by Feathr, it only fits for-eyes-only scenarios but not cases require accuracy and consistency such as Feathr client. Please re-think and re-design the whole solution.

Agree, but the goal of this PR is not to solve all those issues. I have a separate PR solving those issues and please take a look: #368

bozhonghu pushed a commit that referenced this pull request Jun 15, 2022
* main:
  Fixing purview test issues and improve performance (#350)
  [feathr] Add product_recommendation advanced sample (#348)
  obejectId query cmd update (#360)
  add license, release, docs, python api ref badges with shields img (#357)
  quick fix the 404 not found in read me link (#355)
  Python SQL Registry (#311)
  enable JWT token param in frontend API calls (#337)
  Optimize environment variable behavior (#333)
  Adding better warning message to let user know that config file is missing and they need to set env parameters. (#347)
  Feature Monitoring (#330)
  Windoze/211 maven submission (#334)
  Windoze/211 maven submission (#334)
  Windoze/211 maven submission (#334)
  Fix Synapse quickstart link (#346)
  Show feature details when click feature in lineage graph (#339)
  Update pull_request_push_test.yml
  Update UI README for how to create overrides for local development (#335)
  Update databricks quick start experience (#217)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to connect to PurView caused CI tests failure
4 participants