Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registry-sweeper upgrade for multitenant registry #120

Closed
tloubrieu-jpl opened this issue Apr 15, 2024 · 15 comments · Fixed by #130
Closed

Registry-sweeper upgrade for multitenant registry #120

tloubrieu-jpl opened this issue Apr 15, 2024 · 15 comments · Fixed by #130
Assignees
Labels
B14.1 B15.0 i&t.skip Skip I&T of this task/ticket s.high High severity sprint-backlog task

Comments

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Apr 15, 2024

💡 Description

Using opensearch serverless

We now have one single opensearch serverless URL.

Sweeper still takes a single node as argument, AWS infrastructure takes care of running the needed sweepers (one for each node). There will be one task definition per node.

Create new roles read/write access that will be associated to the sweeper's ECS task.

We should not need to signed the HTTP requests to connect to OpenSearch and to do it we should use a AWS SDK (see https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-sdk.html). @sjoshi-jpl has example of using this code.
If we want to run sweeper from a local laptop we would still need the signed URLs to be implemented.

⚔️ Parent Epic / Related Tickets

No response

@tloubrieu-jpl
Copy link
Member Author

Alex is making progress on this ticket.

@jordanpadams
Copy link
Member

Status: Sagar to take a look at necessary roles to perform this action.

@alexdunnjpl
Copy link
Contributor

Currently being tested on MCP - image has been pushed to ECR but need ECS settings to continue. Will follow up with @sjoshi-jpl tomorrow

Backwards compatibility with non-MT registry has been manually tested.

@tloubrieu-jpl
Copy link
Member Author

Deployed in dev but issues with 403 errors.

@alexdunnjpl
Copy link
Contributor

status: auth is handled, but lidvids appear to be typed as text instead of keyword, causing sweepers to fail

@tloubrieu-jpl
Copy link
Member Author

@alexdunnjpl is unblocked on this ticket and will resume the testing when he is done with the resolution of the data migration.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented May 31, 2024

Status: nontrivial refactoring required.

Need to look at EC2 scratch scripts in aoss branch and existing work in multitenancy-update branch, merge the work (i.e. cherry-pick anything relevant from aoss - DON'T FORGET THE SEARCH-AFTER POLLUTION GUARD), and poach whatever's relevant into the cognito wrapper work, too.

Separate credential instantiation from auth/opensearch-py client instantiation.

Will be straightforward, but if I don't write it down now I'll need to figure it out again later.

@tloubrieu-jpl
Copy link
Member Author

So more work is needed.

@alexdunnjpl
Copy link
Contributor

Unsure about current state/status - probably will need to cherrypick/reimplement PR from new branch

#122

@alexdunnjpl
Copy link
Contributor

@jordanpadams pulling from sprint for now - but if this is a high priority, feel free to add it back. I've lost track of where this sits in terms of priority and being blocked by other work.

@alexdunnjpl
Copy link
Contributor

Cherrypick complete(?) and open in #130, but not tested yet

@jordanpadams
Copy link
Member

Status: blocked in testing this on prod.

@alexdunnjpl
Copy link
Contributor

Status: IAM role created, but looks like permissions need tweaking. Working with @sjoshi-jpl to resolve this.

@alexdunnjpl
Copy link
Contributor

@jordanpadams @tloubrieu-jpl @sjoshi-jpl Is the ECS sweepers deployment in another ticket? If not, feel free to re-open this if you don't want to create a separate one.

@alexdunnjpl
Copy link
Contributor

@jordanpadams @tloubrieu-jpl psa ancestry sweeper terminated on mcp prod1 due to out-of-memory condition.

All other sweepers/nodes have completed successfully. Recommend against attempting to re-run on prod1 - it can wait until we have it deployed to ECS (where we can size accordingly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.1 B15.0 i&t.skip Skip I&T of this task/ticket s.high High severity sprint-backlog task
Projects
Status: 🏁 Done
Status: 🏁 Done
3 participants