Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintenance/scaling of the platform #428

Closed
1 of 2 tasks
esraneufeld opened this issue Mar 3, 2021 · 19 comments
Closed
1 of 2 tasks

maintenance/scaling of the platform #428

esraneufeld opened this issue Mar 3, 2021 · 19 comments
Assignees
Labels
Epic Zenhub label (Pleas do not modify) PO issue Created by Product owners

Comments

@esraneufeld
Copy link
Member

esraneufeld commented Mar 3, 2021

the scrum master wrote:
We are having quite a hard time maintaining the platformS, so that they pass our E2E tests [...]

in this sprint:

  • [...] I would say we could say having a stable E2E with 20 users for this sprint. [...]

in the following ones:

  • [...] then it should be recurring because we always have things to be done in maintenance. [...]
@esraneufeld esraneufeld added the PO issue Created by Product owners label Mar 3, 2021
@sanderegg sanderegg added the Epic Zenhub label (Pleas do not modify) label Mar 3, 2021
@sanderegg
Copy link
Member

sanderegg commented Mar 25, 2021

Update on sprint Red Panda

E2E

  • added Jupyters e2e test running in parallel to simulate 20 users on master deploy, 10 users on staging AWS and production AWS deploys
  • A number of issues were detected and fixed to improve the success rate of the e2e/p2p testing (see platform stability #1426 case)

Maintenance

  • 3 data loss problems were detected:
    1. Jupyter lab state data was lost due to (un)archiving issue. Issue was fixed through #2192. Also DevOps change to ensure automated backups of S3 storage is done and also versioning which should prevent loosing data. Currently tested in on-premise master deploy, and will be moved to staging/production (on-premise and AWS) next sprint.
    2. A user opened a shared study and could not see the sharer user data in a Jupyter lab. After closing the study the empty data overwrote the original data, thus loosing the original Jupyter lab data. Issue was fixed through #2041.
    3. User A with a shared study opened got disconnected from the platform for a long time. User B worked on the shared study and closed it properly. User A eventually re-connected, this overwrote the data of user B. Issue was fixed through #2225
  • All maintenance cases are listed in platform stability #1426. in the future will be connected to this issue.

NOTE: these fixes are currently all deployed in master/staging/staging_AWS. They will be deployed to production ASAP.

@sanderegg
Copy link
Member

sanderegg commented May 5, 2021

Update on sprint Schwarznasenschaf

Improvements

  • Studies/template pagination #2268, #2273, #2292, #2298, to improve platform reactivity
  • Optimize listing of services from backend #2313
  • Allow cancelling request to list services #2302

Bugfixes

  • Creating node link after selecting/moving node works again #2317
  • Storage project copy failing #2301, #2295
  • Notebook service cannot start #117

Testing facilities

E2E

  • Bugfixes in E2E scripts #2311, #2299, #2294
  • Added studies cleaner script to remove studies left over after failed E2Es 2312

Maintenance

  • Security updates #2307
  • Upgrade testing and tooling requirements #2291
    NOTE: these fixes are currently all deployed in master/staging/staging_AWS. They will be deployed to production ASAP.

@sanderegg
Copy link
Member

sanderegg commented Jun 2, 2021

Update on sprint Chinchilla

Improvements

  • Added static-webserver backend service, improves static website responsivity #2342

Bugfixes

  • Revision of fastapi-based backend services startup #2356
  • Fix slow blocking calls at startup #2346
  • Passing variable to computation services non global #2330, #2316
  • Faster access to S3 storage #2329
  • Webserver background task auto-restarts when database connection is invalidated #2246

Testing facilities

E2E

  • Added test for guest study dispatcher #2362

Maintenance

@sanderegg
Copy link
Member

sanderegg commented Jun 30, 2021

Update on sprint Marmoset

Improvements

  • Improved service configuration approach
    • New settings-library #2395
    • Updates on settings in storage #2369
  • Re-enabled DAT-Core - Pennsieve access since name change #2391

Bug fixes

Changed

  • Upgrade to python 3.8.10 in all osparc-simcore services #2079
  • Libraries updates #2394, #2367
  • Cap unarchiving workers to 2 #2384
  • Improved error handling in frontend #2283

Testing facilities

  • Upgrades of e2e libraries #2352
  • Upgrades of tests linked to python 3.8 #2387

Open issue / ongoing

  • webserver responsiveness issue (503s)

@sanderegg
Copy link
Member

sanderegg commented Aug 4, 2021

Update on sprint Wombat

Improvements

Bug fixes

  • Fixes running computational task not aborted #2449
  • Fixes dynamic-sidecar settings in director-v2 #2431
  • Fixes API upload timeout too short #2433
  • Fixes timeout for synchronisation of storage metadata #2420

Changed

  • Webserver responsiveness issues: disabling unused metrics with high cardinality #2452
  • Improve log messages #2430
  • Director-v2 uses settings-library #2427
  • Replaced auto-generated internal storage REST API client #2578

Testing facilities

Open issue / ongoing

  • webserver responsiveness issue (503s) under observation
  • test to check metrics endpoints #2417
  • improving CI workflow speed #2466
  • storage service refactoring #2396
  • webserver service refactoring #2008

@sanderegg
Copy link
Member

Update on sprint Chevrotain

Improvements

Bug fixes

Changed

  • Refactor of service library (separation of concerns) #2516
  • Upgrade of library, services dependencies #2485, #2524

Testing facilities

  • Added data consistency scripts #2531, #2533, #2535
  • Improve Github CI workflow #2510
  • Upgrade testing and tooling dependencies #2475

3rd party services

  • Fix dependencies in Mattward viewer #137

Open issue / ongoing

  • webserver responsiveness issue (503s) under observation
  • test to check metrics endpoints #2417
  • improving CI workflow speed #2466
  • storage service refactoring #2396
  • webserver service refactoring #2008

@sanderegg
Copy link
Member

sanderegg commented Oct 6, 2021

Update on sprint Capra delle nevi

Bug fixes

  • Frontend: improve logging when failure to retrieve data in service occurs #2574
  • Frontend: Copy thumbnail only if it exists #2548
  • webserver/storage: Fail to copy studies with a lot of data #2542
  • webserver: Handle error when webserver tries to disconnect an already disconnected websocket #2551
  • Frontend: Shared guided mode settings in template #2561

Changed

  • Refactoring webserver module "director-v2" and remove cycling dependencies #2567
  • Updated settings practices in api-server #2563
  • Refactoring servicelib #2550
  • Upgrade of testing&tooling dependencies #2547
  • Maintenance of dependencies #2545

Testing facilities

  • New scripts to validate project database tables #2550

Open issue / ongoing

  • Adding tracing in fastapi-based services #2558
  • Adding tracing in aiohttp-based services #2559
  • Adding more logs from pending services #2566
  • Refactoring webserver settings #2376
  • webserver responsiveness issue (503s) under observation
  • improving CI workflow speed #2466, #2525
  • storage service refactoring #2396
  • webserver service refactoring #2008
  • service-integration backlog 2409

@sanderegg
Copy link
Member

sanderegg commented Nov 4, 2021

Update on sprint Anti-PER

Bug fixes

  • Improve CI reliability/stability #2626, #2623, #2624, #2616, #2609, #2600
  • Use director-v0 expected signature #2619
  • Fixes unhandled exception when parsing invalid string #2608
  • Divers fixes (codestyle, openapi, dockerignore) #2593

E2E testing

  • Fix 3D e2e after design changes #2601

Changed

  • Ensure director-v2 auto-restart only triggers when src code folder changes #2598, #2596
  • Adding tracing in aiohttp-based services #2559
  • Adding more logs from pending services #2566

Open issue / ongoing

  • Adding tracing in fastapi-based services #2558
  • Refactoring webserver settings #2376
  • webserver responsiveness issue (503s) under observation
  • improving CI workflow speed #2466, #2525
  • storage service refactoring #2396
  • webserver service refactoring #2008
  • service-integration backlog 2409

@sanderegg
Copy link
Member

sanderegg commented Dec 9, 2021

Update on sprint Meerkat

New

  • Script for listing repo contents #2639
  • New Handling of staging hotfixes #2649
  • Add config for github automatic changelog generation upon release #2662

Bug fixes

  • Properly handle timeouts when copying a project #2655
  • Fixes makefile recipe for hotfix-releases #2650

Deployment

  • Fix Storage lazy update of files #2636

Changed

  • Adding tracing in fastapi-based services #2558
  • New interfaces for webserver.director_v2 plugin #2647
  • Maintenance of libraries dependencies and repo tooling #2660, #2663, #2679, #2669, #2676, #2664
  • Remove overload of pytest loop fixture: #2674
  • Github issue template reworked: #2673

Open issue / ongoing

  • Refactoring webserver settings #2376
  • webserver responsiveness issue (503s) under observation
  • improving CI workflow speed #2466, #2525
  • storage service refactoring #2396
  • webserver service refactoring #2008
  • service-integration backlog 2409

@sanderegg
Copy link
Member

sanderegg commented Jan 26, 2022

Update on sprint Rudolph

New

Bug fixes

Changed

Open issue / ongoing

@mguidon
Copy link
Member

mguidon commented Feb 17, 2022

@sanderegg @pcrespov These bullet lists are priceless for the quarterly reports! Thanks a lot for updating them.

@sanderegg
Copy link
Member

sanderegg commented Feb 23, 2022

Update on sprint R. Schumann

New

Bug fixes

Changed

Open issue / ongoing

@sanderegg
Copy link
Member

sanderegg commented Apr 1, 2022

Update on sprint E. Schackleton

View of this issue on Zehnhub March1 - Apr 3

New

Bug fixes

Changed

Open issue / ongoing

@sanderegg
Copy link
Member

sanderegg commented Apr 28, 2022

Update on sprint Macarons

View of this issue on Zenhub March1 - Apr 3

Bug fixes

Changed

Open issue / ongoing

@sanderegg
Copy link
Member

sanderegg commented Jun 2, 2022

Update on sprint Croissant

Done

Ongoing

Open

@sanderegg
Copy link
Member

sanderegg commented Jul 3, 2022

Update on sprint Diolkos

Done

Ongoing

Open

@sanderegg
Copy link
Member

Update on sprint Diolkos

Done

Ongoing

Open

@sanderegg
Copy link
Member

sanderegg commented Aug 25, 2022

Update on sprint Brutalism

Done

Ongoing

Open

@elisabettai
Copy link
Collaborator

Closing this one, since we have a new one for y6 (#675)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic Zenhub label (Pleas do not modify) PO issue Created by Product owners
Projects
None yet
Development

No branches or pull requests

7 participants