-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance the job run process not to kill its own process, instead let … #1440
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…it to MPM manage.
yhwen
requested review from
chesterxgchen,
IsaacYangSLA,
YuanTingHsieh,
yanchengnv and
nvidianz
February 28, 2023 22:33
nvidianz
previously approved these changes
Feb 28, 2023
/build |
YuanTingHsieh
approved these changes
Mar 1, 2023
/build |
guopengf
pushed a commit
to holgerroth/NVFlare
that referenced
this pull request
Mar 9, 2023
NVIDIA#1440) * Enhance the job run process not to kill its own process, instead let it to MPM manage. * refactored. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
chesterxgchen
added a commit
that referenced
this pull request
Apr 8, 2023
* Multi-process worker integration with FCI cellnet (#1393) * FL server integrate with FCI cellnet. * Codestyle reformat. * fed_server_test.py integrate with Cellnet. * FL client integrate with Cellnet. Client register to server. * Reformat codestyle. * reformat codestyle. * update the message. * Codestyle reformat. * Fixed the import sort. * Addes the PR reviews. * codestyle fix. * made the cell_timeout configurable. * codestyle fix. * removed no use import. * disable simulator_runner_test temporary. * FCI integration for job run. * Fix for the admin auto login. * reformat codestyle. * moved create_admin_server() to utils. * removed the no use import. * sort import. * Changes after the PR reviews. * rolled back the change dh_psi_test.py. * Removed no use import. * PR review changes. * type hint change. * FCI integration multi-gpu changes. * PR review changes. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Quick start [skip ci] (#1385) * update quick start * rewrite quick start * quick start guide * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update * update * update * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * Update based PR comments * update * Move HE from app_common to app_opt. Update app_opt requirements (#1392) * Move HE from app_common to app_opt. Update app_opt requirements management. * Address comments * Fix github premerge * Change all to dev for test env. * Use scikit-learn instead of sklearn * Fix circular import * Fix typo * Use dev for test env. * Fix in time model selector (#1401) * Use get cookie instead of get header to get CONTRIBUTION_ROUND * Fix intime model selector issue * Fix HE imports * Fix unit test * Simulator integration with FCI Cellnet (#1398) * simulator integrate with FCI. * codestyle reformat. * Removed the no use import. * Removed no use import. * PR reviews change. * rolled back a change. * Refactored. * Removed no use import. * sort import. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Add limit to number of jobs in list_jobs and options to flare_api (#1381) * Add limit to number of jobs in list_jobs and options to flare_api * remove print * Remove print Remove print statement that should not be there --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed close_cb bug and added socket cleanup (#1399) * Merged async TCP driver to dev (#1397) * fix new_insecure_session (#1403) * Update SKLearn readmes and refactor SKLearnExecutor [skip ci] (#1388) * update readmes and refactor SKLearnExecutor add SVC link update return type hints and readme * update type hint * Merged async UDS driver to dev (#1404) * add auc log (#1406) add Homogeneity log * update README for hello-pt on model initialization [skip ci] (#1402) * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Graceful cell stop (#1405) * help graceful cell closing and shutdown * reformat * no need to join daemon thread --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Ha fix (#1407) * Fixes for HA. * codestyle fix. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * update README.md (#1408) Co-authored-by: chesterc <n9Z0GoPp5u1Y> * README Update for PSI [skip ci] (#1409) * update README.md * update README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * update README 3 [skip ci] (#1410) * update README.md * update README.md * update README.md * update README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Add note for brats18 data access (#1245) * Add a note for brats18 dataset and fix a bug in prostate example * reorganize folder * reorganize folder * update brats link * Readme 4 [skip ci] (#1413) * fix some sentence * fix some sentence * formatting changes * formatting changes * update PSI image and README.md * update PSI image and README.md * update PSI image and README.md * update PSI image and README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Fix integration tests (#1370) * Fix integration tests * Fix dummy yaml * fix yaml * clean up workspace * use secure mode * Increase buffer size * Try not start server * raise exception if things go wrong * Read more lines * Debugging * Use subprocess * Use subprocess as default rather than pty * To be consistent with CI env * Fix admin console test * Update run_integration_tests.sh --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * 1. Added remove_endpoint() (#1414) 2. Unified to max message size to 2GB 3. Fixed the deleting socket file problem. * RESTORE Old README before Release [skip ci] (#1418) * update README.md * update README.md * update README.md * update README.md * RESTORE OLD README before release --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * update fl context to sync correctly; make current round sticky in SaG workflow (#1400) update unit test * Randomize azure client resource group (#1419) * Enahce Simulator to avoid the Cell Error at end run. (#1421) * Hide cell cmds (#1420) * hide cell commands * changed for_test to diagnose --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Changed the fetch_task fetch_again without delay. (#1423) * fix default order of jobs in list_jobs command (#1416) * fix default order of jobs in list_jobs command * revise behavior of list_jobs * fix ci * Add back the SimulatorRunner (#1425) * Add required stuff back * Fix year * Move virtual env of all examples to main folder (#1411) * Move virtual env of all examples to main folder * Reverse change to cifar * Remove venv prefix --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Move CIFAR10 example and update CI tests (#1415) * Add debug mode to ci (#1428) * Add debug mode to ci * Undo other changes * Restructure hello-world examples to standardize for tests (#1412) * restructure hello-world examples to standardize for tests * update tests with new locations of jobs * rename job_configs directory to jobs * add missed rename * Move TBReceiver to experiment tracking (#1424) * Move TBReceiver to experiment tracking * Move job_configs to jobs * Change tensorflow to tensorboard * Use setup steps in CI * Update setup.cfg * Add __init__.py to decomposers folder so build system will include it. (#1430) * Add messages at the end of cloud launch scripts so (#1432) users know how to delete the resource group / terminate the EC2 instance * UPDATE PSI README.md (#1434) * Avoid the simulator cell error after END_RUN. (#1431) * Cleaned up logs (#1426) * Enable Simulator to use resources.json. (#1435) * Enable Simulator to use resources.json. * update log. * Fix list jobs command argument parsing bug (#1427) * Fix list job bug * Keep the default behavior the same as 2.2 * Fix CI issue --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed the simulator hang due to missing import. (#1436) * Fixed the simulator hang due to missing import. * Added log for the error. * Removed commented out code. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Switch to --use-device-code for all az login cases (#1437) * update nightly build version (#1439) * Enhance the job run process not to kill its own process, instead let … (#1440) * Enhance the job run process not to kill its own process, instead let it to MPM manage. * refactored. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Remove unused codes (#1442) * Fixed a few QA bugs. (#1445) * Random forest update (#1441) * Fix SnG workflow allowing empty global model for random forest and xgboost * Fix SnG workflow allowing empty global model for random forest and xgboost * Reverse error in auto refectoring * Move the allow empty check * Update readme * Update util functions and folder names * Update util functions and folder names * Add model validation script and results * change server json * Improve POC shutdown (#1438) Change to use new FLARE API remove print statement use insecure_session fix formatting issue Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Randomize resource group to avoid duplicate resource group names (#1450) * Added more detail when recursive data is found in FOBS (#1448) * Fixed the QA test recursive ref issue. (#1451) * Fixed the issue job status not updated to exception when controller e… (#1447) * Fixed the issue job status not updated to exception when controller exception. * Added a job_id in runner_process check. * removed comment out codes. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update integration tests; Add test config auto generation code (#1446) * Update integration tests; Add test config auto generation code * Remove files that should not be checked in * add more options for ci script * Fix handling admin_api response * Update tb streaming test * Shorten ci premerge * Remove unused dependecies * Change test_diff_job_config from POC to HA for clean shutdown --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * README redesign [skip ci] (#1449) * change README to remove Quick start, reduce POC and other in quick start move feature highlights in release node README redesign * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * Change to use new FLARE API * Change to use new FLARE API * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * update * update * update * Add notes for traditional ML and FedSM (#2) * Update README.md * Update README.md * update readme (#3) * update readme * update * more updates * Update README.md * Update README.md * Update release_notes.md * Update release_notes.md * minor text edit * minor text edit --------- Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> * Check if resource group exists. If yes, reuse it. (#1456) * Move split learning to advanced examples; update release notes (#1457) * Fix admin API issues and support optional messages (#1458) * fix qa issues * reformat * restor executable scripts (#1460) * Fix jupyter notebook FLARE API path issue (#1462) Add codes to set username in jupyter notebook at provisioning time * Silent Reconnect (#1463) * Added more detail when recursive data is found in FOBS * Added silent reconnect * fix shutdown log messages (#1465) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Add FedSM example (#562) * Initial deposit for unorganized FedSM example * update readme for FedSM * update config for FedSM * update config for FedSM * update config for FedSM * move fedsm from example to research * change fedsm to use simulator * change fedsm to use simulator * Format compliance * Format compliance * Fully functional FedSM * 3-client version for stable simulation without error * Update tb record plot and testing scripts * Update tb record plot and testing scripts * Code update * fix typos; add citaton * Update readme correct num_clients and datapath * Code update to reflect the latest reviews * Update to reflect suggestions * Update global best model saving and testing scripts * Update global best model saving and testing scripts * Update readme and remove single-line scripts * Update to reflect comments * code refactor and corrections * code refactor for new dev branch * change jobs folder name * latest communication pattern * update the learnable pattern * Remove after train validation * Update to reflect the results under latest dev branch * Add testing results and update curve * Update config, plot, requirement, and readme * Update readme --------- Co-authored-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Randomize security group in AWS client scripts (#1467) * Fix cifar and auto generated integration tests (#1455) * Update Federated Stats to follow the new example structure [skip ci] (#1464) * 1. restructure and example to the standard format : add prepare_data.sh 2. update README.md (due to example structure changes) * 1. restructure and example to the standard format : add prepare_data.sh 2. update README.md (due to example structure changes) * cleanup * cleanup * update Image_stats job as well * restore the original version * remove invalid tests * restructure research folders (#1469) follow template requirements section fix typo restor xgboost example reword * fixed peer context handling in aux runner (#1470) * fixed peer context handling in aux runner * remove unused import * convert PSI to the standard test structure (#1468) * Update docs to have release notes in whats new, new glossary, fixes [skip ci] (#1461) * Update docs to have release notes in whats new, new glossary, fixes * Fix issue with jquery not being available in built docs * address PR comments, link to previous versions of examples, further additions --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Server Listens on All Interfaces (#1471) * fix configuration for readthedocs to build docs with new requirements (#1472) * Fix fl context prop (#1474) * Fix fl context prop * Change to sticky * Fixed exception in list_jobs (#1473) * Added more detail when recursive data is found in FOBS * Fixed exception in list_jobs when no jobs --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed the simulator threads option for multi_gpu case. (#1476) * Change Azure VM create to remove warning (#1477) Show Azure VM login info * cleanup error msg; fix sag wait; fix get_task timeout (#1479) * cleanup error msg; fix sag wait; fix get_task timeout * update test case * Fix job runner multiple start issue (#1466) * Start job runner when server is turn to hot * undo changes * Address comments * Address comments * Address comments * Get rid of hello-examples warnings (#1475) * Get rid of hello-examples warnings * Fix import --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update protobuf version (#1478) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Configuration exception handling (#1480) * WIP * fix exception swallow bugs * fix exception swallow bugs * 1) change definition is_class to have path argument with class_path, no need for "args" 2) add more unit tests for no argument case 3) fix test case failure for python3.10 where the failure message changes from version less than 3.9 4) restore example config formatting. * fix api status and dead job message (#1484) * fix list_job in flare api (#1487) * protect server state against multiple state changes (#1489) * Fix loading conf in aws scripts (#1488) Add early stop on error cases in aws * add wait_for_system_shutdown [skip ci] (#1481) * add wait_for_system_shutdown * add wait_for_system_shutdown * add wait_for_system_shutdown * handle code with N answer * Add Jupyter-Lab notebooks [skip ci] (#1482) * Add Jupyter-Lab notebooks 1) getting_started.ipynb 2) install_in_container.ipynb 3) data_frame_fed_stats.ipynb 4) readme update for df_stats * add POC notebook and POC run * add new Notebooks * clean up * 1. clean up 2. remove install_in_container.ipynb * 1. clean up 2. remove install_in_container.ipynb 3. remove exmaples notebook * 1. clean up 2. remove install_in_container.ipynb 3. remove exmaples notebook * update * Fix controller timing issue (#1459) * Fix many build warning and issues, more documentation additions [skip ci] (#1486) * fix many build warning and issues, more documentation additions * fix ci --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Splitnn fix (#1485) * fix paths in split learning example add new line in configs * fix circular import * new line --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * minor fixes (#1495) * fix job listing (#1496) * fix job listing * updated test cases * improve authz user print format * Add user guide on cloud deployment (#1497) * add back sections for migrating that were removed (#1498) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * simulator create the clients in parallel. (#1491) * simulator create the clients in parallel. * Changed to use threadpool to create the clients in parallel. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Add notebooks for traditional ml examples (#1483) * add notebook for kmeans example * update kmeans notebook * update kmeans notebook * update kmeans notebook * update kmeans notebook * update kmeans notebook * update kmeans notebook --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * use common JupyterLab instructions (#1499) update link update readme restore getting started notebook delete some output * fix wf task exit status handling (#1494) * fix wf task exit status handling * fix dead client detection * update what's new (#1502) * MONAI example updates (#1506) * update instructions & paths * remove virtualenv folders * fix links * Add check on az login exit code (#1504) Add check on derived location and specified location * silent abort message logging (#1505) * fix listjobs detail handling (#1503) * not creating internal listener for the job cell. (#1507) * not creating internal listener for the job cell. * create the client internal listener for multi-gpu case. * Update README, Notebook, Fed Stats fix (#1501) * 1. notebook and fed status and README.md * update * update * fix typo * rm unnecessary virtualenv folder (#1512) * Ensure the start_run event for sub_worker_process. (#1514) * Remove things in __init__.py in app_opt (#1508) * Add notebooks for other machine learning methods (#1500) * add notebook for random forest * add notebook for random forest * add notebook for random forest * update readme for random forest * add linear model * add linear model * add linear model * add svm model * add xgboost tree model * add xgboost tree model * add notebook for xgboost tree * add notebook for xgboost histogram * correction to xgboost sharable generator and executor * correction to xgboost sharable generator and executor * correction to xgboost sharable generator and executor * rename job_configs to jobs * remove notebook outputs --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * More docs additions and fixes [skip ci] (#1510) * add notebook for simulator and other docs additions and fixes * remove notebook for simulator and add logging configuration page and info for migrating to 2.3 * Hello World Notebook (New) [skip ci] (#1518) * Added more detail when recursive data is found in FOBS * Added hello-world notebook * Removed N >= 2 * Fix auth test (#1519) * CIFAR-10 Auto-FedRL example (#1283) * Try to fix unsigned commits * refactor ScatterAndGatherAutoFedRL using python inheritance update path to accommodate latest nvflare change Note to TODO add license and update README * remove virtualenv folder * add reproduced results on cifar-10 clean code clean code * remove decomposers in PR add more exp details to README * pt_decomposers -> decomposers * add more util details remove nvflare from req file job_configs -> jobs * correct typo and add nvflare req --------- Co-authored-by: Pengfei Guo <pengfeig@nvidia.com> Co-authored-by: Pengfei Guo <32000655+guopengf@users.noreply.github.com> * Limit the ip address range of inbound ssh to creator's public ip only * Add one FAQ item to describe DNS cache/propagation and how to resolve it * update what's new (#1522) * restore set_env.sh (#1513) * Check that requirements are consistent through examples, update doc [skip ci] (#1521) * check that requirements are consistent through examples and add an item to migration notes * one more requirements txt * Doc & Talks updates [skip ci] (#1525) * update what's new * update year * updates to readmes and talks * Add config_type to distinguish (#1526) * Throw exception when connection monitor is not registered (#1520) * Added more detail when recursive data is found in FOBS * Throw exception when no monitor is registered --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Optimize the get_all_clients, move to the training process beginning. (#1524) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * add pengfei to blossim-ci (#1528) * fix job status and speed up fed event end_run (#1523) * fix job status and speed up fedevent end_run * reformat * remove a debug line --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update logging config example (#1530) * Fix a style issue on FAQ about server DNS propagation/caching. * Added Decomposers for HE Classes [skip ci] (#1527) * Added more detail when recursive data is found in FOBS * Added decomposer for CKKSVector * Fixed a supported_type() bug * Black reformat * Black format fix * Renamed he_decomposers to decomposers * When execute has result_error, raise exception instead of simple logging. (#1529) * Fix SAG typo (#1536) * Notebooks update [skip ci] (#1541) * repeat POC Setup based on QA inputs * fix typos * fix typos * Add new notebooks and some updates in docs [skip ci] (#1545) * add new notebooks from Kris' DLI and some updates in docs * add image for MONAI * update nvflare version (#1546) * Support direct cell message (#1534) * support direct cell comm * support direct cell msg * improve based on review comments * updated based on review comments --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Add a controller_lock to prevent racing condition (#1537) * Add a controller lock * Fix typo * Add link to on-shot-vfl repo (#1548) * Ha authentication fix (#1535) * Added the missing authentication functions for server job process. * notify the server state change to the running jobs. * codestyle fix. * renamed a logger. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Fixed the shared object issue in the controller task return. (#1549) * Fixed the shared object issue in the controller task return. * codestyle fix. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Remove unneeded cancel all task call (#1540) * Add information about ssh source IP * Fix integration tests (#1492) * update scatter & gather messages (#1552) * Limit the FOBS error log size (#1544) * Added more detail when recursive data is found in FOBS * Limit the size of the log message for FOBS errors --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Enhance job meta validator (#1555) * Enhance job meta validator * Fix typo * use python3 command (#1551) * update comments and exception messages[skip ci] (#1559) * update comments and exception messages * add items cache to executor * Remove manual serialize/deserialize for HE components (#1538) * HE refactoring to rm serialize/deserialize calls * fix simplify he aggregation code * run cifar10 with he * reset processed_algorithms * fix unit test * restore fl_context_utils.py * restore docstring formatting * fix weighted aggregation with HE * move bool flag to constructor * only check for the same process algorithm when accepting * formatting * remove abstract decorators when unnecessary; rename class * remove unused aggregation_weights in config * also introduce process_post_get() filter routine * Fix HE * use HECrossSiteModelEval in monai example * fix x-site val misconfig * use encryption during x-site validation * update warning message * add todos --------- Co-authored-by: YuanTingHsieh <yuantingh@nvidia.com> * Fix cell timing (#1558) * fix cell setup timing * fix cell setup timing * fixed list job * make client_cmd channel messages optional * fix invalid client error --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Restructure docs and notebooks as discussed [skip ci] (#1554) * restructure docs and notebooks as discussed * make updates * fix kernel * some more edits --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update monai integration versions [skip_ci] (#1560) * update monai integration versions * update monai integration versions * Enhance preflight check (#1557) * Add preflight check to non-primary server * fix typo * Change optional to required * Fixed -m option in list_jobs [skip ci] (#1556) * Fix integration tests issues (#1562) * Fix incorrect server status after job aborted and server restarted * Updated a re-activate client error message. (#1567) * Early stop on both AWS/Azure when duplicate servers are launched (by design) Add document on this behavior * Fix abort job with only connected clients (#1563) * Fix a typo * update notebooks based on feedback [skip ci] (#1570) * update notebooks based on feedback * minor notebook change * more fixes and add links to notebooks in READMEs * Fix max client in client_manager (#1572) * Update fed policy example (#1575) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Qa issues (#1568) * QA issues. * Refactored. * Removed commented out lines. * Changed to use logger. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * fix a typo in a script (#1577) Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Notebooks upgrade [skip ci] (#1574) * 1) change requirements.txt to make it possible to test 2) update POC and hello_world.ipynb * add provision.ipynb * remove outputs * split notebook pre-, post- run scripts * split notebook pre-, post- run scripts * update * update * update fed stats * update wording * update wording * remove clean up directories * fix RESULT_ERROR in FedStats (#1579) * fix RESULT_ERROR * check potential error condition * check potential error condition * check potential error condition * Fix SAG client result error handling (#1571) * update POC and tutorial storage locations [skip ci] (#1580) * update POC and tutorial storage locations * formatting --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Controller no deepcopy (#1565) * Optimize controller not use deepcopy. * codestyle fix. * removed no used import. * Added interval and task_processed in the log message. * reformatted. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Ensure to end the simulator run after client exception. (#1582) * Update xgboost path (#1584) * Notebook and documentation fixes [skip ci] (#1581) * notebook and documentation fixes * revise for PR * add link --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update notebook setup_poc [skip_ci] (#1588) * update notebook * update notebook * add notebook links to README.md (#1585) * Update InitializeGlobalWeights workflow to not require clients (#1576) * update InitializeGlobalWeights workflow to not require clients * add type information * fix typo * handle different input args * addition reorganization of the linking for the documentation (#1591) * Fix provision notebook bugs [ski ci] (#1589) * fix bug * fix bug * minor fix to menu (#1594) * Change job_configs to jobs for consistency (#1596) * Add example of fednlp for NER task using BERT model (#1564) * Add nlp example for NER task using BERT model * minor updates * code polish * add data example * update learner for data loading * further refinement on docstring and pad_token * add seqeval licence * modify metric output and custom folder * format * add ner task details * config correction --------- Co-authored-by: Holger Roth <hroth@nvidia.com> * Ignore unknown task result in SAG (#1595) * Cell no executor pool (#1590) * Optimize controller not use deepcopy. * codestyle fix. * removed no used import. * Added interval and task_processed in the log message. * reformatted. * Changes for measure simulator performance. * Cell not use executor pool. * codestyle. * Removed the no use import. * optimized. * refactored. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * test client-side model initialization (#1593) * test client-side model initialization * delete unused file * Fixed cell not been stopped properly when config error. (#1597) * Fixed cell not been stopped properly when config error. * added the exception trace. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * fix bugs and cleanup notebooks (#1598) Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Ensure the daemon process to re-start client root process will end if error happens (#1578) * Add job submit success to CI (#1601) * Fix typoe in fuel communicator (#1604) * Fix abort job command return message (#1603) * validate client name type in GlobalWeightsInitializer (#1606) * Revert "Ignore unknown task result in SAG (#1595)" (#1607) This reverts commit 4db55be. * fix workspace bug in notebook [skip ci] (#1605) * fix workspace bug * fix workspace bug * fix workspace bug * fix POC command bug (#1609) * fix workspace bug * fix workspace bug * fix workspace bug * fix workspace bug * fix workspace bug * restore some changes * restore dev branch for now * update split learning readme (#1610) * Re-factor PSI and add user email match to CI (#1583) * Add section on run modes and fix description for list_jobs in notebook (#1600) * add section on run modes * fix link * fix description for list_job in the notebook for the FLARE API --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fix various notebook bugs [skip ci] (#1618) * Fix bugs * update * update * update * fix a bug * Don't submit update with task data from old SSID (#1611) * Don't submit update with task data from old SSID * undo other changes * Use fl context instead of cookie * Job status management enhancement (#1613) * job status enhancement. Added HA mode. * codestyle fix. * Added reviews. * Add docstring to executor (#1599) * fix controller dead client handling; added stats pool to_dict (#1617) * fix controller dead client handling; added stats pool to_dict * changed to handle all finished job status * remove unused imports; change to use parse_hist_mode * Make consistent the error message for shutdown_system without auth (#1614) * make consistent the error message for shutdown_system without auth * update command * fix ci * make updates as discussed * fix ci * fix ci * more changes from PR feedback --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * more notebooks bug fixes and updates [slip ci] (#1624) * update * fix notebooks --------- Co-authored-by: Zhihong Zhang <100308595+nvidianz@users.noreply.github.com> * Fixes several shutdown related issues (#1608) * Added more detail when recursive data is found in FOBS * Added exit_func to shutdown communicator * fixed the job status for config error. (#1615) * fixed the job status for config error. * Added FINISHED_ABNORMAL state to indicate the job complete with abnormal complete return code. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed job could not run when overseer is offline. (#1625) * Fixed job could not run when overseer is offline. * removed no used import. * Removed the duplicate call. * add qat to repo (#1628) * add qat to repo * fix format * remove combo stuff * Removing UDS (#1616) * Added more detail when recursive data is found in FOBS * Removed UDS drivers * Add link and base readme to fed-ce repo [skip ci] (#1623) * Add link and base readme to fed-ce repo * Add link and base readme to fed-ce repo * Add link and base readme to fed-ce repo * add abstracts to fedsm and fedce --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> * Add readme for one-shot VFL paper [skip ci] (#1629) * add readme * update license statement * update the abort_job status after the job complete. (#1627) * Change default initial task fetch interval at client side from 0.1 to 0.5 (#1621) * Add missing parent constructor (#1612) * fix POC stop exception (#1620) * Reduced the non-meaningful logs. (#1630) * Reduced the non-meaningful logs. * Added a space in the log. * Clean up fed stats example (#1602) * Clean up fed stats example * Address comments --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Add default_task_fetch_interval (#1633) * Fixed a save_workspace error. (#1634) * delay the overseer agent start for client job worker process. (#1636) * [PSI] add fl_ctx to finalize() and fix bug (#1638) * update README.md * update README.md * update README.md * update README.md * Create index.html * Add fl_ctx to finalize() method * remove extra files --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Scripts refactoring and notebooks bug fixes/update [skip ci] (#1635) * refactor shutdown_system * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * include optional requirements * move the start and shutdown system to api_utils.py * Fix AIO task cancellation and improve abort_job (#1637) * add qat to repo * fix format * remove combo stuff * fix aio task cancellation; improve abort_job cmd * Fix CI (#1639) * update the aborted job status immediately (#1640) * update the aborted job status immediately. * Enhance the shutdown server running job check. * remove the _ensure_daemon_process_shutdown which caused restart fail. (#1642) * Correction to xgboost requirements files [skip ci] (#1641) * correction to xgboost requirements files * update xgboost version * Add GPT-2 model (#1626) * add got-2 functionality with corrected data loading and align * add got-2 functionality with corrected data loading and align * add got-2 functionality with corrected data loading and align * remove residules from notebook execution * add creating model message * update model diff computation --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Print job schedule result (#1631) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Do not shutdown job runner when server turn to cold state (#1619) * Do not shutdown job runner when server turn to cold state * Fix review comments * address comments * use 1 arg instead of 2 args * Fix file license headers (#1643) * Fix header year * Fix issues * Update run test * Add to documentation (#1644) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Use secure logging for exceptions (#1645) * Fixed the server_command_agent AUTHENTICATION_ERROR reply. (#1648) * Update the _turn_to_cold to set to ColdState first. (#1649) * Improvement on model diff computation (#1647) * adjust the computation of model diff / update * adjust the computation of model diff / update * adjust the computation of model diff / update --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> * fix description of list_jobs in FLARE API notebook (#1646) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fix readme typos (#1653) * Change abort_job command to return None (#1650) * add qat to repo * fix format * remove combo stuff * fix aio task cancellation; improve abort_job cmd * change abort_job to return None * do not raise error when closing --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * notebooks tweaks [skip ci] (#1651) * upgrade notebooks * update notebooks * update notebooks * update notebooks * update notebooks --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * fix abort_job in old FLAdminAPI (#1657) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update monai integration notebook [skip ci] (#1652) * Update split nn notebook (#1654) * Update xgboost notebooks [skip ci] (#1655) * Update RF notebook (#1656) * Add notebook info [skip ci] (#1658) * add section on notebook setup to docs, clean up index page * add sentence for VDR feedback * Improve example readme [skip ci] (#1659) * Improve example readme * Add install * update readme * Add markdown link check workflow [skip ci] (#1660) (#1661) * Add markdown link check workflow * Fix links * Fix links * Check modified files only * Remove unused file (#1671) * Update RC to real release (#1668) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Cherry pick docs update to 2.3 branch (#1669) * Add markdown link check workflow [skip ci] (#1660) * Add markdown link check workflow * Fix links * Fix links * Check modified files only * cherry pick docs update to 2.3 branch --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yuhong Wen <yuhongw@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: nvkevlu <55759229+nvkevlu@users.noreply.github.com> Co-authored-by: Zhihong Zhang <100308595+nvidianz@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Pengfei Guo <pengfeig@nvidia.com> Co-authored-by: Pengfei Guo <32000655+guopengf@users.noreply.github.com>
holgerroth
pushed a commit
to holgerroth/NVFlare
that referenced
this pull request
May 15, 2023
NVIDIA#1440) * Enhance the job run process not to kill its own process, instead let it to MPM manage. * refactored. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
holgerroth
pushed a commit
to holgerroth/NVFlare
that referenced
this pull request
Dec 4, 2023
NVIDIA#1440) * Enhance the job run process not to kill its own process, instead let it to MPM manage. * refactored. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes # .
Fix the random job need to wait for 3600 seconds to shutdown, then start the 2nd job in CI issue.
Description
Instead of monitor the job finish and kill its own process, send the stop() to the Runner to stop the main loop. Then let the process shutdown to MPM to manage.
Types of changes
./runtest.sh
.