Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testutils/iotlab.py: Randomize the selected node #298

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

MrKevinWeiss
Copy link
Contributor

@MrKevinWeiss MrKevinWeiss commented Jan 29, 2024

It seems there have been some failures mainly due to infrastructure. Specifically the samr21-xpro failing to flash will cause many reruns with the same faulty hardware.

Previously it would just take the first available node in the list, which is deterministic but doesn't help with flakey test reruns. This may cause an issue with distance to other nodes, but if random selection of nodes becomes a problem we would have to introduce node pairing lists... Which is a bit more work.

This is at least a first step.

@MrKevinWeiss
Copy link
Contributor Author

Running tox -e test -- -k "spec04 and task02" locally prevented the selection of the failing samr21-11.saclay.iot-lab.info and instead selected samr21-13.saclay.iot-lab.info which is why this passed.

@MrKevinWeiss
Copy link
Contributor Author

This doesn't actually seem to trigger in all tests at least...

It would appear that

    def _submit(self, site, duration):
        """Submit an experiment with required nodes"""
        api = Api(*self.user_credentials())
        resources = []
        for ctrl in self.ctrls:
            if ctrl.env.get('IOTLAB_NODE') is not None:
                resources.append(exp_resources([ctrl.env.get('IOTLAB_NODE')]))
            elif ctrl.board() is not None:
                board = IoTLABExperiment._archi_from_board(ctrl.board())
                alias = AliasNodes(1, site, board)
                resources.append(exp_resources(alias))
            else:
                raise ValueError("neither BOARD or IOTLAB_NODE are set")
        return submit_experiment(api, self.name, duration, resources)['id']

Is responsible for selecting the nodes.

@MrKevinWeiss
Copy link
Contributor Author

So it must have previously just had the bad nodes blocked... I force pushed and update that gets a list of all available nodes that fit our requirements, then randomly selects them... Maybe we would want to add a flag to use this on not...

@MrKevinWeiss
Copy link
Contributor Author

hmmm I still have to adjust some tests

miri64
miri64 previously requested changes Jan 30, 2024
Copy link
Member

@miri64 miri64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might complicate things, but since you are getting the node information anyway from the site (which IIRC also include the position of the node), would it make sense to have some kind of distance heuristic in the random selection?

testutils/iotlab.py Outdated Show resolved Hide resolved
@MrKevinWeiss
Copy link
Contributor Author

would it make sense to have some kind of distance heuristic in the random selection?

Yes, I can look into that but I imagine it would already complicate an complicated process. Maybe a simpler solution would be to only randomize boards that don't report error conditions (such as m3 boards), typically the grouping of the "special" boards are pretty close together anyways.

Also if it randomly selects a poor choice we can rerun anyways.

@miri64 miri64 dismissed their stale review January 30, 2024 18:34

Then I dismiss my review for now

@MrKevinWeiss
Copy link
Contributor Author

I made the change to only randomize non-m3 nodes... Also IoTlabs fixed the samrs so it is hard for me to reproduce the failure but easy to show it selects random nodes.

It seems there have been some failures mainly
due to infrastructure. Specifically the samr21-xpro
failing to flash will cause many reruns with the same
faulty hardware.

Previously it would just take the first available
node in the list, which is deterministic but doesn't
help with flakey test reruns. This may cause an issue
with distance to other nodes, but if random selection
of nodes becomes a problem we would have to introduce
node pairing lists... Which is a bit more work.

This is at least a first step.
@MrKevinWeiss
Copy link
Contributor Author

I think we should have this in @Teufelchen1

Currently contiki is always selecting the same nodes and they are not working...

@MrKevinWeiss
Copy link
Contributor Author

As soon as I posted that, the nodes started working.

@MrKevinWeiss
Copy link
Contributor Author

Maybe @mguetschow would be interested in looking at this?

Apply miri64s suggestion

Co-authored-by: Martine Lenders <martine.lenders@tu-dresden.de>
Copy link
Contributor

@Teufelchen1 Teufelchen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, did not run it though.

@mguetschow
Copy link
Contributor

mguetschow commented Jul 3, 2024

When I run this locally with tox -- -k "spec04 and task02" --non-RC, I get:

test: commands[0]> pytest -k 'spec04 and task02' --non-RC
======================================= test session starts =======================================
platform linux -- Python 3.11.2, pytest-7.3.2, pluggy-1.5.0 -- /home/mikolai/TUD/Code/Release-Specs/.tox/test/bin/python
cachedir: .tox/test/.pytest_cache
rootdir: /home/mikolai/TUD/Code/Release-Specs
configfile: setup.cfg
plugins: cov-5.0.0, rerunfailures-14.0
collected 136 items / 135 deselected / 1 selected                                                 

04-single-hop-6lowpan-icmp/test_spec04.py::test_task02[nodes0] RERUN                        [100%]
04-single-hop-6lowpan-icmp/test_spec04.py::test_task02[nodes0] RERUN                        [100%]
04-single-hop-6lowpan-icmp/test_spec04.py::test_task02[nodes0] RERUN                        [100%]
04-single-hop-6lowpan-icmp/test_spec04.py::test_task02[nodes0] ERROR                        [100%]

============================================= ERRORS ==============================================
______________________________ ERROR at setup of test_task02[nodes0] ______________________________

local = False, request = <SubRequest 'nodes' for <Function test_task02[nodes0]>>
boards = ['samr21-xpro', 'iotlab-m3'], iotlab_site = 'saclay'

    @pytest.fixture
    def nodes(local, request, boards, iotlab_site):
        """
        Provides the nodes for a test as a list of RIOTCtrl objects
        """
        ctrls = []
        if boards is None:
            boards = request.param
        only_native = all(b.startswith("native") for b in boards)
        for board in boards:
            if local or only_native or IoTLABExperiment.valid_board(board):
                env = {'BOARD': f'{board}'}
                if only_native:
                    # XXX this does not work for a mix of native and non-native boards,
                    # but we do not have these in the release tests at the moment.
                    env["RIOT_TERMINAL"] = "native"
            else:
                env = {
                    'BOARD': IoTLABExperiment.board_from_iotlab_node(board),
                    'IOTLAB_NODE': f'{board}',
                }
            ctrls.append(RIOTCtrl(env=env))
        if local or only_native:
            yield ctrls
        else:
            name_fmt = get_namefmt(request)
            # Start IoT-LAB experiment if requested
            exp = IoTLABExperiment(
                # pylint: disable=C0209
                name="RIOT-release-test-{module}-{function}".format(**name_fmt),
                ctrls=ctrls,
                site=iotlab_site,
            )
            RUNNING_EXPERIMENTS.append(exp)
>           exp.start(duration=IOTLAB_EXPERIMENT_DURATION)

conftest.py:306: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
testutils/iotlab.py:144: in start
    self.exp_id = self._submit(site=self.site, duration=duration)
testutils/iotlab.py:193: in _submit
    return submit_experiment(api, self.name, duration, resources)['id']
.tox/test/lib/python3.11/site-packages/iotlabcli/experiment.py:73: in submit_experiment
    experiment.add_exp_resources(res_dict)
.tox/test/lib/python3.11/site-packages/iotlabcli/experiment.py:605: in add_exp_resources
    self._set_type(resources['type'])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <iotlabcli.experiment._Experiment object at 0x7f24af5abb10>, exp_type = 'alias'

    def _set_type(self, exp_type):
        """ Set current experiment type.
        If type was already set and is different ValueError is raised
        """
        if self.type is not None and self.type != exp_type:
>           raise ValueError(
                "Invalid experiment, should be only physical or only alias")
E           ValueError: Invalid experiment, should be only physical or only alias

.tox/test/lib/python3.11/site-packages/iotlabcli/experiment.py:596: ValueError
======================================== warnings summary =========================================
.tox/test/lib/python3.11/site-packages/_pytest/cacheprovider.py:387
  /home/mikolai/TUD/Code/Release-Specs/.tox/test/lib/python3.11/site-packages/_pytest/cacheprovider.py:387: PytestCacheWarning: cache could not write path /home/mikolai/TUD/Code/Release-Specs/.tox/test/.pytest_cache/v/cache/lastfailed
    config.cache.set("cache/lastfailed", self.lastfailed)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------ generated xml file: /home/mikolai/TUD/Code/Release-Specs/test-report.xml -------------
================= 135 deselected, 1 warning, 1 error, 3 rerun in 93.90s (0:01:33) =================
test: exit 1 (94.30 seconds) /home/mikolai/TUD/Code/Release-Specs> pytest -k 'spec04 and task02' --non-RC pid=240341
test: FAIL ✖ in 1 minute 34.33 seconds
flake8: commands[0]> flake8
flake8: OK ✔ in 0.25 seconds
pylint: commands[0]> pylint conftest.py testutils/ 03-single-hop-ipv6-icmp/ 04-single-hop-6lowpan-icmp/ 05-single-hop-route/ 06-single-hop-udp/ 07-multi-hop/ 08-interop/ 09-coap/ 10-icmpv6-error/ 11-lorawan/

--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

pylint: OK ✔ in 10.16 seconds
black: commands[0]> black --check --diff .
All done! ✨ 🍰 ✨
31 files would be left unchanged.
  test: FAIL code 1 (94.33=setup[0.02]+cmd[94.30] seconds)
  flake8: OK (0.25=setup[0.00]+cmd[0.24] seconds)
  pylint: OK (10.16=setup[0.01]+cmd[10.15] seconds)
  black: OK (0.35=setup[0.00]+cmd[0.35] seconds)
  evaluation failed :( (105.12 seconds)

Edit: does not happen on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants