-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLD document on add/remove ports dynamically feature #900
HLD document on add/remove ports dynamically feature #900
Conversation
praveen-li comments are located here: |
The comments from tomer-israel#1:
@praveen-li 20 days ago Owner
@praveen-li praveen-li 20 days ago Owner
@praveen-li praveen-li 20 days ago Owner
@praveen-li praveen-li 20 days ago Owner " If del port is followed by add Port immediately. We may need to validate the config in that case." - why del port and add port immediately is problematic? can you explain why do you think we need to validate from config? |
Xcvrd, buffermgrd, natmgr, natsync – waiting for PortInitDone | ||
|
||
## Init types: | ||
The Dynamic port add/remove configuration will be supported for both types of init types:<br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both -> all ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be changed
|
||
## Init with zero ports: | ||
Starting with zero ports requires new SKU for zero ports with these changes:<br /> | ||
**Port_confg.ini** – without entries<br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the port_config.ini deprecated and not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be removed
**Port_confg.ini** – without entries<br /> | ||
**Hwsku.json** – without interfaces<br /> | ||
**Platform.json** – without interfaces<br /> | ||
**Sai xml** file – needs to be without port entries. <br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make it more generic name like sai profile? some platforms would have other file than xml file.
Also, do we even have intention to have sai profile without ports? It makes sense to have superset of the ports in profile and up to SONiC to decide what to expose like host interfaces, admin up on ports etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it to sai profile
on our sai profile we have all the ports configured, but probably in the future, I will consult with the team and it will be changed.
# Initialization stage | ||
|
||
|
||
![Init stage](images/init_stage_diagram.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diagram is very hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Enlarged the image
|
||
#### Add port: | ||
|
||
![Add port](images/add_port_diagram.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diagram is very hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Enlarged the image
9. Portsyncd will remove the port entry from state db | ||
|
||
|
||
## Modules that “listen” to changes on config port table & App port table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also stateDB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also state db, I will add it
**No need to change the code** | ||
|
||
#### Teammgrd: | ||
Listen to events from state db, when entry is added -> add the port to lag<br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to complete the picture, mention the case when port is removed from config db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be added
• If the portsyncd is “quicker” than the buffermgr the orchagent will try to remove the port from SAI before the buffer configuration was removed.<br /> | ||
• Need to test this scenario in order to check if this race condition is reproducing or it’s rare scenario<br /> | ||
• Solution for this: <br /> | ||
Need to add to orchagent the ability to add the buffer configuration of a port and increase a reference counter for each port, in the same way ACL cfg on port is working. We already have infrastructure for this just need to add the buffer cfg to use it. If a port has with buffer cfg on – this port will not be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have retry logic implemented before in case port was in-use, see: https://github.com/Azure/sonic-swss/blob/master/orchagent/portsorch.cpp#L3312 , so I think this condition should be handled unless SAI is not implemented right with correct return code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of this retry implementation, but after trying the current implementation and the proposed one, i prefer the proposed one from these reasone:
in the current implementation we will receive endless orchagent/SAI error messages:
...
Nov 30 11:05:38.892675 r-bulldog-04 NOTICE swss#orchagent: :- doPortTask: Deleting Port Ethernet0
Nov 30 11:05:38.895147 r-bulldog-04 ERR swss#orchagent: :- meta_is_object_in_default_state: object oid:0x1a000000000005 has non default state on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE: oid:0x190000000004c8, expected NULL
Nov 30 11:05:38.895147 r-bulldog-04 ERR swss#orchagent: :- meta_port_remove_validation: port related object oid:0x1a000000000005 is not in default state, can't remove
Nov 30 11:05:38.895195 r-bulldog-04 WARNING swss#orchagent: :- doPortTask: Failed to remove port 1000000000170, as the object is in use
Nov 30 11:05:38.895230 r-bulldog-04 NOTICE swss#orchagent: :- doPortTask: Deleting Port Ethernet0
Nov 30 11:05:38.897932 r-bulldog-04 ERR swss#orchagent: :- meta_is_object_in_default_state: object oid:0x1a000000000005 has non default state on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE: oid:0x190000000004c8, expected NULL
Nov 30 11:05:38.897932 r-bulldog-04 ERR swss#orchagent: :- meta_port_remove_validation: port related object oid:0x1a000000000005 is not in default state, can't remove
Nov 30 11:05:38.897932 r-bulldog-04 WARNING swss#orchagent: :- doPortTask: Failed to remove port 1000000000170, as the object is in use
Nov 30 11:05:38.897932 r-bulldog-04 NOTICE swss#orchagent: :- doPortTask: Deleting Port Ethernet0
Nov 30 11:05:38.900494 r-bulldog-04 ERR swss#orchagent: :- meta_is_object_in_default_state: object oid:0x1a000000000005 has non default state on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE: oid:0x190000000004c8, expected NULL
Nov 30 11:05:38.900494 r-bulldog-04 ERR swss#orchagent: :- meta_port_remove_validation: port related object oid:0x1a000000000005 is not in default state, can't remove
...
In the suggested fix we don't have these errors, and the ref count warning message is printed only once
Also, I noticed that the SAI permits to remove the port even when there are few buffer_pg configurations, for example:
removing "BUFFER_PG|Ethernet8|0" will cause the port to be removed even if "BUFFER_PG|Ethernet8|3-4" wasn't removed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit surprised that the SAI error messages were continuous with current implementation. By design it should retry periodically (e,g, every 1 second). and typical timing issue should be resolved after a few tries. The retry logic should be the same with the ref counter change.
If SAI does not track some of the dependencies, it sounds like a sai redis api bug to me. we should open a issue.
It might be case by case issue, but in general, we should let SAI redis layer to handle the dependencies as that is the final gate anyways. it would be easier this way to onboard any new features with DPB.
@lguohan Any comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand.
I wanted the buffer configuration to be the same as ACL/VLAN/INTERFACE configuration, which uses the ref counter for the dependencies, and before removing a port we check this ref counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this caveat into the doc itself, so we know this was the choice we made and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
• When port is added to the config db – the speed and the admin state is saved on internal db<br /> | ||
• After port was added the user can add buffer configuration to this port (dynamic or static configuration) and only then the buffermgr will set the buffer configuration on App table<br /> | ||
|
||
• We have rare situation of race condition in the add port flow:<br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when applying buffer cfg into port, we should always check if port is ready first, so this race condition should be handled. If this check is not done, we should do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no check if port exist, we can check it before trying to add buffer cfg, if it is not exist need to retry.
I will add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any description added yet to the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to the document
|
||
|
||
|
||
Suggested change: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify what problem you are trying to address here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a description for it
@zhenggen-xu could you please check recent update following your comments? |
Please resolve/mark the things you have addressed. |
@@ -0,0 +1,301 @@ | |||
# Delete or remove ports dynamically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we change the HLD title to "dynamic port add and del enhancements" or "Enhancements to add or del ports dynamically", so it is clear that we are fixing a few cases that need to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to "Enhancements to add or del ports dynamically"
|
||
• We have rare situation of race condition in the add port flow:<br /> | ||
|
||
![possible buffermgr race condition](images/buffermgr_possible_race.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enlarged the image
|
||
**Note:** This is a new type of init that was never tested and will be supported.<br /> | ||
The zero-port system is a special case of this feature. <br /> | ||
Few PRs were already added in order to support zero ports init.<br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR links?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be added to document
**No need to change the code** | ||
|
||
#### PMON - Xcvrd: | ||
Listen to events on cfg port table and update transeiver information <br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have design doc for this? how other platforms can leverage this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have the PR for this:
sonic-net/sonic-buildimage#8422
I will add it to the document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking for the pmon - xcvrd design document, I still haven't seen it in the latest doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I'm not sure if we have a design document for it
@Junchao-Mellanox , do we have a design document for the xcvrd changes of the dynamic port configuration ?
• When port is added to the config db – the speed and the admin state is saved on internal db<br /> | ||
• After port was added the user can add buffer configuration to this port (dynamic or static configuration) and only then the buffermgr will set the buffer configuration on App table<br /> | ||
|
||
• We have rare situation of race condition in the add port flow:<br /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any description added yet to the doc.
• If the portsyncd is “quicker” than the buffermgr the orchagent will try to remove the port from SAI before the buffer configuration was removed.<br /> | ||
• Need to test this scenario in order to check if this race condition is reproducing or it’s rare scenario<br /> | ||
• Solution for this: <br /> | ||
Need to add to orchagent the ability to add the buffer configuration of a port and increase a reference counter for each port, in the same way ACL cfg on port is working. We already have infrastructure for this just need to add the buffer cfg to use it. If a port has with buffer cfg on – this port will not be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit surprised that the SAI error messages were continuous with current implementation. By design it should retry periodically (e,g, every 1 second). and typical timing issue should be resolved after a few tries. The retry logic should be the same with the ref counter change.
If SAI does not track some of the dependencies, it sounds like a sai redis api bug to me. we should open a issue.
It might be case by case issue, but in general, we should let SAI redis layer to handle the dependencies as that is the final gate anyways. it would be easier this way to onboard any new features with DPB.
@lguohan Any comments?
add the reason why we chose the ref counter mechanism on Buffer configuration
@zhenggen-xu kindly reminder. Any further comments or should we go ahead and merge? |
@zhenggen-xu kindly reminder. Any further comments or should we go ahead and merge? |
@zhenggen-xu I assume if no further comments, the HLD PR can be merged. Let me know if you have any additional feedback and if not the HLD will be assumed as approved. Thanks. |
as no additional feedback and seems like all the comments were addressed, the PR is merged. |
… added or removed (#2019) - What I did Add support for adding & removing port counters dynamically each time port was added or removed the counters that were supported are: 1. pg watermark counters 2. pg drop counters 3. queue stat counters 4. queue watermark counters 5. debug counters 6. buffer drop counters and port stat counters are already supported to be added or removed each time port is added/removed Implemented according to the - 'HLD document on add/remove ports dynamically feature' sonic-net/SONiC#900 - Why I did it In order to support dynamically add or removed ports on sonic - How I verified it tested manually that the flex counters were added or removed correctly whenever we add or remove ports added new test cases to the following vs tests: test_flex_counters.py test_drop_counters.py test_pg_drop_counter.py Co-authored-by: dprital <drorp@nvidia.com>
… added or removed (sonic-net#2019) - What I did Add support for adding & removing port counters dynamically each time port was added or removed the counters that were supported are: 1. pg watermark counters 2. pg drop counters 3. queue stat counters 4. queue watermark counters 5. debug counters 6. buffer drop counters and port stat counters are already supported to be added or removed each time port is added/removed Implemented according to the - 'HLD document on add/remove ports dynamically feature' sonic-net/SONiC#900 - Why I did it In order to support dynamically add or removed ports on sonic - How I verified it tested manually that the flex counters were added or removed correctly whenever we add or remove ports added new test cases to the following vs tests: test_flex_counters.py test_drop_counters.py test_pg_drop_counter.py Co-authored-by: dprital <drorp@nvidia.com>
sonic-net/sonic-swss#2194 need be added into the code PR list to track. @tomer-israel |
… added or removed (sonic-net#2019) - What I did Add support for adding & removing port counters dynamically each time port was added or removed the counters that were supported are: 1. pg watermark counters 2. pg drop counters 3. queue stat counters 4. queue watermark counters 5. debug counters 6. buffer drop counters and port stat counters are already supported to be added or removed each time port is added/removed Implemented according to the - 'HLD document on add/remove ports dynamically feature' sonic-net/SONiC#900 - Why I did it In order to support dynamically add or removed ports on sonic - How I verified it tested manually that the flex counters were added or removed correctly whenever we add or remove ports added new test cases to the following vs tests: test_flex_counters.py test_drop_counters.py test_pg_drop_counter.py Co-authored-by: dprital <drorp@nvidia.com>
…ter (#2194) - What I did This PR replace PR #2022 Added increasing/decreasing to the port ref counter each time a port buffer configuration is added or removed Implemented according to the - sonic-net/SONiC#900 - Why I did it In order to avoid cases where a port is removed before the buffer configuration on this port were removed also - How I verified it VS Test was added in order to test it. we remove a port with buffer configuration and the port is not removed. only after all buffer configurations on this port were removed - this port will be removed.
…ter (#2194) - What I did This PR replace PR #2022 Added increasing/decreasing to the port ref counter each time a port buffer configuration is added or removed Implemented according to the - sonic-net/SONiC#900 - Why I did it In order to avoid cases where a port is removed before the buffer configuration on this port were removed also - How I verified it VS Test was added in order to test it. we remove a port with buffer configuration and the port is not removed. only after all buffer configurations on this port were removed - this port will be removed.
This HLD document is describing the dynamic port add/remove feature.
Flow diagrams and code changes are described in details.
Also described the flow of initialize without any ports .
Related PRs: