Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC execution stuck when NETCONF server closes session unexpectedly #914

Closed
nitishv opened this issue Apr 16, 2019 · 5 comments
Closed

RPC execution stuck when NETCONF server closes session unexpectedly #914

nitishv opened this issue Apr 16, 2019 · 5 comments
Assignees
Milestone

Comments

@nitishv
Copy link

nitishv commented Apr 16, 2019

Expected Behavior

RPC invocation must exit with an error if the underlying NETCONF session is closed by the server.

Current Behavior

NETCONF session has been established using the NetconfServiceProvider, then an RPC is invoked which gets stuck when NETCONF server is killed or crashed.
The python script never exits, neither does it dump any error messages. The last log message seen is "============= Sending RPC to device ============="

Steps to Reproduce

In the test environment, NCA app container connects to NSO to modify a device config, which is managed by this NSO.
The NSO server is running on 192.168.123.11 port 2022

Steps to get into the test setup:

  1. ssh cw-admin@172.20.80.59 (unicast nvashish@cisco.com for password)
  2. Step into NCA container
    kubectl exec -it $(kubectl get pods|grep robot-nca|grep Running|cut -d' ' -f1) bash
  3. Run the attached script
  4. Kill the NSO NETCONF server when the script is sleeping.

Your Script

test_session_close.py.txt

#!/usr/bin/env python

from ydk.providers import NetconfServiceProvider
from ydk.path import Repository
from ydk.path import Codec
from ydk.types import EncodingFormat
import logging
from time import sleep

log = logging.getLogger('ydk')
log.setLevel(logging.DEBUG)
handler = logging.FileHandler('test_session_close.log')
formatter = logging.Formatter(("%(asctime)s - %(name)s - %(levelname)s - %(message)s"))
handler.setFormatter(formatter)
log.addHandler(handler)

router_ip = '192.168.123.11'
DEVICE_CONFIG_TMPL = '''<devices xmlns="http://tail-f.com/ns/ncs"><device><name>%s</name><config>%s</config></device></devices>'''
config_xml = '''<interface xmlns="http://tail-f.com/ned/cisco-ios-xr"><GigabitEthernet><id>0/0/0/1</id><description>test</description></GigabitEthernet></interface>'''
provider = NetconfServiceProvider(repo=Repository('/opt/robot/data/cache'), address='192.168.123.111', username='nso', password='cisco123', port=2022)

log.info("RUNNING start-transaction")
rpc = provider.get_session().get_root_schema().create_rpc("tailf-netconf-transactions:start-transaction")
rpc.get_input_node().create_datanode("target/running")
data = (rpc)(provider.get_session())

log.info("RUNNING edit-config")
rpc = provider.get_session().get_root_schema().create_rpc("ietf-netconf:edit-config")
rpc.get_input_node().create_datanode("target/running")
rpc.get_input_node().create_datanode("config", DEVICE_CONFIG_TMPL%(router_ip,config_xml))
data = (rpc)(provider.get_session())

# kill the NSO netconf server process when this process is sleeping
log.info("************************* SLEEP *************************")
sleep(5)


log.info("RUNNING prepare-transaction")
rpc = provider.get_session().get_root_schema().create_rpc("tailf-netconf-transactions:prepare-transaction")

# this RPC should get stuck
result = (rpc)(provider.get_session())


log.info("RUNNING commit-transaction")
rpc = provider.get_session().get_root_schema().create_rpc("tailf-netconf-transactions:commit-transaction")
result = (rpc)(provider.get_session())

Logs

test_session_close.log

System Information

Linux robot-nca-74f49fcd9-487rz 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Python 2.7.15rc1

ydk==0.7.2.post1
ydk-models-cisco-ios-xr==6.3.2
ydk-models-ietf==0.1.5
ydk-models-ned-ios==5.9.2
ydk-models-ned-ios-xr==6.6.1
ydk-models-openconfig==0.1.2
ydk-models-tailf==6.4.1

Reported by

Nitish Vashishtha (nvashish@cisco.com)
SPNAC / Crosswork / NCA

@ygorelik
Copy link
Collaborator

Reproduced the issue with this test script:

import logging
import os

from ydk.providers import NetconfServiceProvider
from ydk.services import CRUDService
from ydk.errors import YError

from test_utils import enable_logging

from ydk.models.ydktest_oc_nis import openconfig_interfaces as oc_interfaces

def create_interface(provider):
    try:
        interfaces = oc_interfaces.Interfaces()
        interface = interfaces.Interface()
        interface.name = 'Loopback10'
        interface.config.name = 'Loopback10'
        interface.config.description = 'Testing'
        interface.config.enabled = False
        interfaces.interface.append(interface)

        crud = CRUDService()
        r = crud.update(provider, interfaces)
    except YError as err:
        print ("Caught exception: %s" % err.message)

if __name__ == "__main__":

    enable_logging(logging.DEBUG)

    provider = NetconfServiceProvider(
                               address='127.0.0.1',
                               port=12022,
                               username='admin',
                               password='admin')
    # Kill Netconf server
    os.system("pkill confd")
    
    # Attempt to perform regular Netconf operation
    create_interface(provider)

To reproduce the issue

# Start confd server
source ~/confd/confdrc
make -C /home/yan/ydk-workspace/ydk-gen/sdk/cpp/core/tests/confd/ydktest start

# Execute the script
python test_914.py

Observe the python process is hanging until manually killed.

Expected behavior
The script should develop YClientError exception and then normally exit.

@ygorelik
Copy link
Collaborator

Root cause
The Netconf operations actually performed in C++ client code based on libnetconf library. Most of the times when connectivity to server is broken, the library code returns valid values, which then processed by YDK, which then develops exception. But sometimes the C process gets interrupted with SIGNAL message (obtained with gdb):

Program received signal SIGPIPE, Broken pipe.
0x00007ffff6ca699d in __libc_send (fd=4, buf=0xa61c30, n=420, flags=0) at ../sysdeps/unix/sysv/linux/x86_64/send.c:26
26	../sysdeps/unix/sysv/linux/x86_64/send.c: No such file or directory.

Because of unexpected process interruption, the control is not returned to Python process, which then gets in limbo.

@ygorelik
Copy link
Collaborator

After the fixes applied the exception caught as expected and script finished normally.

2019-04-18 14:46:02,943 - ydk - INFO - Path where models are to be downloaded: /home/yan/.ydk/127.0.0.1
2019-04-18 14:46:02,959 - ydk - INFO - Connected to 127.0.0.1 on port 12022 using ssh with timeout of -1
2019-04-18 14:46:02,987 - ydk - INFO - Executing CRUD update operation on [openconfig-interfaces:interfaces]
2019-04-18 14:46:02,988 - ydk - INFO - Executing 'edit-config' RPC on [openconfig-interfaces:interfaces]
2019-04-18 14:46:03,005 - ydk - INFO - ============= Sending RPC to device =============
<rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"><edit-config xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
  <target>
    <candidate/>
  </target>
  <config><interfaces xmlns="http://openconfig.net/yang/interfaces" xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0" nc:operation="merge">
  <interface>
    <name>Loopback10</name>
    <config>
      <name>Loopback10</name>
      <description>Testing</description>
      <enabled>false</enabled>
    </config>
  </interface>
</interfaces>
</config>
</edit-config>
</rpc>
2019-04-18 14:46:03,005 - ydk - ERROR - Cannot not receive RPC reply. Not connected to 127.0.0.1: SSH session is not connected
Caught exception:  Cannot not receive RPC reply. Not connected to 127.0.0.1: SSH session is not connected
2019-04-18 14:46:03,006 - ydk - INFO - Disconnected from device

@ygorelik ygorelik self-assigned this Apr 18, 2019
@ygorelik ygorelik added this to the 0.8.3 milestone Apr 18, 2019
ygorelik pushed a commit that referenced this issue Apr 18, 2019
ygorelik pushed a commit to ygorelik/ydk-gen that referenced this issue Apr 18, 2019
@ygorelik
Copy link
Collaborator

I found that the C++ API works fine in all environments when connection to Netconf is broken. The Python API is also working as expected in all the tests with confd, But all the Python tests fail showing the same symptoms when connection is set to ASR9K sandbox. Therefore reopening the bug.

@ygorelik ygorelik reopened this Apr 22, 2019
@ygorelik
Copy link
Collaborator

Resolved the issue by altering libnetconf code (session.c) to prevent endless loops. As a result the Python API started to work and develops YClientError exception as expected. Here is end of the log for original script:

2019-04-23 18:32:31,282 - ydk - INFO - ============= Sending RPC to device =============
<rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"><prepare-transaction xmlns="http://tail-f.com/ns/netconf/transactions/1.0"/>
</rpc>
2019-04-23 18:32:31,282 - ydk - DEBUG - NetconfSSHClient: NC session status: 1
2019-04-23 18:32:31,282 - ydk - DEBUG - Trace: Missing message-id in rpc.
2019-04-23 18:32:31,282 - ydk - DEBUG - Netconf SSH Client: sending RPC
2019-04-23 18:32:31,282 - ydk - DEBUG - Trace: Writing message (session 1751):
<?xml version="1.0"?>
<rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="4">
  <prepare-transaction xmlns="http://tail-f.com/ns/netconf/transactions/1.0"/>
</rpc>

2019-04-23 18:32:31,283 - ydk - ERROR - Connection error occurred: Writing data into the communication channel failed (Socket error: disconnected).
2019-04-23 18:32:31,283 - ydk - ERROR - Failed to send RPC
<rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0"><prepare-transaction xmlns="http://tail-f.com/ns/netconf/transactions/1.0"/>
</rpc>
Traceback (most recent call last):
  File "test_session_close.py", line 39, in <module>
    result = (rpc)(provider.get_session())
RuntimeError: YClientError: Failed to send RPC
2019-04-23 18:32:31,284 - ydk - INFO - Disconnected from device
root@robot-nca-f55cd4b49-7jr46:/var/log/robot#

ghost pushed a commit that referenced this issue May 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants