title | description | author | ms.author | ms.topic | ms.custom | ms.date | |
---|---|---|---|---|---|---|---|
Troubleshoot Azure IoT Operations Preview |
Troubleshoot your Azure IoT Operations deployment |
kgremban |
kgremban |
troubleshooting-general |
|
01/22/2024 |
[!INCLUDE public-preview-note]
This article contains troubleshooting tips for Azure IoT Operations Preview.
For general deployment and configuration troubleshooting, you can use the Azure CLI IoT Operations check and support commands.
Azure CLI version 2.46.0 or higher is required and the Azure IoT Operations extension installed.
-
Use az iot ops check to evaluate Azure IoT Operations service deployment for health, configuration, and usability. The check command can help you find problems in your deployment and configuration.
-
Use az iot ops support create-bundle to collect logs and traces to help you diagnose problems. The support create-bundle command creates a standard support bundle zip archive you can review or provide to Microsoft Support.
If your Data Processor pipeline deployment status is showing as Failed, use the following commands to find the pipeline error codes.
To list the Data Processor pipeline deployments, run the following command:
kubectl get pipelines -A
The output from the pervious command looks like the following example:
NAMESPACE NAME AGE
azure-iot-operations passthrough-data-pipeline 2d20h
azure-iot-operations reference-data-pipeline 2d20h
azure-iot-operations contextualized-data-pipeline 2d20h
To view detailed information for a pipeline, run the following command:
kubectl describe pipelines passthrough-data-pipeline -n azure-iot-operations
The output from the previous command looks like the following example:
...
Status:
Provisioning Status:
Error
Code: <ErrorCode>
Message: <ErrorMessage>
Status: Failed
Events: <none>
If you see the following message when you try to access the Pipelines tab in the Azure IoT Operations (preview) portal:
Data Processor not found in the current deployment. Please re-deploy with the additional argument to include the Data Processor.
You need to deploy Azure IoT Operations with the optional Data Processor component included. To do this, you need to add the --include-dp
argument when you run the az iot ops init command. You must use the --include-dp
argument to include the Data Processor component when you first deploy Azure IoT Operations. You can't add this optional component to an existing deployment.
Tip
If you want to delete the Azure IoT Operations deployment but plan on reinstalling it on your cluster, use the az iot ops delete command.
The troubleshooting guidance in this section is specific to Azure IoT Operations when using the Layered Network Management component. For more information, see How does Azure IoT Operations Preview work in layered network?.
If the Layered Network Management operator install fails or you can't apply the custom resource for a Layered Network Management instance:
- Verify the regions are supported for public preview. Public preview supports eight regions. For more information, see Quickstart: Deploy Azure IoT Operations Preview.
- If there are any other errors in installing Layered Network Management Arc extensions, follow the guidance included with the error. Try uninstalling and installing the extension.
- Verify the Layered Network Management operator is in the Running and Ready state.
- If applying the custom resource
kubectl apply -f cr.yaml
fails, the output of this command lists the reason for error. For example, CRD version mismatch or wrong entry in CRD.
If you repeatedly remove and onboard a cluster with the same machine, you might get an error while Arc-enabling the cluster on nested layers. For example:
Error: We found an issue with outbound network connectivity from the cluster to the endpoints required for onboarding.
Please ensure to meet the following network requirements 'https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster?tabs=azure-cli#meet-network-requirements'
If your cluster is behind an outbound proxy server, please ensure that you have passed proxy parameters during the onboarding of your cluster.
-
Run the following command:
sudo systemctl restart systemd-networkd
-
Reboot the host machine.
- Add the
--debug
parameter when running theconnectedk8s
command. - Capture and investigate a network packet trace. For more information, see capture Layered Network Management packet trace.
You can't install Azure IoT Operations components on nested layers. For example, Layered Network Management on level 4 is running but can't install Azure IoT Operations on level 3.
-
Verify the nodes can access the Layered Network Management service running on parent level. For example, run
ping <IP-ADDRESS-L4-LNM>
from the node. -
Verify the DNS queries are being resolved to the Layered Network Management service running on the parent level using the following commands:
nslookup management.azure.com
DNS should respond with the IP address of the Layered Network Management service.
-
If the domain is being resolved correctly, verify the domain is added to the allowlist. For more information, see Check the allowlist of Layered Network Management.
-
Capture and investigate a network packet trace. For more information, see capture Layered Network Management packet trace.
When installing the Azure IoT Operations components to a cluster, the installation starts and proceeds. However, initialization of one or few of the components (pods) fails.
-
Identify the failed pod
kubectl get pods -n azure-iot-operations
-
Get details about the pod:
kubectl describe pod [POD NAME] -n azure-iot-operations
-
Check the container image related information. If the image download fails, check if the domain name of download path is on the allowlist. For example:
Warning Failed 3m14s kubelet Failed to pull image "…
Layered Network Management blocks traffic if the destination domain isn't on the allowlist.
-
Run the following command to list the config maps.
kubectl get cm -n azure-iot-operations
-
The output should look like the following example:
NAME DATA AGE aio-lnm-level4-config 1 50s aio-lnm-level4-client-config 1 50s
-
The xxx-client-config contains the allowlist. Run:
kubectl get cm aio-lnm-level4-client-config -o yaml
-
All the allowed domains are listed in the output.
In some cases, you might suspect that Layered Network Management instance at the parent level isn't forwarding network traffic to a particular endpoint. Connection to a required endpoint is causing an issue for the service running on your node. It's possible that the service you enabled is trying to connect to a new endpoint after an update. Or you're trying to install a new Arc extension or service that requires connection to endpoints that aren't on the default allowlist. Usually there would be information in the error message to notify the connection failure. However, if there's no clear information about the missing endpoint, you can capture the network traffic on the child node for detailed debugging.
- Install Wireshark network traffic analyzer on the host.
- Run Wireshark and start capturing.
- Reproduce the installation or connection failure.
- Stop capturing.
-
Run the following command to start capturing:
sudo tcpdump -W 5 -C 10 -i any -w AIO-deploy -Z root
-
Reproduce the installation or connection failure.
-
Stop capturing.
Use Wireshark to open the trace file. Look for connection failures or unresponsive connections.
- Filter the packets with the ip.addr == [IP address] parameter. Input the IP address of your custom DNS service address.
- Review the DNS query and response, check if there's a domain name that isn't on the allowlist of Layered Network Management.