# AKS Monitoring: Logs and Alarms

Monitoring your AKS clusters is a fundamental practice for optimizing performance, managing resources, and enhancing security in your containerized applications. In this lesson, we will delve into two critical aspects of AKS monitoring: effective log analysis and setting up custom alerts.

## Log Analytics Workspace and Queries

Log Analytics is the go-to platform for writing, testing, and executing log queries. It offers a wide range of capabilities, from simple queries that filter, sort, and analyze records to advanced queries that perform statistical analyses, allowing you to visualize trends and patterns within your data. 

To access Log Analytics for the AKS cluster in the Azure Portal, navigate to the AKS cluster and select **Logs** under the **Monitoring** page. Accessing Log Analytics will redirect you to the following page:

<p align=center> <img src=images/LogAnalyticsVideo.png width=700 height=450> </p>

This video provides a useful overview of the Log Analytics service, so make sure to watch it before continuing further. Once you've finished watching the introductory video, click **X**, and you will be redirected to a dialog box that contains example queries:


<p align=center> <img src=images/LogQueries.png width=750 height=450> </p>

These queries are excellent starting points, and you can browse or search for queries that align with your requirements. You might even find an example query that perfectly meets your needs, but alternatively you also have the option to load an example query into the editor and make necessary modifications.

The example queries cover a wide range of topics by which you can sort, such as:

- **Alerts**: This category typically includes queries related to setting up and managing alert rules for monitoring your AKS resources. It helps you define the conditions and thresholds for alerting when specific events occur.

- **Container Logs**: Queries in this category focus on container-level logs, allowing you to dig into the details of what's happening inside your containers. You can filter, search, and analyze logs to troubleshoot issues and monitor application behavior.

- **Find in Table**: These queries are incredibly useful for searching specific data in your logs. They can help you quickly locate information within large datasets, saving you time and effort in pinpointing relevant records.

- **Availability**: The availability category offers queries to assess the uptime and availability of your AKS cluster, including checks on the health of nodes, pods, and services. These queries help you maintain consistent service availability.

- **Performance**: The performance category includes queries that focus on assessing the performance of your AKS cluster. This can involve monitoring key metrics related to CPU and memory usage, network performance, and other performance-related aspects.

Let's run the **Container CPU** query under the **Performance** category as an example. First, you will observe you have two choices, either **Run** or **Load to editor**. By selecting **Run**, you will execute the query immediately, and the results will be displayed in the **Results** section of Log Analytics. Choosing **Load to editor** loads the selected query into the query editor without executing it right away. This allows you to review and potentially edit the query before running it. Let's choose the first option for now.

Running this will redirect you to the Log Analytics main interface:

<p align=center> <img src=images/LogAnalyticsInterface.png width=900 height=500> </p>

To help you make the most of Log Analytics, let's familiarize ourselves with the essential components of its user interface:

- **Top Action Bar**: The top bar provides controls for working with queries in the query window. Here, you can find the following features:

  - **Scope**: Specifies the scope of data to be used for the query. This means you can define whether the query should analyze all the data within a Log Analytics workspace or data specific to a particular resource across multiple workspaces. It's a crucial feature for narrowing down the focus of your query.
  
  - **Run Button**: The **Run** button executes the selected query in the query window. You can also run a query by selecting it and pressing **Shift+Enter**. It initiates the query and displays the results.
  
  - **Time Range**: Allows you to select the time range for the data available to the query. This is essential for specifying the period you want to analyze. If your query includes a time filter, the time range settings will be overridden.

  - **Save Button**: The **Save** button is used to save the query to the *Query Explorer* for the workspace. **Query Explorer** is a feature in Log Analytics that enables easy access to saved queries.

- **Left Sidebar**: The sidebar on the left presents tables within the workspace, sample queries, and filter options for the active query

- **Query Window**: The query window provides you with the ability to create, modify, and execute queries using the *Kusto Query Language (KQL)*. KQL is a versatile query language designed for querying and analyzing data within Microsoft services like Log Analytics, Azure Monitor, and more. It offers a structured and SQL-like syntax for data manipulation, filtering, and transformation. 

- **Results Window**: The results are presented in a table format, organized by columns and rows. This view allows you to expand row values, modify the list of columns, sort results, and apply filters. Results can also be presented visually using the **Chart** feature. The chart view transforms query results into various chart types, with the option to choose your preferred chart style and specify columns for the x-axis, y-axis, and series. 

### Hands-On: Exploring AKS Cluster Logs

In this hands-on, we will walk through the process of executing various log queries for an AKS cluster, and demonstrate how to save them for easy access. This allows us to analyze and monitor our AKS cluster efficiently.

#### Step 1: Access Log Analytics Workspace

If you haven't already access the Log Analytics interface, navigate to the **Logs** tab for the desired AKS cluster.

#### Step 2: Explore and Save Log Queries

We will explore and save different types of log queries related to our AKS cluster:

**Log 1: Average Nodes CPU Usage Percentage per Minute**

- Begin by tracking the average node's CPU usage per minute. To do this, choose **Avg node CPU usage percentage per minute** under the **Alerts** log queries.

- Execute the query to visualize the results. Each row in the results section represents the average node CPU usage within a one-minute time bin.

<p align=center> <img src=images/AverageCPUMin.png width=800 height=300> </p>

- To ensure easy access, save the query by clicking the **Save** button, then selecting **Save as query**. Choose a descriptive name, and categorize it under **Containers**. Organizing AKS-related queries within this category will keep them grouped together for easier access. Finally, hit the **Save** button.

**Log 2: Average Nodes Memory Usage Percentage per Minute**

- Start by creating a new log query by clicking on the **+** button in the top ribbon of the Log Analytics interface. Now, let's focus on monitoring the average node's memory usage per minute. To do this, choose **Avg node memory usage percentage per minute** under the **Alerts** log queries.

- Execute the query to visualize the results. Each row in the results section represents the average node memory usage within a one-minute time bin.

- To ensure easy access, save the query by clicking the **Save** button, then selecting **Save as query**. Choose a descriptive name, and categorize it under **Containers**. Finally, hit the **Save** button.

**Log 3: Pods Count with Phase**

- Start by creating a new log query by clicking on the **+** button in the top ribbon of the Log Analytics interface. Now, let's focus on monitoring the number of pods in various phases. To do this, choose **List all the pods count with phase** under the **Availability** log queries.

- Execute the query to visualize the results. Each row in the results section represents the count number of pods in various phases within a one-minute time bin.

- To ensure easy access, save the query by clicking the **Save** button, then selecting **Save as query**. Choose a descriptive name, and categorize it under **Containers**. Finally, hit the **Save** button.

**Log 4: Container Logs**

- Start by creating a new log query by clicking on the **+** button in the top ribbon of the Log Analytics interface. Now, let's focus on monitoring the occurrence of a specific keyword within container logs. To do this, choose **Find a value in Container Logs Table** under the **Container Logs** log queries. Open this log query using the **Load to editor** option.

- Take a look at the query instructions in the query window. You'll need to update the `FindString` value with the desired keyword, which in this case will be `warning`.

<p align=center> <img src=images/CustomizeQuery.png width=800 height=175> </p>

- Execute the query to visualize the results. Each row in the results section represents an instance where the specific keyword was found in the container logs.

- To ensure easy access, save the query by clicking the **Save** button, then selecting **Save as query**. Choose a descriptive name, and categorize it under **Containers**. Finally, hit the **Save** button.

**Log 5: Kubernetes Events**

- Start by creating a new log query by clicking on the **+** button in the top ribbon of the Log Analytics interface. We'll focus on monitoring Kubernetes events within container logs. To do this, choose **Kubernetes events** under the **Diagnostic** log queries. 

- Execute the query to visualize the results. Each row in the results section represents an instance where specific Kubernetes events were found in the container logs.

- To ensure easy access, save the query by clicking the **Save** button, then selecting **Save as query**. Choose a descriptive name, and categorize it under **Containers**. Finally, hit the **Save** button.

#### Step 3: Access Saved Log Queries

- In the Log Analytics interface, look for the **Other** category on the left sidebar. Click on this to expand this category and reveal the list of saved queries.

- You will find the saved log queries that you created in the previous steps, organized under the appropriate categories you specified during the save process.

- Select the relevant query you want to run or analyze. The query will load into the query window for further examination. You can now run the selected query, visualize its results, and perform any necessary analysis.

By following these steps, you have gained practical experience in running and saving log queries for your AKS cluster using the Log Analytics workspace. Monitoring logs is essential for several reasons. Logs provide detailed, event-level information about your AKS cluster's operations and can help you identify issues, troubleshoot problems, and ensure compliance with various operational and security standards. 

Unlike metrics, which provide aggregated data and performance statistics, logs offer a granular view of what's happening within your cluster. This granularity enables you to dig deep into specific events and diagnose problems precisely. Effective log monitoring, in conjunction with metrics, offers a comprehensive approach to managing and maintaining your AKS resources, ensuring optimal performance and enhancing security.

## Alarms

Alarms are a fundamental component of any monitoring strategy. They ensure you can detect and address issues promptly, reducing the risk of disruptions and optimizing the performance of your applications.

In this section, we'll walk through the process of creating custom alerts based on metrics and insights. We'll define alert conditions, thresholds, and notification methods to help you set up an effective monitoring system for your AKS clusters.

The process of setting up alarms includes the following steps:

### 1. Define Alert Conditions

Setting up alarms involves defining specific alert conditions that determine when an alert should be triggered. These conditions are typically based on various metrics related to your AKS cluster's performance. Let's dive into how you can define these alert conditions in the Azure Portal:

- Begin by opening the Alerts homepage for your AKS cluster. This can be accessed under the **Alerts** tab in the **Monitoring** page.

- To set up a new alarm, click on the **Alert rules** in the top ribbon of the page. Here, you'll be able to create and manage your alert rules. This should redirect you to the following page:

<p align=center> <img src=images/AlertRules.png width=900 height=225> </p>

- By default, there are already two alert rules present for **CPU Usage Percentage** and **Memory Working Set Percentage**. The first rules monitors CPU usage and triggers an alert when it exceeds a specified threshold, allowing you to address performance issues caused by high CPU usage. The latter monitors memory usage and sends alerts when it crosses a predefined threshold, helping you maintain sufficient memory resources for your workloads. These serve as a good examples and can be customized to suit your specific monitoring needs.

- Click the **+ Create** button to initiate the process of setting up a new alert. You'll be prompted to create a new alert rule. Start by specifying the alert's condition. This condition depends on the metric you want to monitor. For example, if you want to keep an eye on your AKS cluster's disk usage, you can select the appropriate metric (**Disk Used Percentage**) under the **Signal name** field and set your conditions.

- Configure the threshold that triggers the alert. For example, let's set the threshold to 90%.

- Next, we will need to configure the time settings of the alert rule. The **Check every** determines how often the rule should check for alert conditions. We will set this to 5 minutes, meaning the rule evaluates the conditions every 5 minutes. The **Loopback period** specifies how far back the rule should look when evaluating alert conditions. We will set this to 15 minutes, meaning the alert rule will look back at the last 15 minutes of the data to determine if the conditions are met.

<p align=center> <img src=images/ExampleAlertRule.png width=950 height=600> </p>


### 2. Create Actions Groups

> *Action groups* are collections of notification preferences and actions that can be bundled together to form a comprehensive response to an alert.

- To create a new action group, click the **+ Create action group** button

- Start by configuring the basics: the action group name (provide a name to help you identify its purpose), the display name (for easy reference), the subscription and its resource group

<p align=center> <img src=images/ActionGroupBasics.png width=850 height=475> </p>

- Choose the notification type(s) you want for this action group. Options include:

  - **Email**: Sends notifications via email to designated recipients 
  - **SMS**: Sends notifications via SMS to mobile devices
  - **Push**: Sends push notifications to the Azure mobile app or other configured channels
  - **Voice**: Delivers voice call notifications to specified phone numbers
  - **Azure Resource Manager Role**: Triggers an Azure Resource Manager role for managing access to resources

- For this example I will choose email as the preferred method:

<p align=center> <img src=images/EmailNotification.png width=850 height=475> </p>

- Optionally, you can configure specific actions when an alert is triggered, in the **Actions** page of the **Create action group**. You can choose from a range of actions, such as integrating webhooks or utilizing Azure Logic Apps or Azure Functions for complex automation. For our example we will skip this step.

- Click **Review + create** to review the configured settings, ensuring they align with your alert notification requirements. Once you're satisfied, click the **Create** button to create the action group. 

Once created, you should be redirected back to the **Create an alert rule** page. To finish the process of creating a new alert rule, fill in the necessary information on the **Details** page, including the alert rule name and description. Finally, review and create your new alert rule.

If the alert rule is created successfully, you will find it listed in the alert rules. You may need to click on **Refresh** o ensure the rule appears:


<p align=center> <img src=images/NewAlertRules.png width=950 height=250> </p>

Once an alert rule is created you can modify it by double-clicking on its name, which will redirect you to the alert rule homepage. Here select **Edit** and you can reconfigure the alert rule as desired.

### 3. Responding to Alerts

Setting up alarms for your AKS cluster is just the first step in effective monitoring. Once the alerts are in place, it's crucial to have a well-defined response plan to address issues promptly and efficiently. Here are some important aspects to consider when responding to alerts:

- **Alert Notifications**: When an alert is triggered, notifications are sent to the designated team or individual. Ensure that these notifications are configured to reach the right people who can take immediate action.

- **Acknowledgement and Triage**: Upon receiving an alert, it's essential to acknowledge it promptly. After acknowledgement, perform initial triage to determine the severity of the issue. Some alerts might require immediate action, while others can be part of routine maintenance or troubleshooting.

- **Remediation Actions**: Based on the nature of the alert, execute predefined remediation actions. These actions can include scaling resources, restarting containers, or applying configuration changes to mitigate the issue.

- **Incident Documentation**: Maintain clear documentation of each alert and the actions taken to resolve it. This documentation is valuable for post-incident analysis and for building a knowledge base.

- **Continuous Improvement**: After responding to alerts, conduct post-incident reviews to understand the root cause of the issue and to identify preventive measures. Use these findings to refine your monitoring and alerting strategies.

## Key Takeaways

- Log Analytics is an essential tool for writing, testing, and executing log queries. It provides insights into your AKS cluster operations.
- Saving log queries in Log Analytics allows for easy access and efficient analysis of AKS cluster logs
- Setting up custom alerts for metrics and insights is a fundamental component of any monitoring strategy. Alerts help detect and address issues promptly, reducing the risk of disruptions and optimizing application performance.
- Defining specific alert conditions based on relevant metrics is the first step in setting up alarms. Azure Portal provides an interface to configure the alert's condition, threshold, and evaluation settings.
- Action groups are collections of notification preferences and actions that define how alerts are handled. These include email, SMS, push, voice, or custom actions. Creating action groups allows for comprehensive alert responses.
- A well-defined response plan when alerts are triggered is essential, including notifications, acknowledgment, triage, remediation, documentation, and continuous improvement.