# Azure Monitor Agent (AMA) Gateway Monitoring Setup

This notebook provides step-by-step guidance for setting up **Azure Monitor Agent (AMA)** to monitor on-premises data gateways used by Microsoft Fabric.

Azure Monitor Agent is Microsoft's **native, enterprise-ready solution** for infrastructure monitoring:

✅ **Zero Maintenance**: Microsoft manages updates and bug fixes  
✅ **Enterprise Scale**: Proven across thousands of machines  
✅ **Built-in Resilience**: Automatic retry, buffering, offline queuing  
✅ **Performance Optimized**: Native agent optimized for log collection  
✅ **Cost Effective**: Standard ingestion pricing, no compute overhead  

## What This Setup Monitors
- **Gateway Performance Logs**: Query execution metrics, duration, success rates
- **Windows Event Logs**: Gateway service events, errors, warnings
- **System Counters**: CPU, memory, disk usage on gateway machines
- **Real-time Processing**: KQL transforms in Data Collection Rules

## Step 1: Prerequisites and Environment Setup

In [None]:
# Required Azure CLI extensions and login
import subprocess
import json
import os

print("🔧 Checking Azure CLI and extensions...")

# Check if Azure CLI is installed
try:
    result = subprocess.run(['az', '--version'], capture_output=True, text=True)
    if result.returncode == 0:
        print("✅ Azure CLI is installed")
    else:
        print("❌ Azure CLI not found. Please install: https://docs.microsoft.com/cli/azure/install-azure-cli")
except FileNotFoundError:
    print("❌ Azure CLI not found. Please install: https://docs.microsoft.com/cli/azure/install-azure-cli")

# Install required extensions
extensions = ['monitor-control-service']

for ext in extensions:
    try:
        result = subprocess.run(['az', 'extension', 'add', '--name', ext], 
                              capture_output=True, text=True)
        if result.returncode == 0:
            print(f"✅ Extension '{ext}' installed/updated")
        else:
            print(f"⚠️ Extension '{ext}' installation issue: {result.stderr}")
    except Exception as e:
        print(f"❌ Failed to install extension '{ext}': {e}")

print("\n🔑 Please ensure you're logged into Azure CLI:")
print("   Run: az login")
print("   Run: az account set --subscription 'your-subscription-name'")

## Step 2: Configure Environment Variables

In [None]:
# Configuration - Update these values for your environment
SUBSCRIPTION_ID = "your-subscription-id"
RESOURCE_GROUP = "your-resource-group"
LOCATION = "canadacentral"  # or your preferred region
LOG_ANALYTICS_WORKSPACE = "your-log-analytics-workspace-name"
DCR_NAME = "FabricGatewayMonitoring-DCR"
GATEWAY_VM_NAME = "your-gateway-vm-name"  # Name of VM running the gateway

# Validate configuration
config_valid = True
required_vars = {
    'SUBSCRIPTION_ID': SUBSCRIPTION_ID,
    'RESOURCE_GROUP': RESOURCE_GROUP, 
    'LOG_ANALYTICS_WORKSPACE': LOG_ANALYTICS_WORKSPACE,
    'GATEWAY_VM_NAME': GATEWAY_VM_NAME
}

print("🔍 Validating configuration...")
for var_name, var_value in required_vars.items():
    if var_value.startswith('your-'):
        print(f"❌ Please update {var_name}: {var_value}")
        config_valid = False
    else:
        print(f"✅ {var_name}: {var_value}")

if config_valid:
    print("\n✅ Configuration looks good!")
else:
    print("\n❌ Please update the configuration values above before proceeding.")

## Step 3: Create Data Collection Rule (DCR)

In [None]:
# Create Data Collection Rule JSON configuration
dcr_config = {
    "location": LOCATION,
    "properties": {
        "description": "Gateway monitoring via AMA - performance logs and events",
        "dataSources": {
            "logFiles": [
                {
                    "name": "GatewayQueryExecutionLogs",
                    "streams": ["Custom-FabricGatewayPerformance_CL"],
                    "filePatterns": [
                        "C:\\Users\\PBIEgwService\\AppData\\Local\\Microsoft\\On-premises data gateway\\Report\\QueryExecutionReport*.csv",
                        "C:\\Windows\\ServiceProfiles\\PBIEgwService\\AppData\\Local\\Microsoft\\On-premises data gateway\\Report\\QueryExecutionReport*.csv"
                    ],
                    "format": "text",
                    "settings": {
                        "text": {
                            "recordStartTimestamp": {
                                "format": "ISO 8601"
                            }
                        }
                    }
                },
                {
                    "name": "GatewayQueryStartLogs",
                    "streams": ["Custom-FabricGatewayQueries_CL"],
                    "filePatterns": [
                        "C:\\Users\\PBIEgwService\\AppData\\Local\\Microsoft\\On-premises data gateway\\Report\\QueryStartReport*.csv",
                        "C:\\Windows\\ServiceProfiles\\PBIEgwService\\AppData\\Local\\Microsoft\\On-premises data gateway\\Report\\QueryStartReport*.csv"
                    ],
                    "format": "text"
                },
                {
                    "name": "GatewaySystemCounters",
                    "streams": ["Custom-FabricGatewayCounters_CL"],
                    "filePatterns": [
                        "C:\\Users\\PBIEgwService\\AppData\\Local\\Microsoft\\On-premises data gateway\\Report\\SystemCounterAggregationReport*.csv",
                        "C:\\Windows\\ServiceProfiles\\PBIEgwService\\AppData\\Local\\Microsoft\\On-premises data gateway\\Report\\SystemCounterAggregationReport*.csv"
                    ],
                    "format": "text"
                }
            ],
            "windowsEventLogs": [
                {
                    "name": "GatewayServiceEvents",
                    "streams": ["Custom-FabricGatewayEvents_CL"],
                    "xPathQueries": [
                        "On-premises data gateway service/Operational!*[System[Level=1 or Level=2 or Level=3]]",
                        "Application!*[System[Provider[@Name='On-premises data gateway service']]]"
                    ]
                }
            ]
        },
        "destinations": {
            "logAnalytics": [
                {
                    "workspaceResourceId": f"/subscriptions/{SUBSCRIPTION_ID}/resourceGroups/{RESOURCE_GROUP}/providers/Microsoft.OperationalInsights/workspaces/{LOG_ANALYTICS_WORKSPACE}",
                    "name": "FabricGatewayWorkspace"
                }
            ]
        },
        "dataFlows": [
            {
                "streams": ["Custom-FabricGatewayPerformance_CL"],
                "destinations": ["FabricGatewayWorkspace"],
                "transformKql": """
                    source
                    | extend CSVData = split(RawData, ',')
                    | extend 
                        QueryTrackingId = tostring(CSVData[0]),
                        GatewayObjectId = tostring(CSVData[1]),
                        DataSource = tostring(CSVData[2]),
                        QueryType = tostring(CSVData[3]),
                        QueryExecutionDuration = toint(CSVData[4]),
                        DataProcessingDuration = toint(CSVData[5]),
                        Success = tobool(CSVData[6]),
                        ErrorMessage = tostring(CSVData[7]),
                        QueryExecutionEndTimeUTC = todatetime(CSVData[8])
                    | extend LogType = "query_execution"
                    | extend CollectionSource = "AMA_FileLog"
                    | project-away RawData, CSVData
                """
            },
            {
                "streams": ["Custom-FabricGatewayQueries_CL"],
                "destinations": ["FabricGatewayWorkspace"],
                "transformKql": """
                    source
                    | extend CSVData = split(RawData, ',')
                    | extend 
                        QueryTrackingId = tostring(CSVData[0]),
                        GatewayObjectId = tostring(CSVData[1]),
                        DataSource = tostring(CSVData[2]),
                        QueryType = tostring(CSVData[3]),
                        QueryTextBase64 = tostring(CSVData[4]),
                        QueryExecutionStartTimeUTC = todatetime(CSVData[5])
                    | extend LogType = "query_start"
                    | extend CollectionSource = "AMA_FileLog"
                    | extend QueryTextDecoded = base64_decode_tostring(QueryTextBase64)
                    | project-away RawData, CSVData
                """
            },
            {
                "streams": ["Custom-FabricGatewayCounters_CL"],
                "destinations": ["FabricGatewayWorkspace"],
                "transformKql": """
                    source
                    | extend CSVData = split(RawData, ',')
                    | extend 
                        GatewayObjectId = tostring(CSVData[0]),
                        CounterName = tostring(CSVData[1]),
                        Max = todouble(CSVData[2]),
                        Min = todouble(CSVData[3]),
                        Average = todouble(CSVData[4]),
                        AggregationStartTimeUTC = todatetime(CSVData[5]),
                        AggregationEndTimeUTC = todatetime(CSVData[6])
                    | extend LogType = "system_counters"
                    | extend CollectionSource = "AMA_FileLog"
                    | project-away RawData, CSVData
                """
            },
            {
                "streams": ["Custom-FabricGatewayEvents_CL"],
                "destinations": ["FabricGatewayWorkspace"],
                "transformKql": """
                    source
                    | extend LogType = "event_log"
                    | extend CollectionSource = "AMA_EventLog"
                    | extend Severity = case(
                        Level == 1, "Critical",
                        Level == 2, "Error",
                        Level == 3, "Warning",
                        Level == 4, "Information",
                        "Unknown"
                    )
                """
            }
        ]
    }
}

# Save DCR configuration to file
dcr_file = "fabric-gateway-dcr.json"
with open(dcr_file, 'w') as f:
    json.dump(dcr_config, f, indent=2)

print(f"✅ DCR configuration saved to {dcr_file}")
print(f"📋 Configuration includes:")
print(f"   - Performance logs: QueryExecution, QueryStart, SystemCounters")
print(f"   - Windows Event Logs: Gateway service events")
print(f"   - KQL transforms: CSV parsing and base64 decoding")
print(f"   - Destination: {LOG_ANALYTICS_WORKSPACE}")

## Step 4: Deploy Data Collection Rule

In [None]:
# Deploy the Data Collection Rule
print("🚀 Deploying Data Collection Rule...")

deploy_cmd = [
    'az', 'monitor', 'data-collection', 'rule', 'create',
    '--resource-group', RESOURCE_GROUP,
    '--location', LOCATION,
    '--rule-file', dcr_file,
    '--name', DCR_NAME
]

try:
    result = subprocess.run(deploy_cmd, capture_output=True, text=True, check=True)
    dcr_info = json.loads(result.stdout)
    dcr_resource_id = dcr_info['id']
    
    print(f"✅ Data Collection Rule deployed successfully")
    print(f"   Name: {DCR_NAME}")
    print(f"   Resource ID: {dcr_resource_id}")
    print(f"   Location: {LOCATION}")
    
    # Save DCR resource ID for next steps
    DCR_RESOURCE_ID = dcr_resource_id
    
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to deploy DCR: {e.stderr}")
    print("\n🔍 Troubleshooting:")
    print("   - Ensure you have sufficient permissions")
    print("   - Verify the Log Analytics workspace exists")
    print("   - Check resource group and subscription are correct")
except Exception as e:
    print(f"❌ Unexpected error: {e}")

## Step 5: Install Azure Monitor Agent on Gateway VM

In [None]:
# Install Azure Monitor Agent extension on the gateway VM
print("🔧 Installing Azure Monitor Agent on gateway VM...")

# First, get the VM resource ID
vm_cmd = [
    'az', 'vm', 'show',
    '--resource-group', RESOURCE_GROUP,
    '--name', GATEWAY_VM_NAME,
    '--query', 'id',
    '--output', 'tsv'
]

try:
    result = subprocess.run(vm_cmd, capture_output=True, text=True, check=True)
    vm_resource_id = result.stdout.strip()
    print(f"✅ Found gateway VM: {vm_resource_id}")
    
    # Install AMA extension
    ama_cmd = [
        'az', 'vm', 'extension', 'set',
        '--resource-group', RESOURCE_GROUP,
        '--vm-name', GATEWAY_VM_NAME,
        '--name', 'AzureMonitorWindowsAgent',
        '--publisher', 'Microsoft.Azure.Monitor',
        '--enable-auto-upgrade', 'true'
    ]
    
    print("   Installing AMA extension (this may take a few minutes)...")
    result = subprocess.run(ama_cmd, capture_output=True, text=True, check=True)
    
    print(f"✅ Azure Monitor Agent installed successfully")
    print(f"   VM: {GATEWAY_VM_NAME}")
    print(f"   Extension: AzureMonitorWindowsAgent")
    print(f"   Auto-upgrade: Enabled")
    
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to install AMA: {e.stderr}")
    print("\n🔍 Troubleshooting:")
    print("   - Verify the VM name is correct")
    print("   - Ensure the VM is running Windows")
    print("   - Check VM has internet connectivity")
    print("   - Verify sufficient permissions")
except Exception as e:
    print(f"❌ Unexpected error: {e}")

## Step 6: Associate DCR with Gateway VM

In [None]:
# Associate the Data Collection Rule with the gateway VM
print("🔗 Associating DCR with gateway VM...")

association_name = f"{GATEWAY_VM_NAME}-gateway-monitoring"

associate_cmd = [
    'az', 'monitor', 'data-collection', 'rule', 'association', 'create',
    '--resource', vm_resource_id,
    '--rule-id', DCR_RESOURCE_ID,
    '--association-name', association_name
]

try:
    result = subprocess.run(associate_cmd, capture_output=True, text=True, check=True)
    
    print(f"✅ DCR association created successfully")
    print(f"   Association: {association_name}")
    print(f"   VM: {vm_resource_id}")
    print(f"   DCR: {DCR_RESOURCE_ID}")
    
    print("\n🎉 Gateway monitoring setup complete!")
    print("\n⏰ Data collection will start within 5-10 minutes")
    print("   Log files will be monitored automatically")
    print("   Windows Event Logs will be collected in real-time")
    print("   KQL transforms will parse and enrich the data")
    
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to create DCR association: {e.stderr}")
    print("\n🔍 Troubleshooting:")
    print("   - Verify DCR was created successfully")
    print("   - Check VM resource ID is correct")
    print("   - Ensure sufficient permissions for association")
except Exception as e:
    print(f"❌ Unexpected error: {e}")

## Step 7: Verify Data Collection

In [None]:
# Provide KQL queries to verify data collection
print("🔍 Use these KQL queries in Log Analytics to verify data collection:")
print("\n" + "="*60)

verification_queries = {
    "Gateway Performance Data": """
FabricGatewayPerformance_CL
| where TimeGenerated >= ago(1h)
| summarize RecordCount = count() by LogType_s
| order by RecordCount desc
""",
    
    "Gateway Event Logs": """
FabricGatewayEvents_CL
| where TimeGenerated >= ago(1h)
| summarize EventCount = count() by Severity_s
| order by EventCount desc
""",
    
    "Gateway Query Performance": """
FabricGatewayPerformance_CL
| where LogType_s == "query_execution"
| where TimeGenerated >= ago(24h)
| summarize 
    TotalQueries = count(),
    SuccessRate = avg(todouble(Success_b)) * 100,
    AvgDurationMs = avg(todouble(QueryExecutionDuration_d))
by DataSource_s
| order by TotalQueries desc
""",
    
    "Gateway System Performance": """
FabricGatewayCounters_CL
| where TimeGenerated >= ago(24h)
| where CounterName_s in ("SystemCPUPercent", "SystemMEMUsedPercent")
| summarize AvgValue = avg(todouble(Average_d)) by CounterName_s, GatewayObjectId_s
| evaluate pivot(CounterName_s, AvgValue)
""",
    
    "Gateway Error Analysis": """
FabricGatewayPerformance_CL
| where LogType_s == "query_execution"
| where Success_b == false
| where TimeGenerated >= ago(24h)
| summarize ErrorCount = count() by ErrorMessage_s, DataSource_s
| order by ErrorCount desc
| take 10
"""
}

for title, query in verification_queries.items():
    print(f"\n## {title}")
    print(f"```kusto{query.strip()}```")
    print()

print("\n" + "="*60)
print("\n📊 Next Steps:")
print("1. Wait 5-10 minutes for data collection to start")
print("2. Run the verification queries in Log Analytics")
print("3. Create alerts for critical gateway issues")
print("4. Build dashboards for gateway monitoring")
print("5. Set up retention policies for gateway data")

## Step 8: Create Alerts and Dashboards

In [None]:
# Sample alert rule for gateway failures
print("🚨 Sample alert rule for gateway query failures:")

alert_rule = {
    "location": LOCATION,
    "properties": {
        "displayName": "Gateway Query Failure Rate High",
        "description": "Alert when gateway query failure rate exceeds 10% in 15 minutes",
        "severity": 2,
        "enabled": True,
        "evaluationFrequency": "PT5M",
        "windowSize": "PT15M",
        "criteria": {
            "allOf": [
                {
                    "query": """
FabricGatewayPerformance_CL
| where LogType_s == "query_execution"
| where TimeGenerated >= ago(15m)
| summarize 
    TotalQueries = count(),
    FailedQueries = countif(Success_b == false)
| extend FailureRate = (FailedQueries * 100.0) / TotalQueries
| where FailureRate > 10
""",
                    "timeAggregation": "Count",
                    "operator": "GreaterThan",
                    "threshold": 0
                }
            ]
        }
    }
}

print("\nTo create this alert rule:")
print("1. Go to Azure Portal > Monitor > Alerts")
print("2. Create new alert rule")
print("3. Select your Log Analytics workspace as scope")
print("4. Use the KQL query from the alert rule above")
print("5. Configure action groups for notifications")

print("\n📈 Recommended alerts:")
recommended_alerts = [
    "Query failure rate > 10% in 15 minutes",
    "Average query duration > 30 seconds in 15 minutes", 
    "Gateway service errors in Windows Event Log",
    "Gateway CPU usage > 80% for 10 minutes",
    "Gateway memory usage > 90% for 5 minutes",
    "No gateway performance data received for 1 hour"
]

for i, alert in enumerate(recommended_alerts, 1):
    print(f"   {i}. {alert}")

print("\n🎯 Benefits of AMA Gateway Monitoring:")
benefits = [
    "🔧 Zero maintenance - Microsoft manages the agent",
    "📈 Enterprise scale - handles thousands of machines",
    "🛡️ Built-in resilience - automatic retry and buffering",
    "⚡ Performance optimized - native log collection",
    "💰 Cost effective - standard ingestion pricing only",
    "🔍 Real-time processing - KQL transforms in DCR",
    "🎨 Rich analysis - combine with other Azure Monitor data"
]

for benefit in benefits:
    print(f"   {benefit}")

## Summary

You have successfully set up **Azure Monitor Agent (AMA)** for comprehensive gateway monitoring! 🎉

### ✅ **What's Now Monitoring Your Gateway:**
- **Performance Logs**: Query execution metrics with automatic CSV parsing
- **Windows Event Logs**: Real-time gateway service event collection
- **System Counters**: Resource utilization monitoring
- **KQL Transforms**: Automatic data parsing and enrichment
- **Integrated Analytics**: Data flows directly to Log Analytics

### 🚀 **Next Steps:**
1. **Verify Data Flow**: Run the verification queries in Log Analytics
2. **Set Up Alerts**: Create alerts for critical gateway issues
3. **Build Dashboards**: Visualize gateway performance and health
4. **Scale Deployment**: Apply to additional gateway servers
5. **Integrate Security**: Connect with Microsoft Sentinel for security monitoring

### 📚 **Documentation:**
- [Azure Monitor Agent Overview](https://docs.microsoft.com/azure/azure-monitor/agents/azure-monitor-agent-overview)
- [Data Collection Rules](https://docs.microsoft.com/azure/azure-monitor/agents/data-collection-rule-overview)
- [Gateway Performance Monitoring](https://learn.microsoft.com/en-us/data-integration/gateway/service-gateway-performance)

**AMA provides enterprise-grade gateway monitoring with zero maintenance overhead!** 🎯