# SPL 诊断问题根因-慢

该notebook实现了《如何用SPL快速诊断问题根因 -- 慢.md》中描述的逐步流程，用于通过SPL查询诊断端到端高延迟问题。

## 分析流程

1. 查找高独占时间的span - 使用 trace_exclusive_duration 运算符识别独占持续时间高的span
2. 模式分析 - 使用 diff_patterns 发现高延迟span的特征
3. CPU指标分析 - 通过CMS实体查询，查询指定service的CPU使用率指标
4. 异常检测 - 使用 series_decompose_anomalies 检测CPU使用率异常

目标是找出能够解释延迟上升的root cause （例如 recommendation.cpu）。

In [1]:
import sys
import time
import json
from datetime import datetime, timedelta

# 将父目录添加到路径以便导入模块
sys.path.append('..')

# 导入自定义模块并进行异常处理
try:
    from find_root_cause_spans_rt import FindRootCauseSpansRT
    print("✅ FindRootCauseSpansRT imported successfully")
except ImportError as e:
    print(f"⚠️ Warning: Could not import FindRootCauseSpansRT: {e}")
    FindRootCauseSpansRT = None

try:
    from test_cms_query import TestCMSQuery
    print("✅ TestCMSQuery imported successfully")
except ImportError as e:
    print(f"⚠️ Warning: Could not import TestCMSQuery: {e}")
    print("💡 Please install required dependencies: pip install -r requirements.txt")
    TestCMSQuery = None

try:
    from utils.constants import HIGH_RT_TRACES, TRACES_FOR_AVG_RT, PERCENT_95
    print("✅ Constants imported successfully")
except ImportError as e:
    print(f"⚠️ Warning: Could not import constants: {e}")
    # 如果常量无法导入，则设置默认值
    HIGH_RT_TRACES = 1000
    TRACES_FOR_AVG_RT = 1000
    PERCENT_95 = 0.95
    print("✅ Using default constant values")

# 加载环境变量

print("✅ All available imports loaded successfully")
print("\n💡 If you see warnings above, please run: pip install -r requirements.txt")

✅ FindRootCauseSpansRT imported successfully
✅ TestCMSQuery imported successfully
✅ Constants imported successfully
✅ All available imports loaded successfully



## 配置环境变量

设置访问的project和logstore,设置题目中的故障时间段

In [2]:
# SLS 配置
PROJECT_NAME = "proj-xtrace-a46b97cfdc1332238f714864c014a1b-cn-qingdao"
LOGSTORE_NAME = "logstore-tracing"
REGION = "cn-qingdao"

# 分析时间区间
# 异常时间段（延迟较高时段）
ANOMALY_START_TIME = "2025-08-28 20:00:20"
ANOMALY_END_TIME = "2025-08-28 20:05:20"

# 正常时间段（用于对比的基线）
NORMAL_START_TIME = "2025-08-27 20:00:20"
NORMAL_END_TIME = "2025-08-27 20:05:20"

# 分析参数
DURATION_THRESHOLD = 2000000000  # 2000ms（以纳秒为单位）
LIMIT_NUM = 1000  # 分析的 trace 数量

# CMS 指标配置
CMS_WORKSPACE = "quanxi-tianchi-test"
CMS_ENDPOINT = 'metrics.cn-qingdao.aliyuncs.com'

print(f"📊 Analysis Configuration:")
print(f"  SLS Project: {PROJECT_NAME}")
print(f"  Logstore: {LOGSTORE_NAME}")
print(f"  Anomaly Period: {ANOMALY_START_TIME} to {ANOMALY_END_TIME}")
print(f"  Normal Period: {NORMAL_START_TIME} to {NORMAL_END_TIME}")
print(f"  Duration Threshold: {DURATION_THRESHOLD/1000000000:.1f}s")
print(f"  CMS Workspace: {CMS_WORKSPACE}")

📊 Analysis Configuration:
  SLS Project: proj-xtrace-a46b97cfdc1332238f714864c014a1b-cn-qingdao
  Logstore: logstore-tracing
  Anomaly Period: 2025-08-28 20:00:20 to 2025-08-28 20:05:20
  Normal Period: 2025-08-27 20:00:20 to 2025-08-27 20:05:20
  Duration Threshold: 2.0s
  CMS Workspace: quanxi-tianchi-test


## STS创建客户端

设置个人账户信息，获取访问权限，STS新建客户端

In [3]:
import os
import sys

from aliyun.log import LogClient 
from alibabacloud_sts20150401.client import Client as StsClient
from alibabacloud_sts20150401 import models as sts_models
from alibabacloud_tea_openapi import models as open_api_models
from Tea.exceptions import TeaException

sys.path.append('..')


# ----------请创建一个环境变量文件,并将下面2个参数设置为你之前保存的信息
# ----------请在环境变量文件中设置你创建用户时保存的AK, 样例如下:
'''
ALIBABA_CLOUD_ACCESS_KEY_ID="你保存的AccessKey ID"
ALIBABA_CLOUD_ACCESS_KEY_SECRET"你保存的AccessKey Secret"
'''
MAIN_ACCOUNT_ACCESS_KEY_ID = os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID')
MAIN_ACCOUNT_ACCESS_KEY_SECRET = os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET')
ALIBABA_CLOUD_ROLE_ARN = os.getenv('ALIBABA_CLOUD_ROLE_ARN','acs:ram::1672753017899339:role/tianchi-user-a')
STS_SESSION_NAME = os.getenv('ALIBABA_CLOUD_ROLE_SESSION_NAME', 'my-sls-access') # 自定义会话名称，没有固定要求

if MAIN_ACCOUNT_ACCESS_KEY_ID and MAIN_ACCOUNT_ACCESS_KEY_SECRET and ALIBABA_CLOUD_ROLE_ARN:
    print("✅ SLS访问凭证配置正确")
else:
    print("❌ 请在环境变量文件中配置ALIBABA_CLOUD_ACCESS_KEY_ID和ALIBABA_CLOUD_ACCESS_KEY_SECRET")


def get_sts_credentials():
    
    if not all([MAIN_ACCOUNT_ACCESS_KEY_ID, MAIN_ACCOUNT_ACCESS_KEY_SECRET, ALIBABA_CLOUD_ROLE_ARN]):
        print("❌ 个人账号信息缺失! 请在环境变量文件中配置 ALIBABA_CLOUD_ACCESS_KEY_ID, ALIBABA_CLOUD_ACCESS_KEY_SECRET")
        return None

    config = open_api_models.Config(
        access_key_id=MAIN_ACCOUNT_ACCESS_KEY_ID, # type: ignore
        access_key_secret=MAIN_ACCOUNT_ACCESS_KEY_SECRET, # type: ignore
        endpoint=f'sts.{REGION}.aliyuncs.com'
    )
    sts_client = StsClient(config)
    
    assume_role_request = sts_models.AssumeRoleRequest(
        role_arn=ALIBABA_CLOUD_ROLE_ARN, # type: ignore
        role_session_name=STS_SESSION_NAME,
        duration_seconds=3600
    )
    
    try:
        response = sts_client.assume_role(assume_role_request)
        print("✅ 成功获取访问权限！")
        return response.body.credentials
    except TeaException as e:
        print(f"❌ 获取STS临时凭证失败: {e.message}")
        print(f"  错误码: {e.code}")
        print("  请检查:1. 主账号AK是否正确;2. 目标角色ARN是否正确;3. 目标角色的信任策略是否已配置为信任您的主账号。")
        return None
    except Exception as e:
        print(f"❌ 发生未知错误在获取STS凭证时: {e}")
        return None

# --- 函数：创建SLS客户端 ---
def create_sls_client_with_sts():
    
    sts_credentials = get_sts_credentials()
    
    if not sts_credentials:
        return None
        
    sls_endpoint = f"{REGION}.log.aliyuncs.com"
    
    # aliyun-log-python-sdk 使用 securityToken 参数
    log_client = LogClient(
        endpoint=sls_endpoint,
        accessKeyId=sts_credentials.access_key_id,
        accessKey=sts_credentials.access_key_secret,
        securityToken=sts_credentials.security_token  
    )
    
    print("✅ SLS客户端已使用临时凭证创建。")
    return log_client

# 创建带有STS凭证的SLS客户端
log_client_instance = create_sls_client_with_sts()

✅ SLS访问凭证配置正确
✅ 成功获取访问权限！
✅ SLS客户端已使用临时凭证创建。


## 步骤1：查找高独占时间的Span

使用 trace_exclusive_duration 运算符，在异常时段识别独占时间高的span。

独占时间：span内实际消耗的时间，不包括其子span消耗的时间。

In [4]:
print("🔍 Step 1: Finding high exclusive time spans...")
print("="*60)

from find_root_cause_spans_rt import FindRootCauseSpansRT

# 使用基线对比方式初始化根因分析器
finder = FindRootCauseSpansRT(
    client=log_client_instance,
    project_name=PROJECT_NAME,
    logstore_name=LOGSTORE_NAME,
    region=REGION,
    start_time=ANOMALY_START_TIME,
    end_time=ANOMALY_END_TIME,
    duration_threshold=DURATION_THRESHOLD,
    limit_num=LIMIT_NUM,
    normal_start_time=NORMAL_START_TIME,
    normal_end_time=NORMAL_END_TIME,
    minus_average=True,  # 启用基线值相减，有助于更好地检测异常
    only_top1_per_trace=False
) # type: ignore

print("⚙️  Analyzer 已初始化，并启用了基线对比功能")

🔍 Step 1: Finding high exclusive time spans...
获取独占时间数据...
正常时间段查询到的独占时间日志条数: 3000
开始计算正常时间段的平均独占时间...
收集到 16382 个span的独占时间信息
查询span的serviceName和spanName信息...
从原始span中采样了 3000 个用于计算平均值
查询第 1 批，共 500 个span...
查询到 100 条记录
查询第 2 批，共 500 个span...
查询到 100 条记录
查询第 3 批，共 500 个span...
查询到 100 条记录
查询第 4 批，共 500 个span...
查询到 100 条记录
查询第 5 批，共 500 个span...
查询到 100 条记录
查询第 6 批，共 500 个span...
查询到 100 条记录
组合键 frontend<sep>GET 的平均独占时间: 11552850.68
组合键 frontend-proxy<sep>router frontend egress 的平均独占时间: 22568228.58
组合键 flagd<sep>flagd.evaluation.v1.Service/EventStream 的平均独占时间: 2616811684.30
组合键 frontend<sep>grpc.oteldemo.RecommendationService/ListRecommendations 的平均独占时间: 47885921.92
组合键 frontend-web<sep>resourceFetch 的平均独占时间: 98057042.21
组合键 frontend-web<sep>documentLoad 的平均独占时间: 230399902.20
组合键 cart<sep>POST 的平均独占时间: 70976127.80
组合键 frontend-web<sep>HTTP GET 的平均独占时间: 95925965.52
组合键 recommendation<sep>get_product_list 的平均独占时间: 42056843.00
组合键 recommendation<sep>POST 的平均独占时间: 1298774413.00
组合键 currenc

In [5]:
print(f"📈 最多分析 {LIMIT_NUM} 条持续时间大于 {DURATION_THRESHOLD/1000000000:.1f} 秒的 trace")

# 查找贡献了独占时间前95%的span
print("🎯 正在查找贡献独占时间前95%的span...")

top_95_percent_spans = finder.find_top_95_percent_spans()

print(f"\n📊 结果:")
print(f"  共找到 {len(top_95_percent_spans)} 个贡献了独占时间前95%的span")

if top_95_percent_spans:
    print(f"\n🔍 示例span IDs:")
    for i, span_id in enumerate(top_95_percent_spans[:5]):
        print(f"  {i+1}. {span_id}")
    
    if len(top_95_percent_spans) > 5:
        print(f"  ... and {len(top_95_percent_spans) - 5} more spans")
        
    # Get the query conditions for these spans
    span_conditions, detailed_query = finder.get_top_95_percent_spans_query()
    print(f"\n📝 已生成用于进一步分析的查询条件")
else:
    print("⚠️  未找到高独占时间的span")

📈 最多分析 1000 条持续时间大于 2.0 秒的 trace
🎯 正在查找贡献独占时间前95%的span...
查询到的日志条数: 2000
🔧 处理模式: 处理每个trace中的所有span
总共找到 9308 个有效的span独占时间数据
成功映射 7453 个span的serviceName和spanName
方案1覆盖率: 80.07% (7453/9308)
✅ 选择方案1：直接使用span_list中的serviceName和spanName（推荐）
🚀 [方案1] 使用span_list中的serviceName和spanName进行调整...
🚀 [方案1] 无需额外查询，直接处理 9308 个span
开始本地计算调整后的独占时间...
完成 9308 个span的时间调整计算
总独占时间: 15163002577.5
占前95%独占时间的span数量: 3277
这些span的累计独占时间: 14405595686.5, 占总时间的: 95.00%

📊 结果:
  共找到 3277 个贡献了独占时间前95%的span

🔍 示例span IDs:
  1. 54ba7fd7b19710b8
  2. fab7fa32543a7b59
  3. e4f52bfdb3a16926
  4. 745f145460b09a98
  5. 50505a824cc567d3
  ... and 3272 more spans
查询到的日志条数: 2000
🔧 处理模式: 处理每个trace中的所有span
总共找到 9308 个有效的span独占时间数据
成功映射 7453 个span的serviceName和spanName
方案1覆盖率: 80.07% (7453/9308)
✅ 选择方案1：直接使用span_list中的serviceName和spanName（推荐）
🚀 [方案1] 使用span_list中的serviceName和spanName进行调整...
🚀 [方案1] 无需额外查询，直接处理 9308 个span
开始本地计算调整后的独占时间...
完成 9308 个span的时间调整计算
总独占时间: 15163002577.5
占前95%独占时间的span数量: 3277
这些span的累计独占时间: 14405595686.5,

## 步骤2：使用 diff_patterns 进行模式分析

使用 diff_patterns 运算符，发现高独占时间span与正常span的区别特征。

In [6]:
print("🔍 Step 2: Pattern analysis with diff_patterns...")
print("="*60)

if top_95_percent_spans:
    # 首先将全部高独占时间的span_id拼接成一个字符串，用于diff_patterns查询条件
    span_conditions_for_patterns = " or ".join([f"spanId='{span_id}'" for span_id in top_95_percent_spans[:2000]])  # Limit for query size
    
    param_str = """{"minimum_support_fraction": 0.03}"""
    # 核心为 diff_patterns 算法调用，进行模式差异分析
    diff_patterns_query = f"""
duration > {DURATION_THRESHOLD} | set session enable_remote_functions=true; set session velox_support_row_constructor_enabled=true; 
with t0 as (
    select spanName, serviceName, cast(duration as double) as duration,
           JSON_EXTRACT_SCALAR(resources, '$["k8s.pod.ip"]') AS pod_ip,
           JSON_EXTRACT_SCALAR(resources, '$["k8s.node.name"]') AS node_name,
           JSON_EXTRACT_SCALAR(resources, '$["service.version"]') AS service_version,  
           if(({span_conditions_for_patterns}), 'true', 'false') as anomaly_label, 
           cast(if((statusCode = 2 or statusCode = 3), 1, 0) as double) as error_count 
    from log
), 
t1 as (
    select array_agg(spanName) as spanName, 
           array_agg(serviceName) as serviceName, 
           array_agg(duration) as duration,
           array_agg(pod_ip) as pod_ip, 
           array_agg(node_name) as node_name, 
           array_agg(service_version) as service_version, 
           array_agg(anomaly_label) as anomaly_label, 
           array_agg(error_count) as error_count 
    from t0
),
t2 as (
    select row(spanName, serviceName, anomaly_label) as table_row 
    from t1
),
t3 as (
    select diff_patterns(table_row, ARRAY['spanName', 'serviceName', 'anomaly_label'], 'anomaly_label', 'true', 'false', '', '', '{param_str}') as ret 
    from t2
)
select * from t3
"""
    
    print("📋 已生成用于模式分析的 diff_patterns 查询语句")
    # print(f"🎯 正在分析前 {min(50, len(top_95_percent_spans))} 个高独占时间span的模式")
    
    # 使用SLS客户端进行查询
    print("\n🚀 正在使用SLS客户端执行 diff_patterns 查询...")
    
    try:
        # 创建用于 diff_patterns 查询的 GetLogsRequest
        from aliyun.log import GetLogsRequest
        
        request = GetLogsRequest(
            project=PROJECT_NAME,
            logstore=LOGSTORE_NAME,
            query=diff_patterns_query.strip(),
            fromTime=int(time.mktime(datetime.strptime(ANOMALY_START_TIME, "%Y-%m-%d %H:%M:%S").timetuple())),
            toTime=int(time.mktime(datetime.strptime(ANOMALY_END_TIME, "%Y-%m-%d %H:%M:%S").timetuple())),
            line=100  # Limit results
        )
        
        # 使用finder的SLS客户端执行查询
        patterns_result = finder.client.get_logs(request)
        
        if patterns_result and patterns_result.get_logs():
            logs = [log_item.get_contents() for log_item in patterns_result.get_logs()]
            print(f"✅ 模式分析完成：共返回 {len(logs)} 条结果记录")
            
            # 展示模型分析结果
            # 在结果中会展示同一service在正常样本和异常样本中出现的次数对比
            print("\n📊 模式分析结果:")
            for i, log_entry in enumerate(logs[:3]):  # Show first 3 results
                print(f"  结果 {i+1}: {log_entry}")
                
            # Extract service patterns from diff_patterns results
            service_patterns = {}
            span_patterns = []  # Store span patterns for service inference
            
            for log_entry in logs:
                if hasattr(log_entry, 'get_contents'):
                    contents = log_entry.get_contents()
                else:
                    contents = log_entry
                print(f"  🔍 Analyzing pattern result: {contents}")
                
                # Parse structured result from diff_patterns query
                if 'ret' in contents:
                    ret_value = contents['ret']
                    
                    if isinstance(ret_value, str):
                        try:
                            # Replace 'null' with 'None' for Python parsing
                            data_str = ret_value.replace('null', 'None')
                            result = eval(data_str)
                            
                            if len(result) >= 2 and isinstance(result[0], list) and isinstance(result[1], list):
                                patterns = result[0]  # Pattern names
                                counts = result[1]    # Pattern counts
                                
                                print(f"    📊 Extracted patterns: {patterns}")
                                print(f"    📊 Pattern counts: {counts}")
                                
                                for i, pattern in enumerate(patterns):
                                    if i < len(counts):
                                        count = counts[i]
                                        
                                        # Parse serviceName patterns from diff_patterns results
                                        if 'serviceName' in pattern and '=' in pattern:
                                            # Handle complex patterns like "serviceName"='cart' AND "spanName"='POST'
                                            import re
                                            match = re.search(r'"serviceName"=\'([^\']+)\'', pattern)
                                            if match:
                                                service_part = match.group(1)
                                                service_patterns[service_part] = service_patterns.get(service_part, 0) + count
                                                print(f"    ✅ Found serviceName pattern: '{service_part}' (count: {count})")
                                            else:
                                                print(f"    ⚠️ Could not parse serviceName from: '{pattern}'")
                                        
                                        # Log spanName patterns for service inference
                                        elif 'spanName' in pattern:
                                            span_name = pattern.split('=')[1].strip('\'"') if '=' in pattern else pattern
                                            print(f"    ℹ️ Found spanName pattern: '{span_name}' (count: {count})")
                                            span_patterns.append((span_name, count))
                                            
                        except Exception as e:
                            print(f"    ⚠️ Error parsing ret field: {e}")
                            import traceback
                            traceback.print_exc()
                
            # If no serviceName patterns found, try to infer from spanName patterns
            if not service_patterns and span_patterns:
                print(f"    📊 No serviceName patterns found - attempting service inference from spanName patterns")
                
                service_candidates = {}
                for span_name, count in span_patterns:
                    # Map spanName patterns to likely services
                    if 'CartService' in span_name or 'cart' in span_name.lower():
                        service_candidates['cart'] = service_candidates.get('cart', 0) + count
                    elif 'ProductCatalogService' in span_name or 'product' in span_name.lower():
                        service_candidates['product-catalog'] = service_candidates.get('product-catalog', 0) + count  
                    elif 'PaymentService' in span_name or 'payment' in span_name.lower():
                        service_candidates['payment'] = service_candidates.get('payment', 0) + count
                    elif 'CheckoutService' in span_name or 'checkout' in span_name.lower():
                        service_candidates['checkout'] = service_candidates.get('checkout', 0) + count
                    elif 'RecommendationService' in span_name or 'recommendation' in span_name.lower():
                        service_candidates['recommendation'] = service_candidates.get('recommendation', 0) + count
                    elif 'CurrencyService' in span_name or 'currency' in span_name.lower():
                        service_candidates['currency'] = service_candidates.get('currency', 0) + count
                    elif 'frontend' in span_name.lower():
                        service_candidates['frontend'] = service_candidates.get('frontend', 0) + count
                    elif 'flagservice' in span_name.lower() or 'ad' in span_name.lower():
                        service_candidates['ad'] = service_candidates.get('ad', 0) + count
                    else:
                        print(f"    ❓ Cannot infer service from span: '{span_name}'")
                
                if service_candidates:
                    print(f"    📊 Service inference from spans: {dict(service_candidates)}")
                    service_patterns = service_candidates
                else:
                    print(f"    ❌ Cannot determine target service from available span patterns")
                    
            # Store span patterns globally for additional analysis if needed
            globals()['span_patterns'] = span_patterns
                    
            if service_patterns:
                print(f"\n🎯 识别出的service模式:")
                for service, count in sorted(service_patterns.items(), key=lambda x: x[1], reverse=True):
                    print(f"  - {service}: {count} 次模式匹配")
                    
                # 根据最常见模式更新TARGET_SERVICE
                most_common_service = max(service_patterns.items(), key=lambda x: x[1])[0]
                print(f"\n💡 在模式中出现最多的serviceName: {most_common_service}")
                
                # 存储出现最频繁的service name，供后续步骤使用
                globals()['TARGET_SERVICE_FROM_PATTERNS'] = most_common_service
                
        else:
            print("⚠️  未返回任何模式分析结果")
            
    except Exception as e:
        print(f"❌ 执行 diff_patterns 查询时出错: {e}")
        print("💡 建议在SLS控制台手动执行")
    
        # 打印查询语句用于SLS控制台手动执行（如有需要）
        print("\n📝 SLS控制台手动执行用查询语句（如有需要）：")
        print("="*50)
        print(diff_patterns_query)
        print("="*50)
        
        print("\n💡 预期结果：该查询应该能识别哪个 serviceName 拥有最多高独占时间的span")
        print("   例如：分析结果可能显示 serviceName='recommendation' 的span占比最高")
    
else:
    print("⚠️  无法进行模式分析 - 未找到高独占时间的span")

🔍 Step 2: Pattern analysis with diff_patterns...
📋 已生成用于模式分析的 diff_patterns 查询语句

🚀 正在使用SLS客户端执行 diff_patterns 查询...
✅ 模式分析完成：共返回 1 条结果记录

📊 模式分析结果:
  结果 1: {'ret': '[["\\"spanName\\"=\'grpc.oteldemo.CartService/GetCart\'","\\"spanName\\"=\'router flagservice egress\'","\\"serviceName\\"=\'cart\'"],[705,487,735],[3457,1674,6023],[0.3525,0.2435,0.3675],[0.114356599404565,0.05537545484617929,0.1992391663910023],[0.23814340059543499,0.1881245451538207,0.16826083360899769],[1.0,1.0,1.0],[0.0,0.0,0.0],null]'}
  🔍 Analyzing pattern result: {'ret': '[["\\"spanName\\"=\'grpc.oteldemo.CartService/GetCart\'","\\"spanName\\"=\'router flagservice egress\'","\\"serviceName\\"=\'cart\'"],[705,487,735],[3457,1674,6023],[0.3525,0.2435,0.3675],[0.114356599404565,0.05537545484617929,0.1992391663910023],[0.23814340059543499,0.1881245451538207,0.16826083360899769],[1.0,1.0,1.0],[0.0,0.0,0.0],null]'}
    📊 Extracted patterns: ['"spanName"=\'grpc.oteldemo.CartService/GetCart\'', '"spanName"=\'router flagser

## 步骤3：CPU指标分析

根据模式分析结果（例如，如果 recommendation 服务显示独占时间高），使用CMS实体查询，对该service的CPU指标进行查询。

In [7]:
print("🔍 Step 3: CPU Metrics Analysis...")
print("="*60)

# 初始化CMS测试客户端，用于指标查询
# 如果存在导入问题通过直接创建类修复(except)
try:
    if TestCMSQuery is not None:
        cms_tester = TestCMSQuery()
        cms_tester.setUp()
        print(f"✅ 已通过导入的 TestCMSQuery 初始化CMS客户端")
    else:
        raise ImportError("TestCMSQuery is None")
except:
    print("⚠️  TestCMSQuery import failed, creating CMS client directly...")
    
    import os
    from alibabacloud_cms20240330.client import Client as Cms20240330Client
    from alibabacloud_tea_openapi import models as open_api_models
    from alibabacloud_cms20240330 import models as cms_20240330_models
    from alibabacloud_tea_util import models as util_models
    
    
    class DirectCMSClient:
        def __init__(self):
            self.access_key_id = os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID')
            self.access_key_secret = os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET')
            self.workspace = CMS_WORKSPACE
            self.endpoint = CMS_ENDPOINT
            
            if not self.access_key_id or not self.access_key_secret:
                raise ValueError("请设置环境变量 ALIBABA_CLOUD_ACCESS_KEY_ID 和 ALIBABA_CLOUD_ACCESS_KEY_SECRET")
            
            config = open_api_models.Config(
                access_key_id=self.access_key_id,
                access_key_secret=self.access_key_secret,
            )
            config.endpoint = self.endpoint
            self.cms_client = Cms20240330Client(config)
        
        def _execute_spl_query(self, query: str, from_time: int = None, to_time: int = None):
            """执行SPL查询"""
            if from_time is None:
                from_time = int(time.time()) - 60 * 60 * 1
            if to_time is None:
                to_time = int(time.time())
            
            try:
                headers = cms_20240330_models.GetEntityStoreDataHeaders()
                request = cms_20240330_models.GetEntityStoreDataRequest(
                    query=query,
                    from_=from_time,
                    to=to_time
                )
                runtime = util_models.RuntimeOptions()
                response = self.cms_client.get_entity_store_data_with_options(
                    self.workspace, request, headers, runtime
                )
                return response.body
            except Exception as e:
                print(f"❌ CMS查询错误: {e}")
                return None
    
    cms_tester = DirectCMSClient()
    print(f"✅ CMS client created directly")

print(f"🔧 CMS客户端已初始化")
print(f"🔧 workspace: {CMS_WORKSPACE}")
print(f"🔧 Endpoint: {CMS_ENDPOINT}")

# Determine target service from runtime observations
def infer_service_from_additional_evidence():
    """Try additional service inference methods when primary analysis fails"""
    if 'span_patterns' not in globals():
        return None
    
    service_candidates = {}
    for span_name, count in span_patterns:
        # More comprehensive service mapping
        if 'CartService' in span_name:
            service_candidates['cart'] = service_candidates.get('cart', 0) + count
        elif 'ProductCatalogService' in span_name:
            service_candidates['product-catalog'] = service_candidates.get('product-catalog', 0) + count  
        elif 'PaymentService' in span_name:
            service_candidates['payment'] = service_candidates.get('payment', 0) + count
        elif 'CheckoutService' in span_name:
            service_candidates['checkout'] = service_candidates.get('checkout', 0) + count
        elif 'RecommendationService' in span_name:
            service_candidates['recommendation'] = service_candidates.get('recommendation', 0) + count
        elif 'CurrencyService' in span_name:
            service_candidates['currency'] = service_candidates.get('currency', 0) + count
        elif 'frontend' in span_name.lower():
            service_candidates['frontend'] = service_candidates.get('frontend', 0) + count
        elif 'flagservice' in span_name.lower():
            service_candidates['ad'] = service_candidates.get('ad', 0) + count
    
    if service_candidates:
        best_service = max(service_candidates.items(), key=lambda x: x[1])
        return best_service[0]
    
    return None

# Determine target service using runtime data
if 'TARGET_SERVICE_FROM_PATTERNS' in globals():
    TARGET_SERVICE = TARGET_SERVICE_FROM_PATTERNS 
    print(f"🎯 Using service from diff_patterns: {TARGET_SERVICE}")
else:
    # Try additional inference methods
    inferred_service = infer_service_from_additional_evidence()
    if inferred_service:
        TARGET_SERVICE = inferred_service
        print(f"🎯 Inferred service from span patterns: {TARGET_SERVICE}")
    else:
        print(f"🎯 Cannot determine target service from available data")
        if 'span_patterns' in globals():
            print(f"    Available span patterns: {globals()['span_patterns']}")
        TARGET_SERVICE = None

if TARGET_SERVICE:
    print(f"🎯 正在分析目标service:{TARGET_SERVICE}的CPU指标")
else:
    print(f"⚠️ 无法确定目标service，将跳过service级别的分析")

🔍 Step 3: CPU Metrics Analysis...
✅ 成功获取临时访问凭证！
✅ 已通过导入的 TestCMSQuery 初始化CMS客户端
🔧 CMS客户端已初始化
🔧 workspace: quanxi-tianchi-test
🔧 Endpoint: metrics.cn-qingdao.aliyuncs.com
🎯 Using service from diff_patterns: cart
🎯 正在分析目标service:cart的CPU指标


In [8]:
# 查询CPU使用率与限制
print("📊 正在查询目标service的CPU使用率与限制...")

if TARGET_SERVICE:
    cpu_usage_vs_limits_query = f"""
.entity_set with(domain='k8s', name='k8s.deployment', query=`name='{TARGET_SERVICE}'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_vs_limits', 'range', '1m')
"""
    print(f"🔍 Query: {cpu_usage_vs_limits_query.strip()}")
else:
    print(f"⚠️ 跳过CPU分析 - 无目标service")
    cpu_usage_vs_limits_query = None

# 执行查询
if TARGET_SERVICE and cpu_usage_vs_limits_query:
    try:
        # 将时间字符串转换为CMS查询所需时间戳
        from_time = int(time.mktime(datetime.strptime(NORMAL_START_TIME, "%Y-%m-%d %H:%M:%S").timetuple()))
        to_time = int(time.mktime(datetime.strptime(ANOMALY_END_TIME, "%Y-%m-%d %H:%M:%S").timetuple()))
        
        cpu_usage_vs_limits_result = cms_tester._execute_spl_query(
            cpu_usage_vs_limits_query.strip(),
            from_time=from_time,
            to_time=to_time
        )
        
        if cpu_usage_vs_limits_result and cpu_usage_vs_limits_result.data:
            print(f"✅ 已获取 CPU使用率与限制 数据：共 {len(cpu_usage_vs_limits_result.data)} 条记录")
            
            # 展示部分数据样例
            if cpu_usage_vs_limits_result.header:
                print(f"📋 Fields: {cpu_usage_vs_limits_result.header}")
            
            print(f"📊 数据样例（前3条）:")
            for i, record in enumerate(cpu_usage_vs_limits_result.data[:3]):
                print(f"  Record {i+1}: {record}")
        else:
            print(f"⚠️  未找到 {TARGET_SERVICE} 的 CPU使用率与限制 数据")
            
    except Exception as e:
        print(f"❌ 查询 CPU使用率与限制 时出错: {e}")
else:
    print("⚠️ 跳过CPU分析 - 无有效目标service")

📊 正在查询目标service的CPU使用率与限制...
🔍 Query: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='cart'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_vs_limits', 'range', '1m')
🔍 查询参数:
  Workspace: quanxi-tianchi-test
  时间范围: 2025-08-27 20:00:20 到 2025-08-28 20:05:20
  查询语句: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='cart'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_vs_limits', 'range', '1m')

📊 查询响应:
  状态码: 200
  返回header: ['__labels__', '__name__', '__ts__', '__value__', '__source__']
  返回data行数: 1

✅ 已获取 CPU使用率与限制 数据：共 1 条记录
📋 Fields: ['__labels__', '__name__', '__ts__', '__value__', '__source__']
📊 数据样例（前3条）:
  Record 1: ['{}', 'null', '[1756348280000000000,1756348340000000000,1756348400000000000,1756348460000000000,1756348520000000000,1756348580000000000,1756348640000000000,1756348700000000000,1756348760000000000,1756348820000000000,1756348

In [9]:
# 查询该service的内存使用率与限制，作为对比
print("📊 正在查询目标service的内存使用率与限制（用于对比）...")

memory_usage_vs_limits_query = f"""
.entity_set with(domain='k8s', name='k8s.deployment', query=`name='{TARGET_SERVICE}'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_memory_usage_vs_limits', 'range', '1m')
"""

print(f"🔍 Query: {memory_usage_vs_limits_query.strip()}")

try:
    memory_usage_vs_limits_result = cms_tester._execute_spl_query(
        memory_usage_vs_limits_query.strip(),
        from_time=from_time,
        to_time=to_time
    )
    
    if memory_usage_vs_limits_result and memory_usage_vs_limits_result.data:
        print(f"✅ 已获取内存使用率与限制数据：共: {len(memory_usage_vs_limits_result.data)} 条记录")
        
        # 展示数据样例
        print(f"📊 数据样例（前3条）:")
        for i, record in enumerate(memory_usage_vs_limits_result.data[:3]):
            print(f"  Record {i+1}: {record}")
    else:
        print(f"⚠️  未找到 {TARGET_SERVICE} 的内存使用率与限制数据")
        
except Exception as e:
    print(f"❌ 查询内存使用率与限制时出错: {e}")

📊 正在查询目标service的内存使用率与限制（用于对比）...
🔍 Query: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='cart'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_memory_usage_vs_limits', 'range', '1m')
🔍 查询参数:
  Workspace: quanxi-tianchi-test
  时间范围: 2025-08-27 20:00:20 到 2025-08-28 20:05:20
  查询语句: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='cart'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_memory_usage_vs_limits', 'range', '1m')

📊 查询响应:
  状态码: 200
  返回header: ['__labels__', '__name__', '__ts__', '__value__', '__source__']
  返回data行数: 1

✅ 已获取内存使用率与限制数据：共: 1 条记录
📊 数据样例（前3条）:
  Record 1: ['{}', 'null', '[1756296020000000000,1756296080000000000,1756296140000000000,1756296200000000000,1756296260000000000,1756296320000000000,1756296380000000000,1756296440000000000,1756296500000000000,1756296560000000000,1756296620000000000,1756296680000000000,1756296740000000000,175629680000

## 步骤4：使用 series_decompose_anomalies 进行异常检测

利用SPL内置的异常检测功能，自动识别在延迟突增期间CPU使用率指标是否出现异常行为。

In [10]:
print("🔍 Step 4: CPU Usage Anomaly Detection...")
print("="*60)

# 使用异常检测进行查询
cpu_anomaly_detection_query = f"""
.entity_set with(domain='k8s', name='k8s.deployment', query=`name='{TARGET_SERVICE}'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_total', 'range', '1m')
| extend ret = series_decompose_anomalies(__value__, '{{"confidence": 0.035}}')
| extend anomalies_score_series = ret.anomalies_score_series, anomalies_type_series = ret.anomalies_type_series, error_msg = ret.error_msg
"""

print(f"🔍 Query: {cpu_anomaly_detection_query.strip()}")

try:
    # 增加时间范围以获得更好的异常检测上下文
    extended_from_time = from_time 
    extended_to_time = to_time 
    
    anomaly_result = cms_tester._execute_spl_query(
        cpu_anomaly_detection_query.strip(),
        from_time=extended_from_time,
        to_time=extended_to_time
    )
    
    if anomaly_result and anomaly_result.data:
        print(f"✅ 已获取异常检测结果：共 {len(anomaly_result.data)} 条记录")
        
        if anomaly_result.header:
            print(f"📋 Fields: {anomaly_result.header}")
        
        # 在结果中查找异常指示
        anomaly_found = False
        anomaly_details = []
        
        for i, record in enumerate(anomaly_result.data[:5]):
            print(f"  Record {i+1}: {record}")
            
            # 检查是否有 ExceedUpperBound 或其他异常指示
            if isinstance(record, (list, tuple)) and len(record) > 2:
                # 查找 anomalies_type_series 字段（通常在倒数第二位或指定位置）
                for item in record:
                    if isinstance(item, str):
                        # 检查 anomalies_type_series 中是否有 ExceedUpperBound
                        if 'ExceedUpperBound' in item:
                            anomaly_found = True
                            anomaly_details.append('ExceedUpperBound detected')
                        # 检查其他异常指示
                        elif 'ExceedLowerBound' in item:
                            anomaly_found = True
                            anomaly_details.append('ExceedLowerBound detected')
                        # 检查异常分数是否大于0
                        elif any(x in item for x in ['1.0', '1,0'] if ',' in item or '.' in item):
                            # This indicates anomaly score
                            if '1.0' in item or '1,0' in item:
                                anomaly_found = True
                                anomaly_details.append('High anomaly score detected')
        
        if anomaly_found:
            print(f"🚨 在 {TARGET_SERVICE} 的CPU使用率中检测到异常！")
            for detail in anomaly_details:
                print(f"    📊 {detail}")
        else:
            print(f"✅ {TARGET_SERVICE} 的CPU使用率未检测到明显异常")
            
    else:
        print(f"⚠️  未找到 {TARGET_SERVICE} 的异常检测结果")
        
except Exception as e:
    print(f"❌ 异常检测过程中出错: {e}")

🔍 Step 4: CPU Usage Anomaly Detection...
🔍 Query: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='cart'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_total', 'range', '1m')
| extend ret = series_decompose_anomalies(__value__, '{"confidence": 0.035}')
| extend anomalies_score_series = ret.anomalies_score_series, anomalies_type_series = ret.anomalies_type_series, error_msg = ret.error_msg
🔍 查询参数:
  Workspace: quanxi-tianchi-test
  时间范围: 2025-08-27 20:00:20 到 2025-08-28 20:05:20
  查询语句: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='cart'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_total', 'range', '1m')
| extend ret = series_decompose_anomalies(__value__, '{"confidence": 0.035}')
| extend anomalies_score_series = ret.anomalies_score_series, anomalies_type_series = ret.anomalies_type_series, error_msg = ret.error_msg

📊 查询响应:
  状态码: 200
  返

## 步骤5：根因分析总结

总结分析结果，并确定最可能的根因候选项。

In [11]:
print("🔍 Step 5: Root Cause Analysis Summary")
print("="*60)

# 分析结果
print(f"📊 分析总结：")
print(f"   异常时间段：{ANOMALY_START_TIME} 到 {ANOMALY_END_TIME}")
print(f"   正常时间段：{NORMAL_START_TIME} 到 {NORMAL_END_TIME}")
print(f"   分析目标service：{TARGET_SERVICE}")
print(f"   发现高独占时间的span数量：{len(top_95_percent_spans) if 'top_95_percent_spans' in locals() else 0}")

print(f"\n🎯 根因发现：")

# 检查是否有CPU问题的证据
cpu_evidence = False
memory_evidence = False
anomaly_evidence = False

if 'cpu_usage_vs_limits_result' in locals() and cpu_usage_vs_limits_result and cpu_usage_vs_limits_result.data:
    cpu_evidence = True
    print(f"   ✅ 已获取 {TARGET_SERVICE} 的CPU使用数据")
else:
    print(f"   ❌ 未找到 {TARGET_SERVICE} 的CPU使用数据")

if 'memory_usage_vs_limits_result' in locals() and memory_usage_vs_limits_result and memory_usage_vs_limits_result.data:
    memory_evidence = True
    print(f"   ✅ 已获取 {TARGET_SERVICE} 的内存使用数据")
else:
    print(f"   ❌ 未找到 {TARGET_SERVICE} 的内存使用数据")

# 检查实际异常点，至少需要3个异常点
if 'anomaly_result' in locals() and anomaly_result and anomaly_result.data:
    print(f"   ✅ 异常检测分析已完成")
    
    # 统计异常点数量，至少3个才能确认有异常
    anomaly_point_count = 0
    anomaly_types_found = []
    
    for record in anomaly_result.data:
        if isinstance(record, (list, tuple)):
            for item in record:
                if isinstance(item, str):
                    # 统计ExceedUpperBound出现次数
                    exceed_upper_count = item.count('ExceedUpperBound')
                    exceed_lower_count = item.count('ExceedLowerBound')
                    
                    anomaly_point_count += exceed_upper_count + exceed_lower_count
                    
                    if exceed_upper_count > 0:
                        anomaly_types_found.extend(['ExceedUpperBound'] * exceed_upper_count)
                    if exceed_lower_count > 0:
                        anomaly_types_found.extend(['ExceedLowerBound'] * exceed_lower_count)
    
    print(f"   📊 检测到的异常点总数：{anomaly_point_count}")
    
    # 只有存在3个及以上异常点才确认异常
    if anomaly_point_count >= 3:
        anomaly_evidence = True
        print(f"   🚨 异常确认：发现 {anomaly_point_count} 个异常点（已达≥3阈值）")
        print(f"   📝 异常类型：{', '.join(set(anomaly_types_found))}")
    elif anomaly_point_count > 0:
        anomaly_evidence = False
        print(f"   ⚠️  异常证据不足：仅检测到 {anomaly_point_count} 个点（需要≥3）")
    else:
        anomaly_evidence = False
        print(f"   ℹ️  异常检测已完成，但未发现异常")
else:
    print(f"   ❌ 异常检测分析失败")

print(f"\n🏆 根因候选：")

# 基于evidence的评估：只有实际检测到异常点才设置 evidence=True
evidence = anomaly_evidence and len(top_95_percent_spans) > 0

if evidence and cpu_evidence:
    root_cause_candidate = f"{TARGET_SERVICE}.cpu"
    confidence = "高"
    
    print(f"   🎯 {root_cause_candidate}")
    print(f"   📈 置信度：{confidence}")
    print(f"   ✅ 证据：TRUE（已检测到异常）")
    print(f"   📝 支持证据：")
    print(f"      - 发现 {len(top_95_percent_spans)} 个高独占时间span")
    print(f"      - 模式分析表明涉及服务 {TARGET_SERVICE}")
    print(f"      - 有CPU指标数据可详细分析")
    print(f"      - 自动异常检测确认了CPU使用异常")
        
elif len(top_95_percent_spans) > 0 and cpu_evidence:
    root_cause_candidate = f"{TARGET_SERVICE}.cpu"
    confidence = "中"
    
    print(f"   🎯 {root_cause_candidate}")
    print(f"   📈 置信度：{confidence}")
    print(f"   ❌ 证据：FALSE（未确认异常）")
    print(f"   📝 支持证据：")
    print(f"      - 发现 {len(top_95_percent_spans)} 个高独占时间span")
    print(f"      - 模式分析表明涉及服务 {TARGET_SERVICE}")
    print(f"      - 有CPU指标数据，但未检测到明显异常")
    
else:
    root_cause_candidate = "unknown"
    confidence = "低"
    
    print(f"   🎯 {root_cause_candidate}")
    print(f"   📈 置信度：{confidence}")
    print(f"   ❌ 证据：FALSE（证据不足）")
    print(f"   📝 支持证据：")
    print(f"      - 高独占时间span或指标数据有限")
    print(f"      - 分析可能需要不同的参数或时间范围")

print(f"\n💡 建议：")
if evidence:
    print(f"   - 检查 {TARGET_SERVICE} 部署的CPU资源限制")
    print(f"   - 检查异常期间是否有高CPU消耗操作")
    print(f"   - 考虑扩容 {TARGET_SERVICE} 或优化CPU使用")
elif confidence == "中":
    print(f"   - 进一步排查 {TARGET_SERVICE} 部署的资源使用")
    print(f"   - 检查异常期间 {TARGET_SERVICE} 的日志")
    print(f"   - 即使未检测到明显异常，也要核查是否存在细微性能问题")
else:
    print(f"   - 调整分析参数（时间范围、持续时间阈值等）")
    print(f"   - 核查指定时间段内数据是否齐全")
    print(f"   - 可考虑扩展到其他服务的分析")

print(f"\n" + "="*60)
print(f"🎯 最终答复：{root_cause_candidate}")
print(f"📈 置信度：{confidence}")
print(f"🔍 证据：{'TRUE' if evidence else 'FALSE'}")
print(f"" + "="*60)

🔍 Step 5: Root Cause Analysis Summary
📊 分析总结：
   异常时间段：2025-08-28 20:00:20 到 2025-08-28 20:05:20
   正常时间段：2025-08-27 20:00:20 到 2025-08-27 20:05:20
   分析目标service：cart
   发现高独占时间的span数量：3277

🎯 根因发现：
   ✅ 已获取 cart 的CPU使用数据
   ✅ 已获取 cart 的内存使用数据
   ✅ 异常检测分析已完成
   📊 检测到的异常点总数：210
   🚨 异常确认：发现 210 个异常点（已达≥3阈值）
   📝 异常类型：ExceedLowerBound, ExceedUpperBound

🏆 根因候选：
   🎯 cart.cpu
   📈 置信度：高
   ✅ 证据：TRUE（已检测到异常）
   📝 支持证据：
      - 发现 3277 个高独占时间span
      - 模式分析表明涉及服务 cart
      - 有CPU指标数据可详细分析
      - 自动异常检测确认了CPU使用异常

💡 建议：
   - 检查 cart 部署的CPU资源限制
   - 检查异常期间是否有高CPU消耗操作
   - 考虑扩容 cart 或优化CPU使用

🎯 最终答复：cart.cpu
📈 置信度：高
🔍 证据：TRUE


## 附加分析工具

以下代码提供了可按需运行的附加分析功能。

In [None]:
# Additional：其他service的高级异常检测
print("🔍 附加分析：其他service的CPU和内存异常检测")
print("="*60)

# 需要检测的其他常见服务列表
OTHER_SERVICES = ["frontend", "cart", "checkout", "payment", "shipping", "currency", "ad", "recommendation"]
other_services_anomalies = {}

for service in OTHER_SERVICES:
    try:
        print(f"\n🔍 正在分析 {service} 服务的异常...")

        # CPU异常检测
        service_cpu_anomaly_query = f"""
.entity_set with(domain='k8s', name='k8s.deployment', query=`name='{service}'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_total', 'range', '1m')
| extend ret = series_decompose_anomalies(__value__, '{{"confidence": 0.035}}')
| extend anomalies_score_series = ret.anomalies_score_series, anomalies_type_series = ret.anomalies_type_series, error_msg = ret.error_msg
"""
        
        cpu_result = cms_tester._execute_spl_query(
            service_cpu_anomaly_query.strip(),
            from_time=from_time,
            to_time=to_time
        )
        
        cpu_anomaly_found = False
        cpu_anomaly_details = []
        cpu_anomaly_count = 0
        
        if cpu_result and cpu_result.data:
            print(f"   ✅ {service}：已获取CPU指标数据")
            
            # 在CPU数据中统计异常点数 - 至少3个才算异常
            for record in cpu_result.data:
                if isinstance(record, (list, tuple)):
                    for item in record:
                        if isinstance(item, str):
                            exceed_upper_count = item.count('ExceedUpperBound')
                            exceed_lower_count = item.count('ExceedLowerBound')
                            cpu_anomaly_count += exceed_upper_count + exceed_lower_count
            
            # 只有在异常点数达到3及以上时才认为有异常
            if cpu_anomaly_count >= 3:
                cpu_anomaly_found = True
                cpu_anomaly_details.append(f'CPU异常点：{cpu_anomaly_count} 个')
        
        # 内存异常检测
        service_memory_anomaly_query = f"""
.entity_set with(domain='k8s', name='k8s.deployment', query=`name='{service}'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_memory_usage_total', 'range', '1m')
| extend ret = series_decompose_anomalies(__value__, '{{"confidence": 0.035}}')
| extend anomalies_score_series = ret.anomalies_score_series, anomalies_type_series = ret.anomalies_type_series, error_msg = ret.error_msg
"""
        
        memory_result = cms_tester._execute_spl_query(
            service_memory_anomaly_query.strip(),
            from_time=from_time,
            to_time=to_time
        )
        
        memory_anomaly_found = False
        memory_anomaly_details = []
        memory_anomaly_count = 0
        
        if memory_result and memory_result.data:
            print(f"   ✅ {service}：已获取内存指标数据")
            
            # 在内存数据中统计异常点数 - 至少3个才算异常
            for record in memory_result.data:
                if isinstance(record, (list, tuple)):
                    for item in record:
                        if isinstance(item, str):
                            exceed_upper_count = item.count('ExceedUpperBound')
                            exceed_lower_count = item.count('ExceedLowerBound')
                            memory_anomaly_count += exceed_upper_count + exceed_lower_count
            
            # 只有在异常点数达到3及以上时才认为有异常
            if memory_anomaly_count >= 3:
                memory_anomaly_found = True
                memory_anomaly_details.append(f'内存异常点：{memory_anomaly_count} 个')
        
        # 汇总当前服务的异常分析结果
        service_anomalies = cpu_anomaly_details + memory_anomaly_details
        
        if service_anomalies:
            print(f"   🚨 {service}：检测到异常")
            for anomaly in service_anomalies:
                print(f"      📊 {anomaly}")
            other_services_anomalies[service] = service_anomalies
        else:
            if cpu_result and cpu_result.data or memory_result and memory_result.data:
                print(f"   ✅ {service}：未检测到异常")
            else:
                print(f"   ❌ {service}：无指标数据可用")
            
    except Exception as e:
        print(f"   ❌ {service}：出错 - {e}")
        
    # 增加小延迟以避免API被请求过快
    time.sleep(0.5)

# 其他服务异常检测结果总结
print(f"\n" + "="*60)
print(f"🎯 其他服务异常汇总：")

if other_services_anomalies:
    print(f"📊 检测到异常的服务：")
    for service, anomalies in other_services_anomalies.items():
        print(f"  🚨 {service}：")
        for anomaly in anomalies:
            print(f"    - {anomaly}")
    
    print(f"\n💡 建议进一步调查：")
    for service in other_services_anomalies.keys():
        print(f"   - 检查 {service} 部署的资源约束情况")
        print(f"   - 检查异常期间 {service} 的日志")
        
else:
    print(f"✅ 其他服务未检测到异常")
    print(f"💡 主要异常似乎只影响 {TARGET_SERVICE} 服务")

print(f"="*60)

🔍 附加分析：其他service的CPU和内存异常检测

🔍 正在分析 frontend 服务的异常...
🔍 查询参数:
  Workspace: quanxi-tianchi-test
  时间范围: 2025-08-27 20:00:20 到 2025-08-28 20:05:20
  查询语句: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='frontend'` ) 
| entity-call get_metric('k8s', 'k8s.metric.high_level_metric_deployment', 'deployment_cpu_usage_total', 'range', '1m')
| extend ret = series_decompose_anomalies(__value__, '{"confidence": 0.035}')
| extend anomalies_score_series = ret.anomalies_score_series, anomalies_type_series = ret.anomalies_type_series, error_msg = ret.error_msg

📊 查询响应:
  状态码: 200
  返回header: ['__labels__', '__name__', '__ts__', '__value__', 'ret', 'anomalies_score_series', 'anomalies_type_series', 'error_msg', '__source__']
  返回data行数: 1

   ✅ frontend：已获取CPU指标数据
🔍 查询参数:
  Workspace: quanxi-tianchi-test
  时间范围: 2025-08-27 20:00:20 到 2025-08-28 20:05:20
  查询语句: .entity_set with(domain='k8s', name='k8s.deployment', query=`name='frontend'` ) 
| entity-call get_metric('k8s', 'k8s.metri

## 使用说明

1. **配置参数** - 在第二个单元格中根据您的具体故障事件配置参数
2. **按顺序运行所有单元格** - 分析步骤相互依赖，需按顺序执行
3. **检查模式分析查询** - 在步骤2中查看模式分析查询，如需要可在SLS控制台手动执行
4. **调整目标服务** - 在步骤3中根据模式分析结果调整 `TARGET_SERVICE`
5. **查看最终总结** - 在步骤5中查看根因候选的最终总结

### 预期工作流结果

- **步骤1**: 识别具有异常高独占时间的span
- **步骤2**: 揭示哪个服务（如 `recommendation`）存在最多问题span
- **步骤3**: 显示已识别服务的CPU使用率指标
- **步骤4**: 确认CPU使用率是否显示异常模式
- **步骤5**: 得出根因结论（如 `recommendation.cpu`）

### 故障排除

- **未找到高独占时间span**: 调整 `DURATION_THRESHOLD` 或时间范围
- **CMS查询失败**: 验证环境变量和工作空间访问权限
- **模式分析显示没有明确的服务**: 检查是否多个服务受影响
- **未检测到异常**: 问题可能与CPU无关；检查内存、网络或外部依赖