# GNTO 模型演示

本notebook展示了GNTO项目中各个模型组件的功能和使用方法。

## 项目结构
- **DataPreprocessor**: 将原始查询计划转换为结构化的PlanNode树
- **Encoder**: 将PlanNode编码为数值向量
- **TreeModel**: 聚合编码向量到单一表示
- **Predictioner**: 基于向量特征进行预测
- **GNTO**: 端到端的完整pipeline


In [9]:
# 导入必要的库
import numpy as np
import json
import random
from pprint import pprint

# 导入GNTO模型组件
from DataPreprocessor import DataPreprocessor, PlanNode
from Encoder import Encoder
from TreeModel import TreeModel
from Predictioner import Predictioner
from Gnto import GNTO
from Utils import set_seed, flatten

# 设置随机种子以确保结果可重现
set_seed(42)
print("环境设置完成！")


环境设置完成！


## 1. DataPreprocessor 演示

DataPreprocessor将原始的查询计划字典转换为结构化的PlanNode树。


In [10]:
# 创建DataPreprocessor实例
preprocessor = DataPreprocessor()

# 示例1: 简单的查询计划
simple_plan = {
    "Node Type": "Seq Scan",
    "Relation Name": "users",
    "Cost": 100.0,
    "Rows": 1000
}

print("=== 简单查询计划 ===")
print("原始计划:")
pprint(simple_plan)

# 预处理
processed_simple = preprocessor.preprocess(simple_plan)
print(f"\n处理后的PlanNode:")
print(f"Node Type: {processed_simple.node_type}")
print(f"Children: {len(processed_simple.children)}")
print(f"Extra Info: {processed_simple.extra_info}")


=== 简单查询计划 ===
原始计划:
{'Cost': 100.0, 'Node Type': 'Seq Scan', 'Relation Name': 'users', 'Rows': 1000}

处理后的PlanNode:
Node Type: Seq Scan
Children: 0
Extra Info: {'Node Type': 'Seq Scan', 'Relation Name': 'users', 'Cost': 100.0, 'Rows': 1000}


In [11]:
# 示例2: 复杂的嵌套查询计划
complex_plan = {
    "Node Type": "Hash Join",
    "Join Type": "Inner",
    "Hash Cond": "(users.id = orders.user_id)",
    "Cost": 500.0,
    "Rows": 2500,
    "Plans": [
        {
            "Node Type": "Seq Scan",
            "Relation Name": "users",
            "Cost": 100.0,
            "Rows": 1000,
            "Filter": "users.active = true"
        },
        {
            "Node Type": "Hash",
            "Cost": 200.0,
            "Rows": 5000,
            "Plans": [
                {
                    "Node Type": "Index Scan",
                    "Index Name": "orders_user_id_idx",
                    "Relation Name": "orders",
                    "Cost": 150.0,
                    "Rows": 5000,
                    "Index Cond": "user_id IS NOT NULL"
                }
            ]
        }
    ]
}

print("\n=== 复杂嵌套查询计划 ===")
print("原始计划:")
pprint(complex_plan)

# 预处理复杂计划
processed_complex = preprocessor.preprocess(complex_plan)

def print_plan_tree(node, indent=0):
    """递归打印计划树结构"""
    print("  " * indent + f"├─ {node.node_type}")
    if node.extra_info:
        for key, value in list(node.extra_info.items())[:2]:  # 只显示前两个属性
            print("  " * indent + f"   {key}: {value}")
    for child in node.children:
        print_plan_tree(child, indent + 1)

print(f"\n处理后的计划树结构:")
print_plan_tree(processed_complex)



=== 复杂嵌套查询计划 ===
原始计划:
{'Cost': 500.0,
 'Hash Cond': '(users.id = orders.user_id)',
 'Join Type': 'Inner',
 'Node Type': 'Hash Join',
 'Plans': [{'Cost': 100.0,
            'Filter': 'users.active = true',
            'Node Type': 'Seq Scan',
            'Relation Name': 'users',
            'Rows': 1000},
           {'Cost': 200.0,
            'Node Type': 'Hash',
            'Plans': [{'Cost': 150.0,
                       'Index Cond': 'user_id IS NOT NULL',
                       'Index Name': 'orders_user_id_idx',
                       'Node Type': 'Index Scan',
                       'Relation Name': 'orders',
                       'Rows': 5000}],
            'Rows': 5000}],
 'Rows': 2500}

处理后的计划树结构:
├─ Hash Join
   Node Type: Hash Join
   Join Type: Inner
  ├─ Seq Scan
     Node Type: Seq Scan
     Relation Name: users
  ├─ Hash
     Node Type: Hash
     Cost: 200.0
    ├─ Index Scan
       Node Type: Index Scan
       Index Name: orders_user_id_idx


## 2. Encoder 演示

Encoder将PlanNode树编码为数值向量，使用one-hot编码表示节点类型并递归聚合子节点。


In [12]:
# 创建Encoder实例
encoder = Encoder()

print("=== Encoder演示 ===")

# 编码简单计划
simple_encoded = encoder.encode(processed_simple)
print(f"简单计划编码向量: {simple_encoded}")
print(f"向量维度: {len(simple_encoded)}")
print(f"当前节点类型索引: {encoder.node_index}")

print("\n" + "="*50)

# 编码复杂计划
complex_encoded = encoder.encode(processed_complex)
print(f"复杂计划编码向量: {complex_encoded}")
print(f"向量维度: {len(complex_encoded)}")
print(f"更新后的节点类型索引: {encoder.node_index}")

# 显示各个节点类型的one-hot编码
print(f"\n各节点类型的one-hot编码:")
for node_type, idx in encoder.node_index.items():
    one_hot = encoder._one_hot(idx)
    print(f"{node_type}: {one_hot}")


=== Encoder演示 ===
简单计划编码向量: [1.]
向量维度: 1
当前节点类型索引: {'Seq Scan': 0}

复杂计划编码向量: [1. 1. 1. 1.]
向量维度: 4
更新后的节点类型索引: {'Seq Scan': 0, 'Hash Join': 1, 'Hash': 2, 'Index Scan': 3}

各节点类型的one-hot编码:
Seq Scan: [1. 0. 0. 0.]
Hash Join: [0. 1. 0. 0.]
Hash: [0. 0. 1. 0.]
Index Scan: [0. 0. 0. 1.]


In [13]:
# 测试批量编码
multiple_plans = [
    {
        "Node Type": "Sort",
        "Sort Key": ["name"],
        "Cost": 50.0,
        "Plans": [
            {"Node Type": "Seq Scan", "Relation Name": "products", "Cost": 30.0}
        ]
    },
    {
        "Node Type": "Aggregate",
        "Group Key": ["category"],
        "Cost": 75.0,
        "Plans": [
            {"Node Type": "Index Scan", "Index Name": "category_idx", "Cost": 25.0}
        ]
    }
]

print("\n=== 批量编码演示 ===")
processed_multiple = preprocessor.preprocess_all(multiple_plans)
encoded_multiple = encoder.encode_all(processed_multiple)

for i, (plan, encoded) in enumerate(zip(processed_multiple, encoded_multiple)):
    print(f"计划 {i+1} ({plan.node_type}): {encoded}")

print(f"\n最终节点类型词汇表: {encoder.node_index}")
print(f"词汇表大小: {len(encoder.node_index)}")



=== 批量编码演示 ===
计划 1 (Sort): [1. 0. 0. 0. 1.]
计划 2 (Aggregate): [0. 0. 0. 1. 0. 1.]

最终节点类型词汇表: {'Seq Scan': 0, 'Hash Join': 1, 'Hash': 2, 'Index Scan': 3, 'Sort': 4, 'Aggregate': 5}
词汇表大小: 6


## 3. TreeModel 演示

TreeModel将多个编码向量聚合为单一向量表示，支持mean和sum两种聚合方式。


In [14]:
print("=== TreeModel演示 ===")

# 创建不同reduction方法的TreeModel
tree_model_mean = TreeModel(reduction="mean")
tree_model_sum = TreeModel(reduction="sum")

# 准备测试向量集合
test_vectors = [simple_encoded, complex_encoded] + encoded_multiple

print("输入向量:")
for i, vec in enumerate(test_vectors):
    print(f"向量 {i+1}: {vec}")

print(f"\n各向量的维度: {[len(v) for v in test_vectors]}")

# 使用mean聚合
aggregated_mean = tree_model_mean.forward(test_vectors)
print(f"\nMean聚合结果: {aggregated_mean}")
print(f"聚合向量维度: {len(aggregated_mean)}")

# 使用sum聚合
aggregated_sum = tree_model_sum.forward(test_vectors)
print(f"\nSum聚合结果: {aggregated_sum}")
print(f"聚合向量维度: {len(aggregated_sum)}")

# 比较两种聚合方法
print(f"\n聚合方法比较:")
print(f"Mean vs Sum 比值: {aggregated_mean / (aggregated_sum / len(test_vectors))}")
print(f"向量长度相同: {len(aggregated_mean) == len(aggregated_sum)}")


=== TreeModel演示 ===
输入向量:
向量 1: [1.]
向量 2: [1. 1. 1. 1.]
向量 3: [1. 0. 0. 0. 1.]
向量 4: [0. 0. 0. 1. 0. 1.]

各向量的维度: [1, 4, 5, 6]

Mean聚合结果: [0.75 0.25 0.25 0.5  0.25 0.25]
聚合向量维度: 6

Sum聚合结果: [3. 1. 1. 2. 1. 1.]
聚合向量维度: 6

聚合方法比较:
Mean vs Sum 比值: [1. 1. 1. 1. 1. 1.]
向量长度相同: True


In [15]:
# 测试边界情况
print("\n=== 边界情况测试 ===")

# 空向量列表
try:
    empty_result = tree_model_mean.forward([])
    print(f"空向量列表聚合结果: {empty_result}")
    print(f"空向量维度: {len(empty_result)}")
except Exception as e:
    print(f"空向量列表处理: {e}")

# 单个向量
single_result = tree_model_mean.forward([complex_encoded])
print(f"\n单向量聚合结果: {single_result}")
print(f"原向量: {complex_encoded}")
print(f"结果相同: {np.array_equal(single_result, complex_encoded)}")

# 不同维度向量的混合（TreeModel会自动padding）
different_dims = [
    np.array([1.0, 2.0]),
    np.array([3.0, 4.0, 5.0]),
    np.array([6.0])
]

mixed_result = tree_model_mean.forward(different_dims)
print(f"\n不同维度向量聚合:")
print(f"输入: {different_dims}")
print(f"聚合结果: {mixed_result}")
print(f"结果维度: {len(mixed_result)}")



=== 边界情况测试 ===
空向量列表聚合结果: nan
空向量列表处理: object of type 'numpy.float64' has no len()

单向量聚合结果: [1. 1. 1. 1.]
原向量: [1. 1. 1. 1.]
结果相同: True

不同维度向量聚合:
输入: [array([1., 2.]), array([3., 4., 5.]), array([6.])]
聚合结果: [3.33333333 2.         1.66666667]
结果维度: 3


## 4. Predictioner 演示

Predictioner是一个线性预测头，将特征向量转换为标量预测值。


In [16]:
print("=== Predictioner演示 ===")

# 1. 默认权重（全为1）
predictor_default = Predictioner()
prediction_default = predictor_default.predict(aggregated_mean)
print(f"默认权重预测:")
print(f"输入向量: {aggregated_mean}")
print(f"预测结果: {prediction_default}")
print(f"使用的权重: {predictor_default.weights}")

print("\n" + "="*50)

# 2. 自定义权重
custom_weights = [0.1, 0.5, 0.8, 0.3, 0.9, 0.2, 0.7]
predictor_custom = Predictioner(weights=custom_weights)
prediction_custom = predictor_custom.predict(aggregated_mean)
print(f"自定义权重预测:")
print(f"自定义权重: {custom_weights}")
print(f"预测结果: {prediction_custom}")

print("\n" + "="*50)

# 3. 测试不同输入向量的预测
test_features = [aggregated_mean, aggregated_sum, simple_encoded[:len(aggregated_mean)]]
feature_names = ["Mean聚合", "Sum聚合", "简单编码(截断)"]

print("不同特征向量的预测结果:")
for name, features in zip(feature_names, test_features):
    pred = predictor_custom.predict(features)
    print(f"{name}: {pred:.4f}")

print("\n" + "="*50)

# 4. 权重维度自适应测试
print("权重自适应测试:")
short_weights = [0.5, 1.0, 0.3]  # 权重维度小于特征维度
predictor_short = Predictioner(weights=short_weights)

long_features = np.array([1.0, 2.0, 3.0, 4.0, 5.0])  # 特征维度大于权重维度
pred_adaptive = predictor_short.predict(long_features)

print(f"原始权重: {short_weights}")
print(f"特征向量: {long_features}")
print(f"自适应后权重: {predictor_short.weights}")
print(f"预测结果: {pred_adaptive}")


=== Predictioner演示 ===
默认权重预测:
输入向量: [0.75 0.25 0.25 0.5  0.25 0.25]
预测结果: 2.25
使用的权重: [1. 1. 1. 1. 1. 1.]

自定义权重预测:
自定义权重: [0.1, 0.5, 0.8, 0.3, 0.9, 0.2, 0.7]
预测结果: 0.825

不同特征向量的预测结果:
Mean聚合: 0.8250
Sum聚合: 3.3000
简单编码(截断): 0.1000

权重自适应测试:
原始权重: [0.5, 1.0, 0.3]
特征向量: [1. 2. 3. 4. 5.]
自适应后权重: [0.5 1.  0.3 0.  0. ]
预测结果: 3.4


## 5. Utils 功能演示

Utils模块提供了一些实用函数，包括随机种子设置和嵌套列表扁平化。


In [17]:
print("=== Utils功能演示 ===")

# 1. 随机种子设置演示
print("随机种子设置演示:")

# 设置种子前
print("设置种子前的随机数:")
print(f"random: {[random.random() for _ in range(3)]}")
print(f"numpy: {np.random.random(3)}")

# 设置种子
set_seed(123)
print(f"\n设置种子(123)后的随机数:")
print(f"random: {[random.random() for _ in range(3)]}")
print(f"numpy: {np.random.random(3)}")

# 重新设置相同种子，验证可重现性
set_seed(123)
print(f"\n重新设置相同种子后的随机数:")
print(f"random: {[random.random() for _ in range(3)]}")
print(f"numpy: {np.random.random(3)}")

print("\n" + "="*50)

# 2. flatten函数演示
print("flatten函数演示:")

nested_data = [
    [1, 2, 3],
    [4, 5],
    [6, 7, 8, 9],
    [10]
]

flattened = flatten(nested_data)
print(f"原始嵌套列表: {nested_data}")
print(f"扁平化后: {flattened}")

# 更复杂的嵌套结构
complex_nested = [
    ["a", "b"],
    ["c"],
    ["d", "e", "f"],
    []  # 空列表
]

complex_flattened = flatten(complex_nested)
print(f"\n复杂嵌套列表: {complex_nested}")
print(f"扁平化后: {complex_flattened}")

# 与numpy数组一起使用
array_nested = [
    np.array([1, 2]),
    np.array([3, 4, 5]),
    np.array([6])
]

array_flattened = flatten(array_nested)
print(f"\n数组嵌套: {[arr.tolist() for arr in array_nested]}")
print(f"扁平化后: {[x for x in array_flattened]}")
print(f"扁平化结果类型: {[type(x) for x in array_flattened[:3]]}")


=== Utils功能演示 ===
随机种子设置演示:
设置种子前的随机数:
random: [0.6394267984578837, 0.025010755222666936, 0.27502931836911926]
numpy: [0.37454012 0.95071431 0.73199394]

设置种子(123)后的随机数:
random: [0.052363598850944326, 0.08718667752263232, 0.4072417636703983]
numpy: [0.69646919 0.28613933 0.22685145]

重新设置相同种子后的随机数:
random: [0.052363598850944326, 0.08718667752263232, 0.4072417636703983]
numpy: [0.69646919 0.28613933 0.22685145]

flatten函数演示:
原始嵌套列表: [[1, 2, 3], [4, 5], [6, 7, 8, 9], [10]]
扁平化后: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

复杂嵌套列表: [['a', 'b'], ['c'], ['d', 'e', 'f'], []]
扁平化后: ['a', 'b', 'c', 'd', 'e', 'f']

数组嵌套: [[1, 2], [3, 4, 5], [6]]
扁平化后: [1, 2, 3, 4, 5, 6]
扁平化结果类型: [<class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>]


## 6. 完整GNTO Pipeline演示

GNTO类将所有组件整合在一起，提供端到端的查询计划性能预测功能。


In [18]:
# 重置随机种子以确保可重现的结果
set_seed(42)

print("=== 完整GNTO Pipeline演示 ===")

# 1. 使用默认组件创建GNTO实例
gnto_default = GNTO()

# 测试查询计划集合
test_plans = [
    {
        "Node Type": "Nested Loop",
        "Join Type": "Inner",
        "Cost": 1000.0,
        "Rows": 10000,
        "Plans": [
            {
                "Node Type": "Seq Scan", 
                "Relation Name": "customers",
                "Cost": 200.0,
                "Filter": "age > 25"
            },
            {
                "Node Type": "Index Scan",
                "Index Name": "orders_customer_id_idx",
                "Relation Name": "orders",
                "Cost": 50.0
            }
        ]
    },
    {
        "Node Type": "Hash Join",
        "Join Type": "Left",
        "Cost": 800.0,
        "Rows": 8000,
        "Plans": [
            {
                "Node Type": "Seq Scan",
                "Relation Name": "products",
                "Cost": 300.0
            },
            {
                "Node Type": "Hash",
                "Cost": 100.0,
                "Plans": [
                    {
                        "Node Type": "Seq Scan",
                        "Relation Name": "categories",
                        "Cost": 80.0
                    }
                ]
            }
        ]
    },
    {
        "Node Type": "Sort",
        "Sort Key": ["price DESC", "name ASC"],
        "Cost": 150.0,
        "Rows": 1500,
        "Plans": [
            {
                "Node Type": "Bitmap Heap Scan",
                "Relation Name": "items",
                "Cost": 120.0,
                "Plans": [
                    {
                        "Node Type": "Bitmap Index Scan",
                        "Index Name": "items_price_idx",
                        "Cost": 20.0
                    }
                ]
            }
        ]
    }
]

print("测试查询计划:")
for i, plan in enumerate(test_plans):
    print(f"\n计划 {i+1}: {plan['Node Type']}")
    print(f"  Cost: {plan['Cost']}, Rows: {plan['Rows']}")
    
    # 运行完整pipeline
    prediction = gnto_default.run(plan)
    print(f"  预测结果: {prediction:.4f}")

print("\n" + "="*60)


=== 完整GNTO Pipeline演示 ===
测试查询计划:

计划 1: Nested Loop
  Cost: 1000.0, Rows: 10000
  预测结果: 3.0000

计划 2: Hash Join
  Cost: 800.0, Rows: 8000
  预测结果: 2.0000

计划 3: Sort
  Cost: 150.0, Rows: 1500
  预测结果: 0.0000



In [19]:
# 2. 使用自定义组件创建GNTO实例
print("使用自定义组件的GNTO:")

# 创建自定义组件
custom_preprocessor = DataPreprocessor()
custom_encoder = Encoder()
custom_tree_model = TreeModel(reduction="sum")  # 使用sum聚合
custom_predictor = Predictioner(weights=[0.2, 0.4, 0.1, 0.8, 0.6, 0.3, 0.9, 0.5])

# 创建自定义GNTO实例
gnto_custom = GNTO(
    preprocessor=custom_preprocessor,
    encoder=custom_encoder,
    tree_model=custom_tree_model,
    predictioner=custom_predictor
)

print(f"自定义配置:")
print(f"  TreeModel reduction: {custom_tree_model.reduction}")
print(f"  Predictor weights: {custom_predictor.weights}")

# 对比默认和自定义配置的预测结果
print(f"\n预测结果对比:")
print(f"{'计划':<15} {'默认配置':<12} {'自定义配置':<12} {'差异':<10}")
print("-" * 50)

for i, plan in enumerate(test_plans):
    pred_default = gnto_default.run(plan)
    pred_custom = gnto_custom.run(plan)
    diff = abs(pred_custom - pred_default)
    
    print(f"计划 {i+1:<11} {pred_default:<12.4f} {pred_custom:<12.4f} {diff:<10.4f}")

print("\n" + "="*60)


使用自定义组件的GNTO:
自定义配置:
  TreeModel reduction: sum
  Predictor weights: [0.2 0.4 0.1 0.8 0.6 0.3 0.9 0.5]

预测结果对比:
计划              默认配置         自定义配置        差异        
--------------------------------------------------
计划 1           3.0000       0.7000       2.3000    
计划 2           2.0000       2.2000       0.2000    
计划 3           0.0000       1.7000       1.7000    



In [20]:
# 3. Pipeline内部步骤详细分析
print("Pipeline内部步骤详细分析:")

sample_plan = test_plans[0]  # 使用第一个测试计划
print(f"分析计划: {sample_plan['Node Type']}")

# 步骤1: 预处理
structured = gnto_default.preprocessor.preprocess(sample_plan)
print(f"\n步骤1 - 预处理:")
print(f"  原始计划键: {list(sample_plan.keys())}")
print(f"  结构化节点类型: {structured.node_type}")
print(f"  子节点数量: {len(structured.children)}")

# 步骤2: 编码
encoded = gnto_default.encoder.encode(structured)
print(f"\n步骤2 - 编码:")
print(f"  编码向量: {encoded}")
print(f"  向量维度: {len(encoded)}")
print(f"  节点类型词汇表: {list(gnto_default.encoder.node_index.keys())}")

# 步骤3: 树模型聚合
vector = gnto_default.tree_model.forward([encoded])
print(f"\n步骤3 - 树模型聚合:")
print(f"  输入向量数量: 1")
print(f"  聚合方法: {gnto_default.tree_model.reduction}")
print(f"  输出向量: {vector}")

# 步骤4: 预测
prediction = gnto_default.predictioner.predict(vector)
print(f"\n步骤4 - 预测:")
print(f"  预测权重: {gnto_default.predictioner.weights}")
print(f"  最终预测: {prediction}")

print(f"\n完整pipeline结果: {gnto_default.run(sample_plan)}")
print("✓ 手动步骤结果与pipeline结果一致" if abs(prediction - gnto_default.run(sample_plan)) < 1e-10 else "✗ 结果不一致")


Pipeline内部步骤详细分析:
分析计划: Nested Loop

步骤1 - 预处理:
  原始计划键: ['Node Type', 'Join Type', 'Cost', 'Rows', 'Plans']
  结构化节点类型: Nested Loop
  子节点数量: 2

步骤2 - 编码:
  编码向量: [1. 1. 1. 0. 0. 0. 0. 0.]
  向量维度: 8
  节点类型词汇表: ['Nested Loop', 'Seq Scan', 'Index Scan', 'Hash Join', 'Hash', 'Sort', 'Bitmap Heap Scan', 'Bitmap Index Scan']

步骤3 - 树模型聚合:
  输入向量数量: 1
  聚合方法: mean
  输出向量: [1. 1. 1. 0. 0. 0. 0. 0.]

步骤4 - 预测:
  预测权重: [1. 1. 1. 0. 0. 0. 0. 0.]
  最终预测: 3.0

完整pipeline结果: 3.0
✓ 手动步骤结果与pipeline结果一致


## 总结

本演示展示了GNTO项目的完整功能：

1. **DataPreprocessor**: 将JSON格式的查询计划转换为结构化的PlanNode树
2. **Encoder**: 使用one-hot编码将计划节点转换为数值向量，支持递归编码子节点
3. **TreeModel**: 将多个向量聚合为单一表示，支持mean和sum两种方法
4. **Predictioner**: 线性预测头，将特征向量转换为标量预测
5. **GNTO**: 端到端pipeline，整合所有组件
6. **Utils**: 提供随机种子设置和列表扁平化等实用功能

每个组件都是模块化设计，可以独立使用或替换为自定义实现，为真实项目中的复杂神经网络模型提供了良好的架构基础。
