# 阶段0：数据准备与划分

本notebook完成以下任务：
1. 加载三个数据源
2. 数据探索和质量检查
3. 按组划分数据（8/2/2）
4. 训练组内按时间划分（70%/30%）
5. 特征工程（聚合、整合、标准化）
6. 保存处理后的数据


In [1]:
# 导入必要的库和配置
%run 00_config_and_setup.ipynb

import config
from utils.data_utils import save_intermediate, load_intermediate, load_all_data, check_data_quality
from utils.feature_utils import aggregate_pairwise_features, normalize_features
from utils.visualization_utils import plot_data_distribution

print("准备开始数据准备...")


✓ 所有库导入完成
✓ 配置文件加载完成
项目根目录: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025
输出目录: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs
中间结果目录: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate
✓ 显示选项设置完成
✓ 输出目录创建完成
  - 中间结果: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate
  - 模型文件: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\models
  - 报告文件: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\reports
  - 可视化: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\visualizations
✓ 随机种子设置为: 42
✓ 可视化样式设置完成: seaborn-v0_8
✓ 工具函数导入完成
准备开始数据准备...


## 1. 数据加载


In [2]:
# 加载所有原始数据
pairwise_df, windowed_df, task_df = load_all_data()


正在加载数据文件...
✓ 成对特征数据: 1126602 行, 27 列
✓ 窗口级网络指标: 1740 行, 9 列
✓ 任务性能指标: 12 行, 9 列


## 2. 数据探索


In [3]:
# 检查成对特征数据
print("=== 成对特征数据 ===")
print(f"形状: {pairwise_df.shape}")
print(f"\n前5行:")
display(pairwise_df.head())
print(f"\n列名: {list(pairwise_df.columns)}")
print(f"\n各组数据量:")
print(pairwise_df['group'].value_counts().sort_index())


=== 成对特征数据 ===
形状: (1126602, 27)

前5行:


Unnamed: 0,group,pair,window_idx,window_start,speak_overlap,speak_only_i,speak_only_j,speaker_switch,silence,floor_streak_i,floor_streak_j,resp_latency,prox_binary,dist_mean,approach_rate,dist_accel,dist_jerk,joint_att_count,joint_hover_dur,shared_att_ratio,burst_switch_rate,burst_overlap_rate,dominance_ratio,speaking_entropy,bigram_entropy,fano_switch,material_diversity
0,1,A-B,0,0.0,0,0,0,0,1,0,0,0.0,0,0.9144,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.1591,0.3898,0.6972,0.9688,4.726
1,1,A-B,1,1.0,0,0,0,0,1,0,0,0.0,0,0.9144,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.1591,0.3898,0.6972,0.9688,4.726
2,1,A-B,2,2.0,0,0,0,0,1,0,0,0.0,0,0.9144,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.1591,0.3898,0.6972,0.9688,4.726
3,1,A-B,3,3.0,0,0,0,0,1,0,0,0.0,0,0.9144,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.1591,0.3898,0.6972,0.9688,4.726
4,1,A-B,4,4.0,0,0,0,0,1,0,0,0.0,0,0.9144,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.1591,0.3898,0.6972,0.9688,4.726



列名: ['group', 'pair', 'window_idx', 'window_start', 'speak_overlap', 'speak_only_i', 'speak_only_j', 'speaker_switch', 'silence', 'floor_streak_i', 'floor_streak_j', 'resp_latency', 'prox_binary', 'dist_mean', 'approach_rate', 'dist_accel', 'dist_jerk', 'joint_att_count', 'joint_hover_dur', 'shared_att_ratio', 'burst_switch_rate', 'burst_overlap_rate', 'dominance_ratio', 'speaking_entropy', 'bigram_entropy', 'fano_switch', 'material_diversity']

各组数据量:
group
1       4038
2       4968
3       7512
4       6594
5       3510
6      20946
7     526164
8       4422
9       4596
10      4326
11    535380
12      4146
Name: count, dtype: int64


In [4]:
# 检查窗口级网络指标数据
print("=== 窗口级网络指标数据 ===")
print(f"形状: {windowed_df.shape}")
print(f"\n前5行:")
display(windowed_df.head())
print(f"\n列名: {list(windowed_df.columns)}")
print(f"\n各组数据量:")
print(windowed_df['group'].value_counts().sort_index())
print(f"\n模态类型:")
print(windowed_df['modality'].value_counts())


=== 窗口级网络指标数据 ===
形状: (1740, 9)

前5行:


Unnamed: 0,group,window,modality,t_start,t_end,density,avg_clustering,eigenvector,reciprocity
0,12,1,shared_attention,2025-01-23 14:43:36,2025-01-23 14:44:08,0.6667,0.5833,0.4847,
1,12,1,proximity,2025-01-23 14:43:36,2025-01-23 14:44:08,0.6667,0.5833,0.4847,
2,12,1,conversation,2025-01-23 14:43:36,2025-01-23 14:44:08,0.25,0.0,0.433,0.0
3,12,1,fused,2025-01-23 14:43:36,2025-01-23 14:44:08,0.6667,0.8333,0.4857,0.75
4,12,2,shared_attention,2025-01-23 14:43:52,2025-01-23 14:44:24,0.8333,0.8333,0.4963,



列名: ['group', 'window', 'modality', 't_start', 't_end', 'density', 'avg_clustering', 'eigenvector', 'reciprocity']

各组数据量:
group
1     116
2     136
3     152
4     120
5      68
6     204
7     160
8     272
9     108
10    152
11    124
12    128
Name: count, dtype: int64

模态类型:
modality
shared_attention    435
proximity           435
conversation        435
fused               435
Name: count, dtype: int64


In [5]:
# 检查任务性能指标数据
print("=== 任务性能指标数据 ===")
print(f"形状: {task_df.shape}")
print(f"\n数据:")
display(task_df)


=== 任务性能指标数据 ===
形状: (12, 9)

数据:


Unnamed: 0,group,completion_time_seconds,total_interactions,num_participants,has_data,participant_A_interactions,participant_B_interactions,participant_C_interactions,participant_D_interactions
0,1,415.539,116,4,Yes,42,23,22,29
1,2,620.586,173,4,Yes,24,14,95,40
2,3,676.778,234,4,Yes,32,80,33,89
3,4,513.085,127,4,Yes,26,35,48,18
4,5,209.265,153,4,Yes,56,19,30,48
5,6,622.308,160,4,Yes,43,41,46,30
6,7,562.773,194,4,Yes,34,42,35,83
7,8,994.054,218,4,Yes,9,5,39,165
8,9,430.891,109,4,Yes,28,25,33,23
9,10,652.171,229,4,Yes,69,90,31,39


In [6]:
# 数据质量检查
check_data_quality(pairwise_df, "成对特征数据")
check_data_quality(windowed_df, "窗口级网络指标")
check_data_quality(task_df, "任务性能指标")



=== 成对特征数据质量检查 ===
形状: (1126602, 27)
缺失值:
无缺失值
重复行: 0
数据类型:
float64    15
int64      11
object      1
Name: count, dtype: int64

=== 窗口级网络指标质量检查 ===
形状: (1740, 9)
缺失值:
reciprocity    884
dtype: int64
重复行: 0
数据类型:
float64    4
object     3
int64      2
Name: count, dtype: int64

=== 任务性能指标质量检查 ===
形状: (12, 9)
缺失值:
无缺失值
重复行: 0
数据类型:
int64      7
float64    1
object     1
Name: count, dtype: int64


## 3. 特征工程

### 3.1 聚合成对特征到窗口级别


In [7]:
# 将成对特征聚合成窗口级别
print("正在聚合成对特征...")
pairwise_aggregated = aggregate_pairwise_features(pairwise_df, group_col='group', window_col='window_idx')
print(f"聚合后形状: {pairwise_aggregated.shape}")
print(f"聚合后列数: {len(pairwise_aggregated.columns)}")
display(pairwise_aggregated.head())


正在聚合成对特征...
聚合后形状: (187767, 94)
聚合后列数: 94


Unnamed: 0,group,window_idx,speak_overlap_mean,speak_overlap_std,speak_overlap_max,speak_overlap_min,speak_only_i_mean,speak_only_i_std,speak_only_i_max,speak_only_i_min,speak_only_j_mean,speak_only_j_std,speak_only_j_max,speak_only_j_min,speaker_switch_mean,speaker_switch_std,speaker_switch_max,speaker_switch_min,silence_mean,silence_std,silence_max,silence_min,floor_streak_i_mean,floor_streak_i_std,floor_streak_i_max,floor_streak_i_min,floor_streak_j_mean,floor_streak_j_std,floor_streak_j_max,floor_streak_j_min,resp_latency_mean,resp_latency_std,resp_latency_max,resp_latency_min,prox_binary_mean,prox_binary_std,prox_binary_max,prox_binary_min,dist_mean_mean,dist_mean_std,dist_mean_max,dist_mean_min,approach_rate_mean,approach_rate_std,approach_rate_max,approach_rate_min,dist_accel_mean,dist_accel_std,dist_accel_max,dist_accel_min,dist_jerk_mean,dist_jerk_std,dist_jerk_max,dist_jerk_min,joint_att_count_mean,joint_att_count_std,joint_att_count_max,joint_att_count_min,joint_hover_dur_mean,joint_hover_dur_std,joint_hover_dur_max,joint_hover_dur_min,shared_att_ratio_mean,shared_att_ratio_std,shared_att_ratio_max,shared_att_ratio_min,burst_switch_rate_mean,burst_switch_rate_std,burst_switch_rate_max,burst_switch_rate_min,burst_overlap_rate_mean,burst_overlap_rate_std,burst_overlap_rate_max,burst_overlap_rate_min,dominance_ratio_mean,dominance_ratio_std,dominance_ratio_max,dominance_ratio_min,speaking_entropy_mean,speaking_entropy_std,speaking_entropy_max,speaking_entropy_min,bigram_entropy_mean,bigram_entropy_std,bigram_entropy_max,bigram_entropy_min,fano_switch_mean,fano_switch_std,fano_switch_max,fano_switch_min,material_diversity_mean,material_diversity_std,material_diversity_max,material_diversity_min
0,1,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835
1,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835
2,1,2,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835
3,1,3,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835
4,1,4,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835


### 3.2 整合窗口级网络指标


In [8]:
# 将窗口级网络指标从长格式转换为宽格式（每个模态一列）
print("正在整合窗口级网络指标...")

# 选择要展开的指标列
metric_cols = ['density', 'avg_clustering', 'eigenvector', 'reciprocity']

# 创建透视表
windowed_pivot_list = []
for metric in metric_cols:
    if metric in windowed_df.columns:
        pivot = windowed_df.pivot_table(
            index=['group', 'window'],
            columns='modality',
            values=metric,
            aggfunc='first'
        )
        pivot.columns = [f"{metric}_{col}" for col in pivot.columns]
        windowed_pivot_list.append(pivot)

# 合并所有指标
if windowed_pivot_list:
    windowed_wide = pd.concat(windowed_pivot_list, axis=1)
    windowed_wide = windowed_wide.reset_index()
    print(f"整合后形状: {windowed_wide.shape}")
    display(windowed_wide.head())
else:
    print("警告: 没有找到可整合的指标")
    windowed_wide = windowed_df[['group', 'window']].drop_duplicates()


正在整合窗口级网络指标...
整合后形状: (435, 16)


Unnamed: 0,group,window,density_conversation,density_fused,density_proximity,density_shared_attention,avg_clustering_conversation,avg_clustering_fused,avg_clustering_proximity,avg_clustering_shared_attention,eigenvector_conversation,eigenvector_fused,eigenvector_proximity,eigenvector_shared_attention,reciprocity_conversation,reciprocity_fused
0,1,1,0.5,0.5,0.3333,0.3333,0.8333,0.8333,0.0,0.0,0.4743,0.4743,0.4268,0.4268,0.3333,0.3333
1,1,2,0.5,0.5833,0.5,0.3333,0.8333,1.0,0.75,0.0,0.4743,0.4264,0.433,0.4268,0.3333,0.2857
2,1,3,0.75,0.75,0.8333,0.0,1.0,1.0,0.8333,0.0,0.491,0.491,0.4963,0.1738,0.6667,0.6667
3,1,4,0.75,0.75,1.0,0.0,1.0,1.0,1.0,0.0,0.491,0.491,0.5,0.0862,0.6667,0.6667
4,1,5,0.75,0.75,0.6667,0.3333,1.0,1.0,0.5833,0.0,0.491,0.491,0.4847,0.4268,0.6667,0.6667


### 3.3 添加任务性能特征


In [9]:
# 将任务性能指标合并到窗口数据
# 首先需要将聚合后的成对特征和窗口级指标合并
print("正在合并数据...")

# 合并成对特征和窗口级指标
merged_df = pairwise_aggregated.merge(
    windowed_wide,
    left_on=['group', 'window_idx'],
    right_on=['group', 'window'],
    how='outer'
)

# 添加任务性能特征
merged_df = merged_df.merge(
    task_df,
    on='group',
    how='left'
)

print(f"合并后形状: {merged_df.shape}")
print(f"合并后列数: {len(merged_df.columns)}")
display(merged_df.head())


正在合并数据...
合并后形状: (187767, 117)
合并后列数: 117


Unnamed: 0,group,window_idx,speak_overlap_mean,speak_overlap_std,speak_overlap_max,speak_overlap_min,speak_only_i_mean,speak_only_i_std,speak_only_i_max,speak_only_i_min,speak_only_j_mean,speak_only_j_std,speak_only_j_max,speak_only_j_min,speaker_switch_mean,speaker_switch_std,speaker_switch_max,speaker_switch_min,silence_mean,silence_std,silence_max,silence_min,floor_streak_i_mean,floor_streak_i_std,floor_streak_i_max,floor_streak_i_min,floor_streak_j_mean,floor_streak_j_std,floor_streak_j_max,floor_streak_j_min,resp_latency_mean,resp_latency_std,resp_latency_max,resp_latency_min,prox_binary_mean,prox_binary_std,prox_binary_max,prox_binary_min,dist_mean_mean,dist_mean_std,dist_mean_max,dist_mean_min,approach_rate_mean,approach_rate_std,approach_rate_max,approach_rate_min,dist_accel_mean,dist_accel_std,dist_accel_max,dist_accel_min,dist_jerk_mean,dist_jerk_std,dist_jerk_max,dist_jerk_min,joint_att_count_mean,joint_att_count_std,joint_att_count_max,joint_att_count_min,joint_hover_dur_mean,joint_hover_dur_std,joint_hover_dur_max,joint_hover_dur_min,shared_att_ratio_mean,shared_att_ratio_std,shared_att_ratio_max,shared_att_ratio_min,burst_switch_rate_mean,burst_switch_rate_std,burst_switch_rate_max,burst_switch_rate_min,burst_overlap_rate_mean,burst_overlap_rate_std,burst_overlap_rate_max,burst_overlap_rate_min,dominance_ratio_mean,dominance_ratio_std,dominance_ratio_max,dominance_ratio_min,speaking_entropy_mean,speaking_entropy_std,speaking_entropy_max,speaking_entropy_min,bigram_entropy_mean,bigram_entropy_std,bigram_entropy_max,bigram_entropy_min,fano_switch_mean,fano_switch_std,fano_switch_max,fano_switch_min,material_diversity_mean,material_diversity_std,material_diversity_max,material_diversity_min,window,density_conversation,density_fused,density_proximity,density_shared_attention,avg_clustering_conversation,avg_clustering_fused,avg_clustering_proximity,avg_clustering_shared_attention,eigenvector_conversation,eigenvector_fused,eigenvector_proximity,eigenvector_shared_attention,reciprocity_conversation,reciprocity_fused,completion_time_seconds,total_interactions,num_participants,has_data,participant_A_interactions,participant_B_interactions,participant_C_interactions,participant_D_interactions
0,1,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835,,,,,,,,,,,,,,,,415.539,116,4,Yes,42,23,22,29
1,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835,1.0,0.5,0.5,0.3333,0.3333,0.8333,0.8333,0.0,0.0,0.4743,0.4743,0.4268,0.4268,0.3333,0.3333,415.539,116,4,Yes,42,23,22,29
2,1,2,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835,2.0,0.5,0.5833,0.5,0.3333,0.8333,1.0,0.75,0.0,0.4743,0.4264,0.433,0.4268,0.3333,0.2857,415.539,116,4,Yes,42,23,22,29
3,1,3,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835,3.0,0.75,0.75,0.8333,0.0,1.0,1.0,0.8333,0.0,0.491,0.491,0.4963,0.1738,0.6667,0.6667,415.539,116,4,Yes,42,23,22,29
4,1,4,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0,0,1.0,0.0,1,1,0.0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.9144,0.0,0.9144,0.9144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1582,0.1667,0.4757,0.0165,0.8987,0.276,1.1942,0.3898,1.5481,0.4757,2.1013,0.6972,0.8358,0.0659,0.9688,0.7949,4.7187,0.0213,4.7418,4.6835,4.0,0.75,0.75,1.0,0.0,1.0,1.0,1.0,0.0,0.491,0.491,0.5,0.0862,0.6667,0.6667,415.539,116,4,Yes,42,23,22,29


### 3.4 处理缺失值


In [11]:
# 检查缺失值
print("缺失值统计:")
missing_stats = merged_df.isnull().sum()
missing_stats = missing_stats[missing_stats > 0].sort_values(ascending=False)
if len(missing_stats) > 0:
    print(missing_stats)
    print(f"\n总缺失值比例: {merged_df.isnull().sum().sum() / (merged_df.shape[0] * merged_df.shape[1]):.2%}")
    
    # 处理缺失值：前向填充、后向填充、插值
    print("\n正在处理缺失值...")
    merged_df = merged_df.sort_values(['group', 'window_idx'])
    
    # 按组填充
    for col in merged_df.columns:
        if merged_df[col].isnull().sum() > 0 and col not in ['group', 'window', 'window_idx']:
            merged_df[col] = merged_df.groupby('group')[col].transform(
                lambda x: x.ffill().bfill().fillna(0)
            )
    
    print("缺失值处理完成")
    print(f"处理后缺失值: {merged_df.isnull().sum().sum()}")
else:
    print("无缺失值")


缺失值统计:
reciprocity_conversation           187346
density_conversation               187332
window                             187332
density_proximity                  187332
density_shared_attention           187332
avg_clustering_conversation        187332
density_fused                      187332
avg_clustering_fused               187332
avg_clustering_proximity           187332
eigenvector_conversation           187332
avg_clustering_shared_attention    187332
eigenvector_fused                  187332
eigenvector_proximity              187332
eigenvector_shared_attention       187332
reciprocity_fused                  187332
dtype: int64

总缺失值比例: 12.79%

正在处理缺失值...
缺失值处理完成
处理后缺失值: 187332


## 4. 数据划分

### 4.1 按组划分


In [12]:
# 按组划分数据
train_groups = config.TRAIN_GROUPS
val_groups = config.VAL_GROUPS
test_groups = config.TEST_GROUPS

train_raw = merged_df[merged_df['group'].isin(train_groups)].copy()
val_raw = merged_df[merged_df['group'].isin(val_groups)].copy()
test_raw = merged_df[merged_df['group'].isin(test_groups)].copy()

print(f"训练组: {train_groups}")
print(f"验证组: {val_groups}")
print(f"测试组: {test_groups}")
print(f"\n训练组数据量: {len(train_raw)} 窗口")
print(f"验证组数据量: {len(val_raw)} 窗口")
print(f"测试组数据量: {len(test_raw)} 窗口")


训练组: [1, 2, 3, 4, 5, 6, 7, 8]
验证组: [9, 10]
测试组: [11, 12]

训练组数据量: 96359 窗口
验证组数据量: 1487 窗口
测试组数据量: 89921 窗口


### 4.2 训练组内按时间划分


In [13]:
# 对每个训练组，按时间划分（前70%训练，后30%训练验证）
train_list = []
train_val_list = []

for group in train_groups:
    group_data = train_raw[train_raw['group'] == group].copy()
    group_data = group_data.sort_values('window_idx')
    
    n_windows = len(group_data)
    split_idx = int(n_windows * config.TRAIN_TIME_SPLIT)
    
    train_list.append(group_data.iloc[:split_idx])
    train_val_list.append(group_data.iloc[split_idx:])
    
    print(f"组{group}: 总窗口数={n_windows}, 训练={split_idx}, 训练验证={n_windows-split_idx}")

train_data = pd.concat(train_list, ignore_index=True)
train_val_data = pd.concat(train_val_list, ignore_index=True)

print(f"\n最终划分结果:")
print(f"训练集: {len(train_data)} 窗口")
print(f"训练验证集: {len(train_val_data)} 窗口")
print(f"验证集: {len(val_raw)} 窗口")
print(f"测试集: {len(test_raw)} 窗口")


组1: 总窗口数=673, 训练=471, 训练验证=202
组2: 总窗口数=828, 训练=579, 训练验证=249
组3: 总窗口数=1252, 训练=876, 训练验证=376
组4: 总窗口数=1099, 训练=769, 训练验证=330
组5: 总窗口数=585, 训练=409, 训练验证=176
组6: 总窗口数=3491, 训练=2443, 训练验证=1048
组7: 总窗口数=87694, 训练=61385, 训练验证=26309
组8: 总窗口数=737, 训练=515, 训练验证=222

最终划分结果:
训练集: 67447 窗口
训练验证集: 28912 窗口
验证集: 1487 窗口
测试集: 89921 窗口


### 4.3 特征标准化


In [20]:
# 使用训练集的统计量标准化所有数据
print("正在标准化特征...")

# 确定不标准化的列
exclude_cols = ['group', 'window', 'window_idx', 'window_start', 'pair' , 'has_data']

# 标准化
train_scaled, train_val_scaled, val_scaled, test_scaled, scaler = normalize_features(
    train_data, val_raw, test_raw, train_val_data, 
    exclude_cols=exclude_cols
)

print("标准化完成")
print(f"训练集形状: {train_scaled.shape}")
print(f"训练验证集形状: {train_val_scaled.shape}")
print(f"验证集形状: {val_scaled.shape}")
print(f"测试集形状: {test_scaled.shape}")


正在标准化特征...
标准化完成
训练集形状: (67447, 117)
训练验证集形状: (28912, 117)
验证集形状: (1487, 117)
测试集形状: (89921, 117)


## 5. 保存处理后的数据


In [21]:
# 保存所有处理后的数据
save_intermediate('train_data', train_scaled)
save_intermediate('train_val_data', train_val_scaled)
save_intermediate('val_data', val_scaled)
save_intermediate('test_data', test_scaled)
save_intermediate('scaler', scaler)

# 保存特征名称列表
feature_cols = [col for col in train_scaled.columns if col not in exclude_cols]
save_intermediate('feature_names', feature_cols)

print(f"\n✓ 所有数据已保存到 {config.INTERMEDIATE_DIR}")
print(f"✓ 特征数量: {len(feature_cols)}")


已保存: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate\train_data.pkl
已保存: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate\train_val_data.pkl
已保存: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate\val_data.pkl
已保存: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate\test_data.pkl
已保存: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate\scaler.pkl
已保存: C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate\feature_names.pkl

✓ 所有数据已保存到 C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\intermediate
✓ 特征数量: 113


## 6. 数据划分报告


In [22]:
# 生成数据划分报告
report_lines = []
report_lines.append("=" * 60)
report_lines.append("数据划分报告")
report_lines.append("=" * 60)
report_lines.append(f"\n训练组: {train_groups}")
report_lines.append(f"验证组: {val_groups}")
report_lines.append(f"测试组: {test_groups}")
report_lines.append(f"\n训练集: {len(train_scaled)} 窗口")
report_lines.append(f"训练验证集: {len(train_val_scaled)} 窗口")
report_lines.append(f"验证集: {len(val_scaled)} 窗口")
report_lines.append(f"测试集: {len(test_scaled)} 窗口")
report_lines.append(f"\n总特征数: {len(feature_cols)}")
report_lines.append(f"\n各组窗口数统计:")
report_lines.append(f"训练组: {train_scaled['group'].value_counts().sort_index().to_dict()}")
report_lines.append(f"验证组: {val_scaled['group'].value_counts().sort_index().to_dict()}")
report_lines.append(f"测试组: {test_scaled['group'].value_counts().sort_index().to_dict()}")

report_text = "\n".join(report_lines)
print(report_text)

# 保存报告
with open(config.REPORTS_DIR / "data_split_report.txt", 'w', encoding='utf-8') as f:
    f.write(report_text)

print(f"\n✓ 报告已保存到 {config.REPORTS_DIR / 'data_split_report.txt'}")


数据划分报告

训练组: [1, 2, 3, 4, 5, 6, 7, 8]
验证组: [9, 10]
测试组: [11, 12]

训练集: 67447 窗口
训练验证集: 28912 窗口
验证集: 1487 窗口
测试集: 89921 窗口

总特征数: 113

各组窗口数统计:
训练组: {1: 471, 2: 579, 3: 876, 4: 769, 5: 409, 6: 2443, 7: 61385, 8: 515}
验证组: {9: 766, 10: 721}
测试组: {11: 89230, 12: 691}

✓ 报告已保存到 C:\Users\nowan\Downloads\Shared_with_EECS215_Fall2025\outputs\reports\data_split_report.txt


## 数据准备完成！

下一步：运行 `02_unsupervised_feature_selection.ipynb` 进行无监督特征选择
