# ZTA Survey Analysis

本 Notebook 覆盖如下步骤：

1. 数据清洗与转换（缺失值、Yes/No、Likert 重编码）
2. 信度分析（Cronbach’s Alpha）
3. 探索性因子分析（EFA）
4. 描述统计（人口统计、模型变量）
5. 相关分析（ZTA 熟悉度、必要性、满意度等）
6. ANOVA/卡方（组织规模、行业与采用差异）

结果文件输出至 `analysis_outputs/`。


In [1]:
# 基础依赖导入与配置
import os
import json
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tabulate import tabulate
from scipy import stats
from sklearn.preprocessing import LabelEncoder
from factor_analyzer.factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo
import pingouin as pg
import statsmodels.api as sm
import statsmodels.formula.api as smf

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 160)

OUTPUT_DIR = 'analysis_outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)

DATA_PATH = 'Cleaned Data-5.20.xlsx'
assert os.path.exists(DATA_PATH), f"未找到数据文件: {DATA_PATH}"

print('环境与依赖加载完成。')


环境与依赖加载完成。


In [2]:
# 读取数据
raw = pd.read_excel(DATA_PATH)
df = raw.copy()
print(df.shape)
df.head()


(65, 39)


Unnamed: 0,IPAddress,Progress,Duration (in seconds),Finished,RecordedDate,Consent Form,Q1.1,Q1.2,Q1.3,Q1.4,Q1.4_12_TEXT,Q1.5,Q1.6,Q2.1,Q2.2,Q2.3,Q2.4,Q2.5,Q2.6,Q2.7,Q2.7_11_TEXT,Q2.8,Q2.8_4_TEXT,Q2.9_1,Q2.10,Q2.11,Q2.11_4_TEXT,Q2.12_1,Q2.13,Q3.1_1,Q3.2_1,Q3.3,Q3.3_6_TEXT,Q4.1,Q4.1_6_TEXT,Q4.2,Q5.1_4,Q5.2,Q6.1
0,IP 地址,进度,持续时间（秒）,已完成,记录日期,Statement of consent\n\n\n\nI have read the st...,What is your age?,Size of your current organisation (number of e...,Do you have any experience working at or doing...,What is the industry sector of your organisati...,What is the industry sector of your organisati...,What is your role or job title in your organis...,How many years of experience have you had in c...,Does your organisation have a need for cloud c...,Did your organisation experience any difficult...,Does your organisation use telecommuting with ...,Has your organisation experienced difficulties...,Has your organisation experienced a cybersecur...,Did your organization experience difficulties ...,What are the primary cybersecurity threats exp...,What are the primary cybersecurity threats exp...,What security model does your organisation cur...,What security model does your organisation cur...,What is your level of satisfaction with the cu...,Has your organisation used other security mode...,Which security model has your organization pre...,Which security model has your organization pre...,What is your level of satisfaction with the pr...,What are the main reasons your organisation ch...,How familiar are you with Zero Trust Architect...,Do you think it is necessary to implement ZTA ...,What major benefits do you anticipate or have ...,What major benefits do you anticipate or have ...,What are the primary challenges faced when imp...,What are the primary challenges faced when imp...,Would these challenges prevent your organisati...,"Compared to traditional security models, do yo...",Please briefly justify your response to previo...,Additional comments or recommendations regardi...
1,116.237.144.10,100,2343,真,2025-03-27 21:07:26.820000,Agree,18-24,50–250,,Manufacturing,,IT Professional,0-3 year(s),Yes,Yes,Yes,Yes,No,,,,Hybrid,,3,No,,,,,4,4,"Enhanced security,Reduced attack surface,Impro...",,"Financial investment,Legacy system integration...",,Yes,2,SMEs may have financial problems to buy and de...,It depends on the type of organization. If the...
2,116.237.144.10,100,420,真,2025-03-27 21:27:09.925000,Agree,25-34,50–250,,Technology,,Cybersecurity Professional,4-7 years,No,,Yes,No,No,,,,Traditional perimeter based,,3,No,,,,,3,1,"Enhanced security,Reduced attack surface,Impro...",,"Complexity of Identity and Access Management,S...",,No,5,提高公司信息安全水平，简化员工信息安全防护要求，提高系统使用效率，便于日志审计等。,
3,219.89.140.125,100,391,真,2025-04-10 16:33:54.389000,Agree,,251–500,,Manufacturing,,Other role with experiences with security,12-15 years,No,,No,,No,,,,Hybrid,,2,Yes,Traditional perimeter based,,3,not sufficient,3,3,Improved compliance,,Staff training,,Unsure,2,not sure,
4,184.22.39.135,100,373,真,2025-04-15 14:01:23.820000,Agree,18-24,50–250,,Technology,,Cybersecurity Professional,0-3 year(s),Yes,No,Yes,No,No,,,,Hybrid,,3,No,,,,,3,5,"Enhanced security,Reduced attack surface",,"Complexity of Identity and Access Management,S...",,Unsure,5,I think it is beneficial for all business size...,


In [3]:
# Qualtrics 预清洗：删除问卷说明/标签行，并规范完成样本
if 'Progress' in df.columns:
    # 仅保留 Progress 可数值化的行（去除如“进度”等标签行）
    _prog_num = pd.to_numeric(df['Progress'], errors='coerce')
    df = df[_prog_num.notna()].copy()

# 去除显然的“说明文本行”
text_like_cols = [c for c in df.columns if c.endswith('_TEXT')]
for c in text_like_cols:
    # 如果该列存在与问题描述高度相似的长文本且仅出现一次，通常为说明行，直接忽略该列在筛选中的影响
    pass

# 统一索引
df = df.reset_index(drop=True)
print('预清洗完成，样本量: ', len(df))


预清洗完成，样本量:  64


In [4]:
# 数值化：基于比率判定列为 Likert(1-7) 或 Yes/No，并分别转换（修复索引对齐问题）
for col in df.columns:
    s = df[col]
    if s.dtype == 'object':
        mask = s.notna()
        if mask.sum() == 0:
            continue
        vals = s[mask].astype(str).str.strip()
        low = vals.str.lower()
        # 比例：纯数字1-7
        numeric_ratio = (low.str.fullmatch(r'[1-7]').sum()) / len(vals)
        # 比例：Yes/No
        yesno_ratio = (low.isin(['yes','no','y','n','是','否']).sum()) / len(vals)
        s_new = s.copy()
        if numeric_ratio >= 0.5:
            s_new.loc[mask] = pd.to_numeric(vals, errors='coerce')
        elif yesno_ratio >= 0.5:
            ymap = {'yes':1, 'no':0, 'y':1, 'n':0, '是':1, '否':0}
            s_new.loc[mask] = low.map(ymap)
        df[col] = s_new

print('数值化完成。')


数值化完成。


In [5]:
# 自动构念检测与关键变量映射（可按需手动修改）
AUTO_CONSTRUCTS = {
    'familiarity': [c for c in df.columns if str(c).startswith('Q3.1_')],
    'necessity': [c for c in df.columns if str(c).startswith('Q3.2_')],
    'satisfaction_current': [c for c in df.columns if str(c).startswith('Q2.9_')],
    'satisfaction_previous': [c for c in df.columns if str(c).startswith('Q2.12_')],
    'perceived_benefits': [c for c in df.columns if str(c).startswith('Q5.1_')],
}

# 手动可覆盖的构念（如有多题项量表，将题项列名放这里）
CONSTRUCTS = {
    # 'pts': ['PTS1','PTS2','PTS3'],
    # 'ce': ['CE1','CE2','CE3'],
}

# 将自动构念映射到通用键以便后续分析（命名尽量兼容）
CONSTRUCTS['pb'] = AUTO_CONSTRUCTS['perceived_benefits']
# 若有 Q4.* 作为采纳意向条目，可启用：
CONSTRUCTS['ai'] = [c for c in df.columns if str(c).startswith('Q4.')]

KEY_VARS = {
    'zta_familiarity': 'Q3.1_1',
    'zta_necessity': 'Q3.2_1',
    'zta_satisfaction': 'Q2.9_1',
    'zta_adoption': 'Q2.8',
    'org_size': 'Q1.2',
    'industry': 'Q1.4'
}

print('构念/关键变量自动配置完成：')
print({k: v for k, v in KEY_VARS.items() if v in df.columns})
print('CONSTRUCTS 预览：', {k: len(v) for k, v in CONSTRUCTS.items()})


构念/关键变量自动配置完成：
{'zta_familiarity': 'Q3.1_1', 'zta_necessity': 'Q3.2_1', 'zta_satisfaction': 'Q2.9_1', 'zta_adoption': 'Q2.8', 'org_size': 'Q1.2', 'industry': 'Q1.4'}
CONSTRUCTS 预览： {'pb': 1, 'ai': 3}


In [6]:
# 数据清洗 & 转换辅助函数
LIKERT_MAP = {
    'Strongly disagree': 1, 'Disagree': 2, 'Somewhat disagree': 3,
    'Neutral': 4,
    'Somewhat agree': 5, 'Agree': 6, 'Strongly agree': 7,
    # 中文或数字化版本可扩展
    '非常不同意': 1, '不同意': 2, '有点不同意': 3,
    '中立': 4,
    '有点同意': 5, '同意': 6, '非常同意': 7
}

YESNO_MAP = {
    'Yes': 1, 'No': 0,
    '是': 1, '否': 0,
    'Y': 1, 'N': 0
}

NA_TOKENS = {'NA', 'N/A', 'na', 'n/a', 'none', 'None', 'NULL', 'null', ''}


def normalize_missing(x):
    if pd.isna(x):
        return np.nan
    if isinstance(x, str) and x.strip() in NA_TOKENS:
        return np.nan
    return x


def recode_likert(series: pd.Series) -> pd.Series:
    return series.map(LIKERT_MAP).astype('float') if series.dtype == 'object' else series


def recode_yesno(series: pd.Series) -> pd.Series:
    return series.map(YESNO_MAP).astype('float') if series.dtype == 'object' else series


def cronbach_alpha(df_items: pd.DataFrame) -> float:
    # df_items: 每列为一个题项，数值型
    df_items = df_items.dropna(axis=0, how='any')
    k = df_items.shape[1]
    if k < 2:
        return np.nan
    item_vars = df_items.var(axis=0, ddof=1)
    total_var = df_items.sum(axis=1).var(ddof=1)
    if total_var == 0:
        return np.nan
    alpha = (k / (k - 1)) * (1 - item_vars.sum() / total_var)
    return float(alpha)

print('清洗工具函数已定义。')


清洗工具函数已定义。


In [7]:
# 自动数据清洗：标准化缺失、修剪空白、统一大小写
for col in df.columns:
    df[col] = df[col].apply(lambda x: x.strip() if isinstance(x, str) else x)
    df[col] = df[col].apply(lambda x: x.title() if isinstance(x, str) else x)  # 统一首字母大写风格
    df[col] = df[col].apply(normalize_missing)

print('缺失标准化完成。')

# 尝试自动识别 Likert/YesNo 列进行重编码（保留原列做 _raw 备份）
likert_like_tokens = set([k.title() if isinstance(k, str) else k for k in LIKERT_MAP.keys()])
yesno_tokens = set([k.title() if isinstance(k, str) else k for k in YESNO_MAP.keys()])

recode_report = []
for col in df.columns:
    if df[col].dtype == 'object':
        values = set(v for v in df[col].dropna().unique())
        # 判定为 Likert
        if len(values & likert_like_tokens) >= 3:
            df[f'{col}_raw'] = df[col]
            df[col] = recode_likert(df[col])
            recode_report.append((col, 'Likert'))
        # 判定为 Yes/No
        elif len(values & yesno_tokens) >= 2:
            df[f'{col}_raw'] = df[col]
            df[col] = recode_yesno(df[col])
            recode_report.append((col, 'Yes/No'))

pd.DataFrame(recode_report, columns=['column', 'type']).to_csv(os.path.join(OUTPUT_DIR, 'recode_report.csv'), index=False)
print(f'自动重编码完成，共 {len(recode_report)} 列。')

# 保存清洗后的数据（优先 Parquet，缺少引擎时回退 CSV）
clean_parquet = os.path.join(OUTPUT_DIR, 'cleaned_dataset.parquet')
clean_csv = os.path.join(OUTPUT_DIR, 'cleaned_dataset.csv')
try:
    df.to_parquet(clean_parquet, index=False)
    print('清洗数据已保存至: ', clean_parquet)
except Exception as e:
    print('Parquet 保存失败，改用 CSV:', str(e))
    df.to_csv(clean_csv, index=False)
    print('清洗数据已保存至: ', clean_csv)


缺失标准化完成。
自动重编码完成，共 0 列。
Parquet 保存失败，改用 CSV: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.
清洗数据已保存至:  analysis_outputs/cleaned_dataset.csv


In [8]:
# 配置构念与量表条目（根据你的列名修改）
# 请将下方字典的键替换成你数据中对应的列名列表
CONSTRUCTS = {
    # 示例：感知威胁严重性（Perceived Threat Severity）
    'pts': [
        # 'PTS1', 'PTS2', 'PTS3', ...
    ],
    # 示例：应对效能（Coping Efficacy）
    'ce': [
        # 'CE1', 'CE2', 'CE3', ...
    ],
    # 示例：采纳意向（Adoption Intention）
    'ai': [
        # 'AI1', 'AI2', 'AI3', ...
    ],
    # 示例：感知收益（Perceived Benefits）
    'pb': [
        # 'PB1', 'PB2', 'PB3', ...
    ],
}

# 其它关键变量（根据列名修改）
KEY_VARS = {
    'zta_familiarity': 'ZTA Familiarity',
    'zta_necessity': 'Perceived Necessity of ZTA',
    'zta_satisfaction': 'Satisfaction with Current Security',
    'zta_adoption': 'ZTA Adoption',  # 0/1 或 类别
    'org_size': 'Organization Size',
    'industry': 'Industry'
}

print('请检查并修改 CONSTRUCTS 与 KEY_VARS 中的列名以匹配数据集。')


请检查并修改 CONSTRUCTS 与 KEY_VARS 中的列名以匹配数据集。


In [9]:
# 信度分析：按构念计算 Cronbach's Alpha
alpha_rows = []
for name, cols in CONSTRUCTS.items():
    cols = [c for c in cols if c in df.columns]
    if len(cols) >= 2:
        alpha = cronbach_alpha(df[cols])
        alpha_rows.append({'construct': name, 'k_items': len(cols), 'alpha': alpha})

alpha_df = pd.DataFrame(alpha_rows)
print(alpha_df)
alpha_df.to_csv(os.path.join(OUTPUT_DIR, 'cronbach_alpha.csv'), index=False)


Empty DataFrame
Columns: []
Index: []


In [10]:
# 探索性因子分析（EFA）：以感知收益（PB）与采纳意向（AI）为例
# 将需要做 EFA 的条目集合到一起（可按构念分别做）
EFA_TARGETS = {
    'pb': [c for c in CONSTRUCTS.get('pb', []) if c in df.columns],
    'ai': [c for c in CONSTRUCTS.get('ai', []) if c in df.columns],
}

efa_results = {}
for key, cols in EFA_TARGETS.items():
    if len(cols) >= 3:
        sub = df[cols].dropna()
        # KMO 与 Bartlett 球形度检验
        chi_square_value, p_value = calculate_bartlett_sphericity(sub)
        kmo_all, kmo_model = calculate_kmo(sub)
        # 根据特征值>1 估计因子数
        fa_check = FactorAnalyzer(n_factors=min(6, len(cols)), rotation=None)
        fa_check.fit(sub)
        ev, v = fa_check.get_eigenvalues()
        n_factors = int((ev > 1).sum())
        n_factors = max(1, n_factors)
        # 最终旋转（varimax）
        fa = FactorAnalyzer(n_factors=n_factors, rotation='varimax')
        fa.fit(sub)
        loadings = pd.DataFrame(fa.loadings_, index=cols, columns=[f'F{i+1}' for i in range(n_factors)])
        efa_results[key] = {
            'bartlett_chi2': float(chi_square_value), 'bartlett_p': float(p_value),
            'kmo_model': float(kmo_model),
            'eigenvalues': ev.tolist(),
            'n_factors': int(n_factors),
            'loadings': loadings
        }
        loadings.to_csv(os.path.join(OUTPUT_DIR, f'efa_loadings_{key}.csv'))
        pd.DataFrame({'eigenvalue': ev}).to_csv(os.path.join(OUTPUT_DIR, f'efa_eigenvalues_{key}.csv'), index=False)

# 保存一个简要的 EFA JSON 概览
with open(os.path.join(OUTPUT_DIR, 'efa_summary.json'), 'w') as f:
    json.dump({k: {kk: vv for kk, vv in d.items() if kk != 'loadings'} for k, d in efa_results.items()}, f, indent=2)

print('EFA 完成。')


EFA 完成。


In [11]:
# 描述统计：人口统计与核心变量
# 人口统计列（请根据你的列名调整）
DEMOGRAPHICS = ['Gender', 'Age', 'Education', KEY_VARS['org_size'], KEY_VARS['industry']]

# 数值型核心变量（须已重编码或本身为数值）
NUMERIC_VARS = []
for key in ['zta_familiarity', 'zta_necessity', 'zta_satisfaction']:
    col = KEY_VARS.get(key)
    if col in df.columns:
        NUMERIC_VARS.append(col)

# 概要统计
desc_num = df[NUMERIC_VARS].describe().T if NUMERIC_VARS else pd.DataFrame()
print(desc_num)
desc_num.to_csv(os.path.join(OUTPUT_DIR, 'desc_numeric.csv'))

# 分类频数
cat_reports = {}
for col in DEMOGRAPHICS:
    if col in df.columns:
        counts = df[col].value_counts(dropna=False)
        cat_reports[col] = counts
        counts.to_csv(os.path.join(OUTPUT_DIR, f'freq_{col}.csv'))

print('描述统计完成。')


Empty DataFrame
Columns: []
Index: []
描述统计完成。


In [12]:
# 若无多题Likert构念：输出说明并跳过 Alpha/EFA
has_any_construct = any(len(v) >= 2 for v in CONSTRUCTS.values())
if not has_any_construct:
    note = '未检测到可用于 Cronbach/ EFA 的多题Likert构念；跳过信度与EFA。可改为对二元多选做MCA或聚类。'
    with open(os.path.join(OUTPUT_DIR, 'note_no_constructs.txt'), 'w') as f:
        f.write(note)
    print(note)
else:
    print('检测到多题构念，将在前述单元完成 Alpha/EFA。')


未检测到可用于 Cronbach/ EFA 的多题Likert构念；跳过信度与EFA。可改为对二元多选做MCA或聚类。


In [13]:
# Q3.3 多选题拆分为哑变量，并做共现分析
multi_col = 'Q3.3'
if multi_col in df.columns:
    s = df[multi_col].astype(str).fillna('')
    # 以逗号分隔，清洗空白
    split = s.str.get_dummies(sep=',')
    # 规范列名
    split.columns = [c.strip() for c in split.columns]
    # 去掉空字符串列
    if '' in split.columns:
        split = split.drop(columns=[''])
    # 保存二元矩阵
    split.to_csv(os.path.join(OUTPUT_DIR, 'Q3_3_dummies.csv'), index=False)

    # 简单共现矩阵
    co = split.T @ split
    np.fill_diagonal(co.values, 0)
    co.to_csv(os.path.join(OUTPUT_DIR, 'Q3_3_cooccurrence.csv'))
    print('Q3.3 多选拆分与共现完成。')
else:
    print('未找到 Q3.3 列，跳过多选处理。')


Q3.3 多选拆分与共现完成。


In [14]:
# 逻辑回归：adopt_zta ~ 熟悉度 + 必要性 + 满意度 + 规模 + 行业
from statsmodels.discrete.discrete_model import Logit
from patsy import dmatrices

if 'adopt_zta' not in df.columns and 'Q2.8' in df.columns:
    df['adopt_zta'] = (df['Q2.8'].astype(str).str.contains('Zero Trust Architecture', case=False, na=False)).astype(int)

vars_available = [c for c in ['Q3.1_1','Q3.2_1','Q2.9_1','Q1.2','Q1.4','adopt_zta'] if c in df.columns]
if set(['adopt_zta','Q3.2_1']).issubset(vars_available):
    # 将类别变量设为分类特征
    for cat in ['Q1.2','Q1.4']:
        if cat in df.columns:
            df[cat] = df[cat].astype('category')
    # 使用 Q("列名") 包裹包含点号的列名，避免 patsy 解析错误
    formula = 'adopt_zta ~ Q("Q3.1_1") + Q("Q3.2_1") + Q("Q2.9_1") + C(Q("Q1.2")) + C(Q("Q1.4"))'
    y, X = dmatrices(formula, df, return_type='dataframe')
    # 二项逻辑回归（去除缺失）
    yv = y.iloc[:,0]
    model = Logit(yv, X).fit(disp=False)
    summ = model.summary2().as_text()
    with open(os.path.join(OUTPUT_DIR, 'logit_adopt.txt'), 'w') as f:
        f.write(summ)
    print('逻辑回归完成，结果已保存 logit_adopt.txt')
else:
    print('缺少回归所需变量，跳过逻辑回归。')


逻辑回归完成，结果已保存 logit_adopt.txt




In [15]:
# MCA：基于 Q3.3 多选哑变量的多重对应分析
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import prince

multi_col = 'Q3.3'
if multi_col in df.columns:
    s = df[multi_col].astype(str).fillna('')
    split = s.str.get_dummies(sep=',')
    split.columns = [c.strip() for c in split.columns]
    if '' in split.columns:
        split = split.drop(columns=[''])
    # 过滤全零列（无人选择的选项）
    split = split.loc[:, split.sum(axis=0) > 0]
    # 至少需要两个选项
    if split.shape[1] >= 2 and split.shape[0] >= 5:
        mca = prince.MCA(n_components=2, copy=True, check_input=True, random_state=42)
        mca = mca.fit(split)
        row_coords = mca.row_coordinates(split)
        col_coords = mca.column_coordinates(split)

        row_coords.to_csv(os.path.join(OUTPUT_DIR, 'mca_Q3_3_row_coords.csv'), index=False)
        col_coords.to_csv(os.path.join(OUTPUT_DIR, 'mca_Q3_3_col_coords.csv'))

        # 解释方差
        try:
            eig = pd.DataFrame({'eigenvalue': mca.eigenvalues_})
        except Exception:
            # 某些版本可能没有 eigenvalues_，使用 explained_inertia_
            ev = np.array(mca.explained_inertia_)
            eig = pd.DataFrame({'eigenvalue': ev})
        eig.to_csv(os.path.join(OUTPUT_DIR, 'mca_Q3_3_eigenvalues.csv'), index=False)

        # 简单散点图（行点）
        plt.figure(figsize=(5,4))
        plt.scatter(row_coords.iloc[:,0], row_coords.iloc[:,1], s=20, alpha=0.6)
        plt.axhline(0, color='gray', lw=0.5)
        plt.axvline(0, color='gray', lw=0.5)
        plt.title('MCA of Q3.3 (Rows)')
        plt.tight_layout()
        plt.savefig(os.path.join(OUTPUT_DIR, 'mca_Q3_3_rows.png'), dpi=150)
        plt.close()

        # 选项点图
        plt.figure(figsize=(5,4))
        plt.scatter(col_coords.iloc[:,0], col_coords.iloc[:,1], s=30, c='C1')
        for name, (x,y) in col_coords.iloc[:, :2].iterrows():
            plt.text(x, y, str(name)[:20], fontsize=7)
        plt.axhline(0, color='gray', lw=0.5)
        plt.axvline(0, color='gray', lw=0.5)
        plt.title('MCA of Q3.3 (Options)')
        plt.tight_layout()
        plt.savefig(os.path.join(OUTPUT_DIR, 'mca_Q3_3_options.png'), dpi=150)
        plt.close()
        print('MCA 完成。')
    else:
        print('Q3.3 选项数或样本不足，跳过 MCA。')
else:
    print('未找到 Q3.3 列，跳过 MCA。')


MCA 完成。


In [16]:
# 相关分析：ZTA 熟悉度、必要性、满意度
corr_cols = [KEY_VARS[k] for k in ['zta_familiarity', 'zta_necessity', 'zta_satisfaction'] if KEY_VARS[k] in df.columns]
if len(corr_cols) >= 2:
    sub = df[corr_cols].dropna()
    corr = sub.corr(method='pearson')
    print(corr)
    corr.to_csv(os.path.join(OUTPUT_DIR, 'correlation_core.csv'))
    # 显示散点图矩阵
    sns.pairplot(sub)
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'pairplot_core.png'), dpi=150)
    plt.close()
else:
    print('相关分析所需变量不足。')


相关分析所需变量不足。


In [17]:
# ANOVA 与 卡方检验
adoption_col = KEY_VARS.get('zta_adoption')
size_col = KEY_VARS.get('org_size')
industry_col = KEY_VARS.get('industry')

# 若采纳为数值（0/1），可做组间均值差异（对核心连续变量）
if adoption_col in df.columns:
    # 示例：不同采纳状态下的 ZTA 必要性感知差异
    target = KEY_VARS.get('zta_necessity')
    if target in df.columns:
        tmp = df[[adoption_col, target]].dropna()
        if tmp[adoption_col].nunique() >= 2:
            groups = [g[target].values for _, g in tmp.groupby(adoption_col)]
            fval, pval = stats.f_oneway(*groups)
            print('ANOVA (ZTA 采纳 -> 必要性): F=%.3f, p=%.4g' % (fval, pval))

# 行业 × 采纳 卡方
if industry_col in df.columns and adoption_col in df.columns:
    ct = pd.crosstab(df[industry_col], df[adoption_col])
    chi2, p, dof, exp = stats.chi2_contingency(ct)
    print('Chi-square (行业 × 采纳): chi2=%.3f, p=%.4g, dof=%d' % (chi2, p, dof))
    ct.to_csv(os.path.join(OUTPUT_DIR, 'chisq_industry_adoption_ct.csv'))

# 规模 × 采纳 卡方
if size_col in df.columns and adoption_col in df.columns:
    ct = pd.crosstab(df[size_col], df[adoption_col])
    chi2, p, dof, exp = stats.chi2_contingency(ct)
    print('Chi-square (规模 × 采纳): chi2=%.3f, p=%.4g, dof=%d' % (chi2, p, dof))
    ct.to_csv(os.path.join(OUTPUT_DIR, 'chisq_size_adoption_ct.csv'))

print('ANOVA/卡方检验完成。')


ANOVA/卡方检验完成。


In [18]:
# 可选：导出关键图表（直方图/箱线图/相关热力图）
if NUMERIC_VARS:
    # 直方图
    for col in NUMERIC_VARS:
        plt.figure(figsize=(5,3))
        sns.histplot(df[col].dropna(), kde=True)
        plt.title(f'Distribution - {col}')
        plt.tight_layout()
        plt.savefig(os.path.join(OUTPUT_DIR, f'hist_{col}.png'), dpi=150)
        plt.close()

    # 相关热力图
    corr = df[NUMERIC_VARS].corr()
    plt.figure(figsize=(6,5))
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap')
    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'corr_heatmap.png'), dpi=150)
    plt.close()

print('图表导出完成。')


图表导出完成。
