The client provided four raw datasets: (1) the curriculum/knowledge framework (Skills and Knowledge years 3–10.json), (2) a STEM careers requirement catalog (STEM Careers.json), (3) sample student profiles (mock_users_progress.json), and (4) a field glossary (json_schema_documentation.md). Key issues included inconsistent schemas across files (id vs user_id, knowledge vs knowledge_progress, uneven level naming), mismatches between career requirements and knowledge node IDs, missing or non-standard thresholds/weights/difficulties, user profiles with near-zero signals (everyone appears “cold start”), and no production-ready learning content library to feed recommendations. To ensure usability and explainability, we performed data preprocessing: unified node IDs and level semantics (L1/L2/L3) and built a searchable skills index; normalized the careers file to id/title/min_skill_levels/required_knowledge[{node,min_level,weight}]/threshold; standardized student profiles to id/grade/inquiry_skills/knowledge with robust type/blank handling; and, lacking real content, generated a placeholder learning content library (one unit per “knowledge node × level”) to enable end-to-end testing. After preprocessing we obtained four engine-ready artifacts—skills_index.json, careers_normalized.json, users_normalized.json, and units_placeholder.json—with aligned schemas, complete fields, and traceable, interpretable mappings suitable for the current rules-based recommendation engine and frontend integration.

Step 1 — Curriculum / Knowledge base (Skills and Knowledge years 3–10.json)

What
Normalize the curriculum file into a flat, searchable index of knowledge nodes, unify the levels structure, and (optionally) auto-generate placeholder learning units so recommendations can already produce “next actions”.

Why

The raw levels field appears as both arrays and objects; inconsistent shapes make parsing brittle.

The recommender needs fast lookups by node id, hence a flat index.

Before you have real content, “node × level” placeholders keep the UX loop complete.

How

Recursively walk the JSON; collect id/title/subject/grade/levels for every node.

Normalize levels to a single shape: {1:{…}, 2:{…}, …} preserving description/outcomes.

Extract progression_to (if present) into directed edges.

Generate placeholders: one unit per node × level (difficulty≈level, default lengthMin, pointing back to that node).

Output

skills_index.json — flat index: id → {title, subject, grade, levels{1:{…}}}

skills_edges.json — [{"from","to"}] progression edges (may be empty)

units_placeholder.json — optional placeholder content (node × level)

In [3]:
# Step 1 — 处理“Skills and Knowledge years 3-10.json”
# 产出：skills_index.json, skills_edges.json, units_placeholder.json

import json, re
from pathlib import Path

BASE = Path.cwd()  # notebook dir
SRC  = BASE / "Skills and Knowledge years 3-10.json"
assert SRC.exists(), f"not found：{SRC}"

with SRC.open("r", encoding="utf-8") as f:
    curriculum = json.load(f)

def normalize_levels(levels_obj):
    """统一 levels 为 {1:{description, outcomes[]}, 2:{...}}"""
    norm = {}
    if isinstance(levels_obj, list):
        for i, item in enumerate(levels_obj, start=1):
            if isinstance(item, dict):
                outs = item.get("outcomes")
                norm[i] = {
                    "description": item.get("description"),
                    "outcomes": outs if isinstance(outs, list) else ([outs] if outs else []),
                }
            else:
                norm[i] = {"description": str(item), "outcomes": []}
    elif isinstance(levels_obj, dict):
        tmp = []
        for k, v in levels_obj.items():
            m = re.search(r"(\d+)", str(k))
            if m: tmp.append((int(m.group(1)), v))
        for n, item in sorted(tmp, key=lambda x: x[0]):
            if isinstance(item, dict):
                outs = item.get("outcomes")
                norm[n] = {
                    "description": item.get("description"),
                    "outcomes": outs if isinstance(outs, list) else ([outs] if outs else []),
                }
            else:
                norm[n] = {"description": str(item), "outcomes": []}
    return norm

# Collect nodes and edges
skills_index = {}      # id -> {title, subject, grade, levels{1:{...}}}
edges = []             # [{from, to}]

def walk(obj, parent_subject=None, parent_grade=None):
    if isinstance(obj, dict):
        subject = obj.get("subject") or parent_subject
        grade   = obj.get("year") or obj.get("grade") or parent_grade

        node_id = obj.get("id")
        title   = obj.get("title")
        if isinstance(node_id, str) and isinstance(title, str):
            skills_index[node_id] = {
                "title": title,
                "subject": subject,
                "grade": grade,
                "levels": normalize_levels(obj.get("levels")),
            }
            prog = obj.get("progression_to", [])
            if isinstance(prog, list):
                for t in prog:
                    if isinstance(t, str):
                        edges.append({"from": node_id, "to": t})

        for v in obj.values():
            walk(v, subject, grade)
    elif isinstance(obj, list):
        for x in obj:
            walk(x, parent_subject, parent_grade)

walk(curriculum)

# 生成占位学习单元（node × level）
units_placeholder = []
for nid, info in skills_index.items():
    title = info["title"]
    for lvl in sorted(info["levels"].keys()):
        lvl_obj = info["levels"][lvl]
        units_placeholder.append({
            "id": f"{nid}::L{lvl}",
            "title": f"提升「{title}」到 L{lvl}",
            "kind": "practice",
            "difficulty": min(max(int(lvl),1),3),
            "lengthMin": 8 if lvl == 1 else (10 if lvl == 2 else 12),
            "knowledge_nodes": [{"id": nid, "weight": 1.0}],
            "skills": [],
            "source": "placeholder-from-curriculum",
            "license": "internal",
            "steps": (lvl_obj.get("outcomes") or [])[:5]
        })

# write doc
( BASE / "skills_index.json"   ).write_text(json.dumps(skills_index,   ensure_ascii=False, indent=2), encoding="utf-8")
( BASE / "skills_edges.json"   ).write_text(json.dumps(edges,          ensure_ascii=False, indent=2), encoding="utf-8")
( BASE / "units_placeholder.json").write_text(json.dumps(units_placeholder, ensure_ascii=False, indent=2), encoding="utf-8")


print("✅ print finish")
print("  node num:", len(skills_index))
print("  level num:", sum(len(v["levels"]) for v in skills_index.values()))
print("  Advanced edge count:", len(edges))

try:
    import pandas as pd
    df = pd.DataFrame([
        {"id": nid, "title": info["title"], "subject": info.get("subject"), "grade": info.get("grade"), "levels": len(info["levels"])}
        for nid, info in list(skills_index.items())[:20]
    ])
    display(df)
except Exception as e:
    sample = list(skills_index.items())[:3]
    print("预览（前3条）：")
    for nid, info in sample:
        print("-", nid, "=>", info["title"], "| levels:", list(info["levels"].keys()))


✅ 输出完成
  节点数: 33
  level 条目数: 99
  进阶边数: 0


Unnamed: 0,id,title,subject,grade,levels
0,BIO.Y3.AC9S3U01,Living vs non-living; life cycles (intro),,3,3
1,EARTH.Y3.AC9S3U02,"Soils, rocks, minerals (properties & resources)",,3,3
2,PHYS.Y3.AC9S3U03,Heat sources & temperature change,,3,3
3,CHEM.Y3.AC9S3U04,Solids & liquids; changes of state,,3,3
4,BIO.Y4.AC9S4U01,Life cycles (development & variation),,4,3
5,BIO.Y4.AC9S4U02,Interdependence & environments,,4,3
6,EARTH.Y4.AC9S4U03,Earth’s surface changes,,4,3
7,CHEM.Y4.AC9S4U04,Material properties & uses,,4,3
8,PHYS.Y4.AC9S4U05,Forces: contact & non-contact,,4,3
9,BIO.Y5.AC9S5U01,Adaptations for survival,,5,3


Step 2 — Careers (STEM Careers.json)

What
Standardize knowledge node IDs in the careers file to match the curriculum, deduplicate/merge repeated required_knowledge entries, and validate references against the curriculum index.

Why

Careers use long prefixes (BIOLOGICAL/PHYSICAL/CHEMICAL) while the curriculum uses short ones (BIO/PHYS/CHEM); without normalization you’ll miss joins.

The same node may appear multiple times; without merging you’ll double count weights.

Early reference checks prevent runtime errors and highlight data gaps.

How

Apply a fixed prefix map: BIOLOGICAL→BIO, PHYSICAL→PHYS, CHEMICAL→CHEM, EARTH→EARTH.

For each career’s required_knowledge: group by node, max the min_level, sum the weight.

Compare the normalized node IDs against skills_index.json keys and list any missing references.

Keep the original structure; only the normalized/merged required_knowledge changes.

Output

careers_normalized.json — careers with normalized node IDs and merged duplicates

careers_validation_report.md — stats (merged counts, missing references detail)

In [4]:
# Step 2 — 处理 STEM Careers.json：前缀规范化 / 去重合并 / 引用校验
import json, re
from pathlib import Path

BASE = Path.cwd()
SRC_CAREERS = BASE / "STEM Careers.json"
SRC_SKILLS  = BASE / "skills_index.json"               #from step 1
SRC_CURR    = BASE / "Skills and Knowledge years 3-10.json"  

OUT_CAREERS = BASE / "careers_normalized.json"
OUT_REPORT  = BASE / "careers_validation_report.md"

assert SRC_CAREERS.exists(), f"未找到 {SRC_CAREERS}"

def normalize_levels(levels_obj):
    norm = {}
    if isinstance(levels_obj, list):
        for i, item in enumerate(levels_obj, start=1):
            if isinstance(item, dict):
                outs = item.get("outcomes")
                norm[i] = {
                    "description": item.get("description"),
                    "outcomes": outs if isinstance(outs, list) else ([outs] if outs else []),
                }
            else:
                norm[i] = {"description": str(item), "outcomes": []}
    elif isinstance(levels_obj, dict):
        tmp = []
        for k, v in levels_obj.items():
            m = re.search(r"(\d+)", str(k))
            if m: tmp.append((int(m.group(1)), v))
        for n, item in sorted(tmp, key=lambda x: x[0]):
            if isinstance(item, dict):
                outs = item.get("outcomes")
                norm[n] = {
                    "description": item.get("description"),
                    "outcomes": outs if isinstance(outs, list) else ([outs] if outs else []),
                }
            else:
                norm[n] = {"description": str(item), "outcomes": []}
    return norm

def build_skills_index_from_curriculum(curr_path: Path):
    with curr_path.open("r", encoding="utf-8") as f:
        cur = json.load(f)
    idx = {}
    def walk(o, subject=None, grade=None):
        if isinstance(o, dict):
            subject = o.get("subject") or subject
            grade   = o.get("year") or o.get("grade") or grade
            if isinstance(o.get("id"), str) and isinstance(o.get("title"), str):
                idx[o["id"]] = {
                    "title": o["title"],
                    "subject": subject,
                    "grade": grade,
                    "levels": normalize_levels(o.get("levels"))
                }
            for v in o.values(): walk(v, subject, grade)
        elif isinstance(o, list):
            for x in o: walk(x, subject, grade)
    walk(cur)
    return idx

if SRC_SKILLS.exists():
    skills_index = json.loads(SRC_SKILLS.read_text(encoding="utf-8"))
else:
    assert SRC_CURR.exists(), "缺少 skills_index.json 且找不到课程源 JSON 兜底。"
    skills_index = build_skills_index_from_curriculum(SRC_CURR)
    SRC_SKILLS.write_text(json.dumps(skills_index, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"ℹ️ 已从 {SRC_CURR.name} 生成临时 skills_index.json（{len(skills_index)} 节点）")

node_exists = set(skills_index.keys())

#前缀规范化 
PREFIX_MAP = {"BIOLOGICAL":"BIO", "PHYSICAL":"PHYS", "CHEMICAL":"CHEM", "EARTH":"EARTH"}
def normalize_id(node_id: str) -> str:
    if not isinstance(node_id, str): return node_id
    m = re.match(r"^([A-Z]+)(\..+)$", node_id)
    if not m: return node_id
    pref, rest = m.group(1), m.group(2)
    return (PREFIX_MAP.get(pref, pref)) + rest

# 加载职业库并处理
careers_doc = json.loads(SRC_CAREERS.read_text(encoding="utf-8"))
career_list = careers_doc["careers"] if isinstance(careers_doc, dict) and "careers" in careers_doc else careers_doc

normalized_careers = []
dup_merged_total = 0
missing_details = []   # [{career_id, missing:[node,...]}]

for c in career_list:
    c2 = dict(c)
    req = c.get("required_knowledge", []) or []

    # 合并同一 node：min_level 取最大、weight 相加
    merged = {}
    for r in req:
        nid = normalize_id(r["node"])
        ml  = int(r["min_level"])
        wt  = float(r["weight"])
        if nid not in merged:
            merged[nid] = {"min_level": ml, "weight": wt}
        else:
            merged[nid]["min_level"] = max(merged[nid]["min_level"], ml)
            merged[nid]["weight"] += wt

    after_list = [{"node": nid, "min_level": v["min_level"], "weight": v["weight"]}
                  for nid, v in merged.items()]
    dup_merged_total += (len(req) - len(after_list))
    c2["required_knowledge"] = after_list
    normalized_careers.append(c2)

    # 引用校验
    missing = [x["node"] for x in after_list if x["node"] not in node_exists]
    if missing:
        missing_details.append({"career_id": c.get("id"), "missing": missing})

#  导出 
OUT_CAREERS.write_text(json.dumps({"careers": normalized_careers}, ensure_ascii=False, indent=2), encoding="utf-8")

lines = []
lines.append("# Careers 规范化与校验报告\n")
lines.append(f"- 职业总数：{len(normalized_careers)}\n")
lines.append(f"- 去重合并条数：{dup_merged_total}\n")
missing_total = sum(len(x["missing"]) for x in missing_details)
lines.append(f"- 引用缺失节点总数：{missing_total}\n")
if missing_details:
    lines.append("\n## 缺失详情（最多展示 30 条）\n")
    for row in missing_details[:30]:
        lines.append(f"- {row['career_id']}: {', '.join(row['missing'][:10])}\n")
OUT_REPORT.write_text("".join(lines), encoding="utf-8")

print("✅ 输出完成：", OUT_CAREERS.name, "和", OUT_REPORT.name)


✅ 输出完成： careers_normalized.json 和 careers_validation_report.md


Step 3 — Users (mock_users_progress.json)

What
Normalize user knowledge node IDs, coerce all level/grade values to non-negative integers, validate node references against the curriculum, and optionally map career_interests titles → IDs.

Why

If user node IDs don’t match curriculum IDs, lookups will fail during scoring.

Non-integer or negative values distort thresholds and ranking logic.

Early detection of missing references helps data QA and backfilling.

How

Use the same prefix map as Step 2 to normalize knowledge keys; merge duplicates (keep the max level).

Clamp grade into [3, 10]; convert inquiry_skills and knowledge levels to non-negative integers.

If careers_normalized.json is available, try mapping career_interests titles to career IDs (fallback to original string if not found).

For each user/node, verify the node exists in skills_index.json; count missing references.

Output

users_normalized.json — clean user profiles ready for the rules engine

users_validation_report.md — stats (grade clamping, numeric fixes, interest mappings, missing references)

In [5]:
# Step 3 — 处理 mock_users_progress.json：前缀规范化 / 数值清洗 / 引用校验
import json, re
from pathlib import Path

BASE = Path.cwd()
SRC_USERS   = BASE / "mock_users_progress.json"
SRC_SKILLS  = BASE / "skills_index.json"
SRC_CURR    = BASE / "Skills and Knowledge years 3-10.json"   # 兜底：若缺 skills_index 就用课程源构建
SRC_CAREERS = BASE / "careers_normalized.json"                 # 可选：用于把兴趣名称映射为职业ID

OUT_USERS   = BASE / "users_normalized.json"
OUT_REPORT  = BASE / "users_validation_report.md"

assert SRC_USERS.exists(), f"未找到 {SRC_USERS}"

def normalize_levels(levels_obj):
    norm = {}
    if isinstance(levels_obj, list):
        for i, item in enumerate(levels_obj, start=1):
            if isinstance(item, dict):
                outs = item.get("outcomes")
                norm[i] = {
                    "description": item.get("description"),
                    "outcomes": outs if isinstance(outs, list) else ([outs] if outs else []),
                }
            else:
                norm[i] = {"description": str(item), "outcomes": []}
    elif isinstance(levels_obj, dict):
        tmp = []
        for k, v in levels_obj.items():
            m = re.search(r"(\d+)", str(k))
            if m: tmp.append((int(m.group(1)), v))
        for n, item in sorted(tmp, key=lambda x: x[0]):
            if isinstance(item, dict):
                outs = item.get("outcomes")
                norm[n] = {
                    "description": item.get("description"),
                    "outcomes": outs if isinstance(outs, list) else ([outs] if outs else []),
                }
            else:
                norm[n] = {"description": str(item), "outcomes": []}
    return norm

def build_skills_index_from_curriculum(curr_path: Path):
    with curr_path.open("r", encoding="utf-8") as f:
        cur = json.load(f)
    idx = {}
    def walk(o, subject=None, grade=None):
        if isinstance(o, dict):
            subject = o.get("subject") or subject
            grade   = o.get("year") or o.get("grade") or grade
            if isinstance(o.get("id"), str) and isinstance(o.get("title"), str):
                idx[o["id"]] = {
                    "title": o["title"],
                    "subject": subject,
                    "grade": grade,
                    "levels": normalize_levels(o.get("levels"))
                }
            for v in o.values(): walk(v, subject, grade)
        elif isinstance(o, list):
            for x in o: walk(x, subject, grade)
    walk(cur)
    return idx

if SRC_SKILLS.exists():
    skills_index = json.loads(SRC_SKILLS.read_text(encoding="utf-8"))
else:
    assert SRC_CURR.exists(), "缺少 skills_index.json 且找不到课程源 JSON 兜底。"
    skills_index = build_skills_index_from_curriculum(SRC_CURR)
    SRC_SKILLS.write_text(json.dumps(skills_index, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"ℹ️ 已从 {SRC_CURR.name} 生成临时 skills_index.json（{len(skills_index)} 节点）")

node_exists = set(skills_index.keys())


PREFIX_MAP = {"BIOLOGICAL":"BIO", "PHYSICAL":"PHYS", "CHEMICAL":"CHEM", "EARTH":"EARTH"}
def normalize_id(node_id: str) -> str:
    if not isinstance(node_id, str): return node_id
    m = re.match(r"^([A-Z]+)(\..+)$", node_id)
    if not m: return node_id
    pref, rest = m.group(1), m.group(2)
    return (PREFIX_MAP.get(pref, pref)) + rest

career_title_to_id = {}
if SRC_CAREERS.exists():
    careers_doc = json.loads(SRC_CAREERS.read_text(encoding="utf-8"))
    career_list = careers_doc["careers"] if isinstance(careers_doc, dict) and "careers" in careers_doc else careers_doc
    for c in career_list:
        if isinstance(c.get("title"), str) and isinstance(c.get("id"), str):
            career_title_to_id[c["title"].strip().lower()] = c["id"]

# user lodaing and processing
users_doc = json.loads(SRC_USERS.read_text(encoding="utf-8"))
user_list = users_doc["users"] if isinstance(users_doc, dict) and "users" in users_doc else users_doc

normalized_users = []
missing_nodes_total = 0
out_of_range_grades = 0
fixed_values = 0
interest_mapped = 0

def to_nonneg_int(x, default=0):
    global fixed_values
    try:
        n = int(round(float(x)))
        if n < 0: 
            fixed_values += 1
            return 0
        return n
    except Exception:
        fixed_values += 1
        return default

for u in user_list:
    u2 = dict(u)

    # 1) grade set to [3,10]
    g = to_nonneg_int(u2.get("grade", 0), 0)
    if g < 3 or g > 10:
        out_of_range_grades += 1
        g = min(max(g, 3), 10)
    u2["grade"] = g

    # 2) inquiry_skills —— take all non-negative integers
    skills_inq = u2.get("inquiry_skills", {}) or {}
    cleaned_inq = {}
    for k, v in skills_inq.items():
        cleaned_inq[k] = to_nonneg_int(v, 0)
    u2["inquiry_skills"] = cleaned_inq

    # 3) knowledge —— Uniform prefix + non-negative integer level; Merge duplicate keys (after normalization)
    know = u2.get("knowledge", {}) or {}
    cleaned_kn = {}
    for k, v in know.items():
        nk = normalize_id(k)
        lv = to_nonneg_int(v, 0)
        cleaned_kn[nk] = max(cleaned_kn.get(nk, 0), lv)
    u2["knowledge"] = cleaned_kn

    # 4) career_interests
    interests = u2.get("career_interests")
    if isinstance(interests, list):
        mapped = []
        for item in interests:
            if isinstance(item, str):
                s = item.strip()
                if s in career_title_to_id.values():
                    mapped.append(s)
                else:
                    cid = career_title_to_id.get(s.lower())
                    mapped.append(cid if cid else s)  
                    if cid: interest_mapped += 1
        u2["career_interests"] = mapped

    for nid in list(cleaned_kn.keys()):
        if nid not in node_exists:
            missing_nodes_total += 1

    normalized_users.append(u2)

# export 
OUT_USERS.write_text(json.dumps({"users": normalized_users}, ensure_ascii=False, indent=2), encoding="utf-8")

# report 
lines = []
lines.append("# Users 规范化与校验报告\n")
lines.append(f"- 用户数量：{len(normalized_users)}\n")
lines.append(f"- grade 越界并已裁剪：{out_of_range_grades}\n")
lines.append(f"- 数值清洗（无法解析/负数被修正）次数：{fixed_values}\n")
lines.append(f"- career_interests 名称→ID 映射成功：{interest_mapped}\n")
lines.append(f"- knowledge 引用缺失节点（规范化后）总计：{missing_nodes_total}\n")s
OUT_REPORT.write_text("".join(lines), encoding="utf-8")

print("✅ 输出完成：", OUT_USERS.name, "和", OUT_REPORT.name)


✅ 输出完成： users_normalized.json 和 users_validation_report.md


check

In [6]:
# Expect: "Missing files: None", "Career missing refs: 0", "User missing refs: 0"
from pathlib import Path
import json

needed = ["skills_index.json","careers_normalized.json","users_normalized.json"]
missing = [p for p in needed if not Path(p).exists()]
print("Missing files:", missing or "None")

skills = set(json.load(open("skills_index.json","r",encoding="utf-8")).keys())
careers = json.load(open("careers_normalized.json","r",encoding="utf-8"))["careers"]
users = json.load(open("users_normalized.json","r",encoding="utf-8"))["users"]

miss_c = [rk["node"] for c in careers for rk in c["required_knowledge"] if rk["node"] not in skills]
miss_u = [nid for u in users for nid in u["knowledge"].keys() if nid not in skills]
print("Career missing refs:", len(miss_c))
print("User missing refs:", len(miss_u))


Missing files: None
Career missing refs: 0
User missing refs: 0
