# Resume Parser Requirements
為了讓此自動履歷 parser 正確擷取出各段經驗與對應的 bullet points，履歷建議符合以下格式規範：

段落標題使用全大寫英文
例如：EDUCATION, SKILLS, EXPERIENCE, PROJECTS。Parser 會依據這些全大寫行判斷目前所在的 section。

工作／研究經驗（EXPERIENCE）採「兩行 entry + 多行 bullets」結構

第 1 行：機構／公司名稱＋地點（例如：CAYIN Technology ... Taipei, Taiwan）

第 2 行：職稱＋起迄時間（例如：AI Engineering Intern Jul 2024 - Aug 2024）

後面接多行以 • 或 - 開頭的 bullet points。
Parser 會把「公司行 + 職稱行」組合成一個 entry，並將後續 bullets 全部標記為屬於這個 experience。

專案（PROJECTS）採「一行標題 + 多行 bullets」結構

專案標題與簡短描述放在同一行（例如：Financial Argument Mining with LLMs, NTU Spring 2024）

下一行開始為 • 或 - 開頭的 bullets。
Parser 會將這一行視為 project entry 名稱，並將後續 bullets 全部綁定到這個 entry。

bullet points 建議以 • 或 - 開頭，且內容連續、不刻意斷句
PDF 轉文字時可能會自動換行；parser 會嘗試將同一個 bullet 的斷行自動合併，但若中間插入新的符號或標題，可能影響解析效果。

EDUCATION、SKILLS 等區塊目前只做為整段文字使用，不會細分成 bullet 納入 RAG
這些區塊內容會在「resume walkthrough」或背景問題時直接提供給 LLM，而不作為經驗型回答的檢索單位。

In [1]:
import pdfplumber

def extract_pdf_text(pdf_path: str) -> str:
    text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text.append(page_text)
    return "\n".join(text)


In [None]:
import re
def extract_bullets_from_resume(text: str):
    lines = text.split("\n")
    bullets = []

    for line in lines:
        clean = line.strip()
        if not clean:
            continue

        # 明確 bullet：• 或 -
        if clean.startswith("•"):
            bullets.append(clean.lstrip("• ").strip())
        elif clean.startswith("- "):
            bullets.append(clean.lstrip("- ").strip())

    return bullets

In [None]:
def segment_resume_sections(text: str):
    lines = text.split("\n")
    sections = {}
    current = None

    # 可選：已知 section 名稱（不分大小寫）
    known = [
        "EDUCATION",
        "SKILLS",
        "EXPERIENCE",
        "PROJECTS",
        "RESEARCH",
        "PUBLICATIONS"
    ]

    for line in lines:
        clean = line.strip()
        if not clean:
            continue

        upper = clean.upper()

        # 1) 完全大寫，而且長度合理 → 當作 section 標題
        if clean.isupper() and 3 <= len(clean) <= 40:
            current = upper
            sections[current] = []
            continue

        # 2) 或者跟 known list 完全相同（忽略大小寫）
        if upper in known:
            current = upper
            sections[current] = []
            continue

        # 3) 其他行就視為目前 section 的內容
        if current:
            sections[current].append(clean)

    # 如果完全沒抓到 section，fallback：整份當一個 section
    if not sections:
        sections["FULL_RESUME"] = [l.strip() for l in lines if l.strip()]

    return sections

In [None]:
def extract_bullets_with_sections(text: str):
    sections = segment_resume_sections(text)
    bullet_data = []

    for section, lines in sections.items():
        merged = "\n".join(lines)
        bullets = extract_bullets_from_resume(merged)

        for b in bullets:
            bullet_data.append({
                "section": section,
                "text": b
            })

    # 如果真的一個都沒抓到，直接對整份跑一次當 FULL_RESUME
    if not bullet_data:
        bullets = extract_bullets_from_resume(text)
        bullet_data = [
            {"section": "FULL_RESUME", "text": b}
            for b in bullets
        ]

    return bullet_data


In [12]:
pdf_path = "../user_data/resume.pdf"
raw_text = extract_pdf_text(pdf_path)

print(raw_text)   # 先看前 2k 字就好

Chih-Yuan (Edward) Chang
New York (Open to relocate) +1 (646) 995-8648 | cc5375@columbia.edu | linkedin.com/edwardchang716 | Portfolio
EDUCATION
Columbia University New York, NY
Master of Science in Data Science Aug 2025 - Dec 2026 (expected)
Courses: Machine Learning, Algorithm Design and Analysis, Exploratory Data Analysis
National Taiwan University (NTU) Top1 university in Taiwan Taipei, Taiwan
BS in Environmental Engineering, BS in Psychology, GPA 3.8/4.0
Courses: Python, C/C++, Data structures, Natural Language Processing, Artificial Neural Networks, Experiment Design
SKILLS
Programming: Python, C/C++, JavaScript, R, SQL, MATLAB, Git/Github
Machine Learning & AI: TensorFlow, PyTorch, Keras, scikit-learn, Hugging Face, LLMs, NLP, RAG, LangChain
Analytics & Statistics: Pandas, NumPy, Matplotlib/Seaborn, NLTools, statsmodels, Pymer4/lme4, FactorAnalyzer
EXPERIENCE
CAYIN Technology Worldwide top 11 digital signage software provider Taipei, Taiwan
AI Engineering Intern Jul 2024 - Aug 2

In [11]:
bullets = extract_bullets_from_resume(raw_text)

for b in bullets[:20]:
    print("-", b)

- New York (Open to relocate) +1 (646) 995-8648 | cc5375@columbia.edu | linkedin.com/edwardchang716 | Portfolio
- Master of Science in Data Science Aug 2025 - Dec 2026 (expected)
- Courses: Machine Learning, Algorithm Design and Analysis, Exploratory Data Analysis
- National Taiwan University (NTU) Top1 university in Taiwan Taipei, Taiwan
- BS in Environmental Engineering, BS in Psychology, GPA 3.8/4.0
- Courses: Python, C/C++, Data structures, Natural Language Processing, Artificial Neural Networks, Experiment Design
- Programming: Python, C/C++, JavaScript, R, SQL, MATLAB, Git/Github
- Machine Learning & AI: TensorFlow, PyTorch, Keras, scikit-learn, Hugging Face, LLMs, NLP, RAG, LangChain
- Analytics & Statistics: Pandas, NumPy, Matplotlib/Seaborn, NLTools, statsmodels, Pymer4/lme4, FactorAnalyzer
- CAYIN Technology Worldwide top 11 digital signage software provider Taipei, Taiwan
- AI Engineering Intern Jul 2024 - Aug 2024
- Initiated an LLM-powered generative AI project using OpenA

In [16]:
bullet_data = extract_bullets_with_sections(raw_text)

for b in bullet_data[:10]:
    print(b)

{'section': 'EXPERIENCE', 'text': 'Initiated an LLM-powered generative AI project using OpenAI API, enabling customers to generate advertising'}
{'section': 'EXPERIENCE', 'text': 'Developed a web app (Python, HTML/CSS, JavaScript) to showcase AI-generated ad copies; collaborated with R&D'}
{'section': 'EXPERIENCE', 'text': 'Identified limitations in the company’s initial Canva API plan and independently proposed an alternative solution,'}
{'section': 'EXPERIENCE', 'text': 'Applied NLP (NLTK), LLMs (OpenAI API, Transformers) and ML (scikit-learn) models to analyze verbal,'}
{'section': 'EXPERIENCE', 'text': 'Initiated an A/B test experiment (Pandas, NumPy, OpenAI API, Transformers) revealing significant effects of text'}
{'section': 'EXPERIENCE', 'text': 'Co-authored a study on mental representations of friendship and well-being, applying statistical modeling (Pandas,'}
{'section': 'EXPERIENCE', 'text': 'Led the NLP team and mentored undergraduates on LLMs, data analysis and Python codi

In [17]:
def parse_resume(pdf_path):
    raw = extract_pdf_text(pdf_path)
    bullets = extract_bullets_with_sections(raw)
    for i, b in enumerate(bullets):
        b["id"] = i
    return bullets

resume_bullets = parse_resume("../user_data/resume.pdf")

for b in resume_bullets[:10]:
    print(b)

{'section': 'EXPERIENCE', 'text': 'Initiated an LLM-powered generative AI project using OpenAI API, enabling customers to generate advertising', 'id': 0}
{'section': 'EXPERIENCE', 'text': 'Developed a web app (Python, HTML/CSS, JavaScript) to showcase AI-generated ad copies; collaborated with R&D', 'id': 1}
{'section': 'EXPERIENCE', 'text': 'Identified limitations in the company’s initial Canva API plan and independently proposed an alternative solution,', 'id': 2}
{'section': 'EXPERIENCE', 'text': 'Applied NLP (NLTK), LLMs (OpenAI API, Transformers) and ML (scikit-learn) models to analyze verbal,', 'id': 3}
{'section': 'EXPERIENCE', 'text': 'Initiated an A/B test experiment (Pandas, NumPy, OpenAI API, Transformers) revealing significant effects of text', 'id': 4}
{'section': 'EXPERIENCE', 'text': 'Co-authored a study on mental representations of friendship and well-being, applying statistical modeling (Pandas,', 'id': 5}
{'section': 'EXPERIENCE', 'text': 'Led the NLP team and mentored

In [27]:
def is_section_header(line: str) -> bool:
    clean = line.strip()
    return clean.isupper() and 3 <= len(clean) <= 40

def is_bullet(line: str) -> bool:
    clean = line.strip()
    return clean.startswith("•") or clean.startswith("- ")

def is_role_title(line: str) -> bool:
    # 依照你的履歷常見職稱關鍵字
    keywords = [
        "Intern", "Assistant", "Research", "Engineer", "Scientist",
        "Developer", "Associate", "Fellow"
    ]
    return any(k in line for k in keywords)


def parse_resume_entries(text: str):
    # 去掉空行
    lines = [l.rstrip() for l in text.split("\n") if l.strip()]

    results = []
    current_section = None
    current_entry = None

    i = 0
    while i < len(lines):
        line = lines[i].strip()

        # ----------------------------------------------------
        # 1) SECTION header detection
        # ----------------------------------------------------
        if is_section_header(line):
            current_section = line
            current_entry = None
            i += 1
            continue

        # ----------------------------------------------------
        # 2) EXPERIENCE entry: company + role (two-line entry)
        # ----------------------------------------------------
        if (
            current_section == "EXPERIENCE"
            and i + 1 < len(lines)
            and not is_bullet(line)
            and not is_section_header(line)
            and not is_bullet(lines[i+1])
            and is_role_title(lines[i+1])
        ):
            company = line
            role = lines[i+1].strip()
            current_entry = f"{company} — {role}"
            i += 2
            continue

        # ----------------------------------------------------
        # 2b) PROJECTS entry detection (one-line title)
        # 只在 current_entry 仍為 None 時觸發，
        # 避免把 bullet 的續行誤認為新的 entry
        # ----------------------------------------------------
        if (
            current_section == "PROJECTS"
            and current_entry is None
            and not is_bullet(line)
            and not is_section_header(line)
            and i + 1 < len(lines)
            and is_bullet(lines[i+1])
        ):
            # 例如: "Financial Argument Mining with LLMs, NTU Spring 2024"
            current_entry = line
            i += 1
            continue

        # ----------------------------------------------------
        # 3) BULLET parsing with multi-line merge
        # ----------------------------------------------------
        if is_bullet(line):
            bullet = line.lstrip("•- ").strip()
            j = i + 1

            while j < len(lines):
                nxt = lines[j].strip()

                # bullet 結束：下一行是 bullet 或 section
                if is_bullet(nxt) or is_section_header(nxt):
                    break

                # 下一行看起來像新的 EXPERIENCE entry 也要中斷
                if (
                    j + 1 < len(lines)
                    and not is_bullet(nxt)
                    and is_role_title(lines[j+1])
                ):
                    break

                # 否則視為同一個 bullet 的續行
                bullet += " " + nxt
                j += 1

            results.append({
                "section": current_section,
                "entry": current_entry,
                "text": bullet
            })

            i = j
            continue

        # ----------------------------------------------------
        # 4) 其他普通行：略過
        # ----------------------------------------------------
        i += 1

    return results


In [28]:
raw_text = extract_pdf_text("../user_data/resume.pdf")
entries = parse_resume_entries(raw_text)

for e in entries:
    print(e)

{'section': 'EXPERIENCE', 'entry': 'CAYIN Technology Worldwide top 11 digital signage software provider Taipei, Taiwan — AI Engineering Intern Jul 2024 - Aug 2024', 'text': 'Initiated an LLM-powered generative AI project using OpenAI API, enabling customers to generate advertising images and text via prompts, cutting manual design effort by ~60%'}
{'section': 'EXPERIENCE', 'entry': 'CAYIN Technology Worldwide top 11 digital signage software provider Taipei, Taiwan — AI Engineering Intern Jul 2024 - Aug 2024', 'text': 'Developed a web app (Python, HTML/CSS, JavaScript) to showcase AI-generated ad copies; collaborated with R&D team to integrat into CAYIN’s software, boosting demo success and earning 80%+ positive client feedback'}
{'section': 'EXPERIENCE', 'entry': 'CAYIN Technology Worldwide top 11 digital signage software provider Taipei, Taiwan — AI Engineering Intern Jul 2024 - Aug 2024', 'text': 'Identified limitations in the company’s initial Canva API plan and independently propos