# Parsing HTML Tests

For better ingestion I should reduce the html code to it's minimal expression <br>
This will be just relevant sections and infobox

In [1]:
import pandas as pd
from bs4 import BeautifulSoup, Tag, NavigableString
import codecs
import re
import json
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk import sent_tokenize


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\juanc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\juanc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [9]:
df = pd.read_csv("../data/processed/wikipedia_files/authors_wikipedia.csv")

# just get paul krugman
df = df[df["author_name"] == "Joseph R. Biden Jr"]
df = df[["source"]]
# extrac from dataframe to text
source = df["source"].values[0]
# extra step just to make this readable for me
html = codecs.decode(source, "unicode_escape")
soup = BeautifulSoup(html, "html5lib")

  html = codecs.decode(source, "unicode_escape")


## TEST1: Directly get the infobox and sections

### Sections

In [10]:
def extract_section_text(tag):
    """
    Given a heading tag (<h1>…<h6>), return all of its following
    siblings' text up until the next heading of <= its level.
    """
    level = int(tag.name[1])
    texts = []
    for sib in tag.next_siblings:
        # stop once we hit an H# of the same or higher level
        if isinstance(sib, Tag) and re.fullmatch(r"h[1-6]", sib.name):
            if int(sib.name[1]) <= level:
                break
        # accumulate text
        if isinstance(sib, Tag):
            texts.append(sib.get_text(" ", strip=True))
        elif isinstance(sib, NavigableString) and sib.strip():
            texts.append(str(sib).strip())
    return " ".join(texts).strip()


def build_heading_tree_with_content(html):
    soup     = BeautifulSoup(html, "html5lib")
    headings = soup.find_all(re.compile(r"^h[1-6]$"))

    tree, stack = [], []
    for h in headings:
        node = {
            "level":   int(h.name[1]),
            "title":   h.get_text(strip=True),
            "id":      h.get("id"),
            "content": extract_section_text(h),
            "children": []
        }

        # pop deeper or equal
        while stack and stack[-1]["level"] >= node["level"]:
            stack.pop()

        if stack:
            stack[-1]["children"].append(node)
        else:
            tree.append(node)

        stack.append(node)

    return tree


# ——— Usage ———
# html = ...  your raw page source as a str
tree = build_heading_tree_with_content(html)

# pretty-print
print(json.dumps(tree, ensure_ascii=False, indent=2))


[
  {
    "level": 2,
    "title": "Early life (1942â1965)",
    "id": "Early_life_(1942â1965)",
    "content": "Main article: Early life and career of Joe Biden Joseph Robinette Biden Jr. was born on November 20, 1942, [ 1 ] at St. Mary's Hospital in Scranton, Pennsylvania , [ 2 ] to Catherine Eugenia \"Jean\" Biden ( nÃ©e Â Finnegan ) and Joseph Robinette Biden Sr. [ 3 ] [ 4 ] He was the oldest child in a Catholic family of mostly Irish descent. Biden has a sister, Valerie , and two brothers, Francis and James . [ 5 ] Home life Joseph Sr. had been wealthy, and the family purchased a home in the affluent Long Island suburb of Garden City, New York , in 1946. [ 6 ] After he suffered business setbacks around the time Biden was seven years old, [ 7 ] [ 8 ] [ 9 ] the family lived with Jean's parents in Scranton for several years. [ 10 ] Scranton fell into economic decline during the 1950s, and Joseph Sr. could not find steady work. [ 11 ] Beginning in 1953, when Biden was ten, [ 12 ] 

### Infobox

In [11]:
infobox = soup.find("table", class_="infobox")
if not infobox:
    raise ValueError("No <table class='infobox'> found")
info = {}
for row in infobox.find_all("tr"):
    header = row.find("th")
    cell   = row.find("td")
    if header and cell:
        key = header.get_text(" ", strip=True)
        val = cell.get_text(" ", strip=True)
        info[key] = val
print(info['Education'])

University of Delaware ( BA ) Syracuse University ( JD )


## TEST2: Look for section with educational info
I have recollected some keywords that might be useful but fallback strategies must be developed

### Using just the keywords

In [12]:
def load_keywords(path):
    with open(path, 'r', encoding='utf-8') as f:
        # strip out empty lines
        return [line.strip() for line in f if line.strip()]

def search_keywords_in_tree(tree, keywords):
    """
    Recursively search for any of `keywords` in each branch's 'content' field,
    and return a flat list of matching branches (with their id, content & level).
    """
    matching_branches = []

    for branch in tree:
        # if *any* keyword matches in this branch, record it
        for kw in keywords:
            if kw in branch['title']:
                matching_branches.append({
                    "id": branch["id"],
                    "content": branch["content"],
                    "level": branch["level"],
                    "keyword": kw
                })
                # once matched, no need to test other keywords on this branch
                break

        # recurse into children and *capture* their results
        if branch.get("children"):
            child_matches = search_keywords_in_tree(branch["children"], keywords)
            matching_branches.extend(child_matches)

    return matching_branches




# 1) Load your keywords into a simple list
keywords = load_keywords('../data/keywords/keywords.txt')

# 2) Call the search once, passing that list
results = search_keywords_in_tree(tree, keywords)

print(f"Found {len(results)} matching branches:")
for hit in results:
    print(f"  [{hit['level']}] {hit['id']}: “…{hit['content'][:50]}…” (matched “{hit['keyword']}”)")


Found 4 matching branches:
  [2] Early_life_(1942â1965): “…Main article: Early life and career of Joe Biden J…” (matched “Early life”)
  [3] Sports_and_young_adulthood: “…At Archmere Academy in Claymont, [ 16 ] Biden play…” (matched “young”)
  [2] Marriages,_law_school,_and_early_career_(1966â1973): “…Main article: Early career of Joe Biden  See also:…” (matched “career”)
  [2] References: “…Citations 1 2 .mw-parser-output cite.citation{font…” (matched “References”)


In [13]:
import re
from nltk.tokenize import sent_tokenize

# 1) Catch one or more “[ 33 ]”, “[33]”, “[33][34]”, etc.
CITATION_RE = re.compile(r'(?:\[\s*\d+\s*\]\s*)+')

# 2) Your combined degree pattern (unchanged)
degree_pattern = re.compile(r"""
    (?:# ABBREVIATIONS (with context checks)
    (?<!\.)\bB\.?(?:A|S|Sc|F\.A|B\.A|Phil)(?=\s|\.|$)
    | (?<!\.)\bM\.?(?:A|S|Sc|Res|Phil|F\.A|B\.A|Ed)(?=\s|\.|$)
    | (?<!\.)\b(?:Ph|J|M|Ed|Sc|D|LL)\.?D(?=\s|\.|$)
    )
    |
    (?:# FULL NAMES
        \bBachelor of (?:Arts|Science|Fine Arts|Philosophy|Laws)\b
        | \bMaster of (?:Arts|Science|Research|Philosophy|Education|Laws)\b
        | \bDoctor of (?:Philosophy|Science|Medicine|Engineering)\b
        | \bJuris Doctor\b | \bDr\.?\s*rer\.?\s*nat\b
    )
""", re.IGNORECASE | re.VERBOSE)


for hit in results:
    raw = hit["content"]


    # 1) strip out all bracketed citations (and any trailing space)
    clean = CITATION_RE.sub('', raw)

    # 2) tidy up leftover spaces before commas/periods
    clean = re.sub(r'\s+([,\.])', r'\1', clean)
    # 3) collapse multiple spaces
    clean = re.sub(r'\s{2,}', ' ', clean).strip()

    # 4) sentence‐split
    sentences = sent_tokenize(clean)

    # 5) pick out only those with a degree
    degree_sentences = [s for s in sentences if degree_pattern.search(s)]
    for s in degree_sentences:
        print("→", s)


error: missing ), unterminated subpattern at position 5 (line 2, column 5)

## Adding all

In [41]:
import re
import sys
import codecs
import pandas as pd
from bs4 import BeautifulSoup, Tag, NavigableString
from nltk.tokenize import sent_tokenize

# ----------------------------------------------------------------
# 1) Regex for precise degree parsing (sections)
DEGREE_PATTERN = re.compile(r"""
    (?:
        \bB\.?A\.?\b
      | \bA\.?B\.?\b               # "AB" or "A.B."
      | \bB(?:Sc|\.?S\.?|Sc)\b     # "BSc" or "BS" or "B.S."
      | \bM\.?A\.?\b               # "MA" or "M.A."
      | \bM(?:Sc|\.?S\.?|Sc)\b     # "MSc" or "MS" or "M.S."
      | \bPh\.?D\.?s?\b            # "PhD" variants
      | \bJ\.?D\.?s?\b             # "JD" variants
      | \bL\.?L\.?M\.?s?\b        # "LLM" variants
    )
    |   # full names
    (?:
        \bBachelor(?:'s)?\b
      | \bMaster(?:'s)?\b
      | \bDoctorate\b
    )
""", re.IGNORECASE | re.VERBOSE)

# ----------------------------------------------------------------
# 2) Loose regex fallback: any sentence containing degree tokens
LOOSE_DEGREE_RE = re.compile(r"""
    (?:\b(?:degree|degrees|graduating|Bachelor(?:'s)?|Master(?:'s)?|Doctorate)\b
      | BA|A\.?B\.?|BS|B\.?S\.?|MA|M\.?A\.?|MS|M\.?S\.?|Ph\.?D\.?|J\.?D\.?|L\.?L\.?M\.?  
    )
""", re.IGNORECASE | re.VERBOSE)

# ----------------------------------------------------------------
# Section blacklist
BLACKLIST_SECTIONS = {
    "See also", "References", "External links", "Further reading", "Notes"
}

# ----------------------------------------------------------------
# Helper: detect HTML

def is_html(text: str) -> bool:
    return bool(re.search(r'<[a-zA-Z][^>]*>', text))

# ----------------------------------------------------------------
# Extract all sections (lead + headings + infobox education)

def extract_all_sections(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "html5lib")
    # strip site-wide junk
    for sel in ["style", "script", "table.navbox", "sup.reference", "span.mw-cite-backlink", "ol.references", "div.reflist", "div.hatnote", "div#toc"]:
        for el in soup.select(sel):
            el.decompose()

    body = soup.select_one("div.mw-parser-output") or soup
    sections = []
    # Lead paragraph(s)
    lead_chunks = []
    first_h = body.find(re.compile(r"^h[1-6]$"))
    for sib in body.children:
        if sib is first_h:
            break
        if isinstance(sib, Tag) and sib.name == 'p':
            lead_chunks.append(sib.get_text(" ", strip=True))
    lead_text = " ".join(lead_chunks).strip()
    if lead_text:
        sections.append({"title": "_lead_", "content": lead_text})

    # Heading-based sections
    def extract_section_text(tag: Tag) -> str:
        level = int(tag.name[1])
        texts = []
        for sib in tag.next_siblings:
            if isinstance(sib, Tag) and re.fullmatch(r"h[1-6]", sib.name):
                if int(sib.name[1]) <= level:
                    break
            if isinstance(sib, Tag) and sib.name == 'p':
                texts.append(sib.get_text(" ", strip=True))
            elif isinstance(sib, NavigableString) and sib.strip():
                texts.append(str(sib).strip())
        return " ".join(texts).strip()

    for h in body.find_all(re.compile(r"^h[1-6]$")):
        title = h.get_text(strip=True)
        if title in BLACKLIST_SECTIONS:
            continue
        content = extract_section_text(h)
        if content:
            sections.append({"title": title, "content": content})

    # Infobox education as a pseudo-section
    infobox = soup.find("table", class_="infobox")
    if infobox:
        edu_texts = []
        for row in infobox.find_all("tr"):
            hdr = row.find("th")
            cell = row.find("td")
            if hdr and cell and ("education" in hdr.get_text(" ", strip=True).lower() or "alma mater" in hdr.get_text(" ", strip=True).lower()):
                edu_texts.append(cell.get_text(" ", strip=True))
        if edu_texts:
            sections.append({"title": "_infobox_education_", "content": "; ".join(edu_texts)})

    return sections

# ----------------------------------------------------------------
# Primary parse: extract degree sentences per section

def parse_degrees_from_sections(sections: list[dict]) -> dict:
    extracted = {}
    for sec in sections:
        title = sec["title"]
        for sent in sent_tokenize(sec["content"]):
            if DEGREE_PATTERN.search(sent):
                extracted.setdefault(title, []).append(sent.strip())
    return extracted

# ----------------------------------------------------------------
# Fallback: scan cleaned sections for any degree tokens

def extract_every_degree_sentence(html: str) -> list[str]:
    sections = extract_all_sections(html)
    hits = []
    for sec in sections:
        for sent in sent_tokenize(sec["content"]):
            if LOOSE_DEGREE_RE.search(sent):
                hits.append(sent.strip())
    # dedupe
    return list(dict.fromkeys(hits))

# ----------------------------------------------------------------
# Convert extracted mapping to Markdown

def degrees_to_markdown(degrees_map: dict) -> str:
    md_lines = []
    for section, sents in degrees_map.items():
        header = "# Lead" if section == "_lead_" else f"## {section}"
        md_lines.append(header)
        for s in sents:
            md_lines.append(f"- {s}")
        md_lines.append("")
    return "\n".join(md_lines)

# ----------------------------------------------------------------
# Main extraction: returns markdown text

def extract_degrees_markdown(html: str) -> str:
    if not is_html(html):
        raise ValueError("Input is not valid HTML")
    sections = extract_all_sections(html)
    # tight section parse
    sec_map = parse_degrees_from_sections(sections)
    if sec_map:
        return degrees_to_markdown(sec_map)
    # fallback: catch-all sentences from cleaned sections
    fallback = extract_every_degree_sentence(html)
    md = "## Degree Mentions"
    for s in fallback:
        md += f"\n- {s}"
    return md

# ----------------------------------------------------------------
# Example usage
if __name__ == "__main__":
    df = pd.read_csv("../data/processed/wikipedia_files/authors_wikipedia.csv")
    for author in ["Robert Hughes", "Troy Gil", "Bruce Marks", "Paul Krugman", "Schuyler Bailar", "Joseph R. Biden Jr", "Elon Musk", "Keith B Richburg", "Jennifer Rubin", "Bob Greene"]:
        row = df[df["author_name"] == author].iloc[0]
        html = codecs.decode(row["source"], "unicode_escape")
        try:
            md = extract_degrees_markdown(html)
            print(f"=== {author} ===")
            print(md)
        except ValueError as e:
            print(author, "error:", e)


Robert Hughes error: Input is not valid HTML
=== Troy Gil ===
## Biography
- He attended Jamaica High School , and received his bachelor's degree, master's degree, and doctorate from Harvard University .

=== Bruce Marks ===
## Education
- Marks attended University of Pennsylvania where he graduated cum laude with bachelor's degrees in economics and Russian.
- Marks also received a JD from the Law School of the University of Pennsylvania cum laude , and an L.L.M.

## _infobox_education_
- University of Pennsylvania ( BA , BS , JD ) University of Cambridge ( LLM )



  html = codecs.decode(row["source"], "unicode_escape")


=== Paul Krugman ===
## Early life and education
- In 1974, Krugman earned his BA summa cum laude in economics from Yale University , where he was a National Merit Scholar .
- He then went on to pursue a PhD in economics from Massachusetts Institute of Technology (MIT).
- In 1977, he successfully completed his PhD in three years, with a thesis titled Essays on flexible exchange rates.
- Krugman later praised his PhD thesis advisor, Rudi Dornbusch , as "one of the great economics teachers of all time" and said that he "had the knack of inspiring students to pick up his enthusiasm and technique, but find their own paths".

## Personal life
- He is currently married to Robin Wells , an academic economist who received her BA from the University of Chicago and her PhD from the University of California, Berkeley .

## _infobox_education_
- Yale University ( BA ) Massachusetts Institute of Technology ( MA , PhD )

=== Schuyler Bailar ===
## Degree Mentions
- Bailar was assigned female at birt