# 02 - Normalize and Label

## Goal

Normalize MEDLINE XML â†’ flat rows with `pmid`, `title`, `abstract`, `journal`, `year`, `pub_types[]`, `mesh_terms[]`. Then map PT/keywords â†’ `labels[]`. We keep it multi-label.


## Why This Step Matters

Raw XML is unusable for ML. We need:

- **Flat structure:** One row per paper
- **Clean text:** Title + abstract concatenated
- **Structured metadata:** Publication Types and MeSH terms as lists
- **Canonical labels:** Map messy PT to clean categories

This is where **data quality** is won or lost.


In [None]:
# TODO: Import libraries
# Hint: import pandas as pd, yaml
# from pathlib import Path
# from lxml import etree
# from tqdm import tqdm


In [None]:
# TODO: Collect raw XML paths
# Hint: xml_files = sorted(Path('../data/raw').glob('*.xml'))
#       print(f"Found {len(xml_files)} XML files")


## Parse One Article

XPath expressions to extract key fields:

- `PMID`: `.//MedlineCitation/PMID/text()`
- `Title`: `.//ArticleTitle//text()`
- `Abstract`: `.//AbstractText//text()`
- `Journal`: `.//Journal/Title/text()`
- `Year`: `.//PubDate/Year/text()`
- `Publication Types`: `.//PublicationType/text()`
- `MeSH Terms`: `.//MeshHeading/DescriptorName/text()`


In [None]:
# TODO: Parse one article element
# Hint: def parse_article(article_elem):
#     pmid = article_elem.xpath('.//MedlineCitation/PMID/text()')
#     title = ''.join(article_elem.xpath('.//ArticleTitle//text()'))
#     abstract = ' '.join(article_elem.xpath('.//AbstractText//text()'))
#     journal = ''.join(article_elem.xpath('.//Journal/Title/text()'))
#     year_list = article_elem.xpath('.//PubDate/Year/text()')
#     year = int(year_list[0]) if year_list else None
#     pub_types = article_elem.xpath('.//PublicationType/text()')
#     mesh_terms = article_elem.xpath('.//MeshHeading/DescriptorName/text()')
#     return {
#         'pmid': pmid[0] if pmid else None,
#         'title': title,
#         'abstract': abstract,
#         'journal': journal,
#         'year': year,
#         'pub_types': pub_types,
#         'mesh_terms': mesh_terms
#     }


In [None]:
# TODO: Build rows list â†’ DataFrame
# Hint: rows = []
# for xml_file in tqdm(xml_files):
#     tree = etree.parse(str(xml_file))
#     for article in tree.xpath('//PubmedArticle'):
#         row = parse_article(article)
#         if row['abstract']:  # Filter out papers without abstracts
#             rows.append(row)
# df = pd.DataFrame(rows)
# print(f"Parsed {len(df)} articles with abstracts")


In [None]:
# TODO: Save interim parquet
# Hint: Path('../data/interim').mkdir(parents=True, exist_ok=True)
#       df.to_parquet('../data/interim/normalized.parquet', index=False)
#       print("Saved normalized.parquet")


## Label Design

We define **10 canonical labels** for study design:

1. **SystematicReview** â€” Systematic reviews
2. **MetaAnalysis** â€” Meta-analyses (quantitative synthesis)
3. **RCT** â€” Randomized Controlled Trials
4. **ClinicalTrial** â€” Non-randomized clinical trials
5. **Cohort** â€” Cohort studies (prospective/retrospective)
6. **CaseControl** â€” Case-control studies
7. **CaseReport** â€” Case reports / case series
8. **InVitro** â€” In vitro or ex vivo laboratory studies
9. **Animal** â€” Animal studies
10. **Human** â€” Human subjects (not mutually exclusive)

### Why Multi-label?

- A paper can be **both RCT and Human**
- Systematic reviews may also be **MetaAnalysis**
- Some studies combine **Animal and InVitro** work

### Mapping Strategy

1. **Primary:** Match Publication Types (PT) from MEDLINE
2. **Backfill:** Use keywords in title/abstract for InVitro/Animal/Human when PT missing


In [None]:
# TODO: Load pt_to_labels.yaml
# Hint: with open('../configs/pt_to_labels.yaml') as f:
#     label_map = yaml.safe_load(f)
# print(label_map)


In [None]:
# TODO: Map pub_types â†’ labels
# Hint: def assign_labels(row, label_map):
#     labels = set()
#     text = (row['title'] + ' ' + row['abstract']).lower()
#     
#     # Match Publication Types
#     for label_name, config in label_map.items():
#         pt_list = config.get('pt', [])
#         for pt in pt_list:
#             if pt in row['pub_types']:
#                 labels.add(label_name)
#                 break
#         
#         # Keyword backfill
#         keywords = config.get('keywords', [])
#         for keyword in keywords:
#             if keyword.lower() in text:
#                 labels.add(label_name)
#                 break
#     
#     return list(labels)
#
# df['labels'] = df.apply(lambda row: assign_labels(row, label_map), axis=1)


In [None]:
# TODO: Filter unlabeled rows
# Hint: df['label_count'] = df['labels'].apply(len)
#       print(f"Papers with labels: {(df['label_count'] > 0).sum()} / {len(df)}")
#       df_labeled = df[df['label_count'] > 0].copy()
#       print(f"Keeping {len(df_labeled)} labeled papers ({len(df_labeled)/len(df)*100:.1f}%)")


In [None]:
# TODO: Save labeled parquet
# Hint: Path('../data/processed').mkdir(parents=True, exist_ok=True)
#       df_labeled.to_parquet('../data/processed/dental_abstracts.parquet', index=False)
#       print("Saved dental_abstracts.parquet")


## Recommendations

Before moving forward:

- **Keep examples** of ambiguous rows (e.g., both RCT and ClinicalTrial)
- **Track label cardinality:** What's the average number of labels per paper?
- **Note class imbalance:** Some labels will be much rarer than others
- **Spot-check:** Manually review 10-20 random papers to validate labeling logic

### Quick Stats


In [None]:
# TODO: Label cardinality stats
# Hint: print(f"Average labels per paper: {df_labeled['label_count'].mean():.2f}")
#       print(f"Max labels: {df_labeled['label_count'].max()}")
#       # Show distribution
#       print("\nLabel count distribution:")
#       print(df_labeled['label_count'].value_counts().sort_index())


## ðŸ§˜ Reflection Log

**What did you learn in this session?**
- 

**What challenges did you encounter?**
- 

**How will this improve Periospot AI?**
- 
