# Step 3 â€” Data Cleaning, ID Mapping, Unified Graph Construction, and Article Indexing

## 3. Merge and Process All Data Tables

This step consolidates all previously collected internal + external entities and relationships, cleans them, generates ID mappings, and builds the unified Aging-KG graph.

The final outputs include:

* Cleaned node and relation parquet tables
* Entity/ID bi-directional mapping tables
* Unified graph edges & nodes
* Article index enriched with classifier predictions

## 3.1 Clean Raw Data Tables

This step standardizes CSV formats, normalizes types, removes duplicates, and writes out optimized Parquet files for fast downstream processing.

In [None]:
from pathlib import Path
from haldxai.postprocess.build_clean_parquet import build_clean

ROOT = Path("/path/to/HALDxAI-Project")

build_clean(ROOT, force=True)

### Example Output

```
âœ” Finished reading raw files
â€¢ collected_ext_nodes.csv        (4,816,220 rows)
â€¢ collected_ext_relations.csv    (161,243,192 rows)
â€¢ all_annotated_entities.csv     (7,365,569 rows)
â€¢ all_annotated_relationships.csv (593,847 rows)

ðŸ“¦ Output written:
  collected_ext_nodes_clean.parquet        (4,816,220 rows)
  collected_ext_rels_clean.parquet         (161,243,192 rows)
  annotated_entities_clean.parquet         (7,365,569 rows)
  annotated_relationships_clean.parquet    (593,847 rows)

ðŸŽ‰ Cleaning complete!
```

## 3.2 Build ID Mapping Tables

The mapping step creates:

* **name2id.csv** â€” maps biological names/aliases â†’ unified Entity-ID
* **id2name.csv** â€” maps Entity-ID â†’ canonical label

In [None]:
from pathlib import Path
from haldxai.postprocess.build_id_mapping import build_id_mapping

ROOT = Path("/path/to/HALDxAI-Project")

build_id_mapping(ROOT, force=False)

### Example Output

```
INFO: âœ“ NAMEâ†’ID mappings: 7,203,837
INFO: âœ“ IDâ†’NAME mappings: 3,047,924

ðŸŽ‰ ID mapping files generated:
  â€¢ data/mappings/name2id.csv
  â€¢ data/mappings/id2name.csv
```

These mapping files ensure consistent entity identities across all datasets (PubMed, DeepSeek, SciSpacy, BioPortal, external databases).

## 3.3 Build the Unified Knowledge Graph

This step:

* Merges curated PubMed extractions + external dataset relationships
* Applies the ID mappings
* Produces the unified Aging-KG (nodes + relations)

In [None]:
from pathlib import Path
from haldxai.enrich.graph_build import build_unified_graph

ROOT = Path("/path/to/HALDxAI-Project")

build_unified_graph(ROOT, force=True)

The output is stored under:

```
data/finals/unified_graph_nodes.parquet
data/finals/unified_graph_edges.parquet
```

## 3.4 Build Article Index

This index links:

* PMID
* abstract
* extracted entities
* extracted relations
* predicted probability of aging relevance (`aging_prob`)

In [None]:
from pathlib import Path
from haldxai.enrich.article_build import build_articles

ROOT = Path("/path/to/HALDxAI-Project")

build_articles(ROOT, force=True)   # Use force=True to overwrite

### Example Output

```
INFO: â–¶ Reading article CSV â€¦
INFO:    Total articles: 445,435
INFO:    Valid abstracts: 445,435
INFO: â–¶ Loading classifier model: models/aging_classifier_tfidf_lr_v1/model.pkl
INFO: â–¶ Predicting aging_prob â€¦

ðŸŽ‰ ARTICLE index built
INFO:
  â€¢ cache/articles.parquet   (445,435 rows)
```

This file becomes the backbone for:

* Querying by PMID
* Computing article-level similarity
* Entity and relation provenance tracing
* HALDxAI WebApp article search and ranking