# Step 4 ‚Äî Build Final HALDxAI Database Tables

This step compiles all cleaned, annotated, and predicted information into the **final database tables** used by HALDxAI for:

* Entity catalog & unified identifiers
* Aging-related article index
* Entity evidence
* Relation evidence
* Node table
* Relation table
* Entity & relation type tables (both external + prediction-based)

These tables form the backbone of the HALDxAI Aging-KG and are directly consumed by the Web API, Web App, and downstream analytics.

## 4.1 Load Cached Sources

HALDxAI loads all cleaned parquet/json files only once to ensure fast table construction.

In [None]:
from pathlib import Path
from haldxai.enrich.tables.loader                import load_sources
from haldxai.enrich.tables.articles             import build_articles
from haldxai.enrich.tables.entity_catalog        import build_entity_catalog
from haldxai.enrich.tables.entity_catalog_ext    import build_entity_catalog_ext
from haldxai.enrich.tables.entity_evidence       import build_entity_evidence
from haldxai.enrich.tables.entity_types          import build_entity_types
from haldxai.enrich.tables.entity_types_pred     import build_entity_types_pred
from haldxai.enrich.tables.nodes                 import build_nodes
from haldxai.enrich.tables.relation              import build_relations
from haldxai.enrich.tables.relation_evidence     import build_relation_evidence
from haldxai.enrich.tables.relation_types        import build_relation_types

ROOT = Path("/path/to/HALDxAI-Project")

# Load parquet/json sources once
src = load_sources(ROOT)
print("üü¢ Sources loaded!")

Example message:

```
üü¢ Sources loaded!
```

## 4.2 Build Articles Table

This step writes the final **articles.csv**, containing for each PMID:

* Title / Abstract
* Meta-information
* aging_prob (from classifier)
* ID mapping processed metadata

In [None]:
df_articles = build_articles(ROOT, src['Articles'], force=True)

Output:

```
‚ñ∂ ÊûÑÂª∫ articles.csv ‚Ä¶
‚úì articles.csv ÂÜôÂá∫ 445,435 Ë°å ‚Üí data/database/articles.csv
```

## 4.3 Build Entity Catalog (name2id)

This table ensures **all entity names, aliases, synonyms** map to unified Entity-IDs.

In [None]:
df_entity_catalog = build_entity_catalog(ROOT, force=True)

Output:

```
‚úì name2id.json Â∑≤Êõ¥Êñ∞ÔºåÂΩìÂâçÊù°Êï∞ = 6,770,180
‚úì entity_catalog ÂÜôÂá∫ 389,669 Ë°å ‚Üí data/database/entity_catalog.csv
```

## 4.4 Build External Entity Catalog (EXTERNAL sources)

If you want to rebuild:

In [None]:
df_entity_catalog_ext = build_entity_catalog_ext(ROOT, src['ExtNodes'])

## 4.5 Build Entity Evidence Table

This table stores **sentence-level evidence** for each entity appearing in PubMed abstracts, merged from DeepSeek + SciSpacy + predicted entities.

In [None]:
df_entity_evidence = build_entity_evidence(
    ROOT,
    src['Articles'],
    df_llm_entities=src["LlmEnts"],
    force=True
)

Output:

```
‚úì entity_evidence ÂÜôÂá∫ 7,365,014 Ë°å ‚Üí data/database/entity_evidence.csv
‚úì name2id.json Â∑≤Êõ¥Êñ∞
```

## 4.6 Build Entity Types (External + LLM + Predictions)

### ‚ë† External + LLM + BioPortal types

In [None]:
df_articles = build_entity_types(
    ROOT,
    src['ExtNodes'],
    src['LlmEnts'],
    src['PredEnts'],
    force=True
)

Output:

```
‚úì entity_types ÂÜôÂá∫ 4,182,168 Ë°å ‚Üí data/database/entity_types.csv
```

### ‚ë° Entity Type Prediction Table

For entities with model-based predicted types:

In [None]:
df_entity_types = build_entity_types_pred(ROOT, src['PredEnts'], force=True)

Output:

```
‚úì entity_types_pred ÂÜôÂá∫ 1,043,682 Ë°å
```

## 4.7 Build Unified Nodes Table

This merges all entity sources into:

* External nodes
* LLM-extracted nodes
* Predicted nodes
* Unified ID fields

In [None]:
df_nodes = build_nodes(ROOT, src["ExtNodes"], src["LlmEnts"], force=True)

Output:

```
‚úì nodes ÂÜôÂá∫ 2,843,928 Ë°å ‚Üí data/database/nodes.csv
```

## 4.8 Build Unified Relation Table

This merges:

* External relations
* LLM relation extractions
* Predicted relation extractions
* Article-derived relations

In [None]:
df_relations = build_relations(
    ROOT,
    src['ExtRels'],
    src['LlmRels'],
    src['PredRelsArt'],
    force=True
)

Output:

```
‚úì relations ÂÜôÂá∫ 188,101,878 Ë°å ‚Üí data/database/relations.csv
```

## 4.9 Build Relation Evidence Table

This table stores:

* Sentence-level evidence
* Model confidence
* Provenance (LLM, external, predicted)

In [None]:
df_relation_evidence = build_relation_evidence(
    ROOT,
    src['LlmRels'],
    src['PredRelsLlm'],
    src['PredRelsArt'],
    force=True
)

Output:

```
‚úì relation_evidence ÂÜôÂá∫ 27,457,917 Ë°å ‚Üí data/database/relation_evidence.csv
```

## 4.10 Build Relation Types Table

Relation types are integrated from:

* External structured DBs
* LLM extractions
* LLM-predicted relation types
* Article-level predicted relations

In [None]:
df_relation_types = build_relation_types(
    ROOT,
    src['ExtRels'],
    src['LlmRels'],
    src['PredRelsLlm'],
    src['PredRelsArt'],
    force=True
)

Output:

```
‚úì relation_types ÂÜôÂá∫ 188,217,942 Ë°å ‚Üí data/database/relation_types.csv
```

# ‚úî Final Output Summary

After Step 4, your `data/database/` directory contains the **final HALDxAI database**:

| File                     | Description                                       |
| ------------------------ | ------------------------------------------------- |
| `articles.csv`           | All PubMed articles with classifier probabilities |
| `entity_catalog.csv`     | Name ‚Üí ID catalog                                 |
| `entity_catalog_ext.csv` | External dataset entity catalog                   |
| `entity_evidence.csv`    | Sentence-level entity evidence                    |
| `entity_types.csv`       | Unified entity type table                         |
| `entity_types_pred.csv`  | Predicted entity types                            |
| `nodes.csv`              | Unified node table                                |
| `relations.csv`          | Unified relation table                            |
| `relation_evidence.csv`  | Sentence-level relation evidence                  |
| `relation_types.csv`     | Final relation types                              |
