# Step 2 — External Dataset Collection and Integration

## 2. External Dataset Collection and Aggregation

HALDxAI integrates multiple aging-related external datasets (e.g., AgingAtlas, HAGR, CTD, UniProt, etc.) into a unified schema.
This step demonstrates how to:

1. **Download external databases**
2. **Aggregate nodes and relations into unified CSV files**

Before running the commands below, ensure that:

* The HALDxAI project has been initialized (see **Step 1**)
* You have installed all required dependencies

## 2.1 Collect External Databases

In [None]:
from pathlib import Path
from haldxai.enrich.external_db import cli   # Typer-based CLI module

project_root = Path("/path/to/HALDxAI-Project")

cli.main(
    names=["all"],   # Download all available external datasets
    root=project_root,
    force=False      # Set to True to overwrite downloaded files
)

This command downloads every supported external database into:

```
data/external_dbs/<database_name>/
```

## 2.2 Aggregate External Dataset Nodes & Relations

After downloading, HALDxAI unifies all external datasets into consistent:

* **Node tables** (entities)
* **Relation tables** (typed edges)

In [None]:
from pathlib import Path
from haldxai.enrich.ext_collect import build_collect
from haldxai.enrich.external_db import cli

ROOT = Path("/path/to/HALDxAI-Project")

# Aggregate external entities and relationships
build_collect(ROOT, force=True)

### ✔ Output Files

After aggregation, two key CSV files are generated:

| File                                      | Description                                 |
| ----------------------------------------- | ------------------------------------------- |
| `data/finals/collected_ext_nodes.csv`     | All integrated nodes from external datasets |
| `data/finals/collected_ext_relations.csv` | All integrated relations across datasets    |

### Preview Results

In [None]:
import pandas as pd

nodes = pd.read_csv(ROOT / "data/finals/collected_ext_nodes.csv")
rels  = pd.read_csv(ROOT / "data/finals/collected_ext_relations.csv")

print(nodes.shape, rels.shape)
nodes.head()

You should now see:

* A large merged node table containing genes, diseases, pathways, compounds, etc.
* A multi-source relation table across various biological knowledgebases