# THEMAP: Computing Distances Between Molecular Datasets

This notebook demonstrates how to use THEMAP to compute distances between molecular datasets for transfer learning and task selection.

## What You'll Learn
1. How to organize your data
2. How to compute molecular distances with one line of code
3. How to analyze results to select the best source datasets

## 1. Setup

In [None]:
import os
import sys

# Add repo to path (run from notebooks/ directory)
repo_path = os.path.dirname(os.path.abspath(""))
os.chdir(repo_path)
sys.path.insert(0, repo_path)

import matplotlib.pyplot as plt  # noqa: E402
import pandas as pd  # noqa: E402
import seaborn as sns  # noqa: E402

sns.set_palette("Set2")
plt.rcParams["figure.figsize"] = (12, 5)

## 2. Data Organization

THEMAP expects your data in this structure:

```
datasets/
├── train/                    # Source datasets (to transfer FROM)
│   ├── CHEMBL123456.jsonl.gz
│   ├── CHEMBL789012.jsonl.gz
│   └── ...
└── test/                     # Target datasets (to transfer TO)
    ├── CHEMBL111111.jsonl.gz
    └── ...
```

Each `.jsonl.gz` file contains molecules:
```json
{"SMILES": "CCO", "Property": 1}
{"SMILES": "CCCO", "Property": 0}
```

In [None]:
# Check available datasets
import glob

train_files = glob.glob("datasets/train/CHEMBL*.jsonl.gz")
test_files = glob.glob("datasets/test/CHEMBL*.jsonl.gz")

print(f"Found {len(train_files)} training (source) datasets")
print(f"Found {len(test_files)} test (target) datasets")

## 3. Compute Distances (One Line!)

The simplest way to compute distances between all source and target datasets:

In [None]:
from themap import quick_distance

# Compute distances - this does everything:
# 1. Loads all datasets
# 2. Computes molecular fingerprints (ECFP)
# 3. Calculates pairwise distances
# 4. Saves results to CSV

results = quick_distance(
    data_dir="datasets",  # Your data directory
    output_dir="output",  # Where to save results
    molecule_featurizer="ecfp",  # Fingerprint type (ecfp, maccs, desc2D, etc.)
    molecule_method="euclidean",  # Distance metric
    n_jobs=8,  # Parallel workers
)

print("Done! Results saved to output/")

## 4. Load and Visualize Results

In [None]:
# Load the distance matrix from CSV
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)

print(f"Distance matrix shape: {distances.shape}")
print(f"Sources (rows): {list(distances.index)}")
print(f"Targets (columns): {list(distances.columns)}")

distances

In [None]:
# Visualize as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(distances, annot=True, fmt=".2f", cmap="YlOrRd")
plt.title("Molecular Distance Matrix")
plt.xlabel("Target Datasets")
plt.ylabel("Source Datasets")
plt.tight_layout()
plt.show()

## 5. Find Best Source Datasets

For each target, find the closest source datasets (best for transfer learning):

In [None]:
# Find closest source for each target
print("Best source dataset for each target:")
print("=" * 50)

for target in distances.columns:
    closest = distances[target].idxmin()
    dist = distances[target].min()
    print(f"{target} <- {closest} (distance: {dist:.4f})")

In [None]:
# Visualize distances for a specific target
target = distances.columns[0]  # Pick first target

plt.figure(figsize=(12, 5))
distances[target].sort_values().plot(kind="bar")
plt.title(f"Distances from all sources to {target}")
plt.xlabel("Source Datasets")
plt.ylabel("Distance")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

## 6. Compute Task Hardness

Task hardness = average distance to k-nearest source datasets.
Higher values mean the target is harder to transfer to.

In [None]:
k = 3  # Number of nearest neighbors

hardness = {}
for target in distances.columns:
    k_nearest_avg = distances[target].nsmallest(k).mean()
    hardness[target] = k_nearest_avg

hardness_df = pd.Series(hardness).sort_values()

print(f"Task Hardness (avg of {k}-nearest sources):")
print("=" * 50)
for task, h in hardness_df.items():
    print(f"{task}: {h:.4f}")

print(f"\nEasiest target: {hardness_df.idxmin()}")
print(f"Hardest target: {hardness_df.idxmax()}")

In [None]:
# Visualize task hardness
plt.figure(figsize=(10, 5))
hardness_df.plot(kind="bar", color="steelblue")
plt.title(f"Task Hardness (based on {k}-nearest sources)")
plt.xlabel("Target Dataset")
plt.ylabel("Hardness Score")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

## 7. Using a Config File (Optional)

For more control, you can use a YAML config file:

In [None]:
config_content = """
# THEMAP Pipeline Configuration
data:
  directory: "datasets"
  task_list: null  # Auto-discover all files

molecule:
  enabled: true
  featurizer: "ecfp"    # Options: ecfp, maccs, desc2D, ChemBERTa-77M-MLM, etc.
  method: "euclidean"   # Options: euclidean, cosine, otdd

protein:
  enabled: false        # Set to true if you have protein data

output:
  directory: "output"
  format: "csv"
  save_features: true   # Cache features for faster reruns

compute:
  n_jobs: 8
  device: "auto"        # auto, cpu, or cuda
"""

# Save config
with open("config.yaml", "w") as f:
    f.write(config_content)

print("Config saved to config.yaml")
print("\nRun with: themap run config.yaml")

In [None]:
# Run from config file
from themap import run_pipeline

results = run_pipeline("config.yaml")
print("Pipeline complete!")

## Summary

THEMAP makes it easy to:

1. **Compute distances** between molecular datasets with one function call
2. **Find best sources** for transfer learning to your target task
3. **Measure task hardness** to identify challenging targets

### Next Steps

- Try different featurizers: `maccs`, `desc2D`, `ChemBERTa-77M-MLM`
- Try different distance methods: `cosine`, `otdd`
- Add protein distances by setting `protein.enabled: true`
- Use the CLI: `themap run config.yaml`