# TCMxPlore:
## discovering anti-aging TCM formulas with bioinformatics and generative AI

This notebook will walk you through key functions of TCMxPlore. You will start with a list of aging-related genes obtained from a generative biology model [Precious3GPT](https://github.com/insilicomedicine/precious3-gpt/tree/main), then screen natural compounds recorded in [BATMAN-TCM2](https://doi.org/10.57967/hf/3314) to select the molecules that interact with these targets. Finally, you will find TCM herbs and formulas that contain the compounds and read more about their properties in the context of TCM practice.

Firstly, install TCMxPlore and import the `connector` module.
It will allow you to download, parse, and cross-reference TCM entities, such as molecular targets, ingredients, herbs, formulas, and conditions.

In [1]:
from connector import connector

import pickle, json
import pandas as pd
from itertools import product

`BatmanDragonConnector` in particular will download and unpack Hugging Face datasets you need.
Depending on your Internet speed, pre-processing can take some time.

In [2]:
con = connector.BatmanDragonConnector.from_huggingface()

  from .autonotebook import tqdm as notebook_tqdm


📂 Accessing HuggingFace repository...
✅ Connector for BATMAN-TCM2 and DragonTCM has been loaded
Start processing individual databases:
Adding cross mappings
Done adding cross mappings
Adding equivalents
Done adding equivalents
Loading DB: BATMAN
⚠️ Loading this dataset usually takes ~3-5 minutes.
⏳ Downloading compressed database file from HuggingFace...
📂 Database file downloaded, now decompressing and loading...
✅ JSON file has been read!
Loading the DB into memory...
✅ Database has been loaded successfully!
Uploaded DB: BATMAN
Loading DB: American Dragon
Added 1044 herbs
Added 1119 conditions
Added 2580 formulas
Uploaded DB: American Dragon
Adding word maps
Done adding word maps
from_huggingface execution time: 51.90 seconds


You can access and work within idndividual TCM databases stored in the connector.

In [3]:
batman_db = con.dbs['BATMAN']
dragon_db = con.dbs['American Dragon']

Now, load the files that will serve as the foundation of our TCM exploration:
- `aging_signatures.txt` — species- and tissue-specific signatures of aging. Basically, lists with top-100 genes that are upregulated in older age groups, as assessed by the Precious3GPT AI model;
- `P3GPT_compound_targets.pckl` — list of compounds that are featured in both Precious3GPT and BATMAN-TCM2, natural compounds whose effect on gene expression may be estimated using generative AI; 

In [4]:
p_signs = './materials/aging_signatures.txt'
p_cpd_filter = "./materials/P3GPT_compound_targets.pckl"

with open(p_cpd_filter, "rb") as f:
    cpd_filter = pickle.load(f)
results = pd.read_csv(p_signs, sep='\t', index_col=None, header=0)

In [5]:
results.head()

Unnamed: 0,tissue,dataset_type,species,control,case,direction,hallmarks
0,skin,methylation,mouse,Mouse-19.95-30,Mouse-350-400,up,LOC100270710;GABRP;KRTAP5-7;IL2RG;CSMD2;IL1RL2...
1,skin,methylation,human,19.95-25.0,70.0-80.0,up,C11orf90;LPAR1;LOC100270710;BANF2;LOR;MYF6;OBP...
2,skin,expression,mouse,Mouse-19.95-30,Mouse-350-400,up,UTY;CD1A;AKR1C4;PLEKHG1;CALCRL;SLC22A7;SLC17A7...
3,skin,expression,human,19.95-25.0,70.0-80.0,up,SLC1A1;KRT15;BBOX1;CYP2B6;KDM5D;SELENOP;UTY;CD...
4,liver,methylation,mouse,Mouse-19.95-30,Mouse-350-400,up,CD101;PM20D1;NAPSB;PCDH12;LOC100132724;NRG4;LR...


In [6]:
len(cpd_filter)

659

📝 We advise you to check out this [library](https://github.com/insilicomedicine/precious3-gpt) that we used to generate the aging signatures shown above.
It is not necessary to work with this notebook, but many functions, such as gene list intersection and pathway enrichment, are handled much more easily within `precious3gpt`.

## Using cross-species aging halmarks to select TCM formulas
We will define genes to be target with TCM herbs as genes that appear in both human and murine aging signatures. 

In [7]:
params = {"tissue": ['liver', 'muscle', 'lung'],
          "dataset_type": ['expression'],
          "species": ['mouse', 'human']}

In [8]:
sibling_generations = list(product(*[[(x, z) for z in y] for x, y in params.items()][:2]))

signature_genes = dict()
for t in params['tissue']:
    df_slice = results[(results.tissue == t) & (results.dataset_type == 'expression')]
    signature_genes[t] = set.intersection(*[set(x.split(";")) for x in df_slice.hallmarks.tolist()])


Aging signatures for mice and humans have a substantial intersection in the three tried tissues:

In [9]:
{x:len(y) for x,y in signature_genes.items()}

{'liver': 33, 'muscle': 22, 'lung': 27}

First, we look for the compounds whose targets (both known and predicted) significantly overlap with the identified signature genes: 

In [10]:
picked_cpds = dict()
for t in params['tissue']:
    picked_cpds[t] = batman_db.find_enriched_cpds(
                                  signature_genes[t], # genes to be targeted by natural compounds
                                  tg_type='both', # consider both known and predicted molecular targets
                                  thr=0.01, # significance threshold, with multiple comparison
                                  cpd_subset=cpd_filter # consider only compounds known to P3GPT
                                                )
    print(f"\nCompounds targetting {t} aging in mice and humans:\n"
          f"{'; '.join([x['name'].capitalize() for x in picked_cpds[t].values()][-5:])}... (total N = {len(picked_cpds[t])})")

with open("./materials/27Oct2024_picked_cpds_cross-species.pkl", "wb") as f:
    pickle.dump(picked_cpds, f)


Compounds targetting liver aging in mice and humans:
Cholic acid; Deoxycholic acid; Honokiol; Indobufen; Sorbic acid... (total N = 90)

Compounds targetting muscle aging in mice and humans:
Cyclosporin a; Nomegestrol; Cholic acid; Riboflavin; Honokiol... (total N = 27)

Compounds targetting lung aging in mice and humans:
Docosanoic acid; Arachidic acid; Vitamin e; Triamcinolone; L-glutamic acid... (total N = 25)


Among these compounds, eight appear for all three tissues:

In [11]:
multitis_cpds = set.intersection(*[set(picked_cpds[x].keys()) for x in params['tissue']])
print("CID\tCompound name")
print(*[(x, batman_db.ingrs[x].pref_name) for x in multitis_cpds], sep = '\n')

CID	Compound name
(10467, 'Arachidic acid')
(936, 'nicotinamide')
(177, 'acetate')
(243, 'benzoate')
(8215, 'Docosanoic acid')
(2266, 'nonanedioic acid')
(1054, 'pyridoxine')
(3647, 'hydroflumethiazide')


We can now easily identify the formulas from BATMAN-TCM2 that contain all eight by inspecting their ingredients:

In [12]:
flas_w_all_cpds = [x for x,y in batman_db.formulas.items() if 
                   all([z in [a.cid for a in y.ingrs] for z in multitis_cpds])]
print(*flas_w_all_cpds, sep="\n")

SHEN RONG LU TAI GAO


There is only one such formula: [SHEN RONG LU TAI GAO](http://www.tcmip.cn/ETCM/index.php/Home/Index/fj_details.html?pid=SHEN%20RONG%20LU%20TAI%20GAO), which is an ointment containing ginseng and deer placenta used to improve female reproductive health.

There are alternative ways to interact with TCM databases to enable geroprotector search. E.g. we may look for formulas that have compounds affecting 2+ signatures of aging:

In [13]:
cpd_counts =  sum([list(x.keys()) for x in picked_cpds.values()], [])
cpd_counts = [x for x in set(cpd_counts) if cpd_counts.count(x)>1]

sel_flas = batman_db.select_N_formulas_by_cids(cids = cpd_counts, # look for this compounds in TCM formulas
                                               min_cids = 20, # a formula has to have at least 20 cids
                                               N_top = 100 # how many formulas to present
                                              )

The only formula with all 20 compounds is HUA SHAN WU ZI DAN:

In [14]:
sel_flas

(20, ['HUA SHAN WU ZI DAN'])

We can inspect these 20 compounds:

In [21]:
[x.pref_name.capitalize() for x in batman_db.formulas['HUA SHAN WU ZI DAN'].ingrs if x.cid in cpd_counts]

['Arachidic acid',
 'L-ascorbic acid',
 'Benzoate',
 'Cholic acid',
 'Docosanoic acid',
 'Pseudoephedrine',
 'Acetate',
 'Nonanedioic acid',
 'Butyric acid',
 'Picolinic acid',
 'Thiamine',
 'Protoporphyrin ix',
 'Chenodeoxycholic acid',
 'Desoxycortone',
 'Glycerol',
 'Vitamin e',
 'Hypoxanthine',
 'Uric acid',
 'Hydroflumethiazide',
 'L-glutamic acid']

In addition to searching for formulas based on the compounds they contain, TCMxPlore allows picking herbs based on the genes they target:

In [24]:
# Get a set of genes that are present in 2+ signatures of cross-species aging 
double_tgs = ((signature_genes["lung"] & signature_genes["liver"]) |
              (signature_genes["lung"] & signature_genes["muscle"]) |
              (signature_genes["muscle"] & signature_genes["liver"]))
print(f"{len(double_tgs)} genes are upregulated in at least two of tissues in mice and humans\n")
print(*double_tgs, sep ='; ', end="\n\n")

# With this call, you'll see how many genes each herb has as its target
tg_based_herbs = batman_db.select_herbs_for_targets(double_tgs)
# Finally, select only the herbs that hit the highest number of genes
N_max = max(tg_based_herbs.values())
tg_based_herbs = {x:y for x,y in tg_based_herbs.items() if y == N_max}
print("(Herb, N targets from set)")
print(*tg_based_herbs.items(), sep = '\n')

13 genes are upregulated in at least two of tissues in mice and humans

OVCH1; LBP; ACSM1; SCD; MOBP; SERPINA3; C3; SLC7A2; CFD; KRT15; MIOX; SRPX; NNMT

(Herb, N targets from set)
('SHA YUAN ZI', 7)
('ROU CONG RONG', 7)
('SHAN ZHA YE', 7)


You can define your own formulas locally, so they appear in your next searches:

In [25]:
batman_db.formulas["Custom Formula ~1"] = batman_db.create_formula_from_herbs(tg_based_herbs.keys())
batman_db.formulas["Custom Formula ~1"]

<Formula: Custom Formula ~1>

## Using cross-tissue human aging halmarks to select TCM formulas
We may focus on the herbs that are expected 

In [26]:
all_tis = ['liver', 'muscle', 'lung', 'skin', 'heart', 'kidney', 'fat tissue']
signature_genes = dict()

In [27]:
for t in all_tis:
    df_slice = results[(results.tissue == t) & 
                       (results.dataset_type == 'expression') & 
                       (results.species == 'human')]
    signature_genes[t] = set.intersection(*[set(x.split(";")) for x in df_slice.hallmarks.tolist()])


In [28]:
picked_cpds = dict()
for t in all_tis:
    picked_cpds[t] = batman_db.find_enriched_cpds(signature_genes[t],
                                                  tg_type='both',
                                                  thr=0.001,
                                                  cpd_subset=cpd_filter)
    print(f"Done with {t} — {len(picked_cpds[t])} compounds picked")

Done with liver — 177 compounds picked
Done with muscle — 160 compounds picked
Done with lung — 162 compounds picked
Done with skin — 143 compounds picked
Done with heart — 132 compounds picked
Done with kidney — 192 compounds picked
Done with fat tissue — 147 compounds picked


In [29]:
cpd_counts = {x:[t for t in picked_cpds if x in picked_cpds[t]] for x in set(cpd_filter)}
cpd_counts = {x:y for x,y in cpd_counts.items() if y}
cpd_counts = dict(sorted(cpd_counts.items(), key=lambda x:len(x[1]), reverse=True))
top_cpds = [x for x in cpd_counts if len(cpd_counts[x]) == 7]

In [33]:
print(f"Compounds found that affect 1+ tissues: {len(cpd_counts)}\n")
for  i in range(6):
    cpd_hits = len([x for x,y in cpd_counts.items() if len(y) == i+1])
    print(f"Compounds found to affect exactly {i+1} tissues: {cpd_hits}")

print(f"Compounds found that affect all 7 tissues: {len(top_cpds)}\n")
print(*[batman_db.ingrs[x].pref_name.capitalize() for x in top_cpds][-5:], sep="\n")
print("...")

Compounds found that affect 1+ tissues: 294

Compounds found to affect exactly 1 tissues: 64
Compounds found to affect exactly 2 tissues: 52
Compounds found to affect exactly 3 tissues: 21
Compounds found to affect exactly 4 tissues: 40
Compounds found to affect exactly 5 tissues: 34
Compounds found to affect exactly 6 tissues: 29
Compounds found that affect all 7 tissues: 54

Ethinyl estradiol
Progesterone
Chenodeoxycholic acid
Adenosine 5'-monophosphate
Menadione
...


No existing TCM formula has all the idntified compounds targetting the aging processes in 7 tissues, with the best contender [TOU GU ZHEN FENG WAN](https://bidd.group/TCMID/tcmf.php?formula=TCMFx5163) featuring only 25 such compounds.

In [34]:
best_fla = batman_db.select_N_formulas_by_cids(top_cpds,
                                               min_cids = 25,
                                               N_top=1)
print("Compounds found in the best-fitting formula: %s\nFormula name: %s"%best_fla)

Compounds found in the best-fitting formula: 25
Formula name: ['TOU GU ZHEN FENG WAN (TOU GU ZHEN FENG DAN )']


TCMxPlore provides you with several points of control when it comes to designing new formulas.
By default, the `get_greedy_formula(...)` method will keep adding herbs until all the sought compounds are represented in a formula. Such formulas may end up with too complicated. You may limit the total number of herbs in your formula with the `max_herb` parameter, or exclude certain herbs from consideration with the `blacklist` parameter.

In [56]:
greedy_fla = batman_db.get_greedy_formula(top_cpds)
# Enforce a simpler composition at the cost of including fewer compounds
smol_fla = batman_db.get_greedy_formula(top_cpds, max_herb=4)
# Some components may be excluded from search to avoid  potential
# health hazards and animal-based products
tcm_blacklist = ["HA MA YOU", # "Forest frog's oviduct
                 'XIONG DAN', # Bear gall
                 'SHE XIANG', # Deer musk
                 'ZI HE CHE', # Human placenta
                 'DONG CHONG XIA CAO', # Cordyceps caterpillar,
                 'LU RONG' # Deer antlers
                 'JIU', # Alchol used in extraction
                 'REN NIAO' # Human urine
                ]
floral_fla = batman_db.get_greedy_formula(top_cpds, max_herb=4,
                                          blacklist = tcm_blacklist)
# Remove restrictions on the number of compounds
floral_fla_full = batman_db.get_greedy_formula(top_cpds, max_herb=100,
                                               blacklist = tcm_blacklist)

In [64]:
tcm_flas = {"TCM-ISM-1":greedy_fla,
            "TCM-ISM-2":smol_fla,
            "ISM-Formula#1":floral_fla_full,
            "ISM-Formula#2":floral_fla}
for name, fla in tcm_flas.items():
    cpds_hit = len(set([x.cid for x in fla.ingrs])&set(top_cpds))
    n_herbs = len(fla.herbs)
    print(f"Formula {name} contains {n_herbs} ingredients which feature {cpds_hit}/{len(top_cpds)} of selected compounds")
    for i,h in enumerate(fla.herbs[:4]):
        common_name = h.synonyms[2].split(", ")[0]
        cpds_in_herb = len(set([x.cid for x in h.ingrs])&set(top_cpds))
        print(f"\t{i+1}. {common_name} (carries {cpds_in_herb} ingredients)")

# Serialize the formulas before saving to a file
tcm_flas = {"TCM-ISM-1":greedy_fla.serialize(),
            "TCM-ISM-2":smol_fla.serialize(),
            "ISM-Formula#1":floral_fla_full.serialize(),
            "ISM-Formula#2":floral_fla.serialize()}
with open('./all_formulas.json', 'w') as f:
    json.dump(tcm_flas, f, indent=4)

Formula TCM-ISM-1 contains 30 ingredients which feature 54/54 of selected compounds
	1. Croton lechleri (carries 1 ingredients)
	2. Heterophylly falsestarwort root (carries 3 ingredients)
	3. Chinese floweringquince (carries 3 ingredients)
	4. Common tea (carries 5 ingredients)
Formula TCM-ISM-2 contains 4 ingredients which feature 22/54 of selected compounds
	1. Ginseng (carries 5 ingredients)
	2. Chinese ephedra equivalent plant: ephedra equisetina  (carries 8 ingredients)
	3. Human placenta (carries 4 ingredients)
	4. Common sainfoin (carries 5 ingredients)
Formula ISM-Formula#1 contains 30 ingredients which feature 53/54 of selected compounds
	1. Indigoplant leaf (carries 3 ingredients)
	2. Croton lechleri (carries 1 ingredients)
	3. Chinese floweringquince (carries 3 ingredients)
	4. Common tea (carries 5 ingredients)
Formula ISM-Formula#2 contains 4 ingredients which feature 21/54 of selected compounds
	1. Ginseng (carries 5 ingredients)
	2. Chinese ephedra equivalent plant: ephe

# Applying agents to finalize the formula

The descriptions for herbs and formulas in BATMAN-TCM2 are rather brief and do not provide a lot of context for how to mix the herbs or how they act, from the viewpoint of TCM.

The Dragon-TCM database is much more fitting for such tasks and can be easily combined with LLM-based agents as a way to retrieve important information.



In [61]:
# Retrieve the names of the herbs to be looked up in Dragon-TCM
herb_annots = [x.pref_name for x in floral_fla.herbs]
# Look up their counterparts in Dragon-TCM
other_names = []
name_mapper = con.cross_mappings['herbs'][('BATMAN', 'American Dragon')]
for i in herb_annots:
    if i in name_mapper:
        other_names.append(name_mapper[i])
        print(f"[+] '{i.upper()}' found in Dragon-TCM")
    else:
        print(f"[-] '{i}' not found in Dragon-TCM")
# Prepare all information about the herbs for export
herb_annots = [con.dbs['American Dragon'].herbs[x].serialize() for x in other_names]

[+] 'REN SHEN' found in Dragon-TCM
[+] 'MA HUANG' found in Dragon-TCM
[+] 'CHA YE' found in Dragon-TCM
[+] 'LV DOU' found in Dragon-TCM


You may now save these herbs in a separate file:

In [63]:
with open('./herb_annots.json', 'w') as f:
    json.dump(herb_annots, f, indent=4)

*You may now proceed to the [next notebook](./2.TCMxPlore%20Agent%20Annotation.ipynb) in which we demonstrate how AI agents can be used to personalize such formulas*