# Genome-microbiome interactions

We now will look at associations between microbes and metabolites that ar emodulted by the genetic background of the host. This essentially comes down to looking at interaction terms between genetics and the microbiome in a linear model with the metabolite as the response. We will again use the confounder-corrected metabolite abundances and do a train and validation split.

Let's start by loading the required packages and reading the required data sets.

In [1]:
from pyplink import PyPlink
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats.multitest import fdrcorrection
from rich.progress import track
import arivale_data_interface as adi
from utils import rsid2gene
import warnings
warnings.simplefilter("ignore")

metabolites = pd.read_csv("data/metabolites_residuals.csv")
microbes = pd.read_csv("data/genera_clr_filtered.csv")
microbes = microbes[microbes.stool_sample_id.isin(metabolites.stool_sample_id)]
genotypes = PyPlink("input_bed/all_chr/all_genomes_09112019_all_chr")

# Add gene annotations to the variants by querying dbSNP
sig_microbes = pd.read_csv("data/sig_metabolite_taxon.csv")
sig_snps = pd.read_csv("data/final_results.csv")
sig_snps["rsid"] = sig_snps.SNP.str.split(";").str[0]
genes = rsid2gene(sig_snps.rsid.unique())
sig_snps = sig_snps.merge(genes, left_on="rsid", right_on="snp", how="left")

  DEPRECATIONS = yaml.load(in_yaml)


Now we can start with the interaction terms. We will run tests for all metabolites that have a significant genetic association. We start by assembling the metabolome and metabolite data.

In [2]:
metabolites_and_microbes = pd.merge(metabolites, microbes, on="stool_sample_id")

And we can run the association tests. For that we start by writing a function that runs the test for a single metabolite-microbe combination.

In [3]:
MIN_N = 30
genome_ids = genotypes.get_fam().iid.values

def interactions(args):
    """Run the interaction analy;sis for a metabolite-microbe pair."""
    met, mic = args
    snps = sig_snps.SNP[sig_snps.metabolite == met]
    geno = pd.DataFrame({s: genotypes.get_geno_marker(s) for s in snps}, index=genome_ids)
    df = pd.merge(metabolites[["genome_id", "stool_sample_id", met]], geno, left_on="genome_id", right_index=True)
    df = pd.merge(df, metabolites_and_microbes[["stool_sample_id"] + [mic]], on="stool_sample_id").dropna(subset=[mic])
    
    result = pd.DataFrame(
        index = [met + ':' + mic + ':' + i for i in snps],
        columns = ['metabolite','taxon','snps', 'snp_beta', 'interaction_beta','p', 
                   'baseline_r2', 'interaction_r2', 'full_r2', 'n_major', 'n_heterozygous', 'n_homozygous']
    )
    
    if df.shape[0] < MIN_N:
        return result
    
    for snp in snps:
        n_allele = df[snp].value_counts()
        if (n_allele < MIN_N).any() or len(n_allele) < 3:
            continue
        gxe_term = f"Q('{snp}'):Q('{mic}')"
        formula_base = f"{met} ~ Q('{snp}') + Q('{mic}')"
        formula_full = f"{met} ~ Q('{snp}') + Q('{mic}') + {gxe_term}"
        base_fit = smf.ols(formula_base, data=df).fit()
        full_fit = smf.ols(formula_full, data=df).fit()
        result.loc[met + ':' + mic + ':' + snp] = [
            met, 
            mic, 
            snp,
            full_fit.params.loc[f"Q('{snp}')"],
            full_fit.params.loc[gxe_term],
            full_fit.pvalues.loc[gxe_term],
            base_fit.rsquared,
            full_fit.rsquared - base_fit.rsquared,
            full_fit.rsquared,
            n_allele.get(0, 0),
            n_allele.get(1, 0),
            n_allele.get(2, 0)
        ]
    return result

With that we can run our models for all combinations.

In [4]:
from os import path
from multiprocessing import Pool

if not path.exists("data/interaction_results1.csv"):
    with Pool(8) as pool:
        args = [(met, mic) for met in list(sig_snps.metabolite.unique())+["metabolite_100000961"] for mic in microbes.columns[1:]]
        it = track(pool.imap_unordered(interactions, args), total=len(args), description='Fitting models')
        results = pd.concat(list(it))
    results['q'] = fdrcorrection(results.p)[1]
    results.to_csv("data/interaction_results.csv", index=False)
else:
    results = pd.read_csv("data/interaction_results.csv")

Output()

Now we merge in the annotations for genes and metabolites and extract the bacterial genus and family.

In [5]:
metabolite_meta = adi.get_snapshot("metabolomics_metadata", clean=True) [["CHEMICAL_ID", "BIOCHEMICAL_NAME", "SUPER_PATHWAY", "SUB_PATHWAY", "CAS", "KEGG"]]
metabolite_meta["metabolite"] = "metabolite_" + metabolite_meta.CHEMICAL_ID.astype(str)
merged = pd.merge(results, metabolite_meta, on="metabolite")
merged = pd.merge(merged, sig_snps[["SNP", "CHR", "genes", "rsid"]].drop_duplicates(), left_on="snps", right_on="SNP")
merged["genus"] = merged.taxon.str.split("|").str[1]
merged["family"] = merged.taxon.str.split("|").str[0]

merged.sort_values(by=["p", "interaction_r2"], inplace=True)
merged.to_csv("data/interaction_results_annotated.csv", index=False)
merged

Unnamed: 0,metabolite,taxon,snps,snp_beta,interaction_beta,p,baseline_r2,interaction_r2,full_r2,n_major,...,SUPER_PATHWAY,SUB_PATHWAY,CAS,KEGG,SNP,CHR,genes,rsid,genus,family
9541,metabolite_100001104,Ruminococcaceae|Ruminococcaceae_UCG-002,rs56672945,0.309384,-0.032553,0.000001,0.189084,0.014297,0.203381,944,...,Amino Acid,Phenylalanine and Tyrosine Metabolism,537-55-3,,rs56672945,2,ALMS1,rs56672945,Ruminococcaceae_UCG-002,Ruminococcaceae
1632,metabolite_340,Ruminococcaceae|UBA1819,rs1047891,0.158704,0.024361,0.000001,0.149686,0.013943,0.163629,755,...,Amino Acid,"Glycine, Serine and Threonine Metabolism",56-40-6,C00037,rs1047891,2,CPS1,rs1047891,UBA1819,Ruminococcaceae
2524,metabolite_803,Ruminococcaceae|Faecalibacterium,rs1260326,-0.041809,0.02759,0.000002,0.049705,0.014559,0.064264,247,...,Carbohydrate,"Fructose, Mannose and Galactose Metabolism",3458-28-4,C00159,rs1260326,2,GCKR,rs1260326,Faecalibacterium,Ruminococcaceae
18315,metabolite_100010896,Lachnospiraceae|Coprococcus_1,rs7048932,0.249172,0.021118,0.000004,0.240782,0.011562,0.252344,607,...,Nucleotide,"Pyrimidine Metabolism, Uracil containing",2140-76-3,,rs7048932,9,PHYHD1,rs7048932,Coprococcus_1,Lachnospiraceae
1881,metabolite_100001294,Ruminococcaceae|UBA1819,rs1047891,0.190981,0.032918,0.000007,0.101733,0.012099,0.113832,755,...,Peptide,Gamma-glutamyl Amino Acid,1948-29-4,,rs1047891,2,CPS1,rs1047891,UBA1819,Ruminococcaceae
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
906,metabolite_244,Bifidobacteriaceae|Bifidobacterium,rs6800284,-0.08661,-0.0,0.999882,0.047732,0.0,0.047732,477,...,Nucleotide,"Pyrimidine Metabolism, Uracil containing",56-41-7;107-95-9,C00099,rs6800284,3,,rs6800284,Bifidobacterium,Bifidobacteriaceae
22769,metabolite_999946997,Ruminococcaceae|Candidatus_Soleaferrea,rs765285,0.161182,-0.000002,0.999889,0.052567,0.0,0.052567,324,...,,,,,rs765285,6,SLC17A1,rs765285,Candidatus_Soleaferrea,Ruminococcaceae
14710,metabolite_100002259,Lachnospiraceae|Fusicatenibacter,rs12134854,-0.182032,-0.000001,0.999923,0.062553,0.0,0.062553,819,...,Lipid,Fatty Acid Metabolism(Acyl Carnitine),98930-66-6,,rs12134854,1,SLC44A5,rs12134854,Fusicatenibacter,Lachnospiraceae
16583,metabolite_100009078,Erysipelotrichaceae|Erysipelotrichaceae_UCG-003,rs35853021,0.156265,-0.000001,0.999923,0.04425,0.0,0.04425,638,...,Lipid,Phospholipid Metabolism,,,rs35853021,15,,rs35853021,Erysipelotrichaceae_UCG-003,Erysipelotrichaceae


Let's have a look at the significant interactions.

In [6]:
selected = merged[merged.q < 0.05].sort_values(by="interaction_r2", ascending=False)[[
    "BIOCHEMICAL_NAME", "interaction_r2", "interaction_beta", "baseline_r2", 
    "SNP", "genes", "n_major", "n_heterozygous", "n_homozygous", "taxon", "p", "q"]]

In [7]:
selected.taxon.value_counts()

Ruminococcaceae|Ruminococcaceae_UCG-013     20
Ruminococcaceae|Angelakisella               13
Ruminococcaceae|UBA1819                     12
Ruminococcaceae|Ruminiclostridium_5         12
Ruminococcaceae|DTU089                      11
                                            ..
Acidaminococcaceae|Phascolarctobacterium     1
Streptococcaceae|Streptococcus               1
Ruminococcaceae|Ruminococcus_2               1
Eggerthellaceae|Adlercreutzia                1
Desulfovibrionaceae|Bilophila                1
Name: taxon, Length: 79, dtype: int64

In [8]:
selected.head()

Unnamed: 0,BIOCHEMICAL_NAME,interaction_r2,interaction_beta,baseline_r2,SNP,genes,n_major,n_heterozygous,n_homozygous,taxon,p,q
2524,mannose,0.014559,0.02759,0.049705,rs1260326,GCKR,247,758,564,Ruminococcaceae|Faecalibacterium,2e-06,3e-06
9541,N-acetyltyrosine,0.014297,-0.032553,0.189084,rs56672945,ALMS1,944,529,96,Ruminococcaceae|Ruminococcaceae_UCG-002,1e-06,1e-06
1632,glycine,0.013943,0.024361,0.149686,rs1047891,CPS1,755,663,151,Ruminococcaceae|UBA1819,1e-06,1e-06
3609,"cys-gly, oxidized",0.012553,0.041491,0.052908,rs258341,DPEP1,261,785,523,Ruminococcaceae|Faecalibacterium,8e-06,1.6e-05
1881,gamma-glutamylglycine,0.012099,0.032918,0.101733,rs1047891,CPS1,755,663,151,Ruminococcaceae|UBA1819,7e-06,9e-06
