# Cohort assembly

Here we will go through the Arivale data and assemble the cohort for the analysis. Initially what we want to generate is a mapping of public client IDs to their respective genome, metabolite abundances and microbiome profile. This will be used to assemble the fit and validation cohorts.

In [1]:
import warnings
warnings.simplefilter("ignore")

## Genetic variants

We start by reading the set of samples with assigned variants in from the generated BED files and match them to the Arivale client list.

In [2]:
import arivale_data_interface as adi
from pyplink import PyPlink
import warnings
warnings.simplefilter("ignore")

fam = PyPlink("input_bed/all_chr/all_genomes_09112019_all_chr").get_fam()
dashboard = adi.get_snapshot("genetics_snp")
with_genomes = dashboard[dashboard.genome_id.isin(fam.iid) & (dashboard.genome_vendor == "NEXTCODE")][["public_client_id", "genome_id"]]
with_genomes.shape

(2629, 2)

## Metabolome data

We now get the metabolome data and see where we have overlap.

In [3]:
mets = adi.get_snapshot("metabolomics_corrected")
with_metabolomics = mets[["public_client_id", "sample_id", "days_in_program"]].rename(columns={"days_in_program": "blood_days_in_program", "sample_id": "blood_sample_id"})
with_metabolomics

Unnamed: 0,public_client_id,blood_sample_id,blood_days_in_program
0,01000261,A477AV558-002,65
1,01001621,A391BM948-002,265
2,01001621,A776BI445-003,11
3,01002183,A595AV320-002,13
4,01002412,A294AU415-002,13
...,...,...,...
3300,HX409129,A581BK409-002,5
3301,HX460562,A641BO324-003,28
3302,HX794171,A229BM682-002,56
3303,INEW,A750AX220-002,149


In [4]:
with_genomes.public_client_id.isin(with_metabolomics.public_client_id).sum()

1964

## Microbiome data

Now we will start to get the sample IDs for microbiome data. This will be read from a recent reprocessing using DADA2 and SILVA.

In [6]:
import pandas as pd

micro = pd.read_csv("/proj/arivale/microbiome/16S_processed/metadata.csv").dropna(subset=["days_in_program"])
with_microbiome = micro[["public_client_id", "vendor_observation_id", "days_in_program"]].rename(columns={"days_in_program": "stool_days_in_program", "vendor_observation_id": "stool_sample_id"})
with_microbiome["stool_days_in_program"] = with_microbiome["stool_days_in_program"].astype("int64")
micro.shape

(5231, 24)

In [7]:
pd.Series(with_microbiome.public_client_id.unique()).isin(with_metabolomics.public_client_id).sum()

1905

We also create a table with individuals that have stool and microbiome samples within 30 days of each other.

In [8]:
with_micro_metab = pd.merge_asof(with_metabolomics.sort_values(by="blood_days_in_program"), with_microbiome.sort_values(by="stool_days_in_program"), by="public_client_id", left_on="blood_days_in_program", right_on="stool_days_in_program", direction="nearest")
diffs = with_micro_metab.blood_days_in_program - with_micro_metab.stool_days_in_program
with_micro_metab = with_micro_metab[diffs.abs() <= 30].reset_index(drop=True)
with_micro_metab.public_client_id.nunique()

1623

## Combining all data types

Finally we will select all samples with all three data types. In some instances one individual may have 2 blood and/or fecal samples. We will track those cases but remove non-baseline samples from the default data set.

In [9]:
with_all = pd.merge(with_genomes, with_micro_metab.drop_duplicates(subset=["public_client_id"]), on="public_client_id")

In [10]:
with_all.shape

(1569, 6)

In [11]:
with_all.to_csv("data/all_feature_types.csv", index=False)

In [12]:
with_all_multiple = pd.merge(with_genomes, with_micro_metab, on="public_client_id")

In [13]:
with_all_multiple.shape

(1786, 6)

In [14]:
with_all_multiple.to_csv("data/all_feature_types_multiple.csv", index=False)