General idea: Find out whether the genetics influence the microbiome by comparing the samples within the monozygotic and dizygotic pairs and find the significance of the beta diversity. 

In [5]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
import matplotlib.pyplot as plt
%matplotlib inline
from operator import itemgetter
import matplotlib.patches as mpatches
from scipy.stats import shapiro

or_dir = '../data' #original data (demux sequences, metadata)
data_dir = 'data' #data from polybox (ASV, taxonomy analysis)


1. Separate metadata table for mono- and dizygotic twins and generate a table for each twin individually. (Or maybe for each pair?)

In [25]:
metadata = pd.read_csv(or_dir + '/metadata.tsv', sep = '\t')
host_numbers = metadata['host_id'].unique()
    
host_numbers

array([42.1, 27.2, 28.1, 28.2, 39.2,  8.1,  8.2, 29.1, 40.1, 40.2, 35.1,
       35.2, 47.1, 47.2,  4.1,  4.2, 29.2,  3.1, 30.2, 36.1, 36.2,  6.1,
        6.2, 30.1, 33.1, 33.2, 43.2, 44.1, 44.2, 45.1, 45.2,  5.1, 37.1,
       37.2, 39.1, 46.1,  3.2, 43.1, 42.2, 46.2,  5.2, 27.1, 48.2, 48.1,
       32.1, 32.2, 12.2, 13.2, 14.1, 14.2, 10.1, 10.2, 12.1, 13.1, 15.1,
       15.2, 16.1, 25.1, 25.2, 26.2, 11.1,  2.1,  2.2, 20.1, 20.2, 21.1,
       21.2, 23.1, 23.2, 19.2, 16.2, 17.1, 17.2, 18.1, 18.2, 19.1, 24.2,
       11.2, 24.1, 26.1])

In [26]:
for host_numbers in metadata['host_id']:
    #always have an empty dataframe at the beginning of each loop
    df = []
    #put all the rows with the same host number in the dataframe
    df = pd.DataFrame(metadata[metadata['host_id']==host_numbers])
    #save the dataframe as tsv file
    df.to_csv('host_"host_numbers"', sep='/t')

TypeError: "delimiter" must be a 1-character string

2. Problem: some samples contain NaN values, but the host has been weaned before. We need to keep those values and assign the status of weaned and lose all others that do not contain any information. 

In [19]:
metadata[metadata['host_id']==23.1].sort_values(by=['collection_date'])

Unnamed: 0,id,Library Layout,Instrument,collection_date,geo_location_name,geo_latitude,geo_longitude,host_id,age_days,weight_kg,...,birth_length_cm,sex,delivery_mode,zygosity,race,ethnicity,delivery_preterm,diet_milk,diet_weaning,age_months
1600,ERR1311612,PAIRED,Illumina MiSeq,2010-06-09 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,36.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,fd,False,1.0
1589,ERR1311616,PAIRED,Illumina MiSeq,2010-07-08 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,65.0,4.763,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,fd,False,2.0
1258,ERR1310030,PAIRED,Illumina MiSeq,2010-11-10 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,190.0,6.804,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,fd,True,6.0
1259,ERR1310031,PAIRED,Illumina MiSeq,2010-12-03 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,213.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,fd,True,7.0
904,ERR1310681,PAIRED,Illumina MiSeq,2011-01-14 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,256.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,,,8.0
905,ERR1310682,PAIRED,Illumina MiSeq,2011-02-17 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,290.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,,,10.0
906,ERR1310683,PAIRED,Illumina MiSeq,2011-03-16 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,316.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,fd,True,10.0
1617,ERR1311611,PAIRED,Illumina MiSeq,2011-05-05 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,366.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,fd,True,12.0
1587,ERR1311614,PAIRED,Illumina MiSeq,2011-07-07 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,428.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,,,14.0
1586,ERR1311613,PAIRED,Illumina MiSeq,2011-08-05 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,23.1,457.0,,...,47.0,male,Vaginal,Dizygotic,Caucasian,Not Hispanic,True,,,15.0


The NaN values appear to only be after weaning or when not weaned at all, but not before weaning. (How do we check whether that is really true?)

If so, we can assign True to each NaN value if metadata['diet_weaning'].sum() >= 1

example:

In [22]:
example = metadata[metadata['host_id']==42.1] #separate table for one twin (this step would be done already by the loop above)
example.sort_values(by=['collection_date']) #we can see that the NaN values only appear after weaning

example.sort_values(by=['collection_date']).fillna(True)
#use a for loop: for all rows, if example['diet_weaning'].sum() >= 1, fillna(True)
#or something like that

Unnamed: 0,id,Library Layout,Instrument,collection_date,geo_location_name,geo_latitude,geo_longitude,host_id,age_days,weight_kg,...,birth_length_cm,sex,delivery_mode,zygosity,race,ethnicity,delivery_preterm,diet_milk,diet_weaning,age_months
567,ERR1315586,PAIRED,Illumina MiSeq,2011-05-28 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,65.0,5.018,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,False,2.0
445,ERR1315248,PAIRED,Illumina MiSeq,2011-06-25 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,92.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,False,3.0
718,ERR1315184,PAIRED,Illumina MiSeq,2011-07-24 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,122.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,False,4.0
174,ERR1314292,PAIRED,Illumina MiSeq,2011-08-13 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,142.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,5.0
39,ERR1314167,PAIRED,Illumina MiSeq,2011-09-11 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,171.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,6.0
753,ERR1314978,PAIRED,Illumina MiSeq,2011-10-19 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,208.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,7.0
0,ERR1314182,PAIRED,Illumina MiSeq,2011-11-11 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,232.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,8.0
317,ERR1314733,PAIRED,Illumina MiSeq,2011-12-15 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,266.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,9.0
1419,ERR1313849,PAIRED,Illumina MiSeq,2012-01-08 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,290.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,10.0
1423,ERR1313857,PAIRED,Illumina MiSeq,2012-02-08 00:00:00,"USA, Missouri, St. Louis",38.63699,-90.263794,42.1,321.0,True,...,47.0,male,Cesarean,Monozygotic,Caucasian,Not Hispanic,True,fd,True,11.0


3. Filter feature table according to metadata table

In [24]:
! qiime demux filter-samples --help

Usage: [94mqiime demux filter-samples[0m [OPTIONS]

  Filter samples indicated in given metadata out of demultiplexed data.
  Specific samples can be further selected with the WHERE clause, and the
  `exclude_ids` parameter allows for filtering of all samples not specified.

[1mInputs[0m:
  [94m[4m--i-demux[0m ARTIFACT [32mSampleData[SequencesWithQuality¹ |[0m
    [32mPairedEndSequencesWithQuality² | JoinedSequencesWithQuality³][0m
                       The demultiplexed data from which samples should be
                       filtered.                                    [35m[required][0m
[1mParameters[0m:
  [94m[4m--m-metadata-file[0m METADATA...
    (multiple          Sample metadata indicating which sample ids to filter.
     arguments will    The optional `where` parameter may be used to filter
     be merged)        ids based on specified conditions in the metadata. The
                       optional `[4mexclude-ids[0m` parameter may be used to exclude
      

4. Get table for each twin pair and each stage --> find F values for twin column with ANCOM showing differences between individuals

5. ANCOM for zygosity column --> find significance