# Prevention study

First we will be curating all of the datasets associated with the prevention study.  This includes the following
- Microbiome data
- Proteomics
- Hormone data
- Inflammation data

The final files will be found under `processed_data` named
- 1706_1145_otu_table.biom (otu_table)
- 1706_1145_mapping.txt (hormone, inflammation, all of the other data)

In [1]:
import pandas as pd
import numpy as np
from biom import load_table
import re

In [2]:
mapping = pd.read_table('../original_data/1706_prep_1145_qiime_20160731-151247.txt')
otu_table = load_table('../original_data/otu_table_prev.biom')
inflammation_mapping = pd.read_excel('../original_data/inflammation_prevention.xlsx')
hormone_mapping = pd.read_excel('../original_data/prevention_hormones_analysis.xlsx')

In [3]:
submapping = mapping.loc[mapping['body_site'] == 'UBERON:distal colon']
remaining_mapping = mapping.loc[mapping['body_site'] != 'UBERON:distal colon']

First, we'll match up all of the IDs corresponding to the inflammation markers to the IDs 
in the mapping file.

In [4]:
# rename column name so that it doesn't clash with the name in the original mapping file
inflammation_mapping=inflammation_mapping.rename(columns = {'#SampleID':'orig_name'})

submapping = pd.merge(submapping, inflammation_mapping, 
                      how='outer', on='orig_name')

Now, we'll link the sample names found in the proteomics file, and add a column in the mapping file that matches
to those sample names.

EDIT:  We will refrain from doing this here, because the numbers in the sample indeces don't line up with the
host_subject ids.  It is not clear exactly what proteomics samples originated from which pigs.

In [5]:
# proteome_ids = proteome_table.ids(axis='sample')
# index = list(map(lambda x: x.split('_')[1], proteome_ids))
# proteome_ids = pd.DataFrame({'proteome_name': proteome_ids,
#                              'host_subject_id':index})

# pd.merge(submapping, proteome_ids, how='outer', 
#          on='host_subject_id')['proteome_name']

Finally, we'll add weight, feed and hormone information to the mapping file.

In [6]:
hormone_mapping = hormone_mapping.rename(columns = {'Sample ID':'hormone_sample_name'})
hormone_mapping['#SampleID'] = ['1706.DD%d.prev'%i for i in hormone_mapping.hormone_sample_name]
submapping = pd.merge(submapping, hormone_mapping, how='outer', 
                      on='#SampleID')

It is also important to note that some of the pigs were actually placed in separate cages.
(see `Animal housing details.xlsx`).  Specifically all of the pigs between 1-50 were placed in a separate penn
from all of the other pigs.


In [7]:
mapping = pd.concat((submapping, remaining_mapping))

In [11]:
def cage_number(x):
    try:
        if int(x) <= 50:
            return '0'
        else:
            return '1'
    except ValueError:
        return 'NA'
mapping['cage'] = mapping.host_subject_id.apply(cage_number)

def potato_color(x):
    if pd.isnull(x):
        return 'Not applicable'
    if x == 'Control_1':
        return 'Control'
    if x == 'Control_2':
        return 'HCD'
    if 'purple' in x:
        return 'purple'
    if 'white' in x:
        return 'white'
    else:
        #raise ValueError('x (%s) is bad' % x)
        return 'Not applicable'
    
mapping['color'] = mapping['treatment'].apply(potato_color)
    
def potato_processing(x):
    if pd.isnull(x):
        return 'Not applicable'    
    if x == 'Control_1':
        return 'Control'
    if x == 'Control_2':
        return 'HCD'
    if 'baked' in x:
        return 'baked'
    if 'raw' in x:
        return 'raw'
    if 'chipped' in x:
        return 'chipped'
    else:
        #raise ValueError('x (%s) is bad' % x)
        return None
        
mapping['processing'] = mapping['treatment'].apply(potato_processing)

def time(x):
    if pd.isnull(x):
        return np.nan
    if x == 'Not applicable':
        return np.nan
    if x == 'Final':
        return 14
    else:
        return int(x.split(' ')[1])

mapping['week'] = mapping.timepoint.apply(time)

ValueError: invalid literal for int() with base 10: 'applicable'

So, we have combined most of the metadata into a single file.

__Some outstanding questions__
1. What pigs did the proteomics samples originate from.  Because of this is, it is currently not possible to correlation proteomics with anything else.
2. What are the actual proteomics ids?  Right now, they are only labeled numerically between 1 to 4000.
3. Where exactly was the inflammation measured (here we assumed that it was measured from the distal)

In [None]:
mapping.to_csv('../processed_data/1706_1145_mapping.txt', sep='\t')

# Reversal Study


First we will be curating all of the datasets associated with the prevention study.  This includes the following

- Microbiome data
- Inflammation data

The final files will be found under `processed_data` named
- 1706_1911_otu_table.biom (otu_table)
- 1706_1911_mapping.txt (hormone, inflammation, all of the other data)

In [None]:
mapping = pd.read_table('../original_data/mapping_jairam_reversal.txt')
otu_table = load_table('../original_data/otu_table_rev.biom')  
inflammation_mapping = pd.read_excel('../original_data/Reversal study inflammatory markers.xlsx')

# includes weight, feed intake and some hormone data
mapping_parameters = pd.read_table('../original_data/mapping_all_parameters.txt') 

Since there is no explicit timepoint variable we'll add one

In [None]:
def timepoint(x):
    if 'Initial' in x:
        return 'week 0'
    else:
        return 'week 5'
mapping['timepoint'] = mapping.site_potato.apply(timepoint)

Since there are a whole bunch of missing samples, we'll assign the inflammation values to the 3 main body sites.

In [None]:
distal_mapping = mapping.loc[mapping['Ext_Desc']=='Distal_Colon']
proximal_mapping = mapping.loc[mapping['Ext_Desc']=='Proximal_Colon']
ileum_mapping = mapping.loc[mapping['Ext_Desc']=='Ileum']

In [None]:
inflammation_mapping['#SampleID'] = ['DD%d'%i for i in inflammation_mapping['Pig ID #']]
distal_mapping = pd.merge(distal_mapping, inflammation_mapping, how='inner',
                          on='#SampleID').dropna(subset=['BarcodeSequence'])

inflammation_mapping['#SampleID'] = ['ID%d'%i for i in inflammation_mapping['Pig ID #']]
ileum_mapping = pd.merge(ileum_mapping, inflammation_mapping, how='inner',
                         on='#SampleID').dropna(subset=['BarcodeSequence'])

inflammation_mapping['#SampleID'] = ['PD%d'%i for i in inflammation_mapping['Pig ID #']]
proximal_mapping = pd.merge(proximal_mapping, inflammation_mapping, how='inner',
                            on='#SampleID').dropna(subset=['BarcodeSequence'])

In [None]:
mapping = pd.concat((distal_mapping, ileum_mapping, proximal_mapping), axis=0)

Now we'll add on the weight and feed intake parameters.

In [None]:
mapping = pd.merge(mapping, mapping_parameters, 
                   on=['#SampleID','BarcodeSequence','LinkerPrimerSequence', 
                       'Treatment', 'Potato'])

Finally, we'll make this mapping file QIITA complaint.

In [None]:
def body_site(x):
    if x == 'Distal_Colon':
        return 'UBERON:distal colon'
    elif x == 'Ileum':
        return 'UBERON:ileum'
    elif x == 'Proximal_Colon':
        return 'UBERON:proximal colon'
    else:
        return 'UBERON:feces'
mapping['body_site'] = mapping.Ext_Desc.apply(body_site)

In [None]:
mapping['instrument'] = '454 GS FLX Titanium'
mapping['center_name'] = 'CSU Fort Collins'
mapping['center_project_name'] = 'Jairam Purple potatoes'
mapping['library_construction_protocol'] = 'Samples sequenced with 515/80rbc primers for V4 of 16S rRNA'
mapping['key_seq'] = 'TCAG'
mapping['linker'] = 'GA'
mapping['platform'] = 'LS454'
mapping['run_prefix'] = 'GW1M5C'
mapping['sequencing_meth'] = 'pyrosequencing'
mapping['target_gene']='16S rRNA'
mapping['target_subfragment'] = 'V4'
mapping['altitude']=0
mapping['common_name'] = 'pig gut metagenome'
mapping['country'] = 'USA'
mapping['depth'] = 0
mapping['elevation'] = 1525
mapping['env_biome'] = 'urban biome'
mapping['env_feature'] = 'animal-associated habitat'
mapping['host_common_name'] = 'pig'
mapping['host_subject_id'] = mapping['Pig ID #']
mapping['latitude'] = 40.58526
mapping['longitude'] = -105.084423
mapping['host_scientific_name'] = 'Sus scrofula'
mapping['host_taxid']='9825'
mapping['physical_specimen_location'] = 'U Penn'
mapping['physical_specimen_remaining']='FALSE'
mapping['public']='FALSE'
mapping['required_sample_info_status']='completed'
mapping['sex']='male'
mapping['taxon_id']='1510822'

In [None]:
mapping.to_csv('../processed_data/1706_1911_mapping.txt', sep='\t')

__Some outstanding questions__
1. Was there no hormone data collected for the reversal study?
2. What pigs did the proteomics samples originate from.  Because of this is, it is currently not possible to correlation proteomics with anything else.
3. What are the actual proteomics ids?  Right now, they are only labeled numerically between 1 to 4000.

Also note that the original QIITA mapping file is missing a whole bunch of samples.
While this new mapping file contains more samples, it doesn't have the same format as the sample ids in QIITA.
But the sample ids in this new mapping file are consistent with the sample ids in the OTU table.

# Potatoes

The potato metabolite data is under `original_data/LCMS_potatoes.csv`.  

__Some outstanding concerns__
1. The feature finding was done via XCMS.
2. Only positive mode was analysed.