# Process transporter data and run DESeq2

Because statistical analysis of metagenomes may suffer due to genes with low abundance ([Jonsson et. al 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727335/)) we will filter transporters with an average read count <100. This is a tradeoff between producing trustworthy results and producing any results at all (because filtering at higher average read counts will remove too many transporters). In addition, the statistical analysis is performed on representative protein families for each transporter cluster. Representative families are selected by sorting by mean abundance across the samples.

Representative protein families were identified as part of the [01.process_data.ipynb](01.process_data.ipynb) notebook.

In [1]:
import pandas as pd

Read selected transporters.

In [2]:
transinfo = pd.read_table("selected_transporters_classified.tab", header=0, sep="\t", index_col=0)

Read raw counts for transporters (calculated from representative protein families).

In [3]:
mg_trans_reps = pd.read_table("results/mg/rep_trans.raw_counts.tsv", header=0, sep="\t", index_col=0)
mt_trans_reps = pd.read_table("results/mt/rep_trans.raw_counts.tsv", header=0, sep="\t", index_col=0)

Intersect with the selected transporters.

In [4]:
mg_select_trans_reps = mg_trans_reps.reindex(transinfo.index)
mt_select_trans_reps = mt_trans_reps.reindex(transinfo.index)

## Filter out transporters with low coverage.

In [5]:
threshold = 100

In [6]:
mg_select_trans_reps_filt = mg_select_trans_reps.loc[mg_select_trans_reps.mean(axis=1)>=100]
mg_select_trans_reps_filt.to_csv("results/mg/rep_trans_filt.raw_counts.tsv", sep="\t")
print("{} transporters remaining after filtering".format(len(mg_select_trans_reps_filt)))

41 transporters remaining after filtering


In [7]:
mt_select_trans_reps_filt = mt_select_trans_reps.loc[mt_select_trans_reps.mean(axis=1)>=100]
mt_select_trans_reps_filt.to_csv("results/mt/rep_trans_filt.raw_counts.tsv", sep="\t")
print("{} transporters remaining after filtering".format(len(mt_select_trans_reps_filt)))

23 transporters remaining after filtering


## Run the Deseq2 Rscript

In [8]:
!Rscript run_deseq2.R

1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
Loading required package: S4Vectors
Loading required package: methods
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, m

mean-dispersion relationship
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
final dispersion estimates
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
fitting model and testing
converting counts to integer mode
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
estimating size factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and 

  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
final dispersion estimates
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
fitting model and testing
converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting mode