macaqueICD

Python, R, and bash scripts used for analysis of rhesus macaque gut microbiomes for ICD. Except when described otherwise, tools used are all components of the SAMSA2 pipeline https://github.com/transcript/samsa2 .

Base pipeline and analysis

Script used: bash_scripts/macaque_master_script.bash

Explanation: This is the main pipeline script of SAMSA2, set up to run on a cluster for the macaque metatranscriptome files. The reference databases used are NCBI's RefSeq Bacterial non-redundant database (version 102, released 22 December 2015), and TheSEED Subsystems database (retrieved 18 January 2017).

Host read analysis

Script used: bash_scripts/macaque_host_pipeline_script.bash

Explanation: Macaque host reads were screened out of metatranscriptome later than analysis of bacterial sequences, and thus a simplified pipeline script was created for analysis. The reference database used is NCBI's RefSeq macaque proteins database (retrieved 4 October 2017).

Figure generation R scripts

Shannon and Simpson diversity graphs

Script used: R_scripts/diversity_stats.R

$ Rscript diversity_stats.R -d working_directory/

Diversity was calculated based on the genus counts for annotations against the RefSeq Bacteria database. The vegan package is used for calculating Shannon and Simpson diversity measures.

PCA plot

Script used: R_scripts/make_DESeq_PCA.R

$ Rscript make_DESeq_PCA.R working_directory/ PCAplot_save_name

The PCA plot included in the paper was calculated using genus counts for annotations against the RefSeq Bacteria database, comparing the ICD samples to control samples.

Heatmaps

Script used: R_scripts/make_DESeq_heatmap.R
Script used for host heatmap: R_scripts/macaque_host_heatmap.R

$ Rscript make_DESeq_heatmap.R working_directory/ heatmap_save_name

The heatmap in Figure 1 in the paper was caluclated using genus counts for annotations against the RefSeq Bacteria database, comparing each of the samples against others. Numbers 1-12 indicate the ICD samples, while 13-24 indicate healthy controls. For Figure 9, the heatmap of immune functions found in host annotations draws on functional annotations against NCBI's list of macaque proteins.

Stacked bar graphs

*Script used: R_scripts/make_combined_graphs_top10.R

$ Rscript make_combined_graphs_top10.R -g graph_title -d working_directory -o graph_save_name

For Figure 2, the graph was created using genus counts for annotations against the RefSeq Bacteria database. Note that the optional "Other" setting in the script was kept, to preserve Other counts in the figure.

Subsystems variable-width bar graphs

Script used: R_scripts/Subsystems_DESeq_graphs.R

$ Rscript Subsystems_DESeq_graphs.R -d working_directory/ -L Subsystems_level (1-4) -o graph_save_name

Using scripts included in the SAMSA2 pipeline, the SEED Subsystems annotations were filtered to select only reads that annotated against a Prevotella species in RefSeq Bacteria results. This subset of Subsystems annotations was used for creating a bar graph of the top functions of Prevotella. Figure 3 features reads mapped to level 1 annotations, the broadest category.

Figure 4 in the paper used the same script listed above, but no filtering was applied to Subsystems annotated reads, so all Subsystems annotated were used to create the barplot.

Boxplots

Script used: R_scripts/macaque_pathogen_boxplots.R
Script used: R_scripts/mucin_degraders_boxplots.R
Script used: R_scripts/mucin_enzymes_boxplot_all.R

$ Rscript macaque_pathogen_boxplots.R 
$ Rscript mucin_degraders_boxplots.R
$ Rscript mucin_enzymes_boxplot_all.R

These boxplot scripts are not standardized to the same degree as others; they use hard-coded values for initial counts. For Figure 5, the macaque pathogens, values are derived from genus counts in a normalized counts table generated by DESeq, annotated against NCBI Bacteria. Similarly, for Figure 6A, mucin degraders counts are derived from normalized counts of genus-level annotations against the NCBI Bacteria database. For Figure 6B, mucin degrading enzymes counts are derived from normalized counts of functional annotations against the NCBI Bacteria database.

Databases used

Warning: these are links to large files (several gigabytes) that may take time to download. It may be more advisable to download these from the command line using wget or curl.

RefSeq database used: https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/2c8s521xj9907hn/RefSeq_bac.fa

SEED Subsystems database used: https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/2c8s521xj9907hn/subsys_db.fa

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
R_scripts		R_scripts
bash_scripts		bash_scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R_scripts

R_scripts

bash_scripts

bash_scripts

README.md

README.md

Repository files navigation

macaqueICD

Base pipeline and analysis

Host read analysis

Figure generation R scripts

Shannon and Simpson diversity graphs

PCA plot

Heatmaps

Stacked bar graphs

Subsystems variable-width bar graphs

Boxplots

Databases used

About

Releases

Packages

Contributors 2

Languages

transcript/macaqueICD

Folders and files

Latest commit

History

Repository files navigation

macaqueICD

Base pipeline and analysis

Host read analysis

Figure generation R scripts

Shannon and Simpson diversity graphs

PCA plot

Heatmaps

Stacked bar graphs

Subsystems variable-width bar graphs

Boxplots

Databases used

About

Resources

Stars

Watchers

Forks

Languages