GitHub - transcript/SAMSA: Programs and documentation for the SAMSA pipeline, version 1.0

Edit: There's a new version of SAMSA, designated as SAMSA2, that is now available!

SAMSA, this pipeline, is designed to work in conjunction with MG-RAST as the annotation engine. These scripts take the outputs from MG-RAST and help further analyze and break down the results.

SAMSA2, on the other hand, is designed to run locally, either on a user's computer or on a computing cluster. It uses DIAMOND to annotate results against any custom database that the user chooses. SAMSA2 installs with the NCBI RefSeq database for organism matches and uses TheSeed's SEED Subsystems database for functional annotation, including functional hierarchy information.

Take a look at https://www.github.com/transcript/SAMSA2 .

README for SAMSA pipeline

SAMSA version 1.0.0, created winter 2016 Created by Sam Westreich, stwestreich@ucdavis.edu, github.com/transcript/SAMSA

Citation information:

Westreich, S.T., Korf, I., Mills, D.A., Lemay, D.G. (29 September 2016). SAMSA: A comprehensive metatranscriptome analysis pipeline. BMC Bioinformatics, 17(1):399.

Included files:

SAMSA_pre_annotation_pipeline.py - runs Trimmomatic, FLASH (if paired-end files), prepares raw sequences for upload to MG-RAST
MG-RAST_API_downloader.py - allows for batch downloads of annotation results files from MG-RAST via API commands
analysis_counter.py - creates a sorted summary of abundance counts from MG-RAST annotation results files
analysis_counter_w-md5s.py - performs the same step as analysis_counter, but maintains the m5nr ID from MG-RAST for transcript identification
RefSeq_output_reducer.py - consolidates identical organism matches in summary files
func_data_trimmer.py - removes excess functional annotation data from summary files
R_organism_script.Rmd - imports organism summary files and performs statistical significance correlations
R_functional_script_w-md5s.Rmd - imports functional category summary files and performs statistical significance correlations
long_tail_threshold.py - permits thresholding of summary files at a chosen cutoff percentage
taxonomy_shifter.py - converts organism summary files (initially at Genus level) to other taxonomic levels for analysis
Bacteria_Genus_flattened.tsv - reference database for taxonomy_shifter.py
md5-R-results_query.py - submits MG-RAST internal IDs for a functional transcript file to MG-RAST's API to determine the RefSeq ID
functional_search_by_organism.py - retrieves all functional annotation hits for a specific organism
package_install.Rmd - short R script that installs all necessary R packages.

Dependencies:
	SAMSA has several dependencies, including the programs FLASH and Trimmomatic, Python 
	version 2.7, and R, preferably RStudio.  For a complete list of Python includes and
	R packages, see the documentation.  
	
	The documentation also contains links to download Trimmomatic and FLASH.
	
	SAMSA links with MG-RAST, and is currently used with version 3.6 - current as of May 
	2016.

The purpose of the SAMSA pipeline is to handle beginning-to-end metatranscriptome analysis. Starting with a raw sequenced file in fastq format, it performs cleaning and removal of adaptor contamination, and then aligns paired-end files (if specified). The files must then be submitted to MG-RAST for annotation.

After annotation is complete, the results are downloaded, aggregated into summary files, and then analyzed using R and DESeq2. The results can be viewed in either graphical or tabular format, focusing either on organism-specific activity or functional activity within the sample.

The SAMSA pipeline is designed to run in an environment with Python and R installed.
It also requires an internet connection, as annotations are currently handled by the MG-RAST analysis server. Users of the SAMSA pipeline must have an MG-RAST account for assigning annotations.

For more information on how each step in the pipeline functions, consult the included documentation.

At the time of this writing, the SAMSA pipeline does not have any install routine.
Each program listed above must be run from the command line, either individually for each metatranscriptome file or using a bash batch script. For more details on the necessary inputs for each program, run that specific program from the command line with the "-usage" flag, or consult the documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
R_scripts		R_scripts
public_output_files		public_output_files
sample_files		sample_files
sample_output_files		sample_output_files
Bacteria_Genus_flattened.tsv		Bacteria_Genus_flattened.tsv
MG-RAST_API_downloader.py		MG-RAST_API_downloader.py
README.md		README.md
R_functional_script_w-md5s.Rmd		R_functional_script_w-md5s.Rmd
R_organism_script.Rmd		R_organism_script.Rmd
RefSeq_output_reducer.py		RefSeq_output_reducer.py
SAMSA_pipeline_documentation.pdf		SAMSA_pipeline_documentation.pdf
SAMSA_pre_annotation_pipeline.py		SAMSA_pre_annotation_pipeline.py
analysis_counter.py		analysis_counter.py
batch_FLASH_trimmomatic_script.bash		batch_FLASH_trimmomatic_script.bash
example_bash_script.bash		example_bash_script.bash
func_data_trimmer.py		func_data_trimmer.py
functional_search_by_organism.py		functional_search_by_organism.py
gpl.txt		gpl.txt
long_tail_threshold.py		long_tail_threshold.py
md5-R-results-query.py		md5-R-results-query.py
package_install.Rmd		package_install.Rmd
taxonomy_shifter.py		taxonomy_shifter.py
uploader_MG-RAST.py		uploader_MG-RAST.py
uploader_MG_RAST.py		uploader_MG_RAST.py

transcript/SAMSA

Folders and files

Latest commit

History

Repository files navigation

Edit: There's a new version of SAMSA, designated as SAMSA2, that is now available!

README for SAMSA pipeline

About

Resources

Stars

Watchers

Forks

Languages