Skip to content

SethCommichaux/AMRplusplus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Finding the RPKM for AMRplusplus output files

These scripts are for creating a matrix of RPKM values from a set of AMRplusplus output files.

The dependencies are python3 and the numpy, pandas, and biopython python3 libaries.

To calculate the RPKM values, we need to know the number of reads and mean read length per fastq file that was analyzed with AMRplusplus. The following script and command can be used to find the number of reads and mean read length for a list of fastq files, provided as a file with one file path per line. The script works for fastq files and gzipped fastq files with a "gz" suffix (*.fastq.gz). The output is written to a tabular file called "read_counts.txt" whose columns list the base filename, the number of reads, and the mean read length. Here is an example command.

python fastq_read_count.py file_of_filepaths.txt

To find the RPKM values, two input files are required. The first is a file of paths to the AMRplusplus *.gene files, with one file path per line. For example,

/path/to/sample1.gene
/path/to/sample2.gene
/path/to/sample3.gene

The second input file is a tab delimited list of the samples, the number of reads per sample, and the mean read length. For example,

sample1 19545332  151
sample2 11791473  149
sample3 12888422  123

Notice that the sample names correspond to the prefix of the *.gene file names.

With these two file you can run the main script for calculating the rpkm values. This will output two tab-delimited files. One for the rpkm values by gene class (rpkm_table_class.txt) and one for the rpkm values per gene (rpkm_table_gene.txt).

python amr_normalized_abundance.py file_of_filenames.txt read_counts.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages