No description or website provided.
Perl
Pull request Compare This branch is 761 commits behind sanger-pathogens:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
lib/Bio
t
.gitignore
LICENSE
README.md
dist.ini

README.md

Bio-PanGenome

The algorithm works as follows:

1.) Extract protein sequences for each CDS from GFF files, reorienting to positive strand, 2.) Create a combined protein sequence, filtering out proteins with more than 5% missing data (assembly errors), 3.) Cluster sequences with 99% identity and 99% length with cd-hit, 4.) Parallel all-against-all blastp with clustered sequences, 5.) TribeMCL, 6.) Reinflate MCL groups with cd-hit clusters, 7.) Transfer annotation to the groups (gene names),

Outputs: 1.) Plot of group frequency in isolates so you can visually see core and accessory 2.) Spreadsheet with statistics on groups, annotation, etc... 3.) FASTA file with a single representative sequence per group

Querying the data: 1.) Given two sets of isolates, output the genes unique to each set, and the set of common genes (3 files). 2.) Given a set of isolates, output the union, intersection or complement. 3.) Given a list of genes, create multifasta files for each gene from all isolates

Dependancies

exonerate BedTools gsed