-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Welcome to the GEnView tutorial!
In this example, we'll walk you through the process of visualizing a genes genetic environment in several genomes using the GEnView pipeline. Our goal is to identify and visualize the PER-type resistance genes in the genera Rheinheimera and Pararheinheimera.
Step 1 - genview-makedb
1.1 Downloading and searching genomes/plasmids from NCBI by specifying taxa:
The target genes to search for in the publicly avaliable (Para)Rheinheimera genomes and plasmids are antibiotic resistance genes from https://card.mcmaster.ca/. The first command would thus be:
genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -p 20 -id 80 -scov 80 --taxa 'rheinheimera' 'pararheinheimera' --assemblies --plasmids --log --uniprot_db path/to/uniprotKB.dmnd
In case you run genview-makedb for the first time, you will not have yet downloaded the database specified with the --uniprot_db
flag. If you do not specify --uniprot_db and GEnView will prompt you to download the database in the beginning of the run. Due to bandwith issues at the database storage location, please specify --uniprot_db with the path to the uniprotKB.dmnd file on subsequent runs.
1.2 Downloading genome assemblies/plasmids by accession number
If you have specific genomes to search for target genes in mind, you can supply a file containing these accession number with the --accession flag
genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -p 20 -id 80 -scov 80 --accessions /path/to/accessions/accessions.txt --log --uniprot_db path/to/uniprotKB.dmnd --assemblies
1.3 Searching local sequences
If you have local sequences that you want to search for your target gene, you can specify the path the those sequences in fasta format using --local
genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -id 80 -scov 80 --local /path/to/local/sequence_directory --uniprot_db path/to/uniprotKB.dmnd
Make sure that the headers of your local sequences are as simple as possible and do not contain special characters. If spaces are present, headers will be split at first space, and the first element of the resulting list (plus an id added by genview) will be used to identify the respective sequence.
1.4 Updating a database
Since version 0.2, you also have the possibility to add new genomes to an existing database using the --update
flag.
genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -id 80 -scov 80 --taxa 'pararheinheimera' --assemblies --uniprot_db path/to/uniprotKB.dmnd --update
In this case, you would specify the target directory of a previous genview run under -d. The database contained in the directory will then be updated with the specified sequences (in this case all target genes found in the genomes of Pararheinheimera species)
File explanations
protein_fasta_protein_homolog_model.fasta
is a fasta or multifasta file, containing the amino acid sequence of the protein to investigate, such as:
>PER-1
MNVIIKAVVTASTLLMVSFSSFETSAQSPLLKEQIESIVIGKKATVGVAVWGPDDLEPLLINPFEKFPMQSVFKLHLAMLVLHQVDQGKLDLNQTVIVNRAKVLQNTWAPIMKAYQGDEFSVPVQQLLQYSVSHSDNVACDLLFELVGGPAALHDYIQSMGIKETAVVANEAQMHADDQVQYQNWTSMKGAAEILKKFEQKTQLSETSQALLWKWMVETTTGPERLKGLLPAGTVVAHKTGTSGIKAGKTAATNDLGIILLPDGRPLLVAVFVKDSAESSRTNEAIIAQVAQTAYQFELKKLSALSPN
The header line for each protein sequence should be as simple as possible, and not contain '|'. As gene sequences will be extracted from the database by name by genview-visualize
, the amino acid sequences of translated genes you want to visualize together should have similar names, for example: Different variants of PER-type beta-lactamases should be named PER-1, PER-2 and PER-6, and extracted from the database later using -gene per
PER. If you only want to extract e.g PER-1, you would specify 'PER-1' under the -gene
flag when using genview-visualize
.
/path/to/accessions/accession_list.txt
is a file containing the genbank accessions of the assemblies/plasmids you want to search, one accession per line. Example:
GCA_000217935.2
GCA_003990335.1
GCA_001275035.1
GCA_000986865.1
GCF_003862465.1
GCA_004005375.1
GCA_008017875.1
the --assemblies
option specifies that these are accession numbes for assemblies stored at the NCBI Assembly Database, not plasmids. If you want to include plasmids from the NCBI Nucleotide database, specify --plasmids
as well.
The product of this command is the SQLite database file 'genview_database.db'. It contains all identified instances of the target gene, which genome it is found in and the genes that were predicted and annotated in its genetic environment. The database can be viewed with any SQLite compatible database viewer (e.g. 'DB browser for SQLite').
Step 2 - genview-visualize
2.1 visualizing all sequences containing the target gene
Once you have created a genview database, genview-visualize
creates a phylogeny based visualization of the selected target gene locus:
genview-visualize -db /path/to/genview_database.db -id 80 -gene 'your_target_gene_name'
To for example visualize the PER-type genes in the previously created database containing the Rheinheimera/Pararheinheimera genomes, run:
genview-visualize -gene 'per-' -db /path/to/genview_database.db -id 80
Once the the run is done, genview-visualize
will print the location of the output file.
2.2 Visualizing only sequences that are less than 95% similar in sequence identity
If you want to only visualize sequences that are more than 5% dissimilar (basically removing duplicates), you can use the --compress
flag
genview-visualize -gene 'per-' -db /path/to/genview_database.db -id 80 --compress
You can further select the sequences you want to display by the -taxa
flag. Only target sequences contained in the genomes of the specified taxa will be visualized.
This will generate the output files in /path/to/genview_output_directory/per_80_analysis/
, as described in the README.txt on the main page, e.g a visualization of the sequences and metadata on the genes and their genetic environments.
You can now compare the PER-1 gene loci in your browser, e.g by using firefox /path/to/genview_output_directory/per_80_analysis/interactive_visualization.html
It will look like this, with your target gene marked in red (default, you can specify custom colors using the --custom_colors
flag).
You can inspect specific sequences closer through clicking on a given gene sequence:
This will give you access to the nucleotide sequence of the respective gene.
Clicking on the line between single genes will give you access to the complete sequence surrounding your target gene:
2.3 Customizing visualization colors
The colors of specific sequences can be customized using genviews --custom-colors
flag.
In the visualizations above, we have identified several genes coding for transferases and transporter proteins. We want to color the transferases in green, the transporters in pink and the target gene (per-) in blue. custom_colors.txt
is a text file that contains coloring instructions for specific annotated sequences. The values in each row are tab separated. Each row contains the following items: geneclass\tkeyword1,keyword2,...,keywordN\trgb_colorcode. 'geneclass' is a name of your choice.
The file described above would look like this:
target per- rgb(0,128,255)
mfs transporter,mfs rgb(255,0,127)
transferase transferase rgb(102,204,0)
Running genview-visualize -gene per -db /path/to/genview_database.db -id 70 --custom_colors custom_colors.txt
now results in the following visualization: