Skip to content
EbmeyerSt edited this page Apr 29, 2022 · 30 revisions

Welcome to the GEnView tutorial!

In this example, we'll walk you through the process of visualizing a genes genetic environment in several genomes using the GEnView pipeline. Our goal is to identify and visualize the PER-type resistance genes in the genera Rheinheimera and Pararheinheimera.

Step 1 - genview-makedb


1.1 Downloading and searching genomes/plasmids from NCBI by specifying taxa:

The target genes to search for in the publicly avaliable (Para)Rheinheimera genomes and plasmids are antibiotic resistance genes from https://card.mcmaster.ca/. The first command would thus be:

genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -p 20 -id 80 -scov 80 --taxa 'rheinheimera' 'pararheinheimera' --assemblies --plasmids --log --uniprot_db path/to/uniprotKB.dmnd

In case you run genview-makedb for the first time, you will not have yet downloaded the database specified with the --uniprot_db flag. If you do not specify --uniprot_db and GEnView will prompt you to download the database in the beginning of the run. Due to bandwith issues at the database storage location, please specify --uniprot_db with the path to the uniprotKB.dmnd file on subsequent runs.

1.2 Downloading genome assemblies/plasmids by accession number

If you have specific genomes to search for target genes in mind, you can supply a file containing these accession number with the --accession flag

genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -p 20 -id 80 -scov 80 --accessions /path/to/accessions/accessions.txt --log --uniprot_db path/to/uniprotKB.dmnd --assemblies

1.3 Searching local sequences

If you have local sequences that you want to search for your target gene, you can specify the path the those sequences in fasta format using --local

genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -id 80 -scov 80 --local /path/to/local/sequence_directory --uniprot_db path/to/uniprotKB.dmnd

Make sure that the headers of your local sequences are as simple as possible and do not contain special characters. If spaces are present, headers will be split at first space, and the first element of the resulting list (plus an id added by genview) will be used to identify the respective sequence.

1.4 Updating a database

Since version 0.2, you also have the possibility to add new genomes to an existing database using the --update flag.

genview-makedb -d /path/to/target_directory -db /storage/path/to/reference_db/protein_fasta_protein_homolog_model.fasta -id 80 -scov 80 --taxa 'pararheinheimera' --assemblies --uniprot_db path/to/uniprotKB.dmnd --update

In this case, you would specify the target directory of a previous genview run under -d. The database contained in the directory will then be updated with the specified sequences (in this case all target genes found in the genomes of Pararheinheimera species)

File explanations

protein_fasta_protein_homolog_model.fasta is a fasta or multifasta file, containing the amino acid sequence of the protein to investigate, such as:

>PER-1
MNVIIKAVVTASTLLMVSFSSFETSAQSPLLKEQIESIVIGKKATVGVAVWGPDDLEPLLINPFEKFPMQSVFKLHLAMLVLHQVDQGKLDLNQTVIVNRAKVLQNTWAPIMKAYQGDEFSVPVQQLLQYSVSHSDNVACDLLFELVGGPAALHDYIQSMGIKETAVVANEAQMHADDQVQYQNWTSMKGAAEILKKFEQKTQLSETSQALLWKWMVETTTGPERLKGLLPAGTVVAHKTGTSGIKAGKTAATNDLGIILLPDGRPLLVAVFVKDSAESSRTNEAIIAQVAQTAYQFELKKLSALSPN

The header line for each protein sequence should be as simple as possible, and not contain '|'. As gene sequences will be extracted from the database by name by genview-visualize, the amino acid sequences of translated genes you want to visualize together should have similar names, for example: Different variants of PER-type beta-lactamases should be named PER-1, PER-2 and PER-6, and extracted from the database later using -gene per PER. If you only want to extract e.g PER-1, you would specify 'PER-1' under the -gene flag when using genview-visualize.

/path/to/accessions/accession_list.txt is a file containing the genbank accessions of the assemblies/plasmids you want to search, one accession per line. Example:

GCA_000217935.2
GCA_003990335.1
GCA_001275035.1
GCA_000986865.1
GCF_003862465.1
GCA_004005375.1
GCA_008017875.1

the --assemblies option specifies that these are accession numbes for assemblies stored at the NCBI Assembly Database, not plasmids. If you want to include plasmids from the NCBI Nucleotide database, specify --plasmids as well.

The product of this command is the SQLite database file 'genview_database.db'. It contains all identified instances of the target gene, which genome it is found in and the genes that were predicted and annotated in its genetic environment. The database can be viewed with any SQLite compatible database viewer (e.g. 'DB browser for SQLite').

Step 2 - genview-visualize


2.1 visualizing all sequences containing the target gene

Once you have created a genview database, genview-visualize creates a phylogeny based visualization of the selected target gene locus:

genview-visualize -db /path/to/genview_database.db -id 80 -gene 'your_target_gene_name'

To for example visualize the PER-type genes in the previously created database containing the Rheinheimera/Pararheinheimera genomes, run:

genview-visualize -gene 'per-' -db /path/to/genview_database.db -id 80

Once the the run is done, genview-visualize will print the location of the output file.

2.2 Visualizing only sequences that are less than 95% similar in sequence identity

If you want to only visualize sequences that are more than 5% dissimilar (basically removing duplicates), you can use the --compress flag

genview-visualize -gene 'per-' -db /path/to/genview_database.db -id 80 --compress

You can further select the sequences you want to display by the -taxa flag. Only target sequences contained in the genomes of the specified taxa will be visualized.

This will generate the output files in /path/to/genview_output_directory/per_80_analysis/, as described in the README.txt on the main page, e.g a visualization of the sequences and metadata on the genes and their genetic environments.

You can now compare the PER-1 gene loci in your browser, e.g by using firefox /path/to/genview_output_directory/per_80_analysis/interactive_visualization.html

It will look like this, with your target gene marked in red (default, you can specify custom colors using the --custom_colors flag).

genview_example

You can inspect specific sequences closer through clicking on a given gene sequence:

genview_example1

This will give you access to the nucleotide sequence of the respective gene.

Clicking on the line between single genes will give you access to the complete sequence surrounding your target gene:

genview_example2

2.3 Customizing visualization colors

The colors of specific sequences can be customized using genviews --custom-colors flag. In the visualizations above, we have identified several genes coding for transferases and transporter proteins. We want to color the transferases in green, the transporters in pink and the target gene (per-) in blue. custom_colors.txt is a text file that contains coloring instructions for specific annotated sequences. The values in each row are tab separated. Each row contains the following items: geneclass\tkeyword1,keyword2,...,keywordN\trgb_colorcode. 'geneclass' is a name of your choice. The file described above would look like this:

target per- rgb(0,128,255)
mfs transporter,mfs rgb(255,0,127)
transferase transferase rgb(102,204,0)

Running genview-visualize -gene per -db /path/to/genview_database.db -id 70 --custom_colors custom_colors.txt now results in the following visualization:

genview_custom
Clone this wiki locally