# Submodule 4: Comparative genomics analysis
--------
## Overview
In this submodule, you will begin with a directory of proteomes from *de novo* assembled and annotated genomes. A single bacterial genome was analyzed in Submodules 1 & 2 and we automated the process on many genomes in Submodule 3 to produce a total of XX genome sequences. We will add to this dataset with reference genomes publicly available from NCBI. These genome sequences are curated for quality and provide gene accessions with curated functional information. These datasets are crucial for providing context to our new dataset.

Genome assembly is a foundational step that allows researchers to generate a complete picture of the genetic material in a given organism. Assessing the quality and completeness of a genome assembly ensures that it accurately reflects the original genome, providing a reliable foundation for further analysis. Comparative genomics builds on genome assembly by comparing the genomes of different species or individuals. This field focuses on identifying similarities and differences in DNA sequences to understand evolutionary relationships and reveal the genetic underpinnings of adaptation and diversity. Comparative genomics can uncover how species evolve and adapt over time and identify genes associated with specific traits or diseases.

<p align="center">
  <img src="images/toxo_multi_alignment.png" width="80%"/>
</p>

### Learning Objectives

Through this submodule, users will gain experience in comparative genomics, resulting in an understanding of how to use existing genomes a reference points to better understand the context of a novel genome. You will Learn to perform comparative genomic analyses to identify similarities and differences across genomes, run phylogenomic analyses, construct pangenomes to capture genetic diversity. You will then apply these techniques to address biological questions and hypotheses. 

- **Access Genome Datasets from the NCBI**:<br>
  Participants will use command line tools to acces genomes on NCBI for use in comparative genomics analyses.
    
- **Perform and Visualize Comparative Genomics Analyses**:<br>
  Gain proficiency in using comparative genomics tools and visuazing their outputs. Participants will use these outputs to identify patterns across genomes and develop hypotheses about genome relatedness, gene loss or gain events, and gene duplications.

- **Create and Understand Phylogeny Trees**:<br>
  Use both ortholog groups and average nucleotide identity in genes to create phylogeny trees. Use phylogeny trees to gain an understanding of the see species relatedness and to understand taxonomic groupings.

- **Draw Conclusions about Assembled Genome**:<br>
  Using the comparisons to fully annotated genomes, participants should be able to identify patterns in host and strain of comparator genomes and draw their own conclusions about the genomes assembled in submodules 1-3.

## **Install required software**

A few more tools are required for Submodule 4; OrthoFinder, UpSet plot, and fastANI. As with submodule 1, we will install these tools using __[Conda](https://docs.conda.io/en/latest/)__.

Each piece of software, along with links to publications and documentation, will be described in turn. Below is a brief summary of these tools.

### List of software
| **Tool**       | **Description**                                                                                                                                                           |
|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **OrthoFinder**      | Finds orthogroups and orthologs, infers rooted gene trees for all orthogroups and identifies all of the gene duplication events in those gene trees.                         |
| **UpSet Plots**      | UpSet plots are a data visualization method for showing set data with more than three intersecting sets.                                                  |
| **fastANI**        | Developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes.     
| **seaborn**        | A python library necessary for creating complex heatmaps.   |

In [60]:
%%bash

# other installs are already complete, we just need to install seaborn
mamba install seaborn --quiet

## Starting Data

This submodule begins with a directory of genomes in FASTA format and a directory of proteomes in FAA (FASTA amino acid) format. This module is designd to work with data produced from submodule 3, but feel free replace the FAA files within the directory *proteomes* or add additional FAA files as neeeded.

In [1]:
%%bash

ls data/proteomes/

SRR10056681.faa
SRR10056778.faa
SRR10056784.faa
SRR10056829.faa
SRR10056855.faa
SRR10056856.faa
SRR10056914.faa
SRR10067958.faa
SRR10068079.faa
SRR10068117.faa


In [2]:
%%bash

ls data/genomes/

SRR10056681.fasta
SRR10056778.fasta
SRR10056784.fasta
SRR10056829.fasta
SRR10056855.fasta
SRR10056856.fasta
SRR10056914.fasta
SRR10067958.fasta
SRR10068079.fasta
SRR10068117.fasta


## Process 1: Downloading Reference Genomes and Proteomes from NCBI
We have a directory of `genomes` and `proteomes` we created in submodules 1-3. To provide these context, we can also access thousands of deposited and annotated bacterial genomes and proteomes from NCBI with a few commands.

In [3]:
%%bash

# download the list of all refseq assemblies
if [[ ! -s assembly_summary_refseq.txt ]]
then
    wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt -O assembly_summary_refseq.txt --quiet --no-check-certificate
fi

In [41]:
%%bash

# we can use grep to search for our target organism
grep "Campylobacter jejuni" assembly_summary_refseq.txt | head -n 3

GCF_000009085.1	PRJNA57587	SAMEA1705929	na	reference genome	192222	197	Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819	strain=NCTC 11168	na	latest	Complete Genome	Major	Full	2003-05-06	ASM908v1	Sanger Institute	GCA_000009085.1	identical	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/085/GCF_000009085.1_ASM908v1	na	na	na        	haploid	bacteria	1641481	1641481	30.500000	1	1	1	NCBI RefSeq	Annotation submitted by NCBI RefSeq	2022-07-18	1668	1572	56	10688204;17565669
GCF_000011865.1	PRJNA224116	SAMN02604007	na	na	195099	197	Campylobacter jejuni RM1221	strain=RM1221	na	latest	Complete Genome	Major	Full	2005-01-06	ASM1186v1	TIGR	GCA_000011865.1	identical	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/865/GCF_000011865.1_ASM1186v1	na	na	na        	haploid	bacteria	1777831	1777831	30.500000	1	1	1	NCBI RefSeq	GCF_000011865.1-RS_2024_05_07	2024-05-07	1891	1767	56	15660156
GCF_000015525.1	PRJNA224116	SAMN02604044	na	na	354242	197	Campylobacter jejuni subsp. jejuni 81-176	

<div class="alert alert-block alert-info"><b>Tip</b>: Click on one of the blue highlighted links above to be taken to an assembly page on NCBIs server. Notice how the assembly page is laid out just like a directory and the link looks like a path. This is just like how our cloud server works!</div>

<p align="center">
  <img src="images/ncbi_screenshot.png" width="80%"/>
</p>

Let's use grep again to search for our organism, but this time we are going to `sort` the results randomly and take the top 10 results. We can then use `cut` to select only the 20th column which contains the link. After, we will loop through the links and download them by adding the file we want from the assembly directory to the end of the link.

In [44]:
%%bash

links=$(grep "Campylobacter jejuni" assembly_summary_refseq.txt | sort -R | head -n 10 | cut -f 20)

for link in $links
do
    echo Downloading $(echo $link)

    # gets file names
    genome_file=$(echo $link | awk -F'/' '{print $NF}')_genomic.fna.gz
    proteome_file=$(echo $link | awk -F'/' '{print $NF}')_protein.faa.gz

    # gets fna download link
    fna=$(echo $link)/$(echo $genome_file)
    faa=$(echo $link)/$(echo $proteome_file)

    # wget downloads the file
    # -P specifies a directory prefix
    wget -P data/genomes/ $fna --quiet --no-check-certificate
    wget -P data/proteomes/ $faa --quiet --no-check-certificate

done

Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/036/574/495/GCF_036574495.1_ASM3657449v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/178/195/GCF_002178195.1_ASM217819v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/877/725/GCF_001877725.1_ASM187772v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/037/733/045/GCF_037733045.1_ASM3773304v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/297/285/GCF_019297285.1_ASM1929728v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/036/578/165/GCF_036578165.1_ASM3657816v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/964/233/685/GCF_964233685.1_CNRERY-01897
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/764/295/GCF_001764295.1_ASM176429v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/763/575/GCF_001763575.1_ASM176357v1
Downloading https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/297/005/GCF_019297005.1_ASM1929700v1


In [45]:
%%bash

# lastly, we have to unzip the genome and proteome files
gunzip data/genomes/*.gz
gunzip data/proteomes/*.gz

We now have 10 genomes and 10 proteomes from NCBI which we can use for comparative genomics! Let's start by comparing our genomes to the NCBI isolates with FastANI.

## Process 2: Comparing Average Nucleotide Identity using FastANI
Program: FastANI - Fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI)
Citation : Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. *High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.* Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9
Manual: https://github.com/ParBLiSS/FastANI

FastANI is developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. FastANI supports pairwise comparison of both complete and draft genome assemblies and avoids expensive sequence alignments in most ANI tools. With all our genomes on hand, we can make an initial comparison of ANI to give a preview of potential patterns among our genome set.

In [75]:
%%bash

# take a look at the help menu
fastANI -h

-----------------
fastANI is a fast alignment-free implementation for computing whole-genome Average Nucleotide Identity (ANI) between genomes
-----------------
Example usage:
$ fastANI -q genome1.fa -r genome2.fa -o output.txt
$ fastANI -q genome1.fa --rl genome_list.txt -o output.txt

SYNOPSIS
--------
fastANI [-h] [-r <value>] [--rl <value>] [-q <value>] [--ql <value>] [-k
        <value>] [-t <value>] [--fragLen <value>] [--minFraction <value>]
        [--maxRatioDiff <value>] [--visualize] [--matrix] [-o <value>] [-s] [-v]

OPTIONS
--------
-h, --help
     print this help page

-r, --ref <value>
     reference genome (fasta/fastq)[.gz]

--rl, --refList <value>
     a file containing list of reference genome files, one genome per line

-q, --query <value>
     query genome (fasta/fastq)[.gz]

--ql, --queryList <value>
     a file containing list of query genome files, one genome per line

-k, --kmer <value>
     kmer size <= 16 [default : 16]

-t, --threads <value>
     thread coun

We want to run FastANI with all of our genomes queried against each other. This corresponds to the example in the manual `fastANI --ql [QUERY_LIST] --rl [REFERENCE_LIST] -o [OUTPUT_FILE]`. The query and reference lists are lists of the paths to our genomes. Let's first make these files with some more bash looping.

In [68]:
%%bash

# looping through all genome files in our genome directory
for genome in data/genomes/*
do
    # readlink gets the full path to the genome, tee writes the path to two files at once
    echo $genome | tee -a query_list.txt reference_list.txt
done

data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna
data/genomes/GCF_001764295.1_ASM176429v1_genomic.fna
data/genomes/GCF_001877725.1_ASM187772v1_genomic.fna
data/genomes/GCF_002178195.1_ASM217819v1_genomic.fna
data/genomes/GCF_019297005.1_ASM1929700v1_genomic.fna
data/genomes/GCF_019297285.1_ASM1929728v1_genomic.fna
data/genomes/GCF_036574495.1_ASM3657449v1_genomic.fna
data/genomes/GCF_036578165.1_ASM3657816v1_genomic.fna
data/genomes/GCF_037733045.1_ASM3773304v1_genomic.fna
data/genomes/GCF_964233685.1_CNRERY-01897_genomic.fna
data/genomes/SRR10056681.fasta
data/genomes/SRR10056778.fasta
data/genomes/SRR10056784.fasta
data/genomes/SRR10056829.fasta
data/genomes/SRR10056855.fasta
data/genomes/SRR10056856.fasta
data/genomes/SRR10056914.fasta
data/genomes/SRR10067958.fasta
data/genomes/SRR10068079.fasta
data/genomes/SRR10068117.fasta


In [76]:
%%bash

fastANI --ql query_list.txt --rl reference_list.txt -t 24 -o fastani_output.tsv

>>>>>>>>>>>>>>>>>>
Reference = [data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna, data/genomes/GCF_001764295.1_ASM176429v1_genomic.fna, data/genomes/GCF_001877725.1_ASM187772v1_genomic.fna, data/genomes/GCF_002178195.1_ASM217819v1_genomic.fna, data/genomes/GCF_019297005.1_ASM1929700v1_genomic.fna, data/genomes/GCF_019297285.1_ASM1929728v1_genomic.fna, data/genomes/GCF_036574495.1_ASM3657449v1_genomic.fna, data/genomes/GCF_036578165.1_ASM3657816v1_genomic.fna, data/genomes/GCF_037733045.1_ASM3773304v1_genomic.fna, data/genomes/GCF_964233685.1_CNRERY-01897_genomic.fna, data/genomes/SRR10056681.fasta, data/genomes/SRR10056778.fasta, data/genomes/SRR10056784.fasta, data/genomes/SRR10056829.fasta, data/genomes/SRR10056855.fasta, data/genomes/SRR10056856.fasta, data/genomes/SRR10056914.fasta, data/genomes/SRR10067958.fasta, data/genomes/SRR10068079.fasta, data/genomes/SRR10068117.fasta]
Query = [data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna, data/genomes/GCF_001764295.1_ASM176429v

In [77]:
%%bash

# taking a look at our output
head -n 5 fastani_output.tsv

data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna	data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna	100	547	559
data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna	data/genomes/GCF_002178195.1_ASM217819v1_genomic.fna	98.3077	451	559
data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna	data/genomes/GCF_964233685.1_CNRERY-01897_genomic.fna	98.2654	511	559
data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna	data/genomes/GCF_037733045.1_ASM3773304v1_genomic.fna	98.2606	512	559
data/genomes/GCF_001763575.1_ASM176357v1_genomic.fna	data/genomes/SRR10056855.fasta	98.2039	507	559


### Explanation of FastANI output

FastANI outputs a tab delimited file with a row for every query genome and five columns. The columns correspond to the query genome, the reference genome, ANI value, count of bidirectional fragment mappings, and total query fragments respectively. Since our NCBI genomes are from the same species as our isolate, our ANI value should be >95%.

For some downstream analysis, we should clean these names up a bit. We don't need the preceding data/genomes/ before every one of our genomes. Below, we use `sed` to edit a file in place, replacing all instance of `/data/genomes/` with empty text. More info on `sed` can be found in its manual page [here](https://linux.die.net/man/1/sed).

In [80]:
%%bash

# sed syntax is 's|what we want to find|what we want to replace it with|'
# the -i flag specifies that the operation is in place and won't make a new file
# the g at the end of the sed expression stands for global--replacing every instance in the whole document
sed -i 's|data/genomes/||g' fastani_output.tsv

In [81]:
%%bash

# we should see the effect of sed now
head -n 5 fastani_output

GCF_001763575.1_ASM176357v1_genomic.fna	GCF_001763575.1_ASM176357v1_genomic.fna	100	547	559
GCF_001763575.1_ASM176357v1_genomic.fna	GCF_002178195.1_ASM217819v1_genomic.fna	98.3077	451	559
GCF_001763575.1_ASM176357v1_genomic.fna	GCF_964233685.1_CNRERY-01897_genomic.fna	98.2654	511	559
GCF_001763575.1_ASM176357v1_genomic.fna	GCF_037733045.1_ASM3773304v1_genomic.fna	98.2606	512	559
GCF_001763575.1_ASM176357v1_genomic.fna	SRR10056855.fasta	98.2039	507	559


In [82]:
%%bash

scripts/fastANI_heatmap.py -h

usage: fastANI_heatmap.py [-h] [--input INPUT] [--out OUT]

Creates a heatmap from fastANI results

options:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        FastANI results to make a heatmap from
  --out OUT             Name of the resulting heatmap


In [83]:
%%bash

scripts/fastANI_heatmap.py --input fastani_output --out fastani_heatmap.png

<p align="center">
  <img src="fastani_heatmap.png" width="80%"/>
</p>

# HCGS-Comprative-Genomics
NCBI download and Orthofinder analysis

## Our Starting data

```bash
ls /home/share/workshop/faa_files/
```

## What we will be doing

We will be using **Orthofinder** for our main comparative genomic analysis. The manual is very detailed, I recommend taking some time to read it. To run the program we will need some genomes to compare.

Orhtofinder Manual: https://github.com/davidemms/OrthoFinder

The program takes a set of protein sequences for each species and runs pair-wise comarisons to identify orhtologous groups. For each orthogroup a gene tree is calculated and in the end an overall species tree is computed. To get any sort of meaningful phylogenetic tree we need to be sure to include **at least four different genome datasets**. Ideally we would run this program with all of the avaialble sequences on NCBI. As you can imagine, a pairwise comparison with 1,185 Streptomyces genomes will take a lone time (days). We will therfore run it with a reduced set. Next we will determine what genomes we want to download and go over the best ways to retrieve them from NCBI.


## Set up working directories
```bash
cd ~/genomics_tutorial/
mkdir genbank_downloads
cd genbank_downloads/
```

## Locate Reference Data on NCBI

FAQs about genome download from NCBI: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#GBorRS

Whatever method you use be sure to grab an outgroup, or don't thats your call.

### Method 1: Download speicifc genomes

We ran a NCBI blast during the genome assembly tutorial. This BLAST should have given you the closest match against the nt database. Chances are it 'hit' well to many genomes. Choose the top hit to a full genome and follow the links to retriev the download link of the genome FAA from the ftp site.

Alternatively, you can download the reference genome used as part of the MR study in staphylococcus.

Staphylococcus aureus ATCC 29213 is the reference strain.
https://www.ncbi.nlm.nih.gov/assembly/GCF_001879295.1

FOllow the links to the FTP download.

```bash
wget "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/879/295/GCF_001879295.1_StAu00v1/GCF_001879295.1_StAu00v1_protein.faa.gz"
```


### Method 2: Download all refseq genomes for your genus

link to NCBI prokaryote tables: https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/
link to genome reports FTP: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

* Download the genome report file for all all of prokaryotes

We will download the file directly to the server, there is no need to download it to your computer. Right lick on the link and copy the link address. This is a big file so we will filter it a bit first before opening it with tabview.

This file has a lot of useful information. For now we really care about column 21 which is the downloa link for the genome on the ftp site. Copy that link and paste it into a browser to see the files. We will then download the FAA files to the server.


```bash
# download the file
wget "ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt"

# view it
tabview prokaryotes.txt

# grep for species in question and view.
grep -i "Staphylococcus" prokaryotes.txt | grep REPR | tabview -

# print the download commands
grep "Staphylococcus" prokaryotes.txt | grep REPR | awk -F'\t' '{print "wget "$21"/*protein.faa.gz"}'

# download all the faa files automatically. -P is the number of processes at a time.
grep "Staphylococcus" prokaryotes.txt | grep REPR | awk -F'\t' '{print $21"/*protein.faa.gz"}' | xargs -P 1 wget -i

# or even better, rename the files as you go. (Delete the files created from the previous command before proceeding).
grep "Staphylococcus" prokaryotes.txt | grep REPR | sed 's/ /_/g' | awk -F'\t' '{print $1"_"$19".faa.gz",$21"/*protein.faa.gz"}' | xargs -n 2 -P 1 wget -O
```


Remove the empty files, some of them don't have annotations and it will mess up the next part.

```bash
zgrep -c '>' *.faa.gz
zgrep -c '>' *.faa.gz | awk -F':' '$2 == 0'

# automatic deletion with xargs
zgrep -c '>' *.faa.gz| awk -F':' '$2 == 0 {print $1}' | xargs rm
```

Unzip all the files

```bash
gunzip *.faa.gz
```


## Set up orthofinder directory

```bash
# move to analysis folder
mkdir ~/genomics_tutorial/orthofinder-analysis
cd ~/genomics_tutorial/orthofinder-analysis


# create a soft link to the FAA files we just downloaded
ln -s ../genbank_downloads/*.faa ./

# create a soft link to the FAA fles from our PROKKA analysis
ln -s /home/share/workshop/faa_files/*.faa ./
```

## Count the number of proteins in all the starting files
Think about what these numbers tell us right off the bat.

```bash
grep -c '>' *.faa
```

## Run Orthofinder2

The input to the program is a directory containing a FAA file for each species.

```bash
# view the manual
orthofinder --help
# run the program, it will take some time
nohup time orthofinder -t 16 -a 16 -S diamond -f ./ &
```

## Examine the output files

I will review some, but not all of the files. The manual goes into extensive detail.

```bash
cd Results*/
ls
```

### * Orthogroups.csv

A **tab** seperated table. Each orthogroup is a raw, each column is a different sample.

The table provides all of the data for orthogroups that are in at least two different samples. If a sample has more than one protein for that particular orthogroup than it will have a comma seperated list for the entry. 

```bash
tabview Orthogroups.csv
```

### * Orthogroups_UnassignedGenes.csv

The same style table. Instead this one contains Orthogroups that are not belonging to an orthogroup, they are unique to a single sample. As you scroll down you should notice the proteins belong to different samples.

```bash
tabview Orthogroups_UnassignedGenes.csv
```

###  * Orthogroups.GeneCount.csv

My favorite 'Orthogroup' Output file. Orthogroups are the rows, columns are gene counts per species. This can be easily parsed to see what orthogroups are specific to waht species. It provides total gene counts for each sample.

```bash
tabview Orthogroups.GeneCount.csv
```

* add annotations from a reference sequence

~/orthogroups_add_annotations.py <reference_faa> Orthogroups.txt  Orthogroups.GeneCount.csv

```bash
orthogroups_add_annotations.py ../GCF_000203835.1_ASM20383v1_protein.faa Orthogroups.txt  Orthogroups.GeneCount.csv | tabview -
```


## Statistics

### * Statistics_Overall.csv

A file containing the overall statistcis for the analysis. Total number of genes in the dataset etc. 

```bash
tabview Statistics_Overall.csv
```

### * Statistics_PerSpecies.csv

In my opinion this is the most important statistics output file. It provides details for each sample. How many genes were speciifc to that sample. If you want to know a quick statistics of how 'differen't your genome is, this is it.

```bash
tabview Statistics_PerSpecies.csv
```

### * WorkingDirectory/

All of the work that external programs like BLAST or DIAMOND. 'ls' this directory. It contains all the results for each pairwise comparison.

### * Orthologues_DATE/

This directory contains a lot of useful data related to the Orthofinder analysis and how they commpute the phylogenetic trees.

#### * Recon_Gene_Trees/

A directory containing inferred trees for every orthogroup.

#### * SpeciesTree_rooted.txt

A rooted-species tree. Orthofinder commputes a root for the tree automatically. You can view this in any tree viewing program like FigTree or TreeView (macs). This file is in newick format. Check it out.

```bash
more Orthologues*/SpeciesTree_rooted.txt
```

## Export the tree file and view.


## Bonus - Figures
```
orthogroups_add_annotations.py ../GCF_000203835.1_ASM20383v1_protein.faa Orthogroups.txt  Orthogroups.GeneCount.csv

orthotools-venn.py Results_*/ PROKKA_*.faa species1.faa species2.faa  venn

orthotools-UpSet.R Results_*/Orthogroups.GeneCount.csv

```

In [10]:
print("test")

test


In [None]:
#install the required packages
import requests
import json
import ipywidgets as widgets
from IPython.display import display
import random
print("done installing required packages")

#install the module quiz_module.py
##from quiz_module import run_quiz
from quiz_module import run_quiz
print("done installing quiz_module")

In [None]:
#This randomizes the order of the possible answers.
##import_type should be one of two str values: 'json' or 'url'
##import_path here defines the json filepath
run_quiz(import_type="json", import_path="questions/1-1.json", instant_feedback=False, shuffle_questions=False, shuffle_answers=True)

In [None]:
%%bash

aws s3 cp s3://nh-inbre-genome-sequencing-and-comparative-genomic-analysis/ ./

In [None]:
# Phylogenetic tree

Phylo
ETE toolkit
ToyTree

https://github.com/etetoolkit/ete

http://etetoolkit.org/ipython_notebook/
 ETE Toolkit - Visualization and analyses using Ipython Notebooks 
The ETE toolkit - Ipython notebook integration

https://toytree.readthedocs.io/en/latest/

#South Dokota is doing one

aws s3 cp s3://PATH 