# Submodule 4: Comparative genomics analysis

In this submodule, you will begin with a directory of proteomes from de novo assembled and annotated genomes. 

A single bacterial genomes was analyzed in Submodules 1 & 2 and we automated the process on many genomes in Submodule 3 to procude a total of XX genome sequences.

We will add to this dataset, reference genomes publicly available from NCBI. These genome sequences are curated for quality and provide gene accessions with curated functional information. These datasets are crucial for providing context to our new dataset.


### Learning Objectives

- **Access genome datasets from the NCBI**:  
    
- ****:  
  Gain proficiency in using bioinformatics tools and foster skills in data analysis and interpretation.

- ****:  
  Connect the principles of the central dogma (DNA → RNA → Protein) to the process of genome annotation, understanding how gene annotations contribute to functional genomics and biological interpretations.

- ****:  
  Participants will refine the genome based on their assessments, ensuring a high-quality, annotated genome that can be used for further analysis or research applications.

## **Install required software**

A few more tools are required for Submodule 4; OrthoFinder, UpSet plot, bwa, samtools, blast, blobtools, and prokka.  As with submodule 1, we will install these tools using __[Conda](https://docs.conda.io/en/latest/)__.

Each piece of software, along with links to publications and documentation, will be described in turn. Below is a brief summary of these tools.

### List of software
| **Tool**       | **Description**                                                                                                                                                           |
|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **QUAST**      | Used for evaluating and reporting the quality of genome assemblies by comparing them against reference genomes or generating statistical summaries.                          |
| **BUSCO**      | Utilized for assessing genome completeness by searching for conserved single-copy orthologs from specific lineage datasets.                                                 |
| **BWA**        | A fast and memory-efficient tool for aligning sequence reads to large reference genomes, commonly used in variant calling pipelines.                                         |
| **Samtools**   | Used for manipulating and processing sequence alignments stored in SAM/BAM format. Essential for sorting, indexing, and viewing alignment files.                            |
| **BLAST**      | A widely used tool for comparing an input sequence to a database of sequences, identifying regions of local similarity and aiding in functional annotation.                  |
| **BlobTools**  | A versatile tool for visualizing and analyzing genome assemblies, helping to identify contamination or misassembled regions by correlating sequence features with taxonomy.   |
| **Prokka**     | Used for rapid annotation of prokaryotic genomes, identifying genes, coding sequences, rRNAs, tRNAs, and other genomic features.                                             |

## Starting Data

This submodule begins with a directory of proteomes in FAA (FASTA amino acid) format. This module is designd to work with data produced from ssubmodule 3, but replace the FAA files within the working directory *proteomes* or add additional FAA files as neeeded.

In [None]:
%%bash

ls proteomes/




In [None]:
#install the required packages
import requests
import json
import ipywidgets as widgets
from IPython.display import display
import random
print("done installing required packages")

#install the module quiz_module.py
##from quiz_module import run_quiz
from quiz_module import run_quiz
print("done installing quiz_module")

In [None]:
#This randomizes the order of the possible answers.
##import_type should be one of two str values: 'json' or 'url'
##import_path here defines the json filepath
run_quiz(import_type="json", import_path="questions/1-1.json", instant_feedback=False, shuffle_questions=False, shuffle_answers=True)

In [None]:
# Phylogenetic tree

Phylo
ETE toolkit
ToyTree

https://github.com/etetoolkit/ete

http://etetoolkit.org/ipython_notebook/
 ETE Toolkit - Visualization and analyses using Ipython Notebooks 
The ETE toolkit - Ipython notebook integration

https://toytree.readthedocs.io/en/latest/

#South Dokota is doing one


