# Species Identification through Blastn search

## Files needed for the Blastn search

### Text files
Sequences can be provided in different formats as shown below.

#### Sequences downloaded from the website directly
If you open the file, you will see
```
GCGCCCTCCCGAAGGTTAAGCTACCTACTTCTTTTGCAACCCA
```
or in same cases
```
"Contig 1" (1,1355)
  Contig Length:                 1355 bases
  Average Length/Sequence:        707 bases
  Total Sequence Length:         1415 bases  Top Strand:                       1 sequences
  Bottom Strand:                    1 sequences
  Total:                            2 sequences
FEATURES             Location/Qualifiers
     contig          1..1347
                     /Note="Contig 1(1>1355)"
                     /dnas_scaffold_ID=0
                     /dnas_scaffold_POS=0
     coverage_once   1..662
                     /Note="Only_once"
     coverage_below  663..722
                     /Note="Below threshold"
     coverage_once   723..1347
                     /Note="Only_once"

^^
GCAGTCGAGCGGGCCCTTCGGGGTCAGCGGCAGACGGGTGAGTAACACGTGGGAACGTACCCTTTGGTTCGGAATAACGCTGGGAAACTAGCGCTAATACCGGATACGCCCTTTTGGGGAAAGGCTTGCTGCCGAAGGATCGGCCCGCGTC
```

Note:  
The name of the file will be used as the name of the sequence. Make sure that there are no sequences with the same names.  
In most cases you will have many files with the DNA sequences in text format. Compress these files to a single file and send to Xiaolong Cao.


#### Sequences in fasta format

File with one sequence
```
>10-14
GCAGTCGAACGATGAAACCGCCCTCGGGCGGACATGAAGTGGC
```

File with multiple sequences

```
>10-14
GCAGTCGAACGATGAAACCGCCCTCGGGCGGACATGAAGTGGCG
>10-15
GGCCGATCCTTGCGGTTACGGACTTCAGGTACCCCCGGCTCCCA

```

Note:  
`>10-14` is the header line. `10-14` is used as the name of sequence. Make sure that there are no sequences with the same name.  
You may have many files with the DNA sequences in fasta format. Compress these files to a single file and send to Xiaolong Cao.  
You may have one file with multiple DNA sequences. Send it to Xiaolong directly.

#### Sequences in fastq format
This case is rare. The fastq format looks like
```
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```
Note:  
There are four lines for each sequence in fastq format. `@SEQ_ID` is the name of the sequence; the second line is the sequence; the third line is `+` or `+SEQ_ID`; the forth line is the quality score for each base in the sequence.  
Send fastq files the same as fasta files.  
You may need to send a note saying that files are in fastq format.

### Binary files

There is currently only one binary format for sequence files, which is the file ends with ".ab1".   
To view the extensions of iles, in windows 10, in the file explorer, click "vew", and check the box before "file extensions".  
Compress multiple files to a single one and send it to Xiaolong Cao

## Databases for Blastn search

Typically, SILVA 16S or Ezbio are good choices for bacteria and ITS7.2 for fungi.

### SILVA 16S
A comprehensive on-line resource for quality checked and aligned ribosomal RNA sequence data. Includes almost all known high-quality 16S rRNA sequences, which include more 16S sequences than in NCBI nt database.


### SILVA 18S
A comprehensive on-line resource for quality checked and aligned ribosomal RNA sequence data. Includes almost all known high-quality 18S rRNA sequences, which include more 18S sequences than in NCBI nt database.


### nt_micro
Sequences from NCBI nt database. Only include those from kingdoms of Archae, Bacteria, Viroids, Viruses and Fungi.

### nt
First suggestion from Kejing Wang. Many thanks.  
The same as NCBI nt database.

### Ezbio
Sugguestion from Guozhen Zhao. Many thanks.
The same as Ezbiocloud (https://www.ezbiocloud.net/identify).  
Introduction from the web (http://help.bioiplug.com/ezbiocloud-16s-database/)  
**Advantages**
* Designed for species-level identification Learn more
* Complete taxonomic hierarchy from phylum to species
* Covering species with validly published names, candidatus, potential new species and uncultured phylotypes. For example, >10 tentatively new Acinetobacter species are included.
* Good coverage on human microbiome: >96% species-level identification can be achieved.

**The content of the EzBioCloud 16S database**  
* EzBioCloud 16S database contains the following information:  
    * Standardized 16S rRNA gene sequence representing reference taxa  
        * All sequences are extracted between two most popular PCR primers (27F-1492R), so similarity calculation should be consistently carried out.  
        * In principle, single 16S is assigned to single reference taxon.  
    * The reference taxa mean  
        * Currently validly published taxonomic names  
        * Some of the invalid names (that are likely representing distinct species).  
        * Candidatus taxa  
        * Unnamed phylotypes that do not belong to the above. These include 16S amplicons and genome sequences.  
    * Complete taxonomic hierarchy is given for all 16S sequence (from species to phylum). The hierarchy is based on the maximum likelihood phylogenetic tree of 16S with consideration of the currently accepted classification.  
* Source of 16S data  
    * Since we have tried to secure the best quality of 16S sequences, the sources of 16S can vary and one of the followings:  
    * NCBI 16S amplicon sequences of validly published taxa: e.g.,  AY692362 for Adiaceo aphidicola  
    * NCBI 16S amplicon sequences of phylotypes: e.g.,  AJ290038 for AJ290038_s (phylotype corresponding species)  
    * 16S sequence extracted from NCBI genome assembly: e.g., CP000238 for Baumannia cicadellinicola.  
    * 16S sequence extracted from JGI genome assembly (this genome data may not be available in NCBI): e.g. jgi.1096475 for phylotype jgi.1096475_s in the genusGeodermatophilus.  
    * 16S sequence compiled from Pacific Biosciences full-length sequencing of microbiome samples. These represent high-quality 16S sequences using PacBio’s circular consensus sequencing (ccs) technology: e.g. PAC000364 for phylotype PAC000364_s.  
    * 16S sequence extracted from internally assembled genome data: e.g. CLG_48533 for Arthrobacter oryzae.  
    Consequently, not all data are available in NCBI database. However, all data are freely accessible through www.ezbiocloud.net.  

* Why 16S sequences from genome assemblies were used in EzBioCloud, instead of PCR  
    * Genome assembly is usually in better quality than PCR amplicon sequencing. Typical NGS sequencing resulted in 50X or higher sequencing depths of coverage.  
    * When we include genome sequence-derived 16S to EzBioCloud database, we always check the quality by manual alignment using secondary structural information. In our experience, using genome sequences we can improve the quality of 16S databases for reference purposes.  


### ITS7.2
Unified system for the DNA based fungal species linked to the classification Ver. 7.2 (https://unite.ut.ee/repository.php)  

Following Kõljalg et al. (2013), each terminal fungal taxon for which two or more ITS sequences are available is referred to as a species hypothesis (SH). One sequence is chosen to represent each SH; these sequences are called representative sequences (RepS) when chosen automatically by the computer and reference sequences (RefS) when those choices are overridden (or confirmed) by users with expert knowledge of the taxon at hand.

Used to identify fungal species.

## Search result

After searching against one database, four files will be created.

### \*\*\*AllSeqs.fasta

All sequences stored in fasta format

### \*\*\*sequencesNotIdentified\*\*\*.fasta
Sequences not identified in `***Best***result.xlsx` file

###  \*\*\*Best \*\*\*Result.xlsx
Best result for each of the query sequences selected by the program.

Selection standards:
* Good species name: not "unknown" or "environmental samples"
* Good taxonomies: of the six terms "species genus family order class phylum", allow missing two of them.
* qcover > 95%
* identity > 97%

Explanation of each column:
* query: name of query sequence
* subject: name of the subject sequence. Usually accession number, which you can fild the sequence by searching the name in NCBI
* identity: percentage of same bases of the aligned part between the query and subject sequences.
* matchLength: matched length between the query and subject sequences.
* qcover: percentage of the query sequences that aligned with the subject sequence.
* taxID: taxomomy id of the species, which you can search in NCBI taxomony to find the all taxomony information.
* identical: identical base count between the query and subject sequence.
* species
* genus
* family
* order
* class
* phylum
* \*\*\*Taxa: taxonomy info stored within the database itself.

### \*\*\*All\*\*\*Result.xlsx
All result of blast search.

## Use the result

Typically, the `***Best***result.xlsx` is good enough. If the  `***sequencesNotIdentified***.fasta` is not empty, it is usually because there are some low-quality bases in the sequences such as `N`'s and the qcover is less than 95. You can check the `***All***Result.xlsx` and select the good one.
