# database

To download all fasta files and compile from source, you can run the provided script with the input file from the GTDB project. Next, use the `--add-to-library` and `--build `functions in Kraken to format the database. Example commands are below.

Error 429 too many requests. NCBI is receiving too many general requests, but you can carve out a special place for yourself by getting an NCBI API key. Log into your NCBI profile and copy your key. Then, add it to your environment like so: `export NCBI_API_KEY=1fe2...`

Kraken2 is a k-mer-based classifier used for species classification of high-throughput sequencing data. When building the Kraken database, one can choose to include sequences from the NCBI non-redundant nucleotide database (NT) to increase the accuracy and coverage of classification.

The NT database contains DNA and RNA sequences from various species and sources, including genes, transcripts, non-coding RNAs, genomes, viruses, plasmids, and more. These sequences cover a wide range of biological diversity from different organisms, pathogens, and host cells.

Including the NT database in the Kraken database may be helpful if you want to classify a wide range of biological diversity or if you want to classify sequences from unknown sources. However, if you are only interested in specific species or samples or do not require highly accurate classification results, it may not be necessary to include the NT database.

It is important to note that including the NT database in the Kraken database will increase the size and build time of the database, so specific needs should be considered when making this decision.

## false positive in database
![](https://journals.plos.org/ploscompbiol/article/figure/image?size=large&id=10.1371/journal.pcbi.1006277.g005)
>Top 10 species identified in corneal samples per database.
>
>The non-human reads from the 20 corneal samples were classified against four different Kraken databases: the original EuPathDB (A), EuPathDB-clean (B), RefSeq EuPathDB (C), and the final MicrobeDB (D). The plot above shows the 10 species with the most classified reads per megabase in a single corneal sample.{cite:p}`lu2018removing`
>
>https://doi.org/10.1371/journal.pcbi.1006277.g005

For example, plasmid sequences are often included in bacterial genome sequence data submitted to NCBI, which can lead to incorrect taxonomic classification. After removing all plasmid sequences included in bacterial RefSeq genomes and reassigning them to a distinct taxonomic group, the proportion of reads correctly classified to a specific bacterium was reduced {cite:p}`doster_cautionary_2019`.

In [None]:
kraken2-inspect /data/database/RefSeqV205_500G  | head -5

K-mer-based sequence classifiers such as Kraken2 and KrakenUniq can often make erroneous assignments due to genome similarities, such as the false positive phenomenon of assigning human reads to Clostridium botulinum, which can occur.(See [Kraken2 issue #621:Host Reads Being Classified as MIcrobial reads](https://github.com/DerrickWood/kraken2/issues/621))

## Daabase contruct
since kraken2 origin database constructed is so poor.we use manual script to download and build it.


In [None]:
python /data/project/host-microbiome/kraken_metaphlan_comparison/database_building_scripts/get_ncbi_other_domains.py --domain fungi,viral,vertebrate_mammalian,bacteria,archaea --complete True --folder  /data/database/kraken2_RefSeqV217_Complete_Chrom/download/ --download_genomes True --log_file ./run_all_download.log --processors 50

Then in the path you will get summary_to_download.csv

In [1]:
!grep -c "Complete Genome" /data/database/kraken2_RefSeqV217_Complete_Chrom/download/summary_to_download.csv

44588


In [2]:
!grep -c "Chromosome" /data/database/kraken2_RefSeqV217_Complete_Chrom/download/summary_to_download.csv

5394


In [None]:
kraken2-build --download-taxonomy --db kraken2_RefSeqV217_Complete_Chrom --use-ftp

In [1]:
%cd /data/database/kraken2_RefSeqV217_Complete_Chrom/download

/data/database/kraken2_RefSeqV217_Complete_Chrom/download


In [5]:
assembly = pd.read_csv('archaea_assembly_summary.txt', header=1, index_col=0,sep="\t")
assembly

Unnamed: 0_level_0,bioproject,biosample,wgs_master,refseq_category,taxid,species_taxid,organism_name,infraspecific_name,isolate,version_status,...,genome_rep,seq_rel_date,asm_name,submitter,gbrs_paired_asm,paired_asm_comp,ftp_path,excluded_from_refseq,relation_to_type_material,asm_not_live_date
# assembly_accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCF_002287175.1,PRJNA224116,SAMN04229035,LMVM00000000.1,representative genome,2161,2161,Methanobacterium bryantii,strain=M.o.H.,,latest,...,Full,2017/09/06,ASM228717v1,University of California Santa Barbara,GCA_002287175.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,assembly from type material,na
GCF_000762265.1,PRJNA224116,SAMN03085433,,na,2162,2162,Methanobacterium formicicum,strain=BRM9,,latest,...,Full,2014/10/02,ASM76226v1,PGgRc,GCA_000762265.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
GCF_001458655.1,PRJNA224116,SAMEA2779801,,representative genome,2162,2162,Methanobacterium formicicum,,Mb9,latest,...,Full,2015/11/16,Mb9,CEBITEC,GCA_001458655.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
GCF_015351695.1,PRJNA224116,SAMN16521386,JADIIL000000000.1,na,2162,2162,Methanobacterium formicicum,,bin2,latest,...,Full,2020/11/10,ASM1535169v1,"Water Research Institute, IRSA-CNR",GCA_015351695.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,derived from metagenome,,na
GCF_000824705.1,PRJNA224116,SAMEA2796325,CCXV00000000.1,na,2173,2173,Methanobrevibacter smithii,strain=ACE6,,latest,...,Full,2014/10/02,Methanobrevibacter smithii,URMITE,GCA_000824705.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_029338335.1,PRJNA224116,SAMN33708874,,na,3034020,3034020,Halovivax sp. TS33,strain=TS33,,latest,...,Full,2023/03/22,ASM2933833v1,Jiangsu University,GCA_029338335.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
GCF_029338355.1,PRJNA224116,SAMN33716776,,na,3034023,3034023,Halosegnis sp. DT85,strain=DT85,,latest,...,Full,2023/03/22,ASM2933835v1,Jiangsu University,GCA_029338355.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
GCF_029338375.1,PRJNA224116,SAMN33716851,,na,3034024,3034024,Halorussus sp. DT80,strain=DT80,,latest,...,Full,2023/03/22,ASM2933837v1,Jiangsu University,GCA_029338375.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
GCF_029338395.1,PRJNA224116,SAMN33716999,,na,3034025,3034025,Halorussus sp. DT72,strain=DT72,,latest,...,Full,2023/03/22,ASM2933839v1,Jiangsu University,GCA_029338395.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na


In [7]:
import os
import pandas as pd
assembly_lists = ['archaea_assembly_summary.txt', 'bacteria_assembly_summary.txt', 'fungi_assembly_summary.txt', 'vertebrate_mammalian_assembly_summary.txt', 'viral_assembly_summary.txt']
genome_taxid = []
for assembly in assembly_lists:
  assembly = pd.read_csv(assembly, header=1, index_col=0,sep="\t")
  for row in assembly.index.values:
    genome_taxid.append([row, assembly.loc[row, 'taxid']])

genome_taxid = pd.DataFrame(genome_taxid, columns=['Genome accession', 'taxid']).set_index('Genome accession')
genome_taxid.to_csv('db_samples.tsv', sep='\t')

  assembly = pd.read_csv(assembly, header=1, index_col=0,sep="\t")


In [8]:
genome_taxid

Unnamed: 0_level_0,taxid
Genome accession,Unnamed: 1_level_1
GCF_002287175.1,2161
GCF_000762265.1,2162
GCF_001458655.1,2162
GCF_015351695.1,2162
GCF_000824705.1,2173
...,...
GCF_027946335.1,3003632
GCF_027946345.1,3003632
GCF_027574445.1,3003729
GCF_028515065.1,3020045


In [None]:
python /data/project/host-microbiome/kraken_metaphlan_comparison/database_building_scripts/rename_fasta_headers.py --genome_folder /data/database/kraken2_RefSeqV217_Complete_Chrom/download --genome_list /data/database/kraken2_RefSeqV217_Complete_Chrom/download/db_samples.tsv --log_file logfile.txt --processors 40

In [None]:
python /data/project/host-microbiome/kraken_metaphlan_comparison/database_building_scripts/unzip_add_library.py --genome_folder fasta_renamed_RefSeqV205_Complete --database RefSeqV205_Complete --processors 12

```{bibliography}
:style: unsrt
```