## Preparing the BLAST databases
Two databases were obtained: 
- All annotated 16S rRNA from the NCBI ftp [website](https://ftp-ncbi-nlm-nih-gov.ejournal.mahidol.ac.th/blast/db/v5/)
- Annotated 16S rRNA used by the tool [MicFunPred](https://github.com/microDM/MicFunPred)

Because these databases were of incompatible versions (the MicFun is a version 4 database while the NCBI is version 5), I extracted all the 

In [None]:
# Combining nucleotide databases
blastdb_aliastool -dblist "db1 db2" -dbtype nucl -out 16sCombined -title "16sCombined"

In [None]:
# Verify the combined database
blastdbcmd -entry all -db 16sCombined -out combined_check.fasta

In [None]:
# Search with a specific output format
# -outfmt 5 is for xml format, 6 for tabular
blastn -query <query file in fasta or raw seqs> -db 16sCombined -outfmt 5 -out <output filenme>

## Parsing BLAST output
Will be using the Biopython package for this

In [61]:
import subprocess as sp
import numpy as np
from Bio import SearchIO, SeqIO
import pandas as pd
from Bio.Blast import NCBIXML
blastdb: str = "data/blastdb/16sCombined"
params: str = (
            "-outfmt 5 " +
            "-max_target_seqs 10 " +
            "-perc_identity 90 "
)
to_parse = "testblast.fasta"
parse_type = "fasta"

seq_frame = pd.DataFrame()
sequences: np.array = np.array([])
ids: np.array = np.array([])
top_hits: np.array = np.array([])
# Search the fasta file
for query in SeqIO.parse(to_parse, parse_type):
    temp = open('temp.fasta', 'w+')
    temp.write(f'>{query.id}\n{query.seq}')
    t = temp.read()
    temp.close
    blastsearch: str = f"blastn {params} -query temp.fasta -db {blastdb} > temp.txt"
    sp.run(blastsearch, shell=True)
    parse = SearchIO.read('temp.txt', 'blast-xml')
    if not len(parse):
        continue
    sequences = np.append(sequences, str(query.seq))
    top_hits = np.append(top_hits, parse[0].id)
    ids = np.append(ids, query.id)

# Construct the dataframe
seq_frame['Query id'] = ids
seq_frame['Query sequence'] = sequences
seq_frame['Top hit'] = top_hits
seq_frame.to_csv('output.csv')

In [21]:
# For parsing through blast output with 1+ queries
result_handle = open("my_blast.xml")
blast_records = NCBIXML.parse() # Is an iterator object, so you can loop over it to access each query
for record in blast_records:

SyntaxError: incomplete input (2602383775.py, line 4)

In [5]:
import pandas as pd
new = pd.read_csv('data/blastdb/tab.txt', sep='\t')
new

Unnamed: 0,M01457:82:000000000-B6GV3:1:1101:20647:1278,JX224393.1.1496_Desulfotomaculum,75.745,235,52,3,6,240,340,569,1.60e-24,113
0,M01457:82:000000000-B6GV3:1:1101:20647:1278,JX225284.1.1296_Desulfotomaculum,87.778,90,11,0,6,95,123,212,2.6800000000000003e-22,106
1,M01457:82:000000000-B6GV3:1:1101:20647:1278,JX224226.1.1352_Desulfotomaculum,92.105,76,4,2,6,80,339,413,2.6800000000000003e-22,106
2,M01457:82:000000000-B6GV3:1:1101:20647:1278,JX224392.1.1496_Desulfotomaculum,85.149,101,15,0,6,106,340,440,9.65e-22,104
3,M01457:82:000000000-B6GV3:1:1101:20647:1278,JX224446.1.1287_Desulfosporosinus,90.789,76,5,2,6,80,136,210,1.25e-20,100


# Datasets
These data sets will be the test datasets that you use
- https://www.ebi.ac.uk/ena/browser/view/PRJEB9039 - Microbial life in geothermal and volcanic areas
- https://www.ebi.ac.uk/ena/browser/view/PRJEB10824 - Functional diversity of Outokumpu fractures
- https://www.ebi.ac.uk/ena/browser/view/PRJEB10822 Outokumpu fracture communities
- https://www.ebi.ac.uk/ena/browser/view/PRJNA188465 Sulfidic cave snottites Metagenome
- https://www.ebi.ac.uk/ena/browser/view/PRJNA487671 BACTERIAL COMMUNITIES IN THE ACID AND THERMOPHILIC CRATER-LAKE OF THE VOLCANO "EL CHICHON", MEXICO
- https://www.ebi.ac.uk/ena/browser/view/PRJEB36834 Prokaryotic Community Profiling of the Mt. Makiling Mudspring
- https://www.ebi.ac.uk/ena/browser/view/PRJEB10725 This dataset has both amplicon and shotgun data, so you would try out your method on the amplicon data only and see if it was able to recover the same go annotations. But it is 551 gb so be careful when downloading. Could use this as a training set
    - https://www.ebi.ac.uk/ena/browser/view/PRJEB87662 This too
    - This could also be used, only 14 gb https://www.ebi.ac.uk/ena/browser/view/PRJNA394849
    - https://www.ebi.ac.uk/ena/browser/view/PRJEB34718 This one is on lichens, probably the best one because of its size
- [Here](https://www-ncbi-nlm-nih-gov.ejournal.mahidol.ac.th/sra) is a link to the 16srRNA from [this](https://www.ebi.ac.uk/ena/browser/view/PRJNA682552?show=publications) sediment study
- [Another](https://www-ncbi-nlm-nih-gov.ejournal.mahidol.ac.th/sra) 16srRNA resource from [this](https://www.ebi.ac.uk/metagenomics/studies/MGYS00000666#overview) soil study

# Machine learning
https://www.kaggle.com/code/singhakash/dna-sequencing-with-machine-learning What you should use for the learning