# TODOs

1. Найти референсные последовательности для sbr - https://www.alliancegenome.org/gene/FB:FBgn0003321 (Sequence Details (Mode: cDNA)):
   - найти координаты гены на X хромосоме
   - полная последовательность гена
   - сплайсированный вариант
   - консервативная кассета
2. Написать срипт, который:
   - ищет ген sbr / nxf1 в базе данных NCBI для семейства Drosophilidae
   - выбирает из найденных только адекватные варианты и формирует из них БД для бласта
3. Сделать бласт вариантов из 1-го пункта на варианты из второго пункта

# Imports

In [1]:
from Bio import Entrez
from Bio import Blast

from entrez import nucl_search, save_esearch_results
from parse_blast_results import calculate_qc, update_df
from fasta_processing import plain_to_fasta, read_fasta
from data_processing import cluster_analysis, save_seqs

Entrez.email = "artemvaskaa@gmail.com"
Blast.email = "artemvaskaa@gmail.com"

# Fasta processing funcs

In [42]:
path_to_plain_dir = "/home/artemvaska/Master_degree/Diploma/References/"

In [None]:
plain_to_fasta(path_to_plain_dir + "sbr_RA_gene_plain.fa", fasta_line_length=80, uppercase=True)

In [45]:
plain_to_fasta(path_to_plain_dir + "sbr_RA_CDS_plain.fa", fasta_line_length=80, uppercase=True)

# Entrez funcs

In [8]:
query = "(Drosophilidae[ORGN] NOT Drosophila melanogaster[ORGN]) AND (chromosome X[WORD] NOT PREDICTED[WORD] NOT gene[WORD]) AND 15000000:75000000[SLEN]"

In [9]:
id_list_test = nucl_search(query)

In [63]:
save_esearch_results(id_list_test[:3], "Drosophilidae")

In [13]:
# save all seqs in 1 file

with open("3_species.fa", "w") as ouf:
    for rec in id_list_test[:3]:
        lne = Entrez.efetch(
        db="nucleotide", id=rec, retmode="text", rettype="fasta"
        ).read()
        ouf.write(lne + "\n")

# Local blast+

Available DB in NCBI blastn -remote:

https://rc.dartmouth.edu/index.php/blast-introduction/blast-databases/

Установка blast+ на linux (https://www.ncbi.nlm.nih.gov/books/NBK52640/):

1. https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ (ncbi-blast-2.16.0+-x64-linux.tar.gz)
2. `mv ~/Downloads/ncbi-blast-2.16.0+-x64-linux.tar.gz ~`
3. `tar zxvpf ncbi-blast-2.16.0+-x64-linux.tar.gz`
4. `export PATH=$PATH:$HOME/ncbi-blast-2.16.0+/bin`
5. `mkdir $HOME/blastdb`
6. `export BLASTDB=$HOME/blastdb`
7. export into bash_profile (`nano .bash_profile`)

## Commands for local db

`$ makeblastdb -in Master_degree/Diploma/References/sbr_RA_gene.fa -dbtype nucl -out blastdb/sbr_ra/sbr_ra`

`$ blastn -db blastdb/sbr_ra -query Master_degree/Diploma/References/Drosophilidae/1797095071.fa`

`$ makeblastdb -in Master_degree/Diploma/References/3_species.fa -dbtype nucl -out blastdb/3_sp/3_sp`

`$ blastn -db blastdb/3_sp/3_sp -query Master_degree/Diploma/References/sbr_RA_gene.fa`

AE - NCBI - Genome project

AJ - EBI - Direct submissions

NM - curated RefSeq

# sbr gene

https://www.alliancegenome.org/gene/FB:FBgn0003321

Sequence Details (Mode: cDNA)

---

# Parsing BLAST XML2, calculating QC and saving found sequences

Drosophilidae (taxid:7214)

Drosophila melanogaster (taxid:7227)

https://biochem.slu.edu/bchm628/handouts/2013/Entrez_boolian_searches.pdf

https://biopython.org/docs/dev/Tutorial/chapter_blast.html

https://biopython.org/docs/dev/Tutorial/chapter_blast.html#the-blast-records-record-and-hit-classes

In [1]:
name_of_blast_res = "../Blast_res/full_sbr_RA_wgs_megablast_250_16.xml" # XML2 !!!
result_stream = open(name_of_blast_res, "rb")

In [4]:
blast_record = Blast.read(result_stream)

In [5]:
original_blast_record = blast_record[:]

In [5]:
hit = blast_record[0]
# hsp = hit[0] # HSPs -- “High-scoring Segment Pairs”
# hit.target  # hit.targets
# print(hit)  # span -- the alignment length including gaps

In [8]:
hsp = hit[0]
# print(hsp)

1. Obtain BLAST results (NCBI / Bio.Blast / cmd local db / cmd -remote)
2. Parse (target.id, QC)
3. Filter based on QC_THRESHOLD `#TODO`
4. Group QC `#TODO Scikit-learn cluster?`
5. Download into group_folder via Entrez.efetch
6. Multiple align via `#TODO MEGA / UGENE / cmd` cmd + check in UGENE??

## Pipeline

In [2]:
name_of_blast_res = "../Blast_res/full_sbr_RA_wgs_megablast_250_16.xml" # XML2 !!!
result_stream = open(name_of_blast_res, "rb")
blast_record = Blast.read(result_stream)

qcs = calculate_qc(blast_record)
qcs_df = cluster_analysis(qcs)
qcs_df = update_df(qcs_df, blast_record)

In [3]:
qcs_df

Unnamed: 0,QC,Cluster,Acc,Strand,Start,Stop
gi|2735068466|gb|JBBODO010000732.1|,0.9797,0,JBBODO010000732.1,1,160127,174040
gi|2735070232|gb|JBBODR010000286.1|,0.9764,0,JBBODR010000286.1,1,4625347,4639269
gi|2733175125|gb|JBAMBY010000011.1|,0.9729,0,JBAMBY010000011.1,1,4400346,4414275
gi|2733173904|gb|JBAMBW010000153.1|,0.9757,0,JBAMBW010000153.1,2,10646675,10660638
gi|2735068651|gb|JBBODP010000434.1|,0.9773,0,JBBODP010000434.1,1,4652416,4666336
...,...,...,...,...,...,...
gi|823984223|gb|JXPY01022046.1|,0.1416,1,JXPY01022046.1,1,9944,11982
gi|2030294827|gb|JAECWR010000194.1|,0.2088,1,JAECWR010000194.1,2,17051973,17063133
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,1,JAWNPD010000679.1,1,8198711,8212105
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,1,JAWNOD010000520.1,1,3143039,3154598


In [4]:
# CAREFUL !!!

save_seqs("Drosophilidae", qcs_df)

Пофиксить название файлов в подпапках

Добавить фильтр, чтобы не скачивались очень большие и очень маленькие последовательности

Проверить range(Start, Stop), чтобы Start < Stop

In [7]:
check = qcs_df.Start < qcs_df.Stop

In [13]:
check.nunique()

1

In [14]:
qcs_df.loc[qcs_df["Acc"] == "JARPSD010000001.1"]

Unnamed: 0,QC,Cluster,Acc,Strand,Start,Stop
gi|2514662160|gb|JARPSD010000001.1|,0.3883,1,JARPSD010000001.1,2,7188510,8410026


In [16]:
print(blast_record["gi|2514662160|gb|JARPSD010000001.1|"])

Query: Query_719443
       sbr-RA-gene
  Hit: gi|2514662160|gb|JARPSD010000001.1| (length=32685500)
       Drosophila kikkawai strain 14028-0561.14 chromosome X, whole genome
       shotgun sequence
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         0    1696.34    1825      [2170:3987]      [8408741:8406927]
          1  7.5e-116     431.39     856     [9890:10718]      [8399803:8398959]
          2     3e-50     213.49     793    [12389:13157]      [8396961:8396190]
          3   2.3e-46     200.56     330        [668:993]      [8410026:8409703]
          4     3e-35     163.62     253      [7462:7706]      [8402572:8402321]
          5   2.4e-31     150.70     524      [2235:2751]      [7189026:7188510]
          6   2.4e-31     150.70     525      [2235:2751]      [7297950:

In [17]:
qcs_df.loc[qcs_df["Acc"] == "JAECXM010000281.1"]

Unnamed: 0,QC,Cluster,Acc,Strand,Start,Stop
gi|2030316345|gb|JAECXM010000281.1|,0.3523,1,JAECXM010000281.1,2,2269971,3363902


In [18]:
print(blast_record["gi|2030316345|gb|JAECXM010000281.1|"])

Query: Query_719443
       sbr-RA-gene
  Hit: gi|2030316345|gb|JAECXM010000281.1| (length=27625346)
       Drosophila kikkawai isolate 14028-0561.14 contig_284, whole genome
       shotgun sequence
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         0    1696.34    1825      [2170:3987]      [3362617:3360803]
          1  7.5e-116     431.39     856     [9890:10718]      [3353679:3352835]
          2     3e-50     213.49     793    [12389:13157]      [3350837:3350066]
          3   2.3e-46     200.56     330        [668:993]      [3363902:3363579]
          4     3e-35     163.62     253      [7462:7706]      [3356448:3356197]
          5   2.4e-31     150.70     525      [2235:2751]      [2270487:2269971]
          6     3e-30     147.01     250    [13344:13591]      [3349865:3