# TODOs

1. Найти референсные последовательности для sbr - https://www.alliancegenome.org/gene/FB:FBgn0003321 (Sequence Details (Mode: cDNA)):
   - найти координаты гены на X хромосоме
   - полная последовательность гена
   - сплайсированный вариант
   - консервативная кассета
2. Написать срипт, который:
   - ищет ген sbr / nxf1 в базе данных NCBI для семейства Drosophilidae
   - выбирает из найденных только адекватные варианты и формирует из них БД для бласта
3. Сделать бласт вариантов из 1-го пункта на варианты из второго пункта

# Fasta processing funcs

In [42]:
import numpy as np

path_to_plain_dir = "/home/artemvaska/Master_degree/Diploma/References/"

In [None]:
plain_to_fasta(path_to_plain_dir + "sbr_RA_gene_plain.fa", fasta_line_length=80, uppercase=True)

In [45]:
plain_to_fasta(path_to_plain_dir + "sbr_RA_CDS_plain.fa", fasta_line_length=80, uppercase=True)

# Entrez funcs

In [8]:
query = "(Drosophilidae[ORGN] NOT Drosophila melanogaster[ORGN]) AND (chromosome X[WORD] NOT PREDICTED[WORD] NOT gene[WORD]) AND 15000000:75000000[SLEN]"

In [9]:
id_list_test = nucl_search(query)

In [63]:
save_esearch_results(id_list_test[:3], "Drosophilidae")

In [13]:
# save all seqs in 1 file

with open("3_species.fa", "w") as ouf:
    for rec in id_list_test[:3]:
        lne = Entrez.efetch(
        db="nucleotide", id=rec, retmode="text", rettype="fasta"
        ).read()
        ouf.write(lne + "\n")

# Local blast+

Available DB in NCBI blastn -remote:

https://rc.dartmouth.edu/index.php/blast-introduction/blast-databases/

Установка blast+ на linux (https://www.ncbi.nlm.nih.gov/books/NBK52640/):

1. https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ (ncbi-blast-2.16.0+-x64-linux.tar.gz)
2. `mv ~/Downloads/ncbi-blast-2.16.0+-x64-linux.tar.gz ~`
3. `tar zxvpf ncbi-blast-2.16.0+-x64-linux.tar.gz`
4. `export PATH=$PATH:$HOME/ncbi-blast-2.16.0+/bin`
5. `mkdir $HOME/blastdb`
6. `export BLASTDB=$HOME/blastdb`
7. export into bash_profile (`nano .bash_profile`)

## Commands for local db

`$ makeblastdb -in Master_degree/Diploma/References/sbr_RA_gene.fa -dbtype nucl -out blastdb/sbr_ra/sbr_ra`

`$ blastn -db blastdb/sbr_ra -query Master_degree/Diploma/References/Drosophilidae/1797095071.fa`

`$ makeblastdb -in Master_degree/Diploma/References/3_species.fa -dbtype nucl -out blastdb/3_sp/3_sp`

`$ blastn -db blastdb/3_sp/3_sp -query Master_degree/Diploma/References/sbr_RA_gene.fa`

AE - NCBI - Genome project

AJ - EBI - Direct submissions

NM - curated RefSeq

# sbr gene

https://www.alliancegenome.org/gene/FB:FBgn0003321

Sequence Details (Mode: cDNA)

---

# Parsing BLAST XML2, calculating QC and saving found sequences

Drosophilidae (taxid:7214)

Drosophila melanogaster (taxid:7227)

https://biochem.slu.edu/bchm628/handouts/2013/Entrez_boolian_searches.pdf

https://biopython.org/docs/dev/Tutorial/chapter_blast.html

https://biopython.org/docs/dev/Tutorial/chapter_blast.html#the-blast-records-record-and-hit-classes

In [2]:
name_of_blast_res = "../Blast_res/full_sbr_RA_wgs_megablast_250_16.xml" # XML2 !!!
result_stream = open(name_of_blast_res, "rb")

In [3]:
blast_record = Blast.read(result_stream)

In [4]:
original_blast_record = blast_record[:]

In [5]:
hit = blast_record[0]
# hsp = hit[0] # HSPs -- “High-scoring Segment Pairs”
# hit.target  # hit.targets
# print(hit)  # span -- the alignment length including gaps

In [6]:
hsp = hit[0]
# print(hsp)

1. Obtain BLAST results (NCBI / Bio.Blast / cmd local db / cmd -remote)
2. Parse (target.id, QC)
3. Filter based on QC_THRESHOLD
4. Group QC `Scikit-learn DBSCAN`
5. Download into group_folder via Entrez.efetch
6. Group files with species and choose only 1
7. Multiple align via `#TODO MEGA / UGENE / cmd` cmd + check in UGENE??

# Imports

In [1]:
from Bio import Entrez
from Bio import Blast

from entrez import nucl_search, save_esearch_results
from parse_blast_results import calculate_qc, filter_df, update_df
from fasta_processing import plain_to_fasta, read_fasta
from data_processing import (cluster_analysis_preview,
                             cluster_analysis,
                             save_seqs,
                             extract_genome_coverages,
                             add_genome_coverages,
                             select_max_ids,
                             filter_genome_coverages
                             )
from group_species import group_species, group_species_genome_coverage
from alignment import edit_names_for_alignment

Entrez.email = "artemvaskaa@gmail.com"
Blast.email = "artemvaskaa@gmail.com"

## Pipeline

In [2]:
name_of_blast_res = "../Blast_res/full_sbr_RA_wgs_megablast_250_16.xml" # XML2 !!!
result_stream = open(name_of_blast_res, "rb")
blast_record = Blast.read(result_stream)

In [3]:
print(blast_record[0])

Query: Query_719443
       sbr-RA-gene
  Hit: gi|2735068466|gb|JBBODO010000732.1| (length=1492563)
       Drosophila simulans strain SZ45 tig00001109, whole genome shotgun
       sequence
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         0    9745.89    8606     [5873:14341]        [165809:174040]
          1         0    6167.08    4460      [1113:5494]        [161115:165507]
          2         0    1055.56     761       [360:1116]        [160331:161089]
          3   7.9e-81     315.05     231      [5500:5728]        [165482:165708]
          4     5e-43     189.48     217          [0:217]        [160127:160331]


In [4]:
qcs_df = calculate_qc(blast_record)

In [5]:
qcs_df

Unnamed: 0,QC
gi|2735068466|gb|JBBODO010000732.1|,0.9797
gi|2735070232|gb|JBBODR010000286.1|,0.9764
gi|2733175125|gb|JBAMBY010000011.1|,0.9729
gi|2733173904|gb|JBAMBW010000153.1|,0.9757
gi|2735068651|gb|JBBODP010000434.1|,0.9773
...,...
gi|76492874|gb|AAKO01002607.1|,0.0683
gi|2644252255|gb|JAWNPD010000679.1|,0.1697
gi|2644243871|gb|JAWNOD010000520.1|,0.1625
gi|111231287|gb|AASR01047437.1|,0.0522


In [6]:
qcs_df_upd = update_df(qcs_df, blast_record)  # 8395741 8410026 3349617 3363902

Target range cannot be calculated automatically. Please enter coordinates manually from the list below:
[7188510, 7189026, 7297434, 7297950, 8395741, 8395989, 8396190, 8396961, 8397432, 8397472, 8398959, 8399803, 8400836, 8400909, 8402321, 8402572, 8402845, 8402932, 8405238, 8405296, 8406927, 8408741, 8409289, 8409341, 8409703, 8410026]
Target range cannot be calculated automatically. Please enter coordinates manually from the list below:
[2269971, 2270487, 3349617, 3349865, 3350066, 3350837, 3351311, 3351351, 3352835, 3353679, 3354712, 3354785, 3356197, 3356448, 3356721, 3356808, 3359114, 3359172, 3360803, 3362617, 3363165, 3363217, 3363579, 3363902]


In [7]:
qcs_df_upd

Unnamed: 0,QC,Acc,Strand,Start,Stop
gi|2735068466|gb|JBBODO010000732.1|,0.9797,JBBODO010000732.1,1,160127,174040
gi|2735070232|gb|JBBODR010000286.1|,0.9764,JBBODR010000286.1,1,4625347,4639269
gi|2733175125|gb|JBAMBY010000011.1|,0.9729,JBAMBY010000011.1,1,4400346,4414275
gi|2733173904|gb|JBAMBW010000153.1|,0.9757,JBAMBW010000153.1,2,10646675,10660638
gi|2735068651|gb|JBBODP010000434.1|,0.9773,JBBODP010000434.1,1,4652416,4666336
...,...,...,...,...,...
gi|76492874|gb|AAKO01002607.1|,0.0683,AAKO01002607.1,2,60,1028
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,JAWNPD010000679.1,1,8198711,8212105
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,JAWNOD010000520.1,1,3143039,3154598
gi|111231287|gb|AASR01047437.1|,0.0522,AASR01047437.1,2,0,764


In [8]:
qcs_df_upd_filtered = filter_df(qcs_df_upd, qc_threshold=0.1, range_threshold=10_000)

In [9]:
qcs_df_upd_filtered

Unnamed: 0,QC,Acc,Strand,Start,Stop
gi|2735068466|gb|JBBODO010000732.1|,0.9797,JBBODO010000732.1,1,160127,174040
gi|2735070232|gb|JBBODR010000286.1|,0.9764,JBBODR010000286.1,1,4625347,4639269
gi|2733175125|gb|JBAMBY010000011.1|,0.9729,JBAMBY010000011.1,1,4400346,4414275
gi|2733173904|gb|JBAMBW010000153.1|,0.9757,JBAMBW010000153.1,2,10646675,10660638
gi|2735068651|gb|JBBODP010000434.1|,0.9773,JBBODP010000434.1,1,4652416,4666336
...,...,...,...,...,...
gi|2644202470|gb|JAWNNI010000085.1|,0.1554,JAWNNI010000085.1,1,1293088,1307610
gi|2030294827|gb|JAECWR010000194.1|,0.2088,JAECWR010000194.1,2,17051973,17063133
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,JAWNPD010000679.1,1,8198711,8212105
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,JAWNOD010000520.1,1,3143039,3154598


In [10]:
cluster_analysis_preview(qcs_df_upd_filtered)

eps: 0.01, n_clusters: 16
cluster: 15, qcs_range: (0.1554, 0.1803), items_in_cluster: 7
cluster: 14, qcs_range: (0.1963, 0.1963), items_in_cluster: 1
cluster: 11, qcs_range: (0.2065, 0.3523), items_in_cluster: 100
cluster: 13, qcs_range: (0.3883, 0.3883), items_in_cluster: 1
cluster: 12, qcs_range: (0.4376, 0.4376), items_in_cluster: 1
cluster: 9, qcs_range: (0.4747, 0.4808), items_in_cluster: 3
cluster: 10, qcs_range: (0.4984, 0.4984), items_in_cluster: 1
cluster: 8, qcs_range: (0.5211, 0.5304), items_in_cluster: 6
cluster: 4, qcs_range: (0.5582, 0.6002), items_in_cluster: 14
cluster: 6, qcs_range: (0.6168, 0.6168), items_in_cluster: 1
cluster: 5, qcs_range: (0.651, 0.6661), items_in_cluster: 9
cluster: 7, qcs_range: (0.6859, 0.6874), items_in_cluster: 2
cluster: 1, qcs_range: (0.8843, 0.9011), items_in_cluster: 9
cluster: 2, qcs_range: (0.9171, 0.9171), items_in_cluster: 3
cluster: 3, qcs_range: (0.9375, 0.9382), items_in_cluster: 2
cluster: 0, qcs_range: (0.9649, 0.9797), items_in_c

In [11]:
qcs_df_upd_filtered_clustered = cluster_analysis(qcs_df_upd_filtered, eps=0.04)

In [12]:
qcs_df_upd_filtered_clustered

Unnamed: 0,QC,Acc,Strand,Start,Stop,Cluster
gi|2735068466|gb|JBBODO010000732.1|,0.9797,JBBODO010000732.1,1,160127,174040,0
gi|2735070232|gb|JBBODR010000286.1|,0.9764,JBBODR010000286.1,1,4625347,4639269,0
gi|2733175125|gb|JBAMBY010000011.1|,0.9729,JBAMBY010000011.1,1,4400346,4414275,0
gi|2733173904|gb|JBAMBW010000153.1|,0.9757,JBAMBW010000153.1,2,10646675,10660638,0
gi|2735068651|gb|JBBODP010000434.1|,0.9773,JBBODP010000434.1,1,4652416,4666336,0
...,...,...,...,...,...,...
gi|2644202470|gb|JAWNNI010000085.1|,0.1554,JAWNNI010000085.1,1,1293088,1307610,2
gi|2030294827|gb|JAECWR010000194.1|,0.2088,JAECWR010000194.1,2,17051973,17063133,2
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,JAWNPD010000679.1,1,8198711,8212105,2
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,JAWNOD010000520.1,1,3143039,3154598,2


In [13]:
save_seqs(qcs_df_upd_filtered_clustered, "Drosophilidae")

In [14]:
group_species(qcs_df_upd_filtered_clustered, "Drosophilidae", "Drosophilidae_grouped")

In [15]:
qcs_df_upd_filtered_clustered

Unnamed: 0,QC,Acc,Strand,Start,Stop,Cluster,Species_name
gi|2735068466|gb|JBBODO010000732.1|,0.9797,JBBODO010000732.1,1,160127,174040,0,Drosophila_simulans
gi|2735070232|gb|JBBODR010000286.1|,0.9764,JBBODR010000286.1,1,4625347,4639269,0,Drosophila_simulans
gi|2733175125|gb|JBAMBY010000011.1|,0.9729,JBAMBY010000011.1,1,4400346,4414275,0,Drosophila_mauritiana
gi|2733173904|gb|JBAMBW010000153.1|,0.9757,JBAMBW010000153.1,2,10646675,10660638,0,Drosophila_mauritiana
gi|2735068651|gb|JBBODP010000434.1|,0.9773,JBBODP010000434.1,1,4652416,4666336,0,Drosophila_simulans
...,...,...,...,...,...,...,...
gi|2644202470|gb|JAWNNI010000085.1|,0.1554,JAWNNI010000085.1,1,1293088,1307610,2,Hirtodrosophila_confusa
gi|2030294827|gb|JAECWR010000194.1|,0.2088,JAECWR010000194.1,2,17051973,17063133,2,Drosophila_pruinosa
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,JAWNPD010000679.1,1,8198711,8212105,2,Drosophila_maculinotata
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,JAWNOD010000520.1,1,3143039,3154598,2,Drosophila_pegasa


In [16]:
genome_coverages = extract_genome_coverages(qcs_df_upd_filtered_clustered)

In [17]:
genome_coverages  # check if everything is OK

['            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 103.8x',
 '            Genome Coverage        :: 165.0x',
 '            Genome Coverage        :: 160.0x',
 '            Genome Coverage           :: 120.0x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 180.0x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 104.0x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 75.0x',
 '            Genome Coverage        :: 123.1x',
 '          

In [18]:
qcs_df_upd_filtered_clustered_coverages = add_genome_coverages(genome_coverages, qcs_df_upd_filtered_clustered)

In [19]:
qcs_df_upd_filtered_clustered_coverages

Unnamed: 0,QC,Acc,Strand,Start,Stop,Cluster,Species_name,Genome_Coverage
gi|2735068466|gb|JBBODO010000732.1|,0.9797,JBBODO010000732.1,1,160127,174040,0,Drosophila_simulans,12.0
gi|2735070232|gb|JBBODR010000286.1|,0.9764,JBBODR010000286.1,1,4625347,4639269,0,Drosophila_simulans,12.0
gi|2733175125|gb|JBAMBY010000011.1|,0.9729,JBAMBY010000011.1,1,4400346,4414275,0,Drosophila_mauritiana,50.0
gi|2733173904|gb|JBAMBW010000153.1|,0.9757,JBAMBW010000153.1,2,10646675,10660638,0,Drosophila_mauritiana,50.0
gi|2735068651|gb|JBBODP010000434.1|,0.9773,JBBODP010000434.1,1,4652416,4666336,0,Drosophila_simulans,12.0
...,...,...,...,...,...,...,...,...
gi|2644202470|gb|JAWNNI010000085.1|,0.1554,JAWNNI010000085.1,1,1293088,1307610,2,Hirtodrosophila_confusa,96.8
gi|2030294827|gb|JAECWR010000194.1|,0.2088,JAECWR010000194.1,2,17051973,17063133,2,Drosophila_pruinosa,67.6
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,JAWNPD010000679.1,1,8198711,8212105,2,Drosophila_maculinotata,129.4
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,JAWNOD010000520.1,1,3143039,3154598,2,Drosophila_pegasa,87.1


In [20]:
qcs_df_upd_filtered_clustered_coverages_max = select_max_ids(qcs_df_upd_filtered_clustered_coverages)

In [21]:
qcs_df_upd_filtered_clustered_coverages_max

Unnamed: 0,QC,Acc,Strand,Start,Stop,Cluster,Species_name,Genome_Coverage
gi|650401384|gb|JMCE01000012.1|,0.9745,JMCE01000012.1,2,10109018,10122925,0,Drosophila_simulans,180.0
gi|1601089141|gb|NIGA01000006.1|,0.9665,NIGA01000006.1,2,10614217,10628136,0,Drosophila_mauritiana,165.0
gi|2053677094|gb|JAEIGV010000073.1|,0.9649,JAEIGV010000073.1,1,965229,979127,0,Drosophila_sechellia,123.1
gi|1495148738|gb|QMER02000023.1|,0.9171,QMER02000023.1,1,1390290,1404493,0,Drosophila_erecta,190.0
gi|2074594557|gb|JAEDAA020000001.1|,0.8988,JAEDAA020000001.1,2,15091719,15106748,0,Drosophila_teissieri,168.0
...,...,...,...,...,...,...,...,...
gi|2644202470|gb|JAWNNI010000085.1|,0.1554,JAWNNI010000085.1,1,1293088,1307610,2,Hirtodrosophila_confusa,96.8
gi|2030294827|gb|JAECWR010000194.1|,0.2088,JAECWR010000194.1,2,17051973,17063133,2,Drosophila_pruinosa,67.6
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,JAWNPD010000679.1,1,8198711,8212105,2,Drosophila_maculinotata,129.4
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,JAWNOD010000520.1,1,3143039,3154598,2,Drosophila_pegasa,87.1


In [22]:
qcs_df_upd_filtered_clustered_coverages_max_filtered = filter_genome_coverages(qcs_df_upd_filtered_clustered_coverages_max, genome_coverage_threshold=50)

In [23]:
qcs_df_upd_filtered_clustered_coverages_max_filtered

Unnamed: 0,QC,Acc,Strand,Start,Stop,Cluster,Species_name,Genome_Coverage
gi|650401384|gb|JMCE01000012.1|,0.9745,JMCE01000012.1,2,10109018,10122925,0,Drosophila_simulans,180.0
gi|1601089141|gb|NIGA01000006.1|,0.9665,NIGA01000006.1,2,10614217,10628136,0,Drosophila_mauritiana,165.0
gi|2053677094|gb|JAEIGV010000073.1|,0.9649,JAEIGV010000073.1,1,965229,979127,0,Drosophila_sechellia,123.1
gi|1495148738|gb|QMER02000023.1|,0.9171,QMER02000023.1,1,1390290,1404493,0,Drosophila_erecta,190.0
gi|2074594557|gb|JAEDAA020000001.1|,0.8988,JAEDAA020000001.1,2,15091719,15106748,0,Drosophila_teissieri,168.0
...,...,...,...,...,...,...,...,...
gi|2644202470|gb|JAWNNI010000085.1|,0.1554,JAWNNI010000085.1,1,1293088,1307610,2,Hirtodrosophila_confusa,96.8
gi|2030294827|gb|JAECWR010000194.1|,0.2088,JAECWR010000194.1,2,17051973,17063133,2,Drosophila_pruinosa,67.6
gi|2644252255|gb|JAWNPD010000679.1|,0.1697,JAWNPD010000679.1,1,8198711,8212105,2,Drosophila_maculinotata,129.4
gi|2644243871|gb|JAWNOD010000520.1|,0.1625,JAWNOD010000520.1,1,3143039,3154598,2,Drosophila_pegasa,87.1


In [24]:
group_species_genome_coverage(qcs_df_upd_filtered_clustered_coverages_max_filtered, folder_name="Drosophilidae", new_folder_name="Drosophilidae_filtered")

In [25]:
edit_names_for_alignment("Drosophilidae_filtered")

## Pipeline short version

In [2]:
from Bio import Entrez
from Bio import Blast

from entrez import nucl_search, save_esearch_results
from parse_blast_results import calculate_qc, filter_df, update_df
from fasta_processing import plain_to_fasta, read_fasta
from data_processing import (cluster_analysis_preview,
                             cluster_analysis,
                             save_seqs,
                             extract_genome_coverages,
                             add_genome_coverages,
                             select_max_ids,
                             filter_genome_coverages
                             )
from group_species import group_species, group_species_genome_coverage

Entrez.email = "artemvaskaa@gmail.com"
Blast.email = "artemvaskaa@gmail.com"

In [3]:
name_of_blast_res = "../Blast_res/full_sbr_RA_wgs_megablast_250_16.xml" # XML2 !!!
result_stream = open(name_of_blast_res, "rb")
blast_record = Blast.read(result_stream)
df = calculate_qc(blast_record)

In [4]:
df = update_df(df, blast_record)  # 8395741 8410026 3349617 3363902

Target range cannot be calculated automatically. Please enter coordinates manually from the list below:
[7188510, 7189026, 7297434, 7297950, 8395741, 8395989, 8396190, 8396961, 8397432, 8397472, 8398959, 8399803, 8400836, 8400909, 8402321, 8402572, 8402845, 8402932, 8405238, 8405296, 8406927, 8408741, 8409289, 8409341, 8409703, 8410026]
Target range cannot be calculated automatically. Please enter coordinates manually from the list below:
[2269971, 2270487, 3349617, 3349865, 3350066, 3350837, 3351311, 3351351, 3352835, 3353679, 3354712, 3354785, 3356197, 3356448, 3356721, 3356808, 3359114, 3359172, 3360803, 3362617, 3363165, 3363217, 3363579, 3363902]


In [5]:
df = filter_df(df)

In [6]:
cluster_analysis_preview(df)  # select eps

eps: 0.01, n_clusters: 16
cluster: 15, qcs_range: (0.1554, 0.1803), items_in_cluster: 7
cluster: 14, qcs_range: (0.1963, 0.1963), items_in_cluster: 1
cluster: 11, qcs_range: (0.2065, 0.3523), items_in_cluster: 100
cluster: 13, qcs_range: (0.3883, 0.3883), items_in_cluster: 1
cluster: 12, qcs_range: (0.4376, 0.4376), items_in_cluster: 1
cluster: 9, qcs_range: (0.4747, 0.4808), items_in_cluster: 3
cluster: 10, qcs_range: (0.4984, 0.4984), items_in_cluster: 1
cluster: 8, qcs_range: (0.5211, 0.5304), items_in_cluster: 6
cluster: 4, qcs_range: (0.5582, 0.6002), items_in_cluster: 14
cluster: 6, qcs_range: (0.6168, 0.6168), items_in_cluster: 1
cluster: 5, qcs_range: (0.651, 0.6661), items_in_cluster: 9
cluster: 7, qcs_range: (0.6859, 0.6874), items_in_cluster: 2
cluster: 1, qcs_range: (0.8843, 0.9011), items_in_cluster: 9
cluster: 2, qcs_range: (0.9171, 0.9171), items_in_cluster: 3
cluster: 3, qcs_range: (0.9375, 0.9382), items_in_cluster: 2
cluster: 0, qcs_range: (0.9649, 0.9797), items_in_c

In [7]:
df = cluster_analysis(df, eps=0.04)

In [8]:
# long-time execution

save_seqs(df, "Drosophilidae")

In [9]:
group_species(df, "Drosophilidae", "Drosophilidae_grouped")

In [10]:
# long-time execution

genome_coverages = extract_genome_coverages(df)

In [11]:
genome_coverages  # check if everything is OK

['            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 103.8x',
 '            Genome Coverage        :: 165.0x',
 '            Genome Coverage        :: 160.0x',
 '            Genome Coverage           :: 120.0x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 50x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 180.0x',
 '            Genome Coverage        :: 0x',
 '            Genome Coverage        :: 104.0x',
 '            Genome Coverage        :: 12x',
 '            Genome Coverage        :: 75.0x',
 '            Genome Coverage        :: 123.1x',
 '          

In [12]:
df = add_genome_coverages(genome_coverages, df)
df = select_max_ids(df)
df = filter_genome_coverages(df, genome_coverage_threshold=50)

In [13]:
group_species_genome_coverage(df, folder_name="Drosophilidae", new_folder_name="Drosophilidae_filtered")

In [35]:
edit_names_for_alignment("Drosophilidae_filtered")