# Marker gene database maker
The purpose of this jupyter notebook is to run through a workflow of creating a blast database containing protein sequences of a given gene from a wide range of taxonomic groups that can be used to validate newly submitted sequences against. 

Broadly, this process involves the following steps: 

1. Starting with an Entrez query for the Gene database, download sequences and metadata for genes, transcripts and proteins using NCBI Datasets
2. Parse the data archive from step 1 to tabulate names and symbols for review
3. Parse the data archive from step 1 to tabulate variability in the sequence lengths for review 
4. Given a set of taxonomic group identifiers, tabulate the number of sequences for each group that are present in the data archive
5. Extract sequences from each taxonomic node and generate all-vs-all BLAST alignments 
6. Review the BLAST tabular output to make a list of accessions that are outliers or incorrect that need to be removed from the final BLAST database 
7. Generate a final BLAST database that can be used with VADR and other tools for validating newly submitted sequences.

## Download data

Sequence and metadata are downloaded using NCBI Datasets using an Entrez query provided by the user. 

In [None]:
## specify Entrez query and output filename
entrez_query = 'primates [ORGN] AND cytb [GENE] AND source mitochondrion [PROP] NOT rnatype mrna [PROP] NOT srcdb pdb [PROP] NOT uncultured NOT unverified'
email = 'kodalivk@ncbi.nlm.nih.gov'
output_file = 'ncbi_dataset.zip'

In [None]:
import scripts.obtain_gene_datasets as dl

gene_ids_file = 'gene_ids.txt'
dl.populate_gene_ids_file(entrez_query, email, gene_ids_file)
json_data = dl.format_file_data_into_json(gene_ids_file)
dl.obtain_gene_datasets(json_data, output_file)

## Unzip Datasets archive

In [None]:
!unzip -o {output_file}

## Tabulate unique names

In [None]:
bdbag = 'ncbi_dataset/'
data_table = bdbag + 'data/data_table.tsv'
gene_names = 'gene_names.tsv'

In [None]:
%%bash -s {data_table} {gene_names}

data_table=$1
gene_names=$2

python3 scripts/unique.py ${data_table} > ${gene_names}

In [None]:
import pandas as pd 

df = pd.read_csv(gene_names, sep='\t', header=None, names=['Gene Name', 'Count', 'Gene IDs'])
display(df.sort_values(by=['Count'], ascending=False))

## Identify outliers based on protein size

In [None]:
data_table_df = pd.read_csv(data_table, sep='\t', index_col=1)
data_table_df.head()

In [None]:
data_table_df[['transcript_length', 'protein_length']].describe()

In [None]:
## filter sequences that are either too long or too short

min_len = 350
max_len = 400

rightlength = data_table_df.loc[(data_table_df['protein_length'] > min_len) & (data_table_df['protein_length'] < max_len)]
rightlength.to_csv(data_table, sep='\t')
rightlength.head()


## Extract sequences from specific taxonomic group(s) for further analysis

Analyzing all of the sequences using all-vs-all BLAST is time-consuming. In this step, we will group the sequences into broad taxonomic groups for further analysis. 

In [None]:
acclist_for_blast = 'acclist_for_blast.tsv'
taxids = !cut -f2 example_data/tax_nodes.tsv | head -n 3 | paste -s -d ','
print(taxids)

In [None]:
!python3 scripts/seqids_by_taxa.py --bdbag {bdbag} --taxids {taxids[0]} --output {acclist_for_blast} --email {email}

## Run all-vs-all BLAST

In [None]:
!scripts/blast_all.sh -b {bdbag} -a {acclist_for_blast} -t 6

## Evaluate BLAST results and filter data

In [None]:
%%bash -s {acclist_for_blast}

acclist_for_blast=$1
final_acclist='final_acclist.txt'

cut -f1 ${acclist_for_blast} | while read -r txid ; do 
    tbl=$(echo ${txid} | sed 's/$/_output.tsv/') ;
    python3 scripts/blastfilter.py -i ${tbl} --pident 98 --qcov 99 ;
done > ${final_acclist} 

## Create final BLAST database

In [None]:
final_acclist = 'final_acclist.txt'
filename_prefix = 'cytb_genes'

In [None]:
!scripts/make_finaldb.sh -b {bdbag} -a {final_acclist} -p {filename_prefix} -t 6