# Marker gene database maker
The purpose of this jupyter notebook is to run through a workflow of creating a blast database containing protein sequences of a given gene from a wide range of taxonomic groups that can be used to validate newly submitted sequences against. 

Broadly, this process involves the following steps: 

1. Starting with an Entrez query for the Gene database, download sequences and metadata for genes, transcripts and proteins using NCBI Datasets
2. Parse the data archive from step 1 to tabulate names and symbols for review
3. Parse the data archive from step 1 to tabulate variability in the sequence lengths for review 
4. Given a set of taxonomic group identifiers, tabulate the number of sequences for each group that are present in the data archive
5. Extract sequences from each taxonomic node and generate all-vs-all BLAST alignments 
6. Review the BLAST tabular output to make a list of accessions that are outliers or incorrect that need to be removed from the final BLAST database 
7. Generate a final BLAST database that can be used with VADR and other tools for validating newly submitted sequences.

## Download data

Sequence and metadata are downloaded using NCBI Datasets using an Entrez query provided by the user. 

In [17]:
## specify Entrez query and output filename
entrez_query = 'mammalia [ORGN] AND cytb [GENE] AND source mitochondrion [PROP] NOT rnatype mrna [PROP] NOT srcdb pdb [PROP] NOT uncultured NOT unverified'
email = 'mcveigh@ncbi.nlm.nih.gov'
output_file = 'ncbi_dataset.zip'

In [18]:
import scripts.obtain_gene_datasets as dl

gene_ids_file = 'gene_ids.txt'
dl.populate_gene_ids_file(entrez_query, email, gene_ids_file)
json_data = dl.format_file_data_into_json(gene_ids_file)
dl.obtain_gene_datasets(json_data, output_file)

Gene search for query 'mammalia [ORGN] AND cytb [GENE] AND source mitochondrion [PROP] NOT rnatype mrna [PROP] NOT srcdb pdb [PROP] NOT uncultured NOT unverified' returned 1273 results


## Unzip Datasets archive

In [19]:
!unzip -o {output_file}

Archive:  ncbi_dataset.zip
  inflating: README.md               
  inflating: ncbi_dataset/data/protein.faa  
  inflating: ncbi_dataset/data/data_report.jsonl  
  inflating: ncbi_dataset/data/data_table.tsv  
  inflating: ncbi_dataset/data/dataset_catalog.json  


## Tabulate unique names

In [20]:
bdbag = 'ncbi_dataset/'
data_table = bdbag + 'data/data_table.tsv'
gene_names = 'gene_names.tsv'

In [21]:
%%bash -s {data_table} {gene_names}

data_table=$1
gene_names=$2

python3 scripts/unique.py ${data_table} > ${gene_names}

In [22]:
import pandas as pd 

df = pd.read_csv(gene_names, sep='\t', header=None, names=['Gene Name', 'Count', 'Gene IDs'])
display(df.sort_values(by=['Count'], ascending=False))

Unnamed: 0,Gene Name,Count,Gene IDs
0,CYTB,1228,
2,MT-CYTB,2,1771126192.0
1,MT-CYB,1,4519.0


## Identify outliers based on protein size

In [23]:
data_table_df = pd.read_csv(data_table, sep='\t', index_col=1)
data_table_df.head()

Unnamed: 0_level_0,gene_id,description,scientific_name,common_name,tax_id,genomic_range,orientation,location,gene_type,transcript_accession,transcript_name,transcript_length,transcript_cds_coords,protein_accession,isoform_name,protein_length,protein_name
gene_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CYTB,10020650,cytochrome b,Panthera tigris amoyensis,Amoy tiger,253258,NC_014770.1:15113-16252,+,chr MT,PROTEIN_CODING,,,,,YP_004062165.1,,379,cytochrome b
CYTB,10079733,cytochrome b,Rattus lutreolus,Australian swamp rat,472760,NC_014858.1:14137-15279,+,chr MT,PROTEIN_CODING,,,,,YP_004123242.1,,380,cytochrome b
CYTB,10079783,cytochrome b,Rattus tunneyi,Tunney's rat,10121,NC_014861.1:14132-15274,+,chr MT,PROTEIN_CODING,,,,,YP_004123282.1,,380,cytochrome b
CYTB,10079857,cytochrome b,Rattus villosissimus,long-haired rat,10122,NC_014864.1:14134-15276,+,chr MT,PROTEIN_CODING,,,,,YP_004123323.1,,380,cytochrome b
CYTB,10079923,cytochrome b,Rattus fuscipes,bush rat,10119,NC_014867.1:14131-15273,+,chr MT,PROTEIN_CODING,,,,,YP_004123362.1,,380,cytochrome b


In [24]:
data_table_df[['transcript_length', 'protein_length']].describe()

Unnamed: 0,transcript_length,protein_length
count,0.0,1231.0
mean,,379.261576
std,,0.717818
min,,376.0
25%,,379.0
50%,,379.0
75%,,379.0
max,,385.0


In [25]:
## filter sequences that are either too long or too short

min_len = 350
max_len = 400

rightlength = data_table_df.loc[(data_table_df['protein_length'] > min_len) & (data_table_df['protein_length'] < max_len)]
rightlength.to_csv(data_table, sep='\t')
rightlength.head()


Unnamed: 0_level_0,gene_id,description,scientific_name,common_name,tax_id,genomic_range,orientation,location,gene_type,transcript_accession,transcript_name,transcript_length,transcript_cds_coords,protein_accession,isoform_name,protein_length,protein_name
gene_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CYTB,10020650,cytochrome b,Panthera tigris amoyensis,Amoy tiger,253258,NC_014770.1:15113-16252,+,chr MT,PROTEIN_CODING,,,,,YP_004062165.1,,379,cytochrome b
CYTB,10079733,cytochrome b,Rattus lutreolus,Australian swamp rat,472760,NC_014858.1:14137-15279,+,chr MT,PROTEIN_CODING,,,,,YP_004123242.1,,380,cytochrome b
CYTB,10079783,cytochrome b,Rattus tunneyi,Tunney's rat,10121,NC_014861.1:14132-15274,+,chr MT,PROTEIN_CODING,,,,,YP_004123282.1,,380,cytochrome b
CYTB,10079857,cytochrome b,Rattus villosissimus,long-haired rat,10122,NC_014864.1:14134-15276,+,chr MT,PROTEIN_CODING,,,,,YP_004123323.1,,380,cytochrome b
CYTB,10079923,cytochrome b,Rattus fuscipes,bush rat,10119,NC_014867.1:14131-15273,+,chr MT,PROTEIN_CODING,,,,,YP_004123362.1,,380,cytochrome b


## Extract sequences from specific taxonomic group(s) for further analysis

Analyzing all of the sequences using all-vs-all BLAST is time-consuming. In this step, we will group the sequences into broad taxonomic groups for further analysis. 

In [26]:
acclist_for_blast = 'acclist_for_blast.tsv'
taxids = !cut -f2 example_data/tax_nodes.tsv | head -n 3 | paste -s -d ','
print(taxids)

['9254,311790,1437010']


In [27]:
!python3 scripts/seqids_by_taxa.py --bdbag {bdbag} --taxids {taxids[0]} --output {acclist_for_blast} --email {email}

9254	3
311790	19
1437010	1135


## Run all-vs-all BLAST

In [28]:
!scripts/blast_all.sh -b {bdbag} -a {acclist_for_blast} -t 6

Number of taxids in acclist_for_blast.tsv: 3
Processing 1437010
Tue Oct 27 14:16:10 EDT 2020 Filtering protein fasta...
Tue Oct 27 14:16:11 EDT 2020 Create a BLAST database...


Building a new DB, current time: 10/27/2020 14:16:11
New DB name:   /home/mcveigh/notebook/Marker-Gene-Validator/1437010_blastdb
New DB title:  1437010_input.fa
Sequence type: Protein
Deleted existing Protein BLAST database named /home/mcveigh/notebook/Marker-Gene-Validator/1437010_blastdb
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1135 sequences in 0.0849681 seconds.


Tue Oct 27 14:16:12 EDT 2020 Running all-vs-all blast...
Tue Oct 27 14:18:37 EDT 2020 Generating blast tabular output...
Tue Oct 27 14:19:25 EDT 2020 Generating blast seq-align asn...
Processing 311790
Tue Oct 27 14:19:41 EDT 2020 Filtering protein fasta...
Tue Oct 27 14:19:41 EDT 2020 Create a BLAST database...


Building a new DB, current time: 10/27/2020 14:19:41
New DB name:   /home/mcveigh/notebook/Marke

## Evaluate BLAST results and filter data

In [30]:
%%bash -s {acclist_for_blast}

acclist_for_blast=$1
final_acclist='final_acclist.txt'

cut -f1 ${acclist_for_blast} | while read -r txid ; do 
    tbl=$(echo ${txid} | sed 's/$/_output.tsv/') ;
    python3 scripts/blastfilter.py -i ${tbl} --pident 98 --qcov 99 >> ${final_acclist};
done

## Create final BLAST database

In [33]:
final_acclist = 'final_acclist.txt'
filename_prefix = 'cytb_genes'

In [34]:
!scripts/make_finaldb.sh -b {bdbag} -a {final_acclist} -p {filename_prefix} -t 6

Tue Oct 27 16:23:29 EDT 2020 Filtering protein fasta...
Tue Oct 27 16:23:30 EDT 2020 Create a BLAST database...


Building a new DB, current time: 10/27/2020 16:23:30
New DB name:   /home/mcveigh/notebook/Marker-Gene-Validator/cytb_genes_blastdb
New DB title:  cytb_genes.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /home/mcveigh/notebook/Marker-Gene-Validator/cytb_genes_blastdb
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 74 sequences in 0.037137 seconds.


