## This Jupyter notebook is targeted towards QC analysis of gene models without alleles in a phased assembly
The script will pick up were *_defining_alleles_v01.ipynb left off. In general it can take a list of genes without alleles in a phased assembly and can look why those were missed

It will consider the following options:

#### The allele was left out of the annotation in the other haplotig:
* It takes the gene sequence of a single allele gene and blast its against the other haplotic
* If there is an gene sequence it pulls out a region around this and align the protein sequence using exonerate
* The exonerate alignment is used to scan for matches without frame shifts and without stop codons
* In cases where a good aligment is possible these alleles are written noted

#### The allele was left out because it was not phased in the first place
This step relies on the Pst_104E_v12_coverage_analysis_training script to give out homozygous regions when doing p mapping when compare to p to ph mapping. Might need to be adapted a bit more.
* All remaining single allele genes are tested if they fall into a homozygous coverage area
* If they do not overlap with an ortholgos contig alignement p on h mapping and reverse
* Maybe if they overlap with a unique coverage area. Only possible for p alleles so far. 

#### Else to consider would:
* look for gene that have no-haplotig aligned and their variation in terms of SNPs


##### script considerations

What to do when mulitple filtered allele files are present? Mabye have previous script write out different options to different folders if they already exist.
When filtering through the exonerate vulgar output one migth want to consider partial alignments as well covering 
QcPct of the protein sequence?

What to do with h proteins that have a good p hit but not the allele.

For now script runs on Pst_104E_v12_p_ctg.no.Qcov80.PctID70.alleles and Pst_104E_v12_h_ctg.no.no_p_hits.Qcov80.PctID70.alleles. This ignores the fact that some h proteins are not the best hit of the p protein they are linked with in 

In [12]:
%matplotlib inline

In [13]:
import pandas as pd
import os
import re
from Bio import SeqIO
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil
from Bio.Seq import Seq
import pysam
from Bio import SearchIO
import json
import glob

In [14]:
def blast_outfmt6_to_bed(x):
    "Quick function that converts a blast outfmt6 file to a bed file."
    blast_fo = open(x, 'r')
    blast_lines = blast_fo.readlines()
    bed_file_name = x + '.bed'
    bed_fo = open(bed_file_name, 'w+')
    for l in blast_lines:
        content = l.split('\t')
        if int(content[8]) - int(content[9]) < 1:
            print(content[1], int(content[8]) -1, content[9], content[0], content[10], "+", sep="\t", file=bed_fo) 
        else:
            print(content[1], int(content[9]) -1, content[8],  content[0], content[10], "-", sep = "\t", file=bed_fo)
    blast_fo.close()
    bed_fo.close()
    return bed_file_name

In [15]:
pwh_set = []
def pwh_filter (q_contig, pwh_set=pwh_set):
    '''Checks if contig belongs to the primary with haplotig set.'''
    if q_contig in pwh_set:
        return True
    else:
        return False

In [16]:
def same_contig_blast(x,y):
    '''Function that checks if the blast hit in columne x is on the same contig as the the query sequence in
    column y.
    '''
    q_contig = x.split('.')[2].split('_')[1]
    hit_contig = y.split('_')[1]
    if q_contig == hit_contig:
        return True
    else:
        return False

In [17]:
def on_primary_contig (q_contig):
    '''Quick function that checks if query is on primary contig or not'''
    if q_contig.startswith('hcontig'):
        return False
    elif q_contig.startswith('pcontig'):
        return True
    else:
        print('Contig annotation needs to start with hcontig or pcontig')

In [18]:
def exn_asso_contig(query):
    '''Quick function thats the summary of exonerate on on associated contig to the df'''
    if query in exonerate_no_filtered_allele_asso_contig_bool_dict.keys():
        return exonerate_no_filtered_allele_asso_contig_bool_dict[query]
    else:
        return 'nan'

In [19]:
def exn_no_asso_contig(query):
    '''Quick function thats the summary of exonerate on on associated contig to the df'''
    if query in exonerate_no_filtered_allele_no_asso_contig_bool_dict.keys():
        return exonerate_no_filtered_allele_no_asso_contig_bool_dict[query]
    else:
        return 'nan'

In [20]:
def col_8_id(x):
    '''Function that pulls out the ID from the 9th column of a df.'''
    pattern = r'ID=([a-zA-Z0-9_.]*);'
    regex = re.compile(pattern)  
    m = regex.search(x)
    match = m.groups()[0].replace('TU', 'model')
    if match.startswith('cds.'):
        match = match[4:]
    if 'exon' in match:
        _list = match.split('.')
        match = '.'.join(_list[:-1])
    return match

In [21]:
tmp_SNP_dict ={}
def number_of_SNPs(protein_id, tmp_SNP_dict=tmp_SNP_dict):
    if protein_id in tmp_SNP_dict.keys():
        return tmp_SNP_dict[protein_id]
    else:
        return 0

### ENV parameters and Qcov and PctID cut_offs to define

In [22]:
#Define ENV parameters for blast hits and threads used in blast analysis
n_threads = 4
e_value = 1e-3
blast_stderr_dict ={} #keep track of all the blast outputs and errors if so
#here enter the Qcov and PctID cut off you would like to get analyzed. 
Qcov_cut_off = 80 #this defines the mimimum coverage of the Query to be required for filtering. Will become part of name.
PctID_cut_off = 70 #this defines the mimimum PctID accross the alignment to be required for filtering. Will become part of name.

### PATH variables to define

In [23]:
#Define the PATH
BASE_AA_PATH = '/home/benjamin/genome_assembly/PST79/FALCON/p_assemblies/v9_1/Pst_104E_v12'
BASE_A_PATH = '/home/benjamin/genome_assembly/PST79/FALCON/p_assemblies/v9_1/032017_assembly'
BLAST_RESULT_PATH = os.path.join(BASE_AA_PATH,'allele_analysis' )
ALLELE_PATH =os.path.join(BASE_AA_PATH ,'allele_analysis/alleles')
BLAST_DB = os.path.join(BASE_AA_PATH, 'blast_DB')
OUT_PATH = os.path.join(BASE_AA_PATH, 'allele_analysis', 'no_alleles_QC_Qcov%s_PctID%s'% (Qcov_cut_off, PctID_cut_off))
OUT_PATH_ELSE = os.path.join(OUT_PATH, 'maybe_useful')
OUT_PATH_tmp = os.path.join(OUT_PATH, 'tmp')
EXONERATE_PATH = os.path.join(OUT_PATH_tmp, 'exonerate')
COV_PATH = os.path.join(BASE_AA_PATH, 'COV')
VCF_SRM_PATH = os.path.join(BASE_AA_PATH, 'SRM_VCF')
if not os.path.isdir(OUT_PATH):
    os.mkdir(OUT_PATH)
if not os.path.isdir(OUT_PATH_tmp):
    os.mkdir(OUT_PATH_tmp)
if not os.path.isdir(EXONERATE_PATH):
    os.mkdir(EXONERATE_PATH)
if not os.path.exists(OUT_PATH_ELSE):
    os.mkdir(OUT_PATH_ELSE)

### script variables to define

In [24]:
#clean up the tmp folder?
clean_up = True #True will delete the tmp folder with tmp blast hits and exonerate output files
exonerate_script_name = 'exonerate_alignments_vulgar.sh'

### Genome IDs to enter

In [25]:
#genome
p_genome = 'Pst_104E_v12_p_ctg'
h_genome = 'Pst_104E_v12_h_ctg'

### For COV and SNP analysis enter bed file names or endings

In [26]:
homo_cov_ph_p = '.ph_p_homo_cov.bed' #this is the coverage bed file which defines regions that have homozogous coverage
                                    #when doing ph mapping on p
vcf_file_endings = '.DP10Q20.vcf' #this hould be your filter settings for SNP calling

In [27]:
protein_fa_files = [os.path.join(BASE_A_PATH, x) for x in os.listdir(BASE_A_PATH) if x.endswith('protein.fa')]

In [28]:
#read in protein ids for p and h contigs and store names in a list in a dict with unique key id [first part of
#file name].
fa_protein_dict = {}
fa_protein_seq_dict = {}
fa_protein_length_dict = {}
for file in protein_fa_files:
    seq_list = []
    length_list =[]
    for seq in SeqIO.parse(open(file), 'fasta'):
        fa_protein_seq_dict[seq.id] = seq
        fa_protein_length_dict[seq.id] = len(seq)


In [29]:
#get the file names of the no allele cases including the filtered settings with Qcov and PctID cut offs and the no alleles
#at all that in principle is all p and h proteins without a blast hit with a given e-value right now 0.001
filtered_no_alleles = [os.path.join(ALLELE_PATH, x) for x in os.listdir(ALLELE_PATH)\
                       if (x.split('.')[1] == 'no' and 'Qcov' in x and 'PctID' in x and x.startswith(p_genome) )or \
                           (x.startswith(h_genome) and 'Qcov' in x and 'PctID' in x and 'no.no_p_hits' in x)]
filtered_no_alleles_dict = {}
for x in filtered_no_alleles:
    key = x.split('/')[-1].split('.')[0]
    filtered_no_alleles_dict[key] = x
    
no_alleles_at_all = [os.path.join(ALLELE_PATH, x) for x in os.listdir(ALLELE_PATH)\
                       if (x.split('.')[1] == 'no' and 'Qcov' not in x and 'PctID' not in x and x.startswith(p_genome) )or \
                           (x.startswith(h_genome) and 'Qcov' not in x and 'PctID' not in x and 'no.no_p_hits' in x)]
no_alleles_at_all_dict ={}
for x in no_alleles_at_all:
    key = x.split('/')[-1].split('.')[0]
    no_alleles_at_all_dict[key] = x

###### Might want to be incooporated in the script in future
Pull gff and genome fasta files over into the tmp folder make gene gff and pull out gene sequences with bedtools getfasta on the command line using subproccesses. See below ideas from original script


In [30]:
#get the gene.fa files and put them in a dict that has the genome as a key
gene_fa_files = [os.path.join(BASE_A_PATH, x) for x in os.listdir(BASE_A_PATH) if x.endswith('gene.fa')]
gene_fa_files_dict = {}
for x in gene_fa_files:
    key = x.split('/')[-1].split('.')[0]
    gene_fa_files_dict[key] = x

In [31]:
#generate the blast databases if not already present
os.chdir(BLAST_DB)
blast_dir_content = os.listdir(BLAST_DB)
for x in blast_dir_content:
    if x.endswith('.fa') and ({os.path.isfile(x + e) for e in ['.psq', '.phr', '.pin'] } != {True}\
           and {os.path.isfile(x + e) for e in ['.nin', '.nhr', '.nsq'] } != {True} ):

        make_DB_options = ['-in']
        make_DB_options.append(x)
        make_DB_options.append('-dbtype')
        if 'protein' in x:
            make_DB_options.append('prot')
        else:
            make_DB_options.append('nucl')
        make_DB_command = 'makeblastdb %s' % ' '.join(make_DB_options)
        make_DB_stderr = subprocess.check_output(make_DB_command, shell=True, stderr=subprocess.STDOUT)
        print('%s is done!' % make_DB_command)
print("All databases generated and ready to go!")

All databases generated and ready to go!


In [32]:
#get the blast db files and put them in a dict that has the genome as a key
gene_blast_db = [os.path.join(BLAST_DB, x) for x in os.listdir(BLAST_DB) if x.endswith('gene.fa')]
gene_blast_db_dict ={}
for x in gene_blast_db:
    key = x.split('/')[-1].split('.')[0]
    gene_blast_db_dict[key] = x
genome_blast_db = [os.path.join(BLAST_DB, x) for x in os.listdir(BLAST_DB) if x.endswith('_ctg.fa')]
genome_blast_db_dict ={}
for x in genome_blast_db:
    key = x.split('/')[-1].split('.')[0]
    genome_blast_db_dict[key] = x


In [33]:
#using the dictionary approach to stich together all the different input files. The key is always the genome. In this cases
#being the part of the file name before the first '.'
if len(filtered_no_alleles) != 2:
    print("This script right now is only designed for one set of filter files.")
    print("Please hold!")
else:
    print("One pair of filtered non-allele files given. Good to go!")
    
#simply pulls in the gene sequences of missing alleles. Do this on the filtered set as the unfiltered set is a subset anyway
no_filtered_allele_gene_dict = {}
for no_alleles_key in filtered_no_alleles_dict.keys():
    #read in all the alleles from file this assumes that only one filter setting was run in the allele folder
    no_filtered_allele_list = pd.read_csv(os.path.join(ALLELE_PATH, filtered_no_alleles_dict[no_alleles_key]), header=None, sep='\t')[0].tolist()
    #convert from proteins ids to gene ideas
    no_filtered_allele_list =  [x.replace('evm.model', 'evm.TU') for x in no_filtered_allele_list]
    
    no_filtered_allele_seq = []
    for seq in SeqIO.parse(open(gene_blast_db_dict[no_alleles_key]), 'fasta'):
        if seq.id in no_filtered_allele_list:
            no_filtered_allele_seq.append(seq)
    #get the proper file name
    out_f_prefix = filtered_no_alleles_dict[no_alleles_key].split('/')[-1]
    out_f = out_f_prefix + '.gene.fa'
    f_handle = open(os.path.join(OUT_PATH_ELSE, out_f),'w') #need to generate handle for writing and
    SeqIO.write(no_filtered_allele_seq, f_handle, 'fasta')
    f_handle.close() #closing file afterwards again
    no_filtered_allele_gene_dict[no_alleles_key] = os.path.join(OUT_PATH_ELSE, out_f)

One pair of filtered non-allele files given. Good to go!


In [34]:
#do the gene against other haplotype blast
no_filtered_allele_gene_genome_blast_dict ={}
for no_alleles_key in no_filtered_allele_gene_dict.keys():
    blast_options = ['-query']
    query = no_filtered_allele_gene_dict[no_alleles_key]
    blast_options.append(query)
    blast_options.append('-db')
    if no_alleles_key == p_genome:
        db = genome_blast_db_dict[h_genome]
    elif no_alleles_key == h_genome:
        db = genome_blast_db_dict[p_genome]
    else:
        print("There is something wrong with the file name prefixes and the genome (h and p) provided!")
    blast_options.append(db)
    blast_options.append('-outfmt 6')
    blast_options.append('-evalue')
    blast_options.append(str(e_value))
    blast_options.append('-num_threads')
    blast_options.append(str(n_threads))
    #blast_options.append('-max_target_seqs 1')
    blast_options.append('>')
    if 'gene' in query:
        out_name_list = [ query.split('/')[-1], 'db_' + db.split('/')[-1], str(e_value), 'blastn.outfmt6']
        out_name = os.path.join(OUT_PATH_tmp ,'.'.join(out_name_list))
        blast_options.append(out_name)
        blast_command = 'blastn %s' % ' '.join(blast_options)
    no_filtered_allele_gene_genome_blast_dict[no_alleles_key] = out_name
    print(blast_command)
    if not os.path.exists(out_name):
        blast_stderr_dict[blast_command] = subprocess.check_output(blast_command, shell=True, stderr=subprocess.STDOUT)
        print("New blast run and done!")
    else:
        blast_stderr_dict[blast_command] = 'Previously done already!'
        print('Previously done already!')

blastn -query /home/benjamin/genome_assembly/PST79/FALCON/p_assemblies/v9_1/Pst_104E_v12/allele_analysis/no_alleles_QC_Qcov80_PctID70/maybe_useful/Pst_104E_v12_p_ctg.no.Qcov80.PctID70.alleles.gene.fa -db /home/benjamin/genome_assembly/PST79/FALCON/p_assemblies/v9_1/Pst_104E_v12/blast_DB/Pst_104E_v12_h_ctg.fa -outfmt 6 -evalue 0.001 -num_threads 4 > /home/benjamin/genome_assembly/PST79/FALCON/p_assemblies/v9_1/Pst_104E_v12/allele_analysis/no_alleles_QC_Qcov80_PctID70/tmp/Pst_104E_v12_p_ctg.no.Qcov80.PctID70.alleles.gene.fa.db_Pst_104E_v12_h_ctg.fa.0.001.blastn.outfmt6
New blast run and done!
blastn -query /home/benjamin/genome_assembly/PST79/FALCON/p_assemblies/v9_1/Pst_104E_v12/allele_analysis/no_alleles_QC_Qcov80_PctID70/maybe_useful/Pst_104E_v12_h_ctg.no.no_p_hits.Qcov80.PctID70.alleles.gene.fa -db /home/benjamin/genome_assembly/PST79/FALCON/p_assemblies/v9_1/Pst_104E_v12/blast_DB/Pst_104E_v12_p_ctg.fa -outfmt 6 -evalue 0.001 -num_threads 4 > /home/benjamin/genome_assembly/PST79/FALC

In [35]:
#now convert all the gene level against genome blast hits to bed files
no_filtered_allele_gene_genome_blast_bed_dict = {}
for key, value in no_filtered_allele_gene_genome_blast_dict.items():
    no_filtered_allele_gene_genome_blast_bed_dict[key] = blast_outfmt6_to_bed(value)
    

In [36]:
#here track what happens with the no_alleles. Meaning how many of those have a gene vs. genome hit and how many don't 

#these dict will hold the list of SeqIO.records of blast hit regions for each no_allele hiting the other haplotig split into
#the id of the contig will be h/pcontig_xxx_start_end of DNA sequence

#hit on associated contig
no_filtered_allele_gene_genome_hit_asso_contig_dict = {}

#hit on unlinked contigs
no_filtered_allele_gene_genome_hit_no_asso_contig_dict = {}

no_filtered_allele_gene_no_genome_hit_dict ={}

for key, no_filtered_alllele_fn in filtered_no_alleles_dict.items():

    no_filtered_alleles = pd.read_csv(no_filtered_alllele_fn, sep='\t', header=None)[0].unique()
    genome_hits_header = ['Contig', 'start', 'end', 'blast_query', 'e-value', 'strand']
    gene_genome_hits_df = pd.read_csv(no_filtered_allele_gene_genome_blast_bed_dict[key], sep='\t', \
                                         names = genome_hits_header, header=None)
    gene_genome_hits_df['Protein_ID'] = gene_genome_hits_df['blast_query'].str.replace('evm.TU', 'evm.model')
    #get all alleles with no gene vs. genome hit and save them to file with ending 'no_allele_no_gene_genome_blast_hit.txt'
    no_filtered_allele_gene_no_genome_hit = np.setdiff1d(no_filtered_alleles, gene_genome_hits_df.Protein_ID.unique(), assume_unique= True)
    out_fn = os.path.join(OUT_PATH, key +'.no_allele_no_gene_genome_blast_hit.txt')
    no_filtered_allele_gene_no_genome_hit_dict[key] = out_fn
    np.savetxt(out_fn, no_filtered_allele_gene_no_genome_hit, fmt='%s')
    #now filter out the best hit on an associated contig
    gene_genome_hits_df['asso_contig'] = gene_genome_hits_df['blast_query'].combine(gene_genome_hits_df['Contig'], func=same_contig_blast)
    tmp_same_contig_df = ''
    tmp_same_contig_df = gene_genome_hits_df[gene_genome_hits_df['asso_contig'] == True]
    #now filter out the best hit on an not-associated contig <- not for now as this might get a bit complicated with paraglogs and such
    tmp_diff_contig_df = ''
    tmp_diff_contig_df_grouped = gene_genome_hits_df[gene_genome_hits_df['asso_contig'] == False].groupby('blast_query')
    tmp_diff_contig_best_hits = tmp_diff_contig_df_grouped.apply(lambda g: g[g['e-value'] == g['e-value'].min()])
    #now get all query protein ids
    tmp_protein_id = gene_genome_hits_df['Protein_ID'].unique()
    genome_name = ''
    if key == p_genome:
        genome_name = os.path.join(BASE_A_PATH, h_genome+'.fa')
    elif key == h_genome:
        genome_name = os.path.join(BASE_A_PATH, p_genome+'.fa')
    genome_fa = pysam.FastaFile(genome_name)
    
    for protein_id in tmp_protein_id:
        
        #now loop through the protein_ids of no_alleles hiting the associated contig aka same contig
        #could do something like gene_genome_hits_df.pivot_table(columns=['Protein_ID', 'Contig'], aggfunc={'start' : 'min', 'end':'min'})
        tmp_protein_id_df = gene_genome_hits_df[(gene_genome_hits_df['Protein_ID'] == protein_id) & (gene_genome_hits_df['asso_contig'] == True) ]
        
        if len(tmp_protein_id_df) < 1:
            continue
        tmp_hit_contig = tmp_protein_id_df["Contig"].unique()
        tmp_gene_genome_seq_list = [] #saves SeqIO records from blast hits and suroundings
        #now loop through the associated contig hits incase we have multiple associated contigs hit
        for hit in tmp_hit_contig:
            tmp_df_2 = tmp_protein_id_df[tmp_protein_id_df['Contig'] == hit]
            #get the smallest starting point on the specific contig
            start = tmp_df_2['start'].min() - 30000
            if start < 1:
                start = 1
            end = tmp_df_2['end'].max() + 30000
            seq = genome_fa.fetch(hit, start, end)
            seq_r = '' #initialize empty SeqIO record
            seq_id = hit + '_' + str(start) + '_' + str(end)
            seq_ob = Seq(seq)
            seq_ob.alphabet = 'fasta'
            seq_r = SeqRecord(seq_ob)
            seq_r.id = seq_id
            tmp_gene_genome_seq_list.append(seq_r)
        no_filtered_allele_gene_genome_hit_asso_contig_dict[protein_id] = tmp_gene_genome_seq_list
        
        
    #need to loop through the protein_ids twice as the len(tmp_protein_id_df <1) introduces a silent error for 
    #hits with only not associated contigs
    for protein_id in tmp_protein_id:   
        #now loop through the protein_ids of no_alleles hiting unassociated contig aka diff_contig
        tmp_protein_id_df = tmp_diff_contig_best_hits[(tmp_diff_contig_best_hits['Protein_ID'] == protein_id)]
        if len(tmp_protein_id_df) < 1:
            continue
        tmp_hit_contig = tmp_protein_id_df["Contig"].unique()
        tmp_gene_genome_seq_list = [] #saves SeqIO records from blast hits and suroundings
        #now loop through the associated contig hits incase we have multiple associated contigs hit
        #pull out the blast hit regions (for one contig start(min) and end(max) if mulitple hits on same contig.
        #save SeqIO.Records for each protein_id in a list
        for hit in tmp_hit_contig:
            tmp_df_2 = tmp_protein_id_df[tmp_protein_id_df['Contig'] == hit]
            #get the smallest starting point on the specific contig
            start = tmp_df_2['start'].min() - 30000
            if start < 1:
                start = 1
            end = tmp_df_2['end'].max() + 30000
            seq = genome_fa.fetch(hit, start, end)
            seq_r = '' #initialize empty SeqIO record
            seq_id = hit + '_' + str(start) + '_' + str(end)
            seq_ob = Seq(seq)
            seq_ob.alphabet = 'fasta'
            seq_r = SeqRecord(seq_ob)
            seq_r.id = seq_id
            tmp_gene_genome_seq_list.append(seq_r)
        no_filtered_allele_gene_genome_hit_no_asso_contig_dict[protein_id] = tmp_gene_genome_seq_list
    
    

In [37]:
#now write an exonerate script that aligns the protein sequences to the DNA sequences
EXONERATE_PATH_asso = os.path.join(EXONERATE_PATH, 'hit_associated_contigs')
EXONERATE_PATH_no_asso = os.path.join(EXONERATE_PATH, 'hit_nonassociated_contigs')
if not os.path.exists(EXONERATE_PATH_asso):
    os.mkdir(EXONERATE_PATH_asso)
if not os.path.exists(EXONERATE_PATH_no_asso):
    os.mkdir(EXONERATE_PATH_no_asso)
#open up the script
exonerate_script = os.path.join(OUT_PATH, exonerate_script_name)
out_exonerate = open(exonerate_script, 'w')
out_exonerate.write('#!/bin/bash\n')
for contig_key, contig_seq_list in no_filtered_allele_gene_genome_hit_asso_contig_dict.items():
    out_folder = os.path.join(EXONERATE_PATH_asso, contig_key)
    if not os.path.exists(out_folder):
        os.mkdir(out_folder)
    out_protein_fn = os.path.join(out_folder, contig_key + '.fa')
    out_handle = open(out_protein_fn, 'w')
    #write down the protein sequence
    SeqIO.write(fa_protein_seq_dict[contig_key], out_handle, 'fasta')
    out_handle.close()
    #write the exonerate script
    out_exonerate.write('cd %s\n'% out_folder)
    #write out all the genomic regions
    for seq in contig_seq_list:
        out_seq_name = os.path.join(out_folder, seq.id +'.fa')
        out_seq_handle = open(out_seq_name, 'w')
        SeqIO.write(seq, out_seq_handle, 'fasta')
        out_seq_handle.close()
        #write exonerate script the command
        out_exonerate.write('exonerate --model protein2genome --percent 20 -q %s -t %s --showalignment False -S > %s.vulgar_exn\n'\
                           %(out_protein_fn, out_seq_name, out_seq_name))

    #out_exonerate.write('cd %s\n'% out_folder) #not necessary
    
for contig_key, contig_seq_list in no_filtered_allele_gene_genome_hit_no_asso_contig_dict.items():
    out_folder = os.path.join(EXONERATE_PATH_no_asso, contig_key)
    if not os.path.exists(out_folder):
        os.mkdir(out_folder)
    out_protein_fn = os.path.join(out_folder, contig_key + '.fa')
    out_handle = open(out_protein_fn, 'w')
    #write down the protein sequence
    SeqIO.write(fa_protein_seq_dict[contig_key], out_handle, 'fasta')
    out_handle.close()
    #write the exonerate script
    out_exonerate.write('cd %s\n'% out_folder)
    #write out all the genomic regions
    for seq in contig_seq_list:
        out_seq_name = os.path.join(out_folder, seq.id +'.fa')
        out_seq_handle = open(out_seq_name, 'w')
        SeqIO.write(seq, out_seq_handle, 'fasta')
        out_seq_handle.close()
        #write exonerate script the command
        out_exonerate.write('exonerate --model protein2genome --percent 20 -q %s -t %s --showalignment False -S > %s.vulgar_exn\n'\
                           %(out_protein_fn, out_seq_name, out_seq_name))

    #out_exonerate.write('cd %s\n'% out_folder)       

out_exonerate.close()

In [38]:
#now run the exonerate script
exonerate_command = 'bash %s' % exonerate_script
exonerate_stderr = subprocess.check_output(exonerate_command , shell=True, stderr=subprocess.STDOUT)
print('Exonerate script run successfully')

Exonerate script run successfully


In [39]:
#no loop through the exonerate vulgar result and generate a dictionray of the results
#if hsps query range == (0, query_length) and not F in .vulgar_comp it is likely that the alignment is actually good
#and and the gene model might have been dropped for another reason
#a dict that has the protein ID as key and the results of exonerate as list as value for each contig [contig : True/False].
exonerate_best_hit_dict = {}
exonerate_no_filtered_allele_gene_genome_hit_asso_contig_dict = {}
exonerate_no_filtered_allele_asso_contig_bool_dict = {}


#generate a best hit dict dummy place holder for each contig key
for contig_key in set(list(no_filtered_allele_gene_genome_hit_asso_contig_dict.keys())\
                      + list(no_filtered_allele_gene_genome_hit_no_asso_contig_dict.keys())):
    exonerate_best_hit_dict[contig_key] = ['dummy : 0']



#now loop through the exonerate folders
for contig_key in no_filtered_allele_gene_genome_hit_asso_contig_dict.keys():
    out_folder = os.path.join(EXONERATE_PATH_asso, contig_key)
    query_length = fa_protein_length_dict[contig_key]
    #the results list will store the result for each individual exonerate alignment as boolean value. 
    #True == alignment successful (alignment range == range length protein sequence, no F(rameshit) in vulgar string)
    exonerate_result_list = []
    counter = 0
    overall_best_score = 0
    overall_best_hit = ''
    #get all vulgar alignment results
    vulgar_exn_list = [os.path.join(out_folder, x) for x in os.listdir(out_folder) if x.endswith('vulgar_exn')]
    opt_query_range = (0, query_length)
    #loop through vulgar parser and see if hit is valid 
    for fname in vulgar_exn_list:
        best_score = 0
        best_hit = ''
        result = SearchIO.parse(fname, 'exonerate-vulgar')
        genome_region = fname.split('/')[-1].split('.')[0]
        for hit in result:
            #loop through all hsps hits
            for hsps in hit.hsps:
                hsps_range = hsps.query_range
                vulgar_list = hsps.vulgar_comp.strip(' ').split(' ')
                #print(hsps_range, vulgar_list)
                #this is the contition for something being a potential protein alignment that
                #True == alignment successful (alignment range == range length protein sequence, \
                #no F(rameshit) in vulgar string)
                if hsps_range == opt_query_range and 'F' not in vulgar_list:
                    counter += 1
                    if hsps.score > best_score:
                        best_hit = hsps.hit_id
                        best_score = hsps.score
                    if hsps.score > overall_best_score:
                        overall_best_hit = hsps.hit_id
                        overall_best_score = hsps.score
                    #print(key)
        if best_score > 0:
            exonerate_result_list.append('%s : True' % genome_region)
        else:
            exonerate_result_list.append('%s : False' % genome_region)
            
    exonerate_no_filtered_allele_gene_genome_hit_asso_contig_dict[contig_key] = exonerate_result_list
    
    
    
    if counter > 0:
        exonerate_no_filtered_allele_asso_contig_bool_dict[contig_key] = True
        
        if contig_key in exonerate_best_hit_dict.keys():
            if int(exonerate_best_hit_dict[contig_key][0].split(':')[1][1:]) < overall_best_score:
                exonerate_best_hit_dict[contig_key] = ['%s : %s' % (overall_best_hit, overall_best_score)]
    else:
        exonerate_no_filtered_allele_asso_contig_bool_dict[contig_key] = False
    


In [40]:
#no loop through the exonerate vulgar result and generate a dictionray of the results
#if hsps query range == (0, query_length) and not F in .vulgar_comp it is likely that the alignment is actually good
#and and the gene model might have been dropped for another reason
#a dict that has the protein ID as key and the results of exonerate as list as value for each contig [contig : True/False].
exonerate_no_filtered_allele_gene_genome_hit_no_asso_contig_dict = {}
exonerate_no_filtered_allele_no_asso_contig_bool_dict = {}
#now loop through the exonerate folders


for contig_key in no_filtered_allele_gene_genome_hit_no_asso_contig_dict.keys():
    out_folder = os.path.join(EXONERATE_PATH_no_asso, contig_key)
    query_length = fa_protein_length_dict[contig_key]
    #the results list will store the result for each individual exonerate alignment as boolean value. 
    #True == alignment successful (alignment range == range length protein sequence, no F(rameshit) in vulgar string)
    exonerate_result_list = []
    counter = 0
    overall_best_score = 0
    overall_best_hit = ''
    #get all vulgar alignment results
    vulgar_exn_list = [os.path.join(out_folder, x) for x in os.listdir(out_folder) if x.endswith('vulgar_exn')]
    opt_query_range = (0, query_length)
    #loop through vulgar parser and see if hit is valid 
    for fname in vulgar_exn_list:
        best_score = 0
        best_hit = ''
        result = SearchIO.parse(fname, 'exonerate-vulgar')
        genome_region = fname.split('/')[-1].split('.')[0]
        for hit in result:
            #loop through all hsps hits
            for hsps in hit.hsps:
                hsps_range = hsps.query_range
                vulgar_list = hsps.vulgar_comp.strip(' ').split(' ')
                #print(hsps_range, vulgar_list)
                #this is the contition for something being a potential protein alignment that
                #True == alignment successful (alignment range == range length protein sequence, \
                #no F(rameshit) in vulgar string)
                if hsps_range == opt_query_range and 'F' not in vulgar_list:
                    counter += 1
                    if hsps.score > best_score:
                        best_hit = hsps.hit_id
                        best_score = hsps.score
                    if hsps.score > overall_best_score:
                        overall_best_hit = hsps.hit_id
                        overall_best_score = hsps.score
                    #print(key)
        if best_score > 0:
            exonerate_result_list.append('%s : True' % genome_region)
        else:
            exonerate_result_list.append('%s : False' % genome_region)
            
    exonerate_no_filtered_allele_gene_genome_hit_no_asso_contig_dict[contig_key] = exonerate_result_list
    
    
    
    if counter > 0:
        exonerate_no_filtered_allele_no_asso_contig_bool_dict[contig_key] = True
        
        if contig_key in exonerate_best_hit_dict.keys():
            if int(exonerate_best_hit_dict[contig_key][0].split(':')[1][1:]) < overall_best_score:
                exonerate_best_hit_dict[contig_key] = ['%s : %s' % (overall_best_hit, overall_best_score)]
    else:
        exonerate_no_filtered_allele_no_asso_contig_bool_dict[contig_key] = False
    


In [41]:
#write out the exonerate dictonaries, combine vulgar results and delete all the exonerate files if clean up == True
out_name = os.path.join(OUT_PATH_ELSE, p_genome.replace('p_ctg', 'ph_ctg') + 'exonerate_no_filtered_allele_asso_contig_bool_dict.txt')
json.dump(exonerate_no_filtered_allele_asso_contig_bool_dict,open(out_name, 'w'))

out_name = os.path.join(OUT_PATH_ELSE, p_genome.replace('p_ctg', 'ph_ctg') + 'exonerate_no_filtered_allele_no_asso_contig_bool_dict.txt')
json.dump(exonerate_no_filtered_allele_no_asso_contig_bool_dict,open(out_name, 'w'))

out_name = os.path.join(OUT_PATH_ELSE, p_genome.replace('p_ctg', 'ph_ctg') + 'exonerate_no_filtered_allele_gene_genome_hit_asso_contig_dict.txt')
json.dump(exonerate_no_filtered_allele_gene_genome_hit_asso_contig_dict,open(out_name, 'w'))

out_name = os.path.join(OUT_PATH_ELSE, p_genome.replace('p_ctg', 'ph_ctg') + 'exonerate_no_filtered_allele_gene_genome_hit_no_asso_contig_dict.txt')
json.dump(exonerate_no_filtered_allele_gene_genome_hit_no_asso_contig_dict,open(out_name, 'w'))

out_name = os.path.join(OUT_PATH_ELSE, p_genome.replace('p_ctg', 'ph_ctg') + 'exonerate_best_hit_dict.txt')
json.dump(exonerate_best_hit_dict,open(out_name, 'w'))


vulgar_files = glob.glob(os.path.join(EXONERATE_PATH, '*/*/*.vulgar_exn' ))
out_name = os.path.join(OUT_PATH_ELSE, p_genome.replace('p_ctg','ph_ctg') +'exonerate_vulgar_exn_all.txt')
with open(out_name, 'wb') as outfile:
    for f in vulgar_files:
        with open(f, 'rb') as infile:
            outfile.write(infile.read())

if clean_up == True:
    shutil.rmtree(EXONERATE_PATH)

In [42]:
#now read in the initial blast dataframes, filter them down to all no_alleles_filtered
blast_out_dict = {}
blast_header = ['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery', 'StopQuery', 'StartTarget',\
              'StopTarget', 'e-value','BitScore']
blastp_result_files = [os.path.join(BLAST_RESULT_PATH,x) for x in os.listdir(BLAST_RESULT_PATH) if x.endswith('outfmt6') and x.split('.')[-2] == 'blastp' ]
blastp_results_dict = {}
for x in blastp_result_files:
    key = x.split('/')[-1].split('.')[0]
    blastp_results_dict[key] = x

for key, blastp_fn in blastp_results_dict.items():
    tmp_df = pd.read_csv(blastp_fn, sep='\t', header=None, names=blast_header)
    tmp_no_allele_list = pd.read_csv(filtered_no_alleles_dict[key], sep ='\t', header = None)[0].tolist()
    tmp_df = tmp_df[tmp_df.Query.isin(tmp_no_allele_list)]
    tmp_df["QLgth"] = tmp_df["Query"].apply(lambda x: fa_protein_length_dict[x])
    tmp_df["QCov"] = tmp_df['AlnLgth']/tmp_df['QLgth']*100
    tmp_df.sort_values(by=['Query', 'e-value','BitScore', ],ascending=[True, True, False], inplace=True)
    #now make sure to add proteins/genes without blast hit to the dataframes e.g. some of the no_alleles will have had no blast hit in the initial blast
    tmp_all_queries_w_hit = tmp_df["Query"].unique()
    tmp_queries_no_hit = set(tmp_no_allele_list) - set(tmp_all_queries_w_hit)
    no_hit_list = []
    #loop over the quieres with no hit and make list of list out of them the first element being the query id
    for x in tmp_queries_no_hit:
        NA_list = ['False'] * len(tmp_df.columns)
        NA_list[0] = x
        no_hit_list.append(NA_list)
    tmp_no_hit_df = pd.DataFrame(no_hit_list)
    tmp_no_hit_df.columns = tmp_df.columns
    tmp_no_hit_df['QLgth'] = tmp_no_hit_df.Query.apply(lambda x: fa_protein_length_dict[x])
    tmp_df = tmp_df.append(tmp_no_hit_df)
    tmp_df['q_contig'] = tmp_df['Query'].str.extract(r'([p|h][a-z]*_[^.]*).?')
    tmp_df['t_contig'] = tmp_df['Target'].str.extract(r'([p|h][a-z]*_[^.]*).?')
    #fix that if you don't extract anything return False and not 'nan'
    tmp_df['t_contig'].fillna(False, inplace=True)
    tmp_df['q_contig == t_contig'] = (tmp_df["Query"].str.extract(r'[p|h][a-z]*_([0-9]*)') == tmp_df["Target"].str.extract(r'[p|h][a-z]*_([0-9]*)'))
    tmp_df.reset_index(inplace=True, drop=True)
    blast_out_dict[key] = tmp_df.iloc[:,:]
#no make one summary_df for everything.
no_filtered_allele_summary_df = pd.concat(blast_out_dict.values())
no_filtered_allele_summary_df.reset_index(drop=True, inplace=True)



In [43]:
#get all primary contigs with and without haplotigs as pwh_set and pwoh_set
p_ctgs = []
h_ctgs = []
for seq in SeqIO.parse(os.path.join(BASE_A_PATH, p_genome + '.fa'), 'fasta'):
    p_ctgs.append(seq.id)
for seq in SeqIO.parse(os.path.join(BASE_A_PATH, h_genome + '.fa'), 'fasta'):
    h_ctgs.append(seq.id)
pwh_set = {re.search(r'[a-z]*_[0-9]*', h_ctg).group().replace('h', 'p') for h_ctg in h_ctgs}
pwoh_set = set(pwh_set) - pwh_set

In [44]:
#add column for being on primary contig
no_filtered_allele_summary_df['primary_contig'] = no_filtered_allele_summary_df.q_contig.apply(on_primary_contig)

In [45]:
#add column for being on primary contig with haplotig
no_filtered_allele_summary_df['pwh_contig'] = no_filtered_allele_summary_df.q_contig.apply(pwh_filter)

In [46]:
#get the list of gene no genome hits 
no_filtered_allele_gene_no_genome_hit_list = []
for key, value in no_filtered_allele_gene_no_genome_hit_dict.items():
    no_filtered_allele_gene_no_genome_hit_list += pd.read_csv(value, header=None, sep ='\t')[0].tolist()

In [47]:
#add gene_on_genome hit column
no_filtered_allele_summary_df['gene_on_genome_blast_hit'] = ~no_filtered_allele_summary_df["Query"].isin(no_filtered_allele_gene_no_genome_hit_list)

In [48]:
#quick QC step if exonerate dictonaries and gene_on_genome_blast_hit are the same
exn_set = set(list(exonerate_no_filtered_allele_asso_contig_bool_dict.keys()) + list(exonerate_no_filtered_allele_no_asso_contig_bool_dict.keys()))
if exn_set == set(no_filtered_allele_summary_df[no_filtered_allele_summary_df.gene_on_genome_blast_hit  == True]['Query'].unique()):
    (print("All good at the exonerate summary step! Please continue"))

All good at the exonerate summary step! Please continue


In [49]:
#add exonarate summaries to the summary df
no_filtered_allele_summary_df['exn_asso_contig'] = no_filtered_allele_summary_df.Query.apply(exn_asso_contig)
no_filtered_allele_summary_df['exn_no_asso_contig'] = no_filtered_allele_summary_df.Query.apply(exn_no_asso_contig)

In [50]:
gff_files = [os.path.join(BASE_A_PATH, x) for x in os.listdir(BASE_A_PATH) if x.endswith('anno.gff3')] 
gff_file_dict ={}
for fn in gff_files:
    key = fn.split('/')[-1].split('.')[0]
    gff_file_dict[key] = fn

In [51]:
#read in gene annotation and gff files for downstream analysis
gene_anno_gff_dict = {}
for assembly, file in gff_file_dict.items() :
    tmp_df =  pd.read_csv(file, header=None, sep='\t')
    tmp_df['protein_id'] = tmp_df[8].apply(col_8_id)
    gene_anno_gff_dict[assembly] =  tmp_df

In [52]:
# drop this option for now and simply investigate homozyous regions for all genes e.g. could have gene hit but somewhere else and not at the same position on the associated contig.
#filter down the no hits gene vs genome
no_gene_vs_genome_hits = \
    no_filtered_allele_summary_df[(no_filtered_allele_summary_df.gene_on_genome_blast_hit == False)]['Query'].unique()

In [53]:
#get all no filttered allele IDs to investigate if they are in the homozygous region if on p contigs
no_filtered_alleles = no_filtered_allele_summary_df['Query'].unique()

In [54]:
#save out gff files where no_gene_vs_genome blast hits
no_gene_vs_genome_gff_dict = {}
for assembly, gff in gene_anno_gff_dict.items():
    tmp_df_no_gene_vs_genome_df = ''
    tmp_df_no_gene_vs_genome_df = gff[gff.protein_id.isin(no_gene_vs_genome_hits)]
    out_fn = os.path.join(OUT_PATH_tmp, assembly + '.no_gene_vs_genome.anno.gff')
    no_gene_vs_genome_gff_dict[assembly] = out_fn
    tmp_df_no_gene_vs_genome_df.iloc[:,0:9].to_csv(out_fn, header=None, index=None, sep='\t')

In [55]:
#save out gff files for all no filtered alleles
no_filtered_alleles_gff_dict = {}
for assembly, gff in gene_anno_gff_dict.items():
    tmp_df_no_filtered_alleles_df = ''
    tmp_df_no_filtered_alleles_df = gff[gff.protein_id.isin(no_filtered_alleles)]
    out_fn = os.path.join(OUT_PATH_tmp, assembly + '.Qcov%s.PctID%s.no_filtered_alleles.anno.gff' % (Qcov_cut_off, PctID_cut_off))
    no_filtered_alleles_gff_dict[assembly] = out_fn
    tmp_df_no_filtered_alleles_df.iloc[:,0:9].to_csv(out_fn, header=None, index=None, sep='\t')

In [56]:
#get the bed dataframe from homozygous coverage of p contigs when mapping againts p and h
homo_cov_ph_p_fn = [os.path.join(COV_PATH, x) for x in os.listdir(COV_PATH) if x.endswith(homo_cov_ph_p)][0]
homo_cov_ph_p_bed = pybedtools.BedTool(homo_cov_ph_p_fn)

In [57]:
#get the no gene vs genome df of the primary assembly
no_gene_vs_genome_gff_p_bed = pybedtools.BedTool(no_gene_vs_genome_gff_dict[p_genome])

In [58]:
#get the no gene vs genome df of the primary assembly
no_filtered_alleles_gff_p_bed = pybedtools.BedTool(no_filtered_alleles_gff_dict[p_genome])
no_filtered_alleles_gff_ph_bed = pybedtools.BedTool(no_filtered_alleles_gff_dict[p_genome.replace('p_ctg', 'ph_ctg')])

In [59]:
#get the id of all genes of no gene vs genome gff
gene_ids_ph_p_homo = []
for x in no_filtered_alleles_gff_p_bed.intersect(homo_cov_ph_p_bed):
    y = col_8_id(x[8])
    gene_ids_ph_p_homo.append(y)
gene_ids_ph_p_homo = set(gene_ids_ph_p_homo)

In [60]:
#add a column describing if a protein is encoded in homozygous coverage region on p when mapping against ph
no_filtered_allele_summary_df['ph_p_homo_region'] = \
    no_filtered_allele_summary_df.Query.isin(gene_ids_ph_p_homo)

In [61]:
#write out the summary dataframe
out_fn = os.path.join(OUT_PATH, p_genome.replace('_p_ctg', '_ph_ctg.no_allele_QC.Qcov%s.PctID%s.txt'%(Qcov_cut_off,PctID_cut_off)))
no_filtered_allele_summary_df.to_csv(out_fn, sep='\t', index = None)

In [62]:
#now pull in the vcf files for filtering consider extending this to include multiple SNP callers
#dict
srm_vcf_fn = [os.path.join(VCF_SRM_PATH, x) for x in os.listdir(VCF_SRM_PATH) if x.endswith(vcf_file_endings)] 
srm_vcf_dict ={}
for fn in srm_vcf_fn:
    key_list = fn.split('/')[-1].split('.')
    key = '%s.%s' % (key_list[0], key_list[-3])
    srm_vcf_dict[key] = fn

In [63]:
#now loop through all vcf snp calls and report if a no_filtered_alleles overlaps with SNP when mapping against ph
srm_vcf_ph_mapping_keys = [x for x in srm_vcf_dict.keys() if 'ph_ctg' in x]
for key in srm_vcf_ph_mapping_keys:
    tmp_vcf_bed = pybedtools.BedTool(srm_vcf_dict[key])
    #now get the intersect between vcf files and no_filtered_alleles
    tmp_gene_ids = []
    tmp_df = no_filtered_alleles_gff_ph_bed.intersect(tmp_vcf_bed, wo=True).to_dataframe()
    tmp_df['protein_id'] = tmp_df[8].apply(col_8_id)
    #get the number of SNPs per coding sequence
    tmp_exons_SNP = tmp_df[tmp_df[2] == 'CDS'].groupby('protein_id')[19].sum()
    tmp_SNP_dict = dict(zip(tmp_exons_SNP.index, tmp_exons_SNP))
    #add the SNP boolean values to the df
    column_name = key + '_SNP'
    no_filtered_allele_summary_df[column_name] = no_filtered_allele_summary_df.Query.isin(tmp_exons_SNP.index)
    #add the #number of SNPs to the df
    column_name = column_name +'_#'
    no_filtered_allele_summary_df[column_name] = no_filtered_allele_summary_df.apply(lambda row: number_of_SNPs(row['Query'], tmp_SNP_dict), axis =1)
    #add the % SNPs per bp in CDS to dataframe
    no_filtered_allele_summary_df[column_name.replace('_#', '_%')] = round(no_filtered_allele_summary_df[column_name]*100/(no_filtered_allele_summary_df.QLgth*3+3), 3)

['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 20 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


In [64]:
#now pull together the *.h_by_p_cov.gff and *.p_by_h_cov.gff files from the allele analysis and save them as 
#single gff for the whole genome and split out for each haploid genome
h_by_p_cov_files = glob.glob(os.path.join(BLAST_RESULT_PATH, '*php_8kbp/*.h_by_p_cov.gff' ))
p_by_h_cov_files = glob.glob(os.path.join(BLAST_RESULT_PATH, '*php_8kbp/*.p_by_h_cov.gff' ))
p_on_h_cov_files = h_by_p_cov_files + p_by_h_cov_files
out_name = os.path.join(BLAST_RESULT_PATH, p_genome.replace('p_ctg','ph_ctg') +'p_on_h_cov.gff')
with open(out_name, 'wb') as outfile:
    for f in p_on_h_cov_files:
        with open(f, 'rb') as infile:
            outfile.write(infile.read())
tmp_df = pd.read_csv(out_name,sep='\t', header =None)
#for now a fix here for swapping column 3 and 4 of GFF if 4 < 3
#tmp_df['comp'] = tmp_df[3] - tmp_df[4]
#tmp_swap_index  = tmp_df['comp'] > 0
#tmp_df.loc[tmp_swap_index, 3] , tmp_df.loc[tmp_swap_index, 4] = tmp_df[4], tmp_df[3]
#tmp_df.sort_values(by=[0,3], inplace=True)
#and if 3 == g
#is_null_index  = tmp_df[3] == 0
#tmp_df.loc[is_null_index, 3] = 1
#tmp_df = tmp_df.iloc[:, 0:9]
tmp_df[tmp_df[1].str.contains('pcontig')].to_csv(os.path.join(BLAST_RESULT_PATH, h_genome +'.h_by_p_cov.gff'), sep='\t', header=None, index=None)
tmp_df[tmp_df[1].str.contains('hcontig')].to_csv(os.path.join(BLAST_RESULT_PATH, p_genome +'.p_by_h_cov.gff'), sep='\t', header=None, index=None)
tmp_df.to_csv(out_name, sep='\t', header=None, index=None )

In [65]:
#now add a overlap column to the summary dataframe
haplotig_mapping_bed = pybedtools.BedTool(out_name)
gene_ids_haplotig_mapping = []
for x in no_filtered_alleles_gff_p_bed.intersect(haplotig_mapping_bed):
    y = col_8_id(x[8])
    gene_ids_haplotig_mapping.append(y)
gene_ids_haplotig_mapping = set(gene_ids_haplotig_mapping)
no_filtered_allele_summary_df['overlap_p_on_h_mapping'] = \
    no_filtered_allele_summary_df.Query.isin(gene_ids_haplotig_mapping)

In [72]:
#write out the summary dataframe
out_fn = os.path.join(OUT_PATH, p_genome.replace('_p_ctg', '_ph_ctg.no_alleles_QC.Qcov%s.PctID%s.df'%(Qcov_cut_off,PctID_cut_off)))
no_filtered_allele_summary_df.to_csv(out_fn, sep='\t', index = None)

In [67]:
#now write out a couple of ways
#no blastp, no exonerate, no homozygous region, no SNPs
df = no_filtered_allele_summary_df
out_filter = (df.Target == 'False') & (df.exn_asso_contig != True) & (df.exn_no_asso_contig != True) \
    & (df.ph_p_homo_region != True) & (df['Pst_E104_v1_ph_ctg.freebayes_SNP'] != True)
out_name = os.path.join(OUT_PATH, p_genome.replace('p_ctg', 'ph_ctg')+'.no_alleles_postQC.txt')
np.savetxt(out_name, df[out_filter]["Query"].unique(), fmt='%s')
len(df[out_filter]["Query"].unique())

894

In [68]:
#no exonerate, homozygous region, SNPs
out_filter = (df.exn_asso_contig != True) & (df.exn_no_asso_contig != True) \
    & (df.ph_p_homo_region == True) & (df['Pst_E104_v1_ph_ctg.freebayes_SNP'] == True) #& (df.overlap_p_on_h_mapping == True)
out_name = os.path.join(OUT_PATH, p_genome.replace('p_ctg', 'ph_ctg')+'.no_alleles_unphased.txt')
np.savetxt(out_name, df[out_filter]["Query"].unique(), fmt='%s')
len(df[out_filter]["Query"].unique())

417

In [69]:
#no exonerate, no homozygous region, no SNPs, overlap p on h mapping
out_filter =  (df.ph_p_homo_region == True) & (df['Pst_E104_v1_ph_ctg.freebayes_SNP'] == True) #& (df.overlap_p_on_h_mapping == True)
#out_name = os.path.join(OUT_PATH, p_genome.replace('p_ctg', 'ph_ctg')+'.no_alleles_unphased.txt')
#np.savetxt(out_name, df[out_filter]["Query"].unique(), fmt='%s')
len(df[out_filter]["Query"].unique())

442

In [70]:
if clean_up == True:
    shutil.rmtree(OUT_PATH_tmp)