## Defining Alleles v02

This notebook is aimed at doing allelic comparison between primary contigs and hapltotigs. This was used to generate DataFrames used during the analysis and evolved during the analysis. So it might be a bit more complicated than it needs to be, but it works.

This notebook does NOT use proteinortho in its identification of allelic pairs

Based on original code by Benjamin Schwessinger.

This notebook defines alleles based on a Falcon-Unzip assembly where relational information between two contigs are available.

#### The input as follows:
* Assemblytics alingment of all haplotigs onto their corresponding primary contigs. The *.oriented_coords.csv will be converted to a gff file of haplotig alingments onto their respective primary contigs
* gff file of all features. This will be filtered down to genes only and separated out for each contig
* protein and gene fa files for reciprocal blast of haplotig sequences onto primary sequences

#### What happens:
* for each primary gene/protein the the overlapping haplotig is pulled out and added to the blast dataframe
* this blast dataframe of primary proteins onto haplotig proteins is than used for filtering to provide the following files
* best blast hits are first filtered on e-value minimums followed by BitScore maximum

#### What comes out in the alleles folder:
##### for primary genes/proteins
* a set of files that contain two proteins IDs identifying the two alleles.  
* *_p_ctg.h_contig_overlap.alleles contains best blast hits of the primary protein on haplotig proteins of the overlapping haplotig (truely linked alleles).
* *_p_ctg.no_respective_h_contig_overlap.alleles contains all primary protein that don't have a blast hit on the overlapping haplotig but another haplotig associated with its primary contig.
* *_p_ctg.no_specific_h_contig_overlap.alleles contains all primary proteins that don't have a blast hit on an associated haplotig but another haplotig.
* *_p_ctg.no_alleles contains all primary proteins without blast hit on haplotigs.

##### for haplotig genes/proteins
* *_h_ctg.no.alleles contains all haplotig proteins that are not associated with a primary allele in the above analysis.
* *_h_ctg.best_p_hits.no_alleles contains all haplotig proteins that are not associated with a primary allele in the above analysis, but have a blast hit (h on p). Only the best blast hit for each haplotig is recorded in this file. This might be duplications in the haplotigs.
* *_h_ctg.no_p_hits.alleles contains all haplotig proteins that have no blast hit.

##### filtering of blast output
* filtering of blast output is possible on Query coverage of the blast hit [Qcov] and the percentage identity of the aligenment [PctID]. This generates additional outfiles with *.QcovXX.PctIDYY.alleles. The default values are set to Qcov80 and PctID70.

#### What else to consider:
* the script was tested for outputing the right number of protein sequences for primary and haplotig sequences.
* this is a script and not a program. No warrenties.
* Feedback always welcome.
* Path and other variables are defined at the top of the script.
* It assumes that primary contig sequences are labled as e.g. pcontig_000 and associated haplotigs as hcontig_000_0xx.

#### Downstream considerations:
* Why have some proteins no blast hits? Is it simple that their gene models are missing? Seen this and working on this.
* In case of primary genes without blast hits do those lie in homozygous coverage areas? Working on this.
* Are some of the h genes/proteins without alleles but with blast hits duplications that diverget?
* Do some of the primary proteins/genes without blast hit on haplotigs have a good blast hit on other primary contigs? Is this reciprocal?
* Can we use some of this information to talk about linkage of contigs or does this go to far without fully phased genomes using loooong reads or single nucleii sequencing? Would say yes so far.
* Some primary proteins have two equal got hits are those duplications in the haplotig?
* If we perform downstream genetic variation analysis between alleles do we see any GO, KEGG, effector or other functional domains enriched?


In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import os
import re
from Bio import SeqIO
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil


In [3]:
#Define the PATH

GENOME_VERSION = 'v032'

BASE_OUT_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/%s_no_proteinortho' % GENOME_VERSION
GENOME_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/genome_%s/' % GENOME_VERSION
ASSEMBLYTICS_IN_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/nucmer_assemblytics/%s/Assemblytics/' % GENOME_VERSION

BLAST_DB = os.path.join(BASE_OUT_PATH, 'blast_DB')
OUT_PATH = os.path.join(BASE_OUT_PATH, 'allele_analysis')
OUT_tmp = os.path.join(OUT_PATH, 'tmp')
PROTEINORTHO_OUT_PATH = os.path.join(BASE_OUT_PATH, 'proteinortho')
    
#renamed the allele path to reflect the proteinortho results
ALLELE_PATH = os.path.join(OUT_PATH, 'alleles_proteinortho_graph516')
if not os.path.exists(BASE_OUT_PATH):
    os.mkdir(BASE_OUT_PATH)
if not os.path.isdir(OUT_PATH):
    os.mkdir(OUT_PATH)
if not os.path.isdir(OUT_tmp):
    os.mkdir(OUT_tmp)
if not os.path.exists(ALLELE_PATH):
    os.mkdir(ALLELE_PATH)
if not os.path.exists(BLAST_DB):
    os.mkdir(BLAST_DB)
if not os.path.exists(PROTEINORTHO_OUT_PATH):
    os.mkdir(PROTEINORTHO_OUT_PATH)

#Define your p and h genome and move it into the allele analysis folder
P_GENOME = 'DK_0911_%s_p_ctg' % GENOME_VERSION
H_GENOME = 'DK_0911_%s_h_ctg' % GENOME_VERSION
for x in (x + '.fa' for x in [P_GENOME, H_GENOME]):
    shutil.copy2(GENOME_PATH+'/'+x, OUT_tmp)

In [4]:
#Define ENV parameters for blast hits and threads used in blast analysis
n_threads = 16
e_value = 1e-3
Qcov_cut_off = 0 #this defines the mimimum coverage of the Query to be required for filtering. Will become part of name.
PctID_cut_off = 70 #this defines the mimimum PctID accross the alignment to be required for filtering. Will become part of name.

In [5]:
def att_column_p(x, y, z, a, b):
    '''Generate attribute column of h on p p_gff dataframe.
    '''
    string = 'h_contig=%s;query_start=%s;query_stop=%s;query_length=%s;query_aln_ln=%s' % (x, str(y), str(z), str(a), str(b))
    return string

def att_column_h(x, y, z, a, b):
    '''Generate attribute column of h on p h_gff dataframe.
    '''
    string = 'p_contig=%s;ref_start=%s;ref_stop=%s;ref_length=%s;ref_aln_ln=%s' % (x, str(y), str(z), str(a), str(b))
    return string

def gene_contig_subset(df, contig, feature):
    '''Input the gff and subset the feature columne by gene.
        Return data frame subset by contig and gene'''
    tmp_df = df[(df[0].str.contains(contig)) & (df[2]==feature)]
    tmp_df.sort_values(3,inplace=True)
    tmp_df.reset_index(drop=True, inplace=True)
    return tmp_df

def same_contig_blast(x,y):
    '''Function that checks if the blast hit in column y is on the same contig as the the query sequence in
    column y.
    '''
    q_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', x).group(1)
    if y != 'NaN':
        hit_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', y).group(1)
    else:
        hit_contig = 'NaN'
    return q_contig == hit_contig

def target_on_mapped_haplotig(t_contig, h_contig_overlap):
    '''Simple function that checks if the target contig is in the list of h overlapping contigs'''
    if t_contig == False or h_contig_overlap == False:
        return False
    return t_contig in h_contig_overlap

#### Considerations for blast analysis
Do gene blast and protein blast. Initially go off protein blast results. They should be more imformative as it also includes the coding region and the frame.
> Pull this in as a dataframe and filter it down as well to each contig. Do blasts both ways as well.
> Copy over primary & haplotig .protein.fa and .gene.fa files from the GENOME_PATH, if not already present.

In [6]:
#generate the blast databases if not already present
os.chdir(BLAST_DB)
# copy over .protein.fa and .gene.fa files from the GENOME_PATH folder
# if not already present, for both primary and haplotigs
shutil.copy2(os.path.join(GENOME_PATH, P_GENOME + '.protein.fa'), BLAST_DB)
shutil.copy2(os.path.join(GENOME_PATH, P_GENOME + '.gene.fa'), BLAST_DB)
shutil.copy2(os.path.join(GENOME_PATH, H_GENOME + '.protein.fa'), BLAST_DB)
shutil.copy2(os.path.join(GENOME_PATH, H_GENOME + '.gene.fa'), BLAST_DB)

blast_dir_content = os.listdir(BLAST_DB)
for x in blast_dir_content:
    if x.endswith('.fa') and ({os.path.isfile(x + e) for e in ['.psq', '.phr', '.pin'] } != {True}\
           and {os.path.isfile(x + e) for e in ['.nin', '.nhr', '.nsq'] } != {True} ):

        make_DB_options = ['-in']
        make_DB_options.append(x)
        make_DB_options.append('-dbtype')
        if 'protein' in x:
            make_DB_options.append('prot')
        else:
            make_DB_options.append('nucl')
        make_DB_command = 'makeblastdb %s' % ' '.join(make_DB_options)
        make_DB_stderr = subprocess.check_output(make_DB_command, shell=True, stderr=subprocess.STDOUT)
        print('%s is done!' % make_DB_command)
print("All databases generated and ready to go!")

All databases generated and ready to go!


In [7]:
blast_dict = {}
blast_dict['gene_fa'] = [x for x  in os.listdir(BLAST_DB) if x.endswith('gene.fa') and 'ph_ctg' not in x]
blast_dict['protein_fa'] = [x for x  in os.listdir(BLAST_DB) if x.endswith('protein.fa') and 'ph_ctg' not in x]
blast_stderr_dict = {}
for key in blast_dict.keys():
    for n, query in enumerate(blast_dict[key]):
        tmp_list = blast_dict[key][:]
        tmp_stderr_list = []
        #this loops through all remaining files and does blast against all those
        del tmp_list[n]
        for db in tmp_list:
            print("\nBlasting %s ..." %db)
            blast_options = ['-query']
            blast_options.append(query)
            blast_options.append('-db')

            blast_options.append(db)
            blast_options.append('-outfmt 6')
            blast_options.append('-evalue')
            blast_options.append(str(e_value))
            blast_options.append('-num_threads')
            blast_options.append(str(n_threads))
            blast_options.append('>')
            if 'gene' in query and 'gene' in db:
                out_name_list = [ query.split('.')[0], db.split('.')[0], str(e_value), 'blastn.outfmt6']
                out_name = os.path.join(OUT_PATH,'.'.join(out_name_list))
                blast_options.append(out_name)
                blast_command = 'blastn %s' % ' '.join(blast_options)
            elif 'protein' in query and 'protein' in db:
                out_name_list = [ query.split('.')[0], db.split('.')[0], str(e_value), 'blastp.outfmt6']
                out_name = os.path.join(OUT_PATH,'.'.join(out_name_list))
                blast_options.append(out_name)
                blast_command = 'blastp %s' % ' '.join(blast_options)
            print("Blast command generated:\n%s" % blast_command)
            if not os.path.exists(out_name):
                print("Executing new blast command...")
                blast_stderr_dict[blast_command] = subprocess.check_output(blast_command, shell=True, stderr=subprocess.STDOUT)
                print("New blast executed!")
            else:
                blast_stderr_dict[blast_command] = 'Previously done already!'
                print("Blast command previously completed!")


Blasting DK_0911_v032_h_ctg.protein.fa ...
Blast command generated:
blastp -query DK_0911_v032_p_ctg.protein.fa -db DK_0911_v032_h_ctg.protein.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/v032_no_proteinortho/allele_analysis/DK_0911_v032_p_ctg.DK_0911_v032_h_ctg.0.001.blastp.outfmt6
Blast command previously completed!

Blasting DK_0911_v032_p_ctg.protein.fa ...
Blast command generated:
blastp -query DK_0911_v032_h_ctg.protein.fa -db DK_0911_v032_p_ctg.protein.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/v032_no_proteinortho/allele_analysis/DK_0911_v032_h_ctg.DK_0911_v032_p_ctg.0.001.blastp.outfmt6
Blast command previously completed!

Blasting DK_0911_v032_p_ctg.gene.fa ...
Blast command generated:
blastn -query DK_0911_v032_h_ctg.gene.fa -db DK_0911_v032_p_ctg.gene.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Ric

In [8]:
#make a sequence length dict for all genes and proteins for which blast was performed
seq_list = []
length_list =[]
for key in blast_dict.keys():
    for file in blast_dict[key]:
        for seq in SeqIO.parse(open(file), 'fasta'):
            seq_list.append(seq.id)
            length_list.append(len(seq.seq))
length_dict = dict(zip(seq_list, length_list))

In [9]:
#get assemblytics folders
ass_folders = [os.path.join(ASSEMBLYTICS_IN_PATH, x) for x in os.listdir(ASSEMBLYTICS_IN_PATH) if x.endswith('_php_8kbp')]
ass_folders.sort()

In [10]:
def same_contig_blast(x,y):
    '''Function that checks if the blast hit in columne y is on the same contig as the the query sequence in
    column y.
    '''
    try:
        q_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', x).group(1)
        if y != 'NaN':
            hit_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', y).group(1)
        else:
            hit_contig = 'NaN'
        return q_contig == hit_contig
    except AttributeError:
        print("AttributeError\nx: %s\ny: %s" % (x, y))
        sys.exit()
    except TypeError:
        print("x: %s\ny: %s" % (x, y))
        print("TypeError\ntype(x): %s\ntype(y): %s" % (type(x), type(y)))
        sys.exit()


In [11]:
#read in the gff files
p_gff = pd.read_csv(GENOME_PATH+'/'+P_GENOME+'.anno.gff3', sep='\t', header=None)
h_gff = pd.read_csv(GENOME_PATH+'/'+H_GENOME+'.anno.gff3', sep='\t', header=None)

In [12]:
#read in the blast df and add QLgth and QCov columns
blast_out_dict = {}
blast_header = ['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery', 'StopQuery', 'StartTarget',\
              'StopTarget', 'e-value','BitScore']
for key in blast_stderr_dict.keys():
    file_loc = key.split('>')[-1][1:]
    # file_loc = .../output/defining_alleles/allele_analysis/DK_0911_v03_h_ctg.DK_0911_v03_p_ctg.0.001.blastp.outfmt6
    
    file_name = file_loc.split('/')[-1]
    # DK_0911_v03_h_ctg.DK_0911_v03_p_ctg.0.001.blastp.outfmt6
    
    tmp_df = pd.read_csv(file_loc, sep='\t', header=None, names=blast_header)
    
    tmp_df['QLgth'] = tmp_df['Query'].apply(lambda x: length_dict[x])
    tmp_df['QCov'] = tmp_df['AlnLgth']/tmp_df['QLgth']*100
    
    tmp_df.sort_values(by=['Query', 'e-value','BitScore', ],ascending=[True, True, False], inplace=True)
    
    # assert(len(tmp_df.loc[tmp_df['Query'] == 'pcontig_000:10023-10824']) == 0)
    
    #now make sure to add proteins/genes without blast hit to the DataFrame
    #check if gene blast or protein blast and pull out respective query file
    if file_loc.split('.')[-2] == 'blastp':
        tmp_blast_seq = [x for x in blast_dict['protein_fa'] if x.startswith(file_name.split('.')[0])][0]
    elif file_loc.split('.')[-2] == 'blastn':
        tmp_blast_seq = [x for x in blast_dict['gene_fa'] if x.startswith(file_name.split('.')[0])][0]
    tmp_all_blast_seq = []
    #make list of all query sequences
    for seq in SeqIO.parse(os.path.join(BLAST_DB, tmp_blast_seq), 'fasta'):
        tmp_all_blast_seq.append(str(seq.id))
    tmp_all_queries_w_hit = tmp_df["Query"].unique()
    tmp_queries_no_hit = set(tmp_all_blast_seq) - set(tmp_all_queries_w_hit)
    no_hit_list = []
    
    #loop over the quieres with no hit and make list of list out of them the first element being the query id
    for x in tmp_queries_no_hit:
        NA_list = ['NaN'] * len(tmp_df.columns)
        NA_list[0] = x
        no_hit_list.append(NA_list)
    
    tmp_no_hit_df = pd.DataFrame(no_hit_list)
    tmp_no_hit_df.columns = tmp_df.columns
    tmp_no_hit_df['QLgth'] = tmp_no_hit_df.Query.apply(lambda x: length_dict[x])
    #map stuff at the tmp level
    
    tmp_df = tmp_df.append(tmp_no_hit_df)
    assert(len(tmp_df.loc[tmp_df['Query'] == 'NaN']) == 0)
    
    
    tmp_df['q_contig'] = tmp_df['Query'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)
    tmp_df['t_contig'] = tmp_df['Target'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)
    # print(tmp_df['q_contig'])
    
    #fix that if you don't extract anything return False and not 'nan'
    tmp_df['t_contig'].fillna(False, inplace=True)
    
    # print(tmp_df.apply(lambda row: print(row['Query'])), axis = 1)
    
    tmp_df['q_contig == t_contig'] = tmp_df.apply(lambda row: same_contig_blast(row['Query'], row['Target']), axis = 1)
    tmp_df.reset_index(inplace=True, drop=True)
    blast_out_dict[file_name] = tmp_df.iloc[:,:]

In [13]:
for folder in ass_folders:
    '''
    From oriented_coords.csv files generate gff files with the following set up.
    h_contig overlap
    'query', 'ref', 'h_feature', 'query_start', 'query_end', 'query_aln_len', 'strand', 'frame', 'attribute_h'
    p_contig overlap
    'ref', 'query', 'p_feature','ref_start','ref_end', 'alignment_length', 'strand', 'frame', 'attribute_p'.
    Save those file in a new folder for further downstream analysis. The same folder should include the contig and gene filtered
    gffs. 
    '''
    orient_coords_file = [os.path.join(folder, x) for x in os.listdir(folder) if x.endswith('oriented_coords.csv')][0]
    #load in df and generate several additions columns
    tmp_df = pd.read_csv(orient_coords_file, sep=',')
    #check if there is any resonable alignment if not go to the next one
    if len(tmp_df) == 0:
        print('Check on %s assemblytics' % (folder))
        continue
    tmp_df['p_feature'] = "haplotig"
    tmp_df['h_feature'] ='primary_contig'
    tmp_df['strand'] = "+"
    tmp_df['frame'] = 0
    tmp_df['query_aln_len'] = abs(tmp_df['query_end']-tmp_df['query_start'])
    tmp_df['alignment_length'] = abs(tmp_df['ref_end'] - tmp_df['ref_start'])
    tmp_df.reset_index(drop=True, inplace=True)
    tmp_df.sort_values('query', inplace=True)
    tmp_df['attribute_p'] = tmp_df.apply(lambda row: att_column_p(row['query'],row['query_start'], row['query_end'],\
                                                         row['query_length'], row['query_aln_len']), axis=1)
    
    tmp_df['attribute_h'] = tmp_df.apply(lambda row: att_column_h(row['ref'], row['ref_start'], row['ref_end'],\
                                                                  row['ref_length'], row['alignment_length']), axis=1)
    
    #generate tmp gff dataframe
    tmp_p_gff_df = tmp_df.loc[:, ['ref', 'query', 'p_feature','ref_start','ref_end', 'alignment_length', 'strand', 'frame', 'attribute_p']]
    #now filter added if start < stop swap round. And if start == 0. This was mostly?! only? the case for haplotigs
    tmp_p_gff_df['comp'] = tmp_p_gff_df['ref_start'] - tmp_p_gff_df['ref_end']
    tmp_swap_index  = tmp_p_gff_df['comp'] > 0
    tmp_p_gff_df.loc[tmp_swap_index, 'ref_start'] , tmp_p_gff_df.loc[tmp_swap_index, 'ref_end'] = tmp_p_gff_df['ref_end'], tmp_p_gff_df['ref_start']
    #and if 'ref_start' == g
    is_null_index  = tmp_p_gff_df['ref_start'] == 0
    tmp_p_gff_df.loc[is_null_index, 'ref_start'] = 1
    tmp_p_gff_df = tmp_p_gff_df.iloc[:, 0:9]
    tmp_p_gff_df.sort_values('ref_start',inplace=True)
    tmp_h_gff_df = tmp_df.loc[:, ['query', 'ref', 'h_feature', 'query_start', 'query_end', 'query_aln_len', 'strand', 'frame', 'attribute_h']]
    #same fix for tmp_h_gff
    tmp_h_gff_df['comp'] = tmp_h_gff_df['query_start'] - tmp_h_gff_df['query_end']
    tmp_swap_index  = tmp_h_gff_df['comp'] > 0
    tmp_h_gff_df.loc[tmp_swap_index, 'query_start'] , tmp_h_gff_df.loc[tmp_swap_index, 'query_end'] = tmp_h_gff_df['query_end'], tmp_h_gff_df['query_start']
    #and if 'query_start' == g
    is_null_index  = tmp_h_gff_df['query_start'] == 0
    tmp_h_gff_df.loc[is_null_index, 'query_start'] = 1
    tmp_h_gff_df = tmp_h_gff_df.iloc[:, 0:9]
    tmp_h_gff_df.sort_values('query_start', inplace=True)
    #make the outfolder and save gff files of overlap and filtered gene files
    folder_suffix = folder.split('/')[-1]
    #get the contig
    contig = re.search(r'p([a-z]*_[0-9]*)_', folder_suffix).group(1)
    out_folder = os.path.join(OUT_PATH, folder_suffix)
    if not os.path.exists(out_folder):
        os.mkdir(out_folder)
    out_name_p = os.path.join(out_folder, folder_suffix )
    tmp_p_gff_df.to_csv(out_name_p+'.p_by_h_cov.gff', sep='\t' ,header=None, index =None)
    out_name_h =  os.path.join(out_folder, folder_suffix)
    tmp_h_gff_df.to_csv(out_name_h+'.h_by_p_cov.gff' , sep='\t' ,header=None, index =None)
    #write those out to new folder for php together with the p and h gff of genes
    p_gene_gff = gene_contig_subset(p_gff, contig, 'gene')
    p_gene_gff.to_csv(out_name_p+'.p_gene.gff', sep='\t' ,header=None, index =None)
    h_gene_gff = gene_contig_subset(h_gff, contig, 'gene')
    h_gene_gff.to_csv(out_name_h+'.h_gene.gff', sep='\t' ,header=None, index =None)
    
    #next would be to do overlap between the gene gff and the corresponding alignment gff. Write this out. Dict for each gene and its corresponding h_contig/p_contig
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Check on /home/gamran/genome_analysis/Warrior/Richard/output/nucmer_assemblytics/v032/Assemblytics/DK_0911_v032_pcontig_075_php_8kbp assemblytics


In [14]:
#now loop over the outfolders
out_folders = [os.path.join(OUT_PATH, x) for x in os.listdir(OUT_PATH) if x.endswith('_php_8kbp')]
out_folders.sort()

In [15]:
p_gene_h_contig_overlap_dict = {}
for out_folder in out_folders:
    tmp_folder_content = os.listdir(out_folder)
    tmp_p_gene_gff = [os.path.join(out_folder, x) for x in tmp_folder_content if x.endswith('p_gene.gff') ][0]
    tmp_p_gene_bed = pybedtools.BedTool(tmp_p_gene_gff).remove_invalid().saveas()
    tmp_p_by_h_gff = [os.path.join(out_folder, x) for x in tmp_folder_content if x.endswith('p_by_h_cov.gff') ][0]
    tmp_p_by_h_bed = pybedtools.BedTool(tmp_p_by_h_gff).remove_invalid().saveas()
    #generate a bed intersect
    p_gene_and_p_by_h = tmp_p_gene_bed.intersect(tmp_p_by_h_bed, wb=True)
    #check the length of the obtained intersect is 
    print("This is the length of the intersect %i" %len(p_gene_and_p_by_h ))
    if len(p_gene_and_p_by_h) == 0:
        continue
    #turn this into a dataframe 
    p_gene_and_p_by_h_df = p_gene_and_p_by_h.to_dataframe()
    p_gene_and_p_by_h_df['p_gene'] = p_gene_and_p_by_h_df[8].str.extract(r'ID=([^;]*);', expand=False)
    p_gene_and_p_by_h_df['p_protein'] = p_gene_and_p_by_h_df['p_gene'].str.replace('TU', 'model')
    #save this in the same folder as dataframe
    p_gene_and_p_by_h_df.to_csv(tmp_p_gene_gff.replace('p_gene.gff', 'gene_haplotig_intersect.df'), sep='\t', index=None)
    p_gene_and_p_by_h_df_2 = p_gene_and_p_by_h_df.loc[:, [10, 'p_protein']]
    #make a dict for the overlap of each gene with it's correspoding haplotig
    tmp_p_gene_h_overlap_df = p_gene_and_p_by_h_df_2.groupby('p_protein')[10].apply(list)
    tmp_p_gene_h_overlap_dict = dict(zip(tmp_p_gene_h_overlap_df.index, tmp_p_gene_h_overlap_df.values))
    p_gene_h_contig_overlap_dict.update(tmp_p_gene_h_overlap_dict)
    

This is the length of the intersect 786


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 472
This is the length of the intersect 552


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 500
This is the length of the intersect 498


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 457
This is the length of the intersect 358


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 334
This is the length of the intersect 417


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 302
This is the length of the intersect 365


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 344
This is the length of the intersect 402


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 95
This is the length of the intersect 336


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 291
This is the length of the intersect 302


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 241
This is the length of the intersect 241


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 275
This is the length of the intersect 246
This is the length of the intersect 135


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 250
This is the length of the intersect 205


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 0
This is the length of the intersect 207
This is the length of the intersect 89


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 107
This is the length of the intersect 173
This is the length of the intersect 219


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 115
This is the length of the intersect 224


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 196
This is the length of the intersect 213


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 165
This is the length of the intersect 185


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 148
This is the length of the intersect 89


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 144
This is the length of the intersect 154


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 115
This is the length of the intersect 114
This is the length of the intersect 6


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 131
This is the length of the intersect 74
This is the length of the intersect 71


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 81
This is the length of the intersect 73
This is the length of the intersect 77


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 104
This is the length of the intersect 43
This is the length of the intersect 72


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 92
This is the length of the intersect 82


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 108
This is the length of the intersect 84
This is the length of the intersect 79


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 0
This is the length of the intersect 3
This is the length of the intersect 52
This is the length of the intersect 42


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 35
This is the length of the intersect 40
This is the length of the intersect 13
This is the length of the intersect 64


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom

This is the length of the intersect 26
This is the length of the intersect 8
This is the length of the intersect 21
This is the length of the intersect 17
This is the length of the intersect 21
This is the length of the intersect 23


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 31
This is the length of the intersect 10
This is the length of the intersect 14


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 20
This is the length of the intersect 7
This is the length of the intersect 7
This is the length of the intersect 8


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 12
This is the length of the intersect 8
This is the length of the intersect 7


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 25
This is the length of the intersect 4
This is the length of the intersect 0
This is the length of the intersect 1


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


In [16]:
#get out the p_protein on h_protein blast out dataframe and append the h_contig overlap for each protein
_key = [x for x in blast_out_dict.keys() if x.startswith(P_GENOME) and x.split('.')[-2] == 'blastp' and x.endswith('outfmt6')][0]
p_gene_h_contig_overlap_df = pd.DataFrame([p_gene_h_contig_overlap_dict.keys(), p_gene_h_contig_overlap_dict.values()],  index = ['p_protein', 'h_contig_overlap']).T
_tmp_df = blast_out_dict[_key]
_tmp_df = _tmp_df.merge(p_gene_h_contig_overlap_df, how='outer' ,left_on='Query', right_on='p_protein')
_tmp_df['h_contig_overlap'].fillna(False, inplace=True)
#check if the protein hit resites on the overlapping haplotig
_tmp_df['t_contig == h_contig_overlap'] = _tmp_df.apply(lambda row: target_on_mapped_haplotig(row['t_contig'], row['h_contig_overlap']), axis =1 )
#blast_out_dict[_key] = _tmp_df

In [17]:
# p_gene_h_contig_overlap_df['p_protein'] = p_gene_h_contig_overlap_df['p_protein'].apply(lambda x: x.replace('model', 'TU'))

In [18]:
# write the dataframe again
_tmp_df.to_csv(os.path.join(OUT_PATH, _key+'.allele_analysis'), index=None, sep='\t')

In [19]:
#now mangle the data a bit
#write out t_contig == h_contig_overlap best hits

In [20]:
#Qcov_cut_off = ''
#PctID_cut_off = ''
if Qcov_cut_off and PctID_cut_off:
    pass
else: 
    print('No cut off defined! Do you want to define it. Please go ahead above.')

No cut off defined! Do you want to define it. Please go ahead above.


In [21]:
# Finding:
# • (htg & pCtg on same contig and overlap)

_tmp_grouped = _tmp_df[(_tmp_df['t_contig == h_contig_overlap'] == True)].groupby('Query')
# .groupby('Query') creates a GroupBy object that can be thought of mini-DataFrames,
# each with the same Query. This groups are then reduced according to the
# minimum e-value and maximum BitScore of each group (i.e. retaining the
# best hit for each Query).

# filter on the e-value, then on the BitScore
best_hits_t_contig_and_h_contig = _tmp_grouped.apply(lambda g: g[(g['e-value'] == g['e-value'].min())])
best_hits_t_contig_and_h_contig = best_hits_t_contig_and_h_contig.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])

out_name = P_GENOME + '.h_contig_overlap.alleles'

best_hits_t_contig_and_h_contig.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#now save including the filter as well
if Qcov_cut_off and PctID_cut_off:
    out_name = P_GENOME + '.h_contig_overlap.Qcov%s.PctID%s.alleles' % (Qcov_cut_off,PctID_cut_off)
    best_hits_t_contig_and_h_contig[(best_hits_t_contig_and_h_contig['QCov'] >= Qcov_cut_off) & (best_hits_t_contig_and_h_contig['PctID'] >= PctID_cut_off)]\
    .loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.


In [22]:
# test if filtering like this works
if set(_tmp_df[(_tmp_df['t_contig == h_contig_overlap'] == True)]['Query'].unique()) == set(best_hits_t_contig_and_h_contig['Query'].unique().tolist()):
    print("Filtering is working; proceed.")
else:
    print('The filter threw an error. Not the expected output')

Filtering is working; proceed.


In [23]:
# drop everything that has already been matched well
# already matched:
# • (htg belongs to pCtg and matching sequence overlaps)

query_ids_to_drop = best_hits_t_contig_and_h_contig['Query'].unique().tolist()
target_ids_to_drop = best_hits_t_contig_and_h_contig['Target'].unique().tolist()

In [24]:
# finding
# • (htg belongs to pCtg but matching sequence does not overlap)
_tmp_groupbed = _tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop))&\
                        (_tmp_df['q_contig == t_contig'] == True)].groupby('Query')
best_hits_same_contig_no_overlap = _tmp_groupbed.apply(lambda g: g[(g['e-value'] == g['e-value'].min())]) 
best_hits_same_contig_no_overlap = best_hits_same_contig_no_overlap.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])

out_name = P_GENOME + '.no_specific_h_contig_overlap.alleles'
best_hits_same_contig_no_overlap.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#now save including the filter as well
if Qcov_cut_off and PctID_cut_off:
    out_name = P_GENOME + '.no_specific_h_contig_overlap.Qcov%s.PctID%s.alleles' % (Qcov_cut_off,PctID_cut_off)
    best_hits_same_contig_no_overlap[(best_hits_same_contig_no_overlap['QCov'] >= Qcov_cut_off) & (best_hits_same_contig_no_overlap['PctID'] >= PctID_cut_off)]\
    .loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.


In [25]:
#test if filtering like this works
if set(_tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop)) & (_tmp_df['q_contig == t_contig'] == True)]['Query'].unique())\
        == set(best_hits_same_contig_no_overlap['Query'].unique().tolist()):
    print("All good please proceed!")
else:
    print('The filter threw an error. Not the expected output')

All good please proceed!


In [26]:
# drop everything that has already been matched well
# already matched:
# • (htg belongs to pCtg and matching sequence overlaps)
# • (htg belongs to pCtg but matching sequence does not overlap)

query_ids_to_drop = query_ids_to_drop + best_hits_same_contig_no_overlap['Query'].unique().tolist()
target_ids_to_drop = target_ids_to_drop + best_hits_same_contig_no_overlap['Target'].unique().tolist()

In [27]:
# now get all remaining matches, i.e.
# • (htg does not belong to pCtg)

_tmp_groupbed = _tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop))&\
                        (_tmp_df['q_contig == t_contig'] == False)\
                        & (_tmp_df['Target'] != 'NaN' )].groupby('Query')
best_hits_different_contig = _tmp_groupbed.apply(lambda g: g[(g['e-value'] == g['e-value'].min()) ])
best_hits_different_contig = best_hits_different_contig .reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])
out_name = P_GENOME + '.no_respective_h_contig_overlap.alleles'
best_hits_different_contig.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#now save including the filter as well
if Qcov_cut_off and PctID_cut_off:
    out_name = P_GENOME + '.no_respective_h_contig_overlap.Qcov%s.PctID%s.alleles' % (Qcov_cut_off,PctID_cut_off)
    best_hits_different_contig[(best_hits_different_contig['QCov'] >= Qcov_cut_off) & (best_hits_different_contig['PctID'] >= PctID_cut_off)]\
    .loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.


In [28]:
if set(_tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop))& (_tmp_df['q_contig == t_contig'] == False) &\
               (_tmp_df['Target'] != 'NaN' )]['Query'].unique().tolist())\
        == set(best_hits_different_contig['Query'].unique().tolist()):
    print("All good please proceed!")
else:
    print('The filter threw an error. Not the expected output')

All good please proceed!


In [29]:
# drop everything that has already been matched well
# already matched:
# • (htg belongs to pCtg and matching sequence overlaps)
# • (htg belongs to pCtg but matching sequence does not overlap)
# • (htg does not belong to pCtg)

query_ids_to_drop = query_ids_to_drop + best_hits_different_contig['Query'].unique().tolist()
target_ids_to_drop = target_ids_to_drop + best_hits_different_contig['Target'].unique().tolist()

In [30]:
#now print out the p_proteins without alleles (i.e. no good match)
out_name = P_GENOME + '.no_alleles'
_tmp_df[~_tmp_df['Query'].isin(query_ids_to_drop)].drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#for filtering different query and targets to drop are required to be dropped
if Qcov_cut_off and PctID_cut_off:
    query_ids_to_drop_filtered = \
    best_hits_t_contig_and_h_contig[(best_hits_t_contig_and_h_contig['QCov'] >= Qcov_cut_off) & (best_hits_t_contig_and_h_contig['PctID']>= PctID_cut_off)]['Query'].unique().tolist() \
    + best_hits_same_contig_no_overlap[(best_hits_same_contig_no_overlap['QCov'] >= Qcov_cut_off) & (best_hits_same_contig_no_overlap['PctID'] >= PctID_cut_off)]['Query'].unique().tolist()\
    + best_hits_different_contig[(best_hits_different_contig['QCov'] >= Qcov_cut_off) & (best_hits_different_contig['PctID'] >= PctID_cut_off)]['Query'].unique().tolist()
    
    target_ids_to_drop_filtered = \
    best_hits_t_contig_and_h_contig[(best_hits_t_contig_and_h_contig['QCov'] >= Qcov_cut_off) & (best_hits_t_contig_and_h_contig['PctID']>= PctID_cut_off)]['Target'].unique().tolist()\
    + best_hits_same_contig_no_overlap[(best_hits_same_contig_no_overlap['QCov'] >= Qcov_cut_off) & (best_hits_same_contig_no_overlap['PctID'] >= PctID_cut_off)]['Target'].unique().tolist() \
    + best_hits_different_contig[(best_hits_different_contig['QCov'] >= Qcov_cut_off) & (best_hits_different_contig['PctID'] >= PctID_cut_off)]['Target'].unique().tolist()
    out_name = P_GENOME + '.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    #save the dataframe
    _tmp_df[~_tmp_df['Query'].isin(query_ids_to_drop_filtered)].drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)    
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")                                                                                                                             

No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.


In [31]:
#save back the dataframe with '.analyzed' added to the key
blast_out_dict[_key+'.analyzed'] = _tmp_df

In [32]:
#pull out the h on p blast dataframe
_key = [x for x in blast_out_dict.keys() if x.startswith(H_GENOME) and x.split('.')[-2] == 'blastp' and x.endswith('outfmt6')][0]
_tmp_df = blast_out_dict[_key]


In [33]:
#pull out all h_contigs without alleles
_tmp_df_no_alleles = _tmp_df[~_tmp_df['Query'].isin(target_ids_to_drop)]
out_name = H_GENOME + '.no_alleles'
_tmp_df_no_alleles.drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

In [34]:
#now pull out all h_proteins that have a blast hit but were left behind with selection 
#based on proteinortho and best blast hits above. No QCov and PctID filtering
_tmp_df_no_alleles_grouped = _tmp_df_no_alleles[_tmp_df_no_alleles["Target"] != 'NaN'].groupby('Query')
h_protein_no_alleles = _tmp_df_no_alleles_grouped.apply(lambda g: g[(g['e-value'] == g['e-value'].min())]) 
h_protein_no_alleles = h_protein_no_alleles.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])
out_name = H_GENOME + '.best_p_hits.no_alleles'
h_protein_no_alleles.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
#write done the hits with no blast hit at all. Meaning blasting h on p gave no results
out_name = H_GENOME + '.no_p_hits.no_alleles'
_tmp_df_no_alleles[_tmp_df_no_alleles["Target"] == 'NaN']['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None) 

In [35]:
#now pull out all h_proteins that have a blast hit but were left behind with selection of best blast hits above. With QCov and PctID filtering.
if Qcov_cut_off and PctID_cut_off:
#pull out all h proteins without alleles when filtering was applied
    _tmp_df_no_alleles_filtered = _tmp_df[~_tmp_df['Query'].isin(target_ids_to_drop_filtered)]
    out_name = H_GENOME + '.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    _tmp_df_no_alleles_filtered.drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#pull out all the h proteins with blast hit when filtering was applied and filter those as well.
    #first get ride of all NAs in future version maybe convert QCov and PctID to floats or ints so this is not neccessary anymore
    _tmp_df_no_alleles_filtered_no_NA = _tmp_df_no_alleles_filtered[(_tmp_df_no_alleles_filtered["Target"] != 'NaN')]
    _tmp_df_no_alleles_filtered_grouped = \
        _tmp_df_no_alleles_filtered_no_NA[(_tmp_df_no_alleles_filtered_no_NA['QCov'] >= Qcov_cut_off) &\
                                    (_tmp_df_no_alleles_filtered_no_NA['PctID'] >= PctID_cut_off)].groupby('Query') 

    h_protein_no_alleles_filtered = _tmp_df_no_alleles_filtered_grouped.apply(lambda g: g[(g['e-value'] == g['e-value'].min())]) 
    h_protein_no_alleles_filtered = h_protein_no_alleles_filtered.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])
    out_name = H_GENOME + '.best_p_hits.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    h_protein_no_alleles_filtered.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
    target_ids_to_drop_filtered = target_ids_to_drop_filtered + h_protein_no_alleles_filtered['Query'].unique().tolist()
    out_name = H_GENOME + '.no_p_hits.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    _tmp_df_no_alleles_filtered[~_tmp_df_no_alleles_filtered['Query'].isin(target_ids_to_drop_filtered)].drop_duplicates(subset='Query')['Query']\
    .to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.


In [36]:
#write the dataframe again
_tmp_df.to_csv(os.path.join(OUT_PATH, _key+'.allele_analysis'), index=None, sep='\t')

In [37]:
_tmp_df.columns

Index(['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery',
       'StopQuery', 'StartTarget', 'StopTarget', 'e-value', 'BitScore',
       'QLgth', 'QCov', 'q_contig', 't_contig', 'q_contig == t_contig'],
      dtype='object')

In [38]:
#might be good to write these analyzed dataframes out as well to make life a bit easier
blast_out_dict['%s.%s.0.001.blastp.outfmt6.analyzed' % (P_GENOME, H_GENOME)].columns

Index(['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery',
       'StopQuery', 'StartTarget', 'StopTarget', 'e-value', 'BitScore',
       'QLgth', 'QCov', 'q_contig', 't_contig', 'q_contig == t_contig',
       'p_protein', 'h_contig_overlap', 't_contig == h_contig_overlap'],
      dtype='object')

In [39]:
htgOnPctgBlast = pd.read_csv(os.path.join(OUT_PATH, '%s.%s.0.001.blastp.outfmt6' % (H_GENOME, P_GENOME)), sep='\t', index_col=None, header=None, names=blast_header)
htgNoAllelesDf = pd.read_csv(os.path.join(ALLELE_PATH, '%s.no_alleles' % H_GENOME), names=['Query'])

htgNoAllelesBlastDf = htgOnPctgBlast[htgOnPctgBlast['Query'].isin(htgNoAllelesDf.Query)]
print('%s haplotig proteins have no allele' % len(htgNoAllelesDf))
print('Of these haplotig proteins, %s have a BLAST hit' % len(htgNoAllelesBlastDf['Query'].unique()))
print('Thus, %s haplotig proteins have both no allele and no BLAST hit' % (len(htgNoAllelesDf['Query']) - len(htgNoAllelesBlastDf['Query'].unique())))

# Look at haplotig on primary contig blast for haplotig proteins with no allele,
# and take the best hit (based on e-value and bitscore. If it is tied, then all best hits will
# be taken. If the best hit(s) is >=90%ID, assign the haplotig.

print('\nFor each haplotig protein with no allele, if it has a BLAST hit (blast h on p) we take the best BLAST hit. If the %ID of ths hit is >=90%, we assign it as an allele.')
htgNoAllelesBlastDf_grouped = htgNoAllelesBlastDf.groupby('Query')
htgNoAllelesBlastDf = htgNoAllelesBlastDf_grouped.apply(lambda g: g[g['PctID'] == g['PctID'].max()])

htgNoAllelesBlastDf = htgNoAllelesBlastDf[htgNoAllelesBlastDf['PctID'] >= 90]
out_name = H_GENOME + '.manual_assigned.alleles'
htgNoAllelesBlastDf.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
print('Using this method, we reassigned %s haplotig proteins as alleles' % len(htgNoAllelesBlastDf))

htgNoAllelesDf = htgNoAllelesDf[~htgNoAllelesDf['Query'].isin(htgNoAllelesBlastDf.Query)]
out_name = H_GENOME + '.no_alleles'
htgNoAllelesDf.to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

2275 haplotig proteins have no allele
Of these haplotig proteins, 2017 have a BLAST hit
Thus, 258 haplotig proteins have both no allele and no BLAST hit

For each haplotig protein with no allele, if it has a BLAST hit (blast h on p) we take the best BLAST hit. If the %ID of ths hit is >=90%, we assign it as an allele.
Using this method, we reassigned 1619 haplotig proteins as alleles


In [40]:
htgNoAllelesBlastDf

Unnamed: 0_level_0,Unnamed: 1_level_0,Query,Target,PctID,AlnLgth,NumMis,NumGap,StartQuery,StopQuery,StartTarget,StopTarget,e-value,BitScore
Query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
evm.model.hcontig_000_001.2,720374,evm.model.hcontig_000_001.2,evm.model.pcontig_002.624,100.00,1416,0,0,1,1416,1,1416,0.000000e+00,2958.0
evm.model.hcontig_000_001.2,720375,evm.model.hcontig_000_001.2,evm.model.pcontig_010.134,100.00,1416,0,0,1,1416,1,1416,0.000000e+00,2958.0
evm.model.hcontig_000_001.2,720422,evm.model.hcontig_000_001.2,evm.model.pcontig_032.198,100.00,354,0,0,1063,1416,1,354,0.000000e+00,737.0
evm.model.hcontig_000_001.8,721038,evm.model.hcontig_000_001.8,evm.model.pcontig_000.1077,100.00,167,0,0,1,167,1,167,6.000000e-120,351.0
evm.model.hcontig_000_003.1,605477,evm.model.hcontig_000_003.1,evm.model.pcontig_021.1,100.00,112,0,0,1,112,183,294,2.000000e-75,236.0
evm.model.hcontig_000_003.1,605478,evm.model.hcontig_000_003.1,evm.model.pcontig_000.534,100.00,112,0,0,1,112,183,294,7.000000e-72,234.0
evm.model.hcontig_000_007.8,588367,evm.model.hcontig_000_007.8,evm.model.pcontig_000.501,95.40,261,12,0,1,261,899,1159,2.000000e-172,509.0
evm.model.hcontig_000_013.2,18588,evm.model.hcontig_000_013.2,evm.model.pcontig_000.916,100.00,327,0,0,92,418,2,328,0.000000e+00,667.0
evm.model.hcontig_000_013.4,18590,evm.model.hcontig_000_013.4,evm.model.pcontig_000.917,100.00,74,0,0,28,101,109,182,7.000000e-49,155.0
evm.model.hcontig_000_014.1,607671,evm.model.hcontig_000_014.1,evm.model.pcontig_000.106,99.79,477,1,0,1,477,1,477,0.000000e+00,979.0
