## Defining Alleles

This notebook is aimed at doing allelic comparison between primary contigs and hapltotigs. This was used to generate DataFrames used during the analysis and evolved during the analysis. So it might be a bit more complicated than it needs to be, but it works.

Based on original code by Benjamin Schwessinger.

This notebook defines alleles based on a Falcon-Unzip assembly where relational information between two contigs are available.

#### The input as follows:
* Assemblytics alignment of all haplotigs onto their corresponding primary contigs. The *.oriented_coords.csv will be converted to a gff file of haplotig alignments onto their respective primary contigs
* gff file of all features. This will be filtered down to genes only and separated out for each contig
* protein and gene fa files for reciprocal blast of haplotig sequences onto primary sequences

#### What happens:
* for each primary gene/protein the the overlapping haplotig is pulled out and added to the blast dataframe
* this blast dataframe of primary proteins onto haplotig proteins is than used for filtering to provide the following files
* best blast hits are first filtered on e-value minimums followed by BitScore maximum

* proteinortho was also used to identify alleles. This was run as follows:

/home/gamran/anaconda3/proteinortho_v5.16b/proteinortho5.pl -project=ph_ctg_516 -synteny ph_ctg_516/DK_0911_v04_h_ctg.protein.faa ph_ctg_516/DK_0911_v04_p_ctg.protein.faa

* Use the poff-graph and poff-proteinortho-graph as bases for analysis
* Depending on this run the poff_header needs to be defined. In current case 'Target' followed 'Query'

#### The additional input required
* is the .poff from proteinortho.


#### What comes out in the alleles folder:
* DK_0911_v0x_p_ctg.full_df.alleles: contains all proteinortho hits and the best BLAST hits from the primary onto haplotig BLAST
* DK_0911_v0x_h_ctg.full_df.alleles: contains all the best BLAST hits from the haplotig onto primary contig BLAST
* these files can be filtered based on QCov/%ID in DK_0911_post_allele_analysis

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import os
import re
from Bio import SeqIO
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil


In [3]:
#Define the PATH

GENOME_VERSION = 'v032'

BASE_OUT_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/%s' % GENOME_VERSION
GENOME_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/genome_%s/' % GENOME_VERSION
ASSEMBLYTICS_IN_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/nucmer_assemblytics/%s/Assemblytics/' % GENOME_VERSION

BLAST_DB = os.path.join(BASE_OUT_PATH, 'blast_DB')
OUT_PATH = os.path.join(BASE_OUT_PATH, 'allele_analysis')
OUT_tmp = os.path.join(OUT_PATH, 'tmp')
PROTEINORTHO_OUT_PATH = os.path.join(BASE_OUT_PATH, 'proteinortho')
    
#renamed the allele path to reflect the proteinortho results
ALLELE_PATH = os.path.join(OUT_PATH, 'alleles_proteinortho_graph516')
if not os.path.exists(BASE_OUT_PATH):
    os.mkdir(BASE_OUT_PATH)
if not os.path.isdir(OUT_PATH):
    os.mkdir(OUT_PATH)
if not os.path.isdir(OUT_tmp):
    os.mkdir(OUT_tmp)
if not os.path.exists(ALLELE_PATH):
    os.mkdir(ALLELE_PATH)
if not os.path.exists(BLAST_DB):
    os.mkdir(BLAST_DB)
if not os.path.exists(PROTEINORTHO_OUT_PATH):
    os.mkdir(PROTEINORTHO_OUT_PATH)

#Define your p and h genome and move it into the allele analysis folder
P_GENOME = 'DK_0911_%s_p_ctg' % GENOME_VERSION
H_GENOME = 'DK_0911_%s_h_ctg' % GENOME_VERSION
for x in (x + '.fa' for x in [P_GENOME, H_GENOME]):
    shutil.copy2(GENOME_PATH+'/'+x, OUT_tmp)

In [4]:
#get proteinortho synteny file which ends with .poff
poff_graph_fn = os.path.join(PROTEINORTHO_OUT_PATH, 'ph_ctg_516.poff-graph')

In [5]:
#Define ENV parameters for blast hits and threads used in blast analysis
n_threads = 16
e_value = 1e-3
Qcov_cut_off = 0 #this defines the mimimum coverage of the Query to be required for filtering. Will become part of name.
PctID_cut_off = 70 #this defines the mimimum PctID accross the alignment to be required for filtering. Will become part of name.

In [6]:
def att_column_p(x, y, z, a, b):
    '''Generate attribute column of h on p p_gff dataframe.
    '''
    string = 'h_contig=%s;query_start=%s;query_stop=%s;query_length=%s;query_aln_ln=%s' % (x, str(y), str(z), str(a), str(b))
    return string

def att_column_h(x, y, z, a, b):
    '''Generate attribute column of h on p h_gff dataframe.
    '''
    string = 'p_contig=%s;ref_start=%s;ref_stop=%s;ref_length=%s;ref_aln_ln=%s' % (x, str(y), str(z), str(a), str(b))
    return string

def gene_contig_subset(df, contig, feature):
    '''Input the gff and subset the feature columne by gene.
        Return data frame subset by contig and gene'''
    tmp_df = df[(df[0].str.contains(contig)) & (df[2]==feature)]
    tmp_df.sort_values(3,inplace=True)
    tmp_df.reset_index(drop=True, inplace=True)
    return tmp_df

def same_contig_blast(x,y):
    '''Function that checks if the blast hit in column y is on the same contig as the the query sequence in
    column y.
    '''
    q_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', x).group(1)
    if y != 'NaN':
        hit_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', y).group(1)
    else:
        hit_contig = 'NaN'
    return q_contig == hit_contig

def target_on_mapped_haplotig(t_contig, h_contig_overlap):
    '''Simple function that checks if the target contig is in the list of h overlapping contigs'''
    if t_contig == False or h_contig_overlap == False:
        return False
    return t_contig in h_contig_overlap

#### Considerations for blast analysis
Do gene blast and protein blast. Initially go off protein blast results. They should be more imformative as it also includes the coding region and the frame.
> Pull this in as a dataframe and filter it down as well to each contig. Do blasts both ways as well.
> Copy over primary & haplotig .protein.fa and .gene.fa files from the BASE_A_folder, if not already present.

In [None]:
#generate the blast databases if not already present
os.chdir(BLAST_DB)
# copy over .protein.fa and .gene.fa files from the GENOME_PATH folder
# if not already present, for both primary and haplotigs
shutil.copy2(os.path.join(GENOME_PATH, P_GENOME + '.protein.fa'), BLAST_DB)
shutil.copy2(os.path.join(GENOME_PATH, P_GENOME + '.gene.fa'), BLAST_DB)
shutil.copy2(os.path.join(GENOME_PATH, H_GENOME + '.protein.fa'), BLAST_DB)
shutil.copy2(os.path.join(GENOME_PATH, H_GENOME + '.gene.fa'), BLAST_DB)

blast_dir_content = os.listdir(BLAST_DB)
for x in blast_dir_content:
    if x.endswith('.fa') and ({os.path.isfile(x + e) for e in ['.psq', '.phr', '.pin'] } != {True}\
           and {os.path.isfile(x + e) for e in ['.nin', '.nhr', '.nsq'] } != {True} ):

        make_DB_options = ['-in']
        make_DB_options.append(x)
        make_DB_options.append('-dbtype')
        if 'protein' in x:
            make_DB_options.append('prot')
        else:
            make_DB_options.append('nucl')
        make_DB_command = 'makeblastdb %s' % ' '.join(make_DB_options)
        make_DB_stderr = subprocess.check_output(make_DB_command, shell=True, stderr=subprocess.STDOUT)
        print('%s is done!' % make_DB_command)
print("All databases generated and ready to go!")

makeblastdb -in DK_0911_v032_h_ctg.gene.fa -dbtype nucl is done!
makeblastdb -in DK_0911_v032_p_ctg.protein.fa -dbtype prot is done!
makeblastdb -in DK_0911_v032_h_ctg.protein.fa -dbtype prot is done!
makeblastdb -in DK_0911_v032_p_ctg.gene.fa -dbtype nucl is done!
All databases generated and ready to go!


In [None]:
blast_dict = {}
blast_dict['gene_fa'] = [x for x  in os.listdir(BLAST_DB) if x.endswith('gene.fa') and 'ph_ctg' not in x]
blast_dict['protein_fa'] = [x for x  in os.listdir(BLAST_DB) if x.endswith('protein.fa') and 'ph_ctg' not in x]
blast_stderr_dict = {}

for key in blast_dict.keys():
    for n, query in enumerate(blast_dict[key]):
        tmp_list = blast_dict[key][:]
        tmp_stderr_list = []
        #this loops through all remaining files and does blast against all those
        del tmp_list[n]
        for db in tmp_list:
            print("\nBlasting %s ..." %db)
            blast_options = ['-query']
            blast_options.append(query)
            blast_options.append('-db')

            blast_options.append(db)
            blast_options.append('-outfmt 6')
            blast_options.append('-evalue')
            blast_options.append(str(e_value))
            blast_options.append('-num_threads')
            blast_options.append(str(n_threads))
            blast_options.append('>')
            if 'gene' in query and 'gene' in db:
                out_name_list = [ query.split('.')[0], db.split('.')[0], str(e_value), 'blastn.outfmt6']
                out_name = os.path.join(OUT_PATH,'.'.join(out_name_list))
                blast_options.append(out_name)
                blast_command = 'blastn %s' % ' '.join(blast_options)
            elif 'protein' in query and 'protein' in db:
                out_name_list = [ query.split('.')[0], db.split('.')[0], str(e_value), 'blastp.outfmt6']
                out_name = os.path.join(OUT_PATH,'.'.join(out_name_list))
                blast_options.append(out_name)
                blast_command = 'blastp %s' % ' '.join(blast_options)
            print("Blast command generated:\n%s" % blast_command)
            if not os.path.exists(out_name):
                print("Executing new blast command...")
                blast_stderr_dict[blast_command] = subprocess.check_output(blast_command, shell=True, stderr=subprocess.STDOUT)
                print("New blast executed!")
            else:
                blast_stderr_dict[blast_command] = 'Previously done already!'
                print("Blast command completed previously!")


Blasting DK_0911_v032_p_ctg.gene.fa ...
Blast command generated:
blastn -query DK_0911_v032_h_ctg.gene.fa -db DK_0911_v032_p_ctg.gene.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/v032/allele_analysis/DK_0911_v032_h_ctg.DK_0911_v032_p_ctg.0.001.blastn.outfmt6
Executing new blast command...
New blast executed!

Blasting DK_0911_v032_h_ctg.gene.fa ...
Blast command generated:
blastn -query DK_0911_v032_p_ctg.gene.fa -db DK_0911_v032_h_ctg.gene.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/v032/allele_analysis/DK_0911_v032_p_ctg.DK_0911_v032_h_ctg.0.001.blastn.outfmt6
Executing new blast command...
New blast executed!

Blasting DK_0911_v032_h_ctg.protein.fa ...
Blast command generated:
blastp -query DK_0911_v032_p_ctg.protein.fa -db DK_0911_v032_h_ctg.protein.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output

In [None]:
# make a sequence length dict for all genes and proteins for which blast was performed
seq_list = []
length_list = []
for key in blast_dict.keys():
    for file in blast_dict[key]:
        for seq in SeqIO.parse(open(file), 'fasta'):
            seq_list.append(seq.id)
            length_list.append(len(seq.seq))
length_dict = dict(zip(seq_list, length_list))

In [None]:
# get assemblytics folders
ass_folders = [os.path.join(ASSEMBLYTICS_IN_PATH, x) for x in os.listdir(ASSEMBLYTICS_IN_PATH) if x.endswith('_php_8kbp')]
ass_folders.sort()

In [None]:
# read in the gff files
p_gff = pd.read_csv(os.path.join(GENOME_PATH, P_GENOME+'.anno.gff3'), sep='\t', header=None)
h_gff = pd.read_csv(os.path.join(GENOME_PATH, H_GENOME+'.anno.gff3'), sep='\t', header=None)

In [None]:
def same_contig_blast(x,y):
    '''Function that checks if the blast hit in columne y is on the same contig as the the query sequence in
    column y.
    '''
    try:
        q_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', x).group(1)
        if y != 'NaN':
            hit_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', y).group(1)
        else:
            hit_contig = 'NaN'
        return q_contig == hit_contig
    except AttributeError:
        print("AttributeError\nx: %s\ny: %s" % (x, y))
        sys.exit()
    except TypeError:
        print("x: %s\ny: %s" % (x, y))
        print("TypeError\ntype(x): %s\ntype(y): %s" % (type(x), type(y)))
        sys.exit()


In [None]:
#read in the blast df and add QLgth and QCov columns
blast_out_dict = {}
blast_header = ['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery', 'StopQuery', 'StartTarget',\
              'StopTarget', 'e-value','BitScore']
for key in blast_stderr_dict.keys():
    file_loc = key.split('>')[-1][1:]
    # file_loc = .../output/defining_alleles/allele_analysis/DK_0911_v03_h_ctg.DK_0911_v03_p_ctg.0.001.blastp.outfmt6
    
    file_name = file_loc.split('/')[-1]
    # DK_0911_v03_h_ctg.DK_0911_v03_p_ctg.0.001.blastp.outfmt6
    
    tmp_df = pd.read_csv(file_loc, sep='\t', header=None, names=blast_header)

    tmp_df['QLgth'] = tmp_df['Query'].apply(lambda x: length_dict[x])
    tmp_df['QCov'] = tmp_df['AlnLgth']/tmp_df['QLgth']*100
    
    tmp_df.sort_values(by=['Query', 'e-value','BitScore', ],ascending=[True, True, False], inplace=True)
    
    # assert(len(tmp_df.loc[tmp_df['Query'] == 'pcontig_000:10023-10824']) == 0)
    
    #now make sure to add proteins/genes without blast hit to the DataFrame
    #check if gene blast or protein blast and pull out respective query file
    if file_loc.split('.')[-2] == 'blastp':
        tmp_blast_seq = [x for x in blast_dict['protein_fa'] if x.startswith(file_name.split('.')[0])][0]
    elif file_loc.split('.')[-2] == 'blastn':
        tmp_blast_seq = [x for x in blast_dict['gene_fa'] if x.startswith(file_name.split('.')[0])][0]
    tmp_all_blast_seq = []
    #make list of all query sequences
    for seq in SeqIO.parse(os.path.join(BLAST_DB, tmp_blast_seq), 'fasta'):
        tmp_all_blast_seq.append(str(seq.id))
    tmp_all_queries_w_hit = tmp_df["Query"].unique()
    tmp_queries_no_hit = set(tmp_all_blast_seq) - set(tmp_all_queries_w_hit)
    no_hit_list = []
    
    #loop over the quieres with no hit and make list of list out of them the first element being the query id
    for x in tmp_queries_no_hit:
        NA_list = ['NaN'] * len(tmp_df.columns)
        NA_list[0] = x
        no_hit_list.append(NA_list)
    
    tmp_no_hit_df = pd.DataFrame(no_hit_list)
    tmp_no_hit_df.columns = tmp_df.columns
    tmp_no_hit_df['QLgth'] = tmp_no_hit_df.Query.apply(lambda x: length_dict[x])
    #map stuff at the tmp level
    
    tmp_df = tmp_df.append(tmp_no_hit_df)
    assert(len(tmp_df.loc[tmp_df['Query'] == 'NaN']) == 0)
    
    tmp_df['q_contig'] = tmp_df['Query'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)
    tmp_df['t_contig'] = tmp_df['Target'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)
    # print(tmp_df['q_contig'])
    
    #fix that if you don't extract anything return False and not 'nan'
    tmp_df['t_contig'].fillna(False, inplace=True)
    
    # print(tmp_df.apply(lambda row: print(row['Query'])), axis = 1)
    
    tmp_df['q_contig == t_contig'] = tmp_df.apply(lambda row: same_contig_blast(row['Query'], row['Target']), axis = 1)
    tmp_df.reset_index(inplace=True, drop=True)
    blast_out_dict[file_name] = tmp_df.iloc[:,:]

In [None]:
for folder in ass_folders:
    '''
    From oriented_coords.csv files generate gff files with the following set up.
    h_contig overlap
    'query', 'ref', 'h_feature', 'query_start', 'query_end', 'query_aln_len', 'strand', 'frame', 'attribute_h'
    p_contig overlap
    'ref', 'query', 'p_feature','ref_start','ref_end', 'alignment_length', 'strand', 'frame', 'attribute_p'.
    Save those file in a new folder for further downstream analysis. The same folder should include the contig and gene filtered
    gffs. 
    '''
    orient_coords_file = [os.path.join(folder, x) for x in os.listdir(folder) if x.endswith('oriented_coords.csv')][0]
    #load in df and generate several additions columns
    tmp_df = pd.read_csv(orient_coords_file, sep=',')
    #check if there is any resonable alignment if not go to the next one
    if len(tmp_df) == 0:
        print('Check on %s assemblytics' % (folder))
        continue
    tmp_df['p_feature'] = "haplotig"
    tmp_df['h_feature'] ='primary_contig'
    tmp_df['strand'] = "+"
    tmp_df['frame'] = 0
    tmp_df['query_aln_len'] = abs(tmp_df['query_end']-tmp_df['query_start'])
    tmp_df['alignment_length'] = abs(tmp_df['ref_end'] - tmp_df['ref_start'])
    tmp_df.reset_index(drop=True, inplace=True)
    tmp_df.sort_values('query', inplace=True)
    tmp_df['attribute_p'] = tmp_df.apply(lambda row: att_column_p(row['query'],row['query_start'], row['query_end'],\
                                                         row['query_length'], row['query_aln_len']), axis=1)
    
    tmp_df['attribute_h'] = tmp_df.apply(lambda row: att_column_h(row['ref'], row['ref_start'], row['ref_end'],\
                                                                  row['ref_length'], row['alignment_length']), axis=1)
    
    #generate tmp gff dataframe
    tmp_p_gff_df = tmp_df.loc[:, ['ref', 'query', 'p_feature','ref_start','ref_end', 'alignment_length', 'strand', 'frame', 'attribute_p']]
    #now filter added if start < stop swap round. And if start == 0. This was mostly?! only? the case for haplotigs
    tmp_p_gff_df['comp'] = tmp_p_gff_df['ref_start'] - tmp_p_gff_df['ref_end']
    tmp_swap_index  = tmp_p_gff_df['comp'] > 0
    tmp_p_gff_df.loc[tmp_swap_index, 'ref_start'] , tmp_p_gff_df.loc[tmp_swap_index, 'ref_end'] = tmp_p_gff_df['ref_end'], tmp_p_gff_df['ref_start']
    #and if 'ref_start' == g
    is_null_index  = tmp_p_gff_df['ref_start'] == 0
    tmp_p_gff_df.loc[is_null_index, 'ref_start'] = 1
    tmp_p_gff_df = tmp_p_gff_df.iloc[:, 0:9]
    tmp_p_gff_df.sort_values('ref_start',inplace=True)
    tmp_h_gff_df = tmp_df.loc[:, ['query', 'ref', 'h_feature', 'query_start', 'query_end', 'query_aln_len', 'strand', 'frame', 'attribute_h']]
    #same fix for tmp_h_gff
    tmp_h_gff_df['comp'] = tmp_h_gff_df['query_start'] - tmp_h_gff_df['query_end']
    tmp_swap_index  = tmp_h_gff_df['comp'] > 0
    tmp_h_gff_df.loc[tmp_swap_index, 'query_start'] , tmp_h_gff_df.loc[tmp_swap_index, 'query_end'] = tmp_h_gff_df['query_end'], tmp_h_gff_df['query_start']
    #and if 'query_start' == g
    is_null_index  = tmp_h_gff_df['query_start'] == 0
    tmp_h_gff_df.loc[is_null_index, 'query_start'] = 1
    tmp_h_gff_df = tmp_h_gff_df.iloc[:, 0:9]
    tmp_h_gff_df.sort_values('query_start', inplace=True)
    #make the outfolder and save gff files of overlap and filtered gene files
    folder_suffix = folder.split('/')[-1]
    #get the contig
    contig = re.search(r'p([a-z]*_[0-9]*)_', folder_suffix).group(1)
    out_folder = os.path.join(OUT_PATH, folder_suffix)
    if not os.path.exists(out_folder):
        os.mkdir(out_folder)
    out_name_p = os.path.join(out_folder, folder_suffix )
    tmp_p_gff_df.to_csv(out_name_p+'.p_by_h_cov.gff', sep='\t' ,header=None, index =None)
    out_name_h =  os.path.join(out_folder, folder_suffix)
    tmp_h_gff_df.to_csv(out_name_h+'.h_by_p_cov.gff' , sep='\t' ,header=None, index =None)
    #write those out to new folder for php together with the p and h gff of genes
    p_gene_gff = gene_contig_subset(p_gff, contig, 'gene')
    p_gene_gff.to_csv(out_name_p+'.p_gene.gff', sep='\t' ,header=None, index =None)
    h_gene_gff = gene_contig_subset(h_gff, contig, 'gene')
    h_gene_gff.to_csv(out_name_h+'.h_gene.gff', sep='\t' ,header=None, index =None)
    
    #next would be to do overlap between the gene gff and the corresponding alignment gff. Write this out. Dict for each gene and its corresponding h_contig/p_contig
    

In [None]:
p_gene_gff.head()

In [None]:
#now loop over the outfolders
out_folders = [os.path.join(OUT_PATH, x) for x in os.listdir(OUT_PATH) if x.endswith('_php_8kbp')]
out_folders.sort()

In [None]:
p_gene_h_contig_overlap_dict = {}
for out_folder in out_folders:
    tmp_folder_content = os.listdir(out_folder)
    tmp_p_gene_gff = [os.path.join(out_folder, x) for x in tmp_folder_content if x.endswith('p_gene.gff') ][0]
    tmp_p_gene_bed = pybedtools.BedTool(tmp_p_gene_gff).remove_invalid().saveas()
    tmp_p_by_h_gff = [os.path.join(out_folder, x) for x in tmp_folder_content if x.endswith('p_by_h_cov.gff') ][0]
    tmp_p_by_h_bed = pybedtools.BedTool(tmp_p_by_h_gff).remove_invalid().saveas()
    #generate a bed intersect
    p_gene_and_p_by_h = tmp_p_gene_bed.intersect(tmp_p_by_h_bed, wb=True)
    #check the length of the obtained intersect is 
    print("This is the length of the intersect %i" %len(p_gene_and_p_by_h ))
    if len(p_gene_and_p_by_h) == 0:
        continue
    #turn this into a dataframe 
    p_gene_and_p_by_h_df = p_gene_and_p_by_h.to_dataframe()
    p_gene_and_p_by_h_df['p_gene'] = p_gene_and_p_by_h_df[8].str.extract(r'ID=([^;]*);', expand=False)
    p_gene_and_p_by_h_df['p_protein'] = p_gene_and_p_by_h_df['p_gene'].str.replace('TU', 'model')
    #save this in the same folder as dataframe
    p_gene_and_p_by_h_df.to_csv(tmp_p_gene_gff.replace('p_gene.gff', 'gene_haplotig_intersect.df'), sep='\t', index=None)
    p_gene_and_p_by_h_df_2 = p_gene_and_p_by_h_df.loc[:, [10, 'p_protein']]
    #make a dict for the overlap of each gene with it's correspoding haplotig
    tmp_p_gene_h_overlap_df = p_gene_and_p_by_h_df_2.groupby('p_protein')[10].apply(list)
    tmp_p_gene_h_overlap_dict = dict(zip(tmp_p_gene_h_overlap_df.index, tmp_p_gene_h_overlap_df.values))
    p_gene_h_contig_overlap_dict.update(tmp_p_gene_h_overlap_dict)
    

In [None]:
def get_h_gene_p_contig_overlap_dict(p_gene_h_contig_overlap_dict):
    d = {}
    for key in p_gene_h_contig_overlap_dict.keys():
        for val in p_gene_h_contig_overlap_dict[key]:
            if val in d:
                d[val].append(key)
            else:
                d[val] = [key]
    return d
h_gene_p_contig_overlap_dict = get_h_gene_p_contig_overlap_dict(p_gene_h_contig_overlap_dict)

In [None]:
#get out the p_protein on h_protein blast out dataframe and appand the h_contig overlap for each protein
_key = [x for x in blast_out_dict.keys() if x.startswith(P_GENOME) and x.split('.')[-2] == 'blastp' and x.endswith('outfmt6')][0]
p_gene_h_contig_overlap_df = pd.DataFrame([p_gene_h_contig_overlap_dict.keys(), p_gene_h_contig_overlap_dict.values()],  index = ['p_protein', 'h_contig_overlap']).T
_tmp_df = blast_out_dict[_key]
_tmp_df = _tmp_df.merge(p_gene_h_contig_overlap_df, how='outer' ,left_on='Query', right_on='p_protein')
_tmp_df['h_contig_overlap'].fillna(False, inplace=True)
#check if the protein hit resites on the overlapping haplotig
_tmp_df['t_contig == h_contig_overlap'] = _tmp_df.apply(lambda row: target_on_mapped_haplotig(row['t_contig'], row['h_contig_overlap']), axis =1 )
#blast_out_dict[_key] = _tmp_df

In [None]:
# p_gene_h_contig_overlap_df['p_protein'] = p_gene_h_contig_overlap_df['p_protein'].apply(lambda x: x.replace('model', 'TU'))

In [None]:
#write the dataframe again
_tmp_df.to_csv(os.path.join(OUT_PATH, _key+'.allele_analysis'), index=None, sep='\t')

In [None]:
#Qcov_cut_off = ''
#PctID_cut_off = ''
if Qcov_cut_off and PctID_cut_off:
    pass
else:
    print('No cut off defined! Do you want to define it. Please go ahead above.')

In [None]:
os.chdir('/home/gamran/genome_analysis/Warrior/Richard/scripts')
%run 'DK_0911_proteinortho.ipynb'

In [None]:
poff_graph_header = ['Target', 'Query', 'evalue_ab', 'bitscore_ab', 'evalue_ba', 'bitscore_ba', 'same_strand' , 'simscore']
po_df = pd.read_csv(poff_graph_fn, sep='\t', header=None, names=poff_graph_header, comment='#' )

In [None]:
def reduceGroups(g, labels, maxBest):
    '''returns the best hit based on e-value and BitScore per group'''
    for label, maxBest in zip(labels, maxBest):
        if len(g) == 1:
            return g
        if maxBest:
            g = g[g[label] == g[label].max()]
        else:
            g = g[g[label] == g[label].min()]
#     if len(g) > 1:
#         print('Could not reduce group to 1 element. %s ' % g['Query'])
    return g

# _tmp_df = _tmp_df.groupby('Query').apply(lambda g: reduceGroups(g, ['e-value', 'BitScore'], [False, True]))

In [None]:
#no add a comparision column to po_df and _tmp_df
po_df['comp'] = po_df['Query'] + po_df['Target']
_tmp_df['comp'] = _tmp_df['Query'] + _tmp_df['Target']
#generate a new allele_source column
_tmp_df['allele_source'] = 'BLAST'
_tmp_df.loc[_tmp_df[_tmp_df.comp.isin(po_df.comp.unique())].index, 'allele_source' ] = 'PO'
#now drop the comp column again
_tmp_df = _tmp_df.drop('comp', 1)

print(_tmp_df.columns)

In [None]:
print(len(_tmp_df))
no_PO_df = _tmp_df[_tmp_df['allele_source'] == 'BLAST']
PO_df = _tmp_df[_tmp_df['allele_source'] == 'PO']

no_PO_df = no_PO_df.groupby('Query').apply(lambda g: reduceGroups(g, ['e-value', 'BitScore'], [False, True]))
_tmp_df = no_PO_df.append(PO_df, ignore_index=True)
print(len(_tmp_df))

In [None]:
_tmp_df.to_csv(os.path.join(ALLELE_PATH, P_GENOME + '.full_df.alleles'), sep='\t', index=None, header=None)

In [None]:
htgOnPctgBlast = pd.read_csv(os.path.join(OUT_PATH, '%s.%s.0.001.blastp.outfmt6' % (H_GENOME, P_GENOME)), sep='\t', index_col=None, header=None, names=blast_header)
htgOnPctgBlast['QLgth'] = htgOnPctgBlast['Query'].apply(lambda x: length_dict[x])
htgOnPctgBlast['QCov'] = htgOnPctgBlast['AlnLgth']/htgOnPctgBlast['QLgth']*100

htgOnPctgBlast['q_contig'] = htgOnPctgBlast['Query'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)
htgOnPctgBlast['t_contig'] = htgOnPctgBlast['Target'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)

htgOnPctgBlast['q_contig == t_contig'] = htgOnPctgBlast.apply(lambda row: same_contig_blast(row['Query'], row['Target']), axis = 1)
htgOnPctgBlast.reset_index(inplace=True, drop=True)

#get out the p_protein on h_protein blast out dataframe and appand the h_contig overlap for each protein
p_gene_h_contig_overlap_df = pd.DataFrame([h_gene_p_contig_overlap_dict.keys(), h_gene_p_contig_overlap_dict.values()],  index = ['p_protein', 'h_contig_overlap']).T

htgOnPctgBlast = htgOnPctgBlast.merge(p_gene_h_contig_overlap_df, how='outer' ,left_on='q_contig', right_on='p_protein')
htgOnPctgBlast['h_contig_overlap'].fillna(False, inplace=True)
#check if the protein hit resides on the overlapping haplotig

htgOnPctgBlast['t_contig == h_contig_overlap'] = htgOnPctgBlast.apply(lambda row: target_on_mapped_haplotig(row['Target'], row['h_contig_overlap']), axis = 1)
htgOnPctgBlast['allele_source'] = 'h_rBLAST'
htgOnPctgBlast = htgOnPctgBlast.groupby('Query').apply(lambda g: reduceGroups(g, ['e-value', 'BitScore'], [False, True]))


In [None]:
htgOnPctgBlast.to_csv(os.path.join(ALLELE_PATH, H_GENOME + '.full_df.alleles'), sep='\t', index=None, header=None)