## Defining Alleles

This notebook is aimed at doing allelic comparison between primary contigs and hapltotigs. This was used to generate DataFrames used during the analysis and evolved during the analysis. So it might be a bit more complicated than it needs to be, but it works.

Based on original code by Benjamin Schwessinger.

This notebook defines alleles based on a Falcon-Unzip assembly where relational information between two contigs are available.

#### The input as follows:
* Assemblytics alingment of all haplotigs onto their corresponding primary contigs. The *.oriented_coords.csv will be converted to a gff file of haplotig alingments onto their respective primary contigs
* gff file of all features. This will be filtered down to genes only and separated out for each contig
* protein and gene fa files for reciprocal blast of haplotig sequences onto primary sequences

#### What happens:
* for each primary gene/protein the the overlapping haplotig is pulled out and added to the blast dataframe
* this blast dataframe of primary proteins onto haplotig proteins is than used for filtering to provide the following files
* best blast hits are first filtered on e-value minimums followed by BitScore maximum

#### What comes out in the alleles folder:
##### for primary genes/proteins
* a set of files that contain two proteins IDs identifying the two alleles.  
* *_p_ctg.h_contig_overlap.alleles contains best blast hits of the primary protein on haplotig proteins of the overlapping haplotig (truely linked alleles).
* *_p_ctg.no_respective_h_contig_overlap.alleles contains all primary protein that don't have a blast hit on the overlapping haplotig but another haplotig associated with its primary contig.
* *_p_ctg.no_specific_h_contig_overlap.alleles contains all primary proteins that don't have a blast hit on an associated haplotig but another haplotig.
* *_p_ctg.no_alleles contains all primary proteins without blast hit on haplotigs.

##### for haplotig genes/proteins
* *_h_ctg.no.alleles contains all haplotig proteins that are not associated with a primary allele in the above analysis.
* *_h_ctg.no.best_p_hits.alleles contains all haplotig proteins that are not associated with a primary allele in the above analysis but have a blast hit non-withstanding. This might be duplications in the haplotigs.
* *_h_ctg.no.no_p_hits.alleles contains all haplotig proteins that have no blast hit.

##### filtering of blast output
* filtering of blast output is possible on Query coverage of the blast hit [Qcov] and the percentage identity of the aligenment [PctID]. This generates additional outfiles with *.QcovXX.PctIDYY.alleles. The default values are set to Qcov80 and PctID70.

#### What else to consider:
* the script was tested for outputing the right number of protein sequences for primary and haplotig sequences.
* this is a script and not a program. No warrenties.
* Feedback always welcome.
* Path and other variables are defined at the top of the script.
* It assumes that primary contig sequences are labled as e.g. pcontig_000 and associated haplotigs as hcontig_000_0xx.

#### Downstream considerations:
* Why have some proteins no blast hits? Is it simple that their gene models are missing? Seen this and working on this.
* In case of primary genes without blast hits do those lie in homozygous coverage areas? Working on this.
* Are some of the h genes/proteins without alleles but with blast hits duplications that diverget?
* Do some of the primary proteins/genes without blast hit on haplotigs have a good blast hit on other primary contigs? Is this reciprocal?
* Can we use some of this information to talk about linkage of contigs or does this go to far without fully phased genomes using loooong reads or single nucleii sequencing? Would say yes so far.
* Some primary proteins have two equal got hits are those duplications in the haplotig?
* If we perform downstream genetic variation analysis between alleles do we see any GO, KEGG, effector or other functional domains enriched?


This second version of this allele analysis take the proteinortho5.pl input as reference for syntany analysis.

This was run as follows:

proteinortho5.pl -project=ph_ctg  -synteny ph_ctg/Pst_104E_v12_h_ctg_protein.faa ph_ctg/Pst_104E_v12_p_ctg_protein.faa

<- this gave different results in the graph and regular file in verison 5.15.

run the latest version with 

perl ~/anaconda3/envs/funannotate/proteinortho_v5.16/proteinortho5.pl -cpus=20 -project=ph_ctg_516  -synteny ph_ctg/Pst_104E_v12_h_ctg_protein.faa ph_ctg/Pst_104E_v12_p_ctg_protein.faa

Use the poff-graph and poff-proteinortho-graph as bases for analysis

Depending on this run the poff_header needs to be defined. In current case 'Target' followed 'Query'

#### The additional input required
* is the .poff from proteinortho. This will be used for filtering as all alleles need have a True in the PO_allele column to be included in the output. All other filtering is done as described above.



In [2]:
%matplotlib inline

In [3]:
import pandas as pd
import os
import re
from Bio import SeqIO
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil


In [4]:
#Define the PATH
BASE_OUT_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/'
GENOME_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/genome_v03/'
ASSEMBLYTICS_IN_PATH = '/home/gamran/genome_analysis/Warrior/Richard/output/nucmer_assemblytics/Assemblytics'

BLAST_DB = os.path.join(BASE_OUT_PATH, 'blast_DB')
OUT_PATH = os.path.join(BASE_OUT_PATH, 'allele_analysis')
OUT_tmp = os.path.join(OUT_PATH, 'tmp')
PROTEINORTHO_OUT_PATH = os.path.join(BASE_OUT_PATH, 'proteinortho')

#renamed the allele path to reflect the proteinortho results
ALLELE_PATH = os.path.join(OUT_PATH, 'alleles_proteinortho_graph516')
if not os.path.isdir(OUT_PATH):
    os.mkdir(OUT_PATH)
if not os.path.isdir(OUT_tmp):
    os.mkdir(OUT_tmp)
if not os.path.exists(ALLELE_PATH):
    os.mkdir(ALLELE_PATH)
if not os.path.exists(BLAST_DB):
    os.mkdir(BLAST_DB)
if not os.path.exists(PROTEINORTHO_OUT_PATH):
    os.mkdir(PROTEINORTHO_OUT_PATH)


In [5]:
#Define your p and h genome and move it into the allele analysis folder
p_genome = 'DK_0911_v03_p_ctg'
h_genome = 'DK_0911_v03_h_ctg'
genome_file_suffix = '.genome_file'
for x in (x + '.fa' for x in [p_genome, h_genome]):
    shutil.copy2(GENOME_PATH+'/'+x, OUT_tmp)

In [6]:
#get proteinortho synteny file which ends with .poff
poff_graph_fn = os.path.join(PROTEINORTHO_OUT_PATH, 'ph_ctg_516.poff-graph')

In [7]:
#Define ENV parameters for blast hits and threads used in blast analysis
n_threads = 16
e_value = 1e-3
Qcov_cut_off = 0 #this defines the mimimum coverage of the Query to be required for filtering. Will become part of name.
PctID_cut_off = 70 #this defines the mimimum PctID accross the alignment to be required for filtering. Will become part of name.

In [8]:
def att_column_p(x, y, z, a, b):
    '''Generate attribute column of h on p p_gff dataframe.
    '''
    string = 'h_contig=%s;query_start=%s;query_stop=%s;query_length=%s;query_aln_ln=%s' % (x, str(y), str(z), str(a), str(b))
    return string

In [9]:
def att_column_h(x, y, z, a, b):
    '''Generate attribute column of h on p h_gff dataframe.
    '''
    string = 'p_contig=%s;ref_start=%s;ref_stop=%s;ref_length=%s;ref_aln_ln=%s' % (x, str(y), str(z), str(a), str(b))
    return string

In [10]:
def gene_contig_subset(df, contig, feature):
    '''Input the gff and subset the feature columne by gene.
        Return data frame subset by contig and gene'''
    tmp_df = df[(df[0].str.contains(contig)) & (df[2]==feature)]
    tmp_df.sort_values(3,inplace=True)
    tmp_df.reset_index(drop=True, inplace=True)
    return tmp_df

In [11]:
def same_contig_blast(x,y):
    '''Function that checks if the blast hit in column y is on the same contig as the the query sequence in
    column y.
    '''
    q_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', x).group(1)
    if y != 'NaN':
        hit_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', y).group(1)
    else:
        hit_contig = 'NaN'
    return q_contig == hit_contig

In [12]:
def target_on_mapped_haplotig(t_contig, h_contig_overlap):
    '''Simple function that checks if the target contig is in the list of h overlapping contigs'''
    if t_contig == False or h_contig_overlap == False:
        return False
    return t_contig in h_contig_overlap

#### Considerations for blast analysis
Do gene blast and protein blast. Initially go off protein blast results. They should be more imformative as it also includes the coding region and the frame.
> Pull this in as a dataframe and filter it down as well to each contig. Do blasts both ways as well.
> Copy over primary & haplotig .protein.fa and .gene.fa files from the BASE_A_folder, if not already present.

In [13]:
#generate the blast databases if not already present
os.chdir(BLAST_DB)
#copy over .protein.fa and .gene.fa files from the BASE_A_folder
#if not already present. for both primary and haplotigs

blast_dir_content = os.listdir(BLAST_DB)
for x in blast_dir_content:
    if x.endswith('.fa') and ({os.path.isfile(x + e) for e in ['.psq', '.phr', '.pin'] } != {True}\
           and {os.path.isfile(x + e) for e in ['.nin', '.nhr', '.nsq'] } != {True} ):

        make_DB_options = ['-in']
        make_DB_options.append(x)
        make_DB_options.append('-dbtype')
        if 'protein' in x:
            make_DB_options.append('prot')
        else:
            make_DB_options.append('nucl')
        make_DB_command = 'makeblastdb %s' % ' '.join(make_DB_options)
        make_DB_stderr = subprocess.check_output(make_DB_command, shell=True, stderr=subprocess.STDOUT)
        print('%s is done!' % make_DB_command)
print("All databases generated and ready to go!")

All databases generated and ready to go!


In [14]:
blast_dict = {}
blast_dict['gene_fa'] = [x for x  in os.listdir(BLAST_DB) if x.endswith('gene.fa') and 'ph_ctg' not in x]
blast_dict['protein_fa'] = [x for x  in os.listdir(BLAST_DB) if x.endswith('protein.fa') and 'ph_ctg' not in x]
blast_stderr_dict = {}
for key in blast_dict.keys():
    for n, query in enumerate(blast_dict[key]):
        tmp_list = blast_dict[key][:]
        tmp_stderr_list = []
        #this loops through all remaining files and does blast against all those
        del tmp_list[n]
        for db in tmp_list:
            print("\nBlasting %s ..." %db)
            blast_options = ['-query']
            blast_options.append(query)
            blast_options.append('-db')

            blast_options.append(db)
            blast_options.append('-outfmt 6')
            blast_options.append('-evalue')
            blast_options.append(str(e_value))
            blast_options.append('-num_threads')
            blast_options.append(str(n_threads))
            blast_options.append('>')
            if 'gene' in query and 'gene' in db:
                out_name_list = [ query.split('.')[0], db.split('.')[0], str(e_value), 'blastn.outfmt6']
                out_name = os.path.join(OUT_PATH,'.'.join(out_name_list))
                blast_options.append(out_name)
                blast_command = 'blastn %s' % ' '.join(blast_options)
            elif 'protein' in query and 'protein' in db:
                out_name_list = [ query.split('.')[0], db.split('.')[0], str(e_value), 'blastp.outfmt6']
                out_name = os.path.join(OUT_PATH,'.'.join(out_name_list))
                blast_options.append(out_name)
                blast_command = 'blastp %s' % ' '.join(blast_options)
            print("Blast command generated:\n%s" % blast_command)
            if not os.path.exists(out_name):
                print("Executing new blast command...")
                blast_stderr_dict[blast_command] = subprocess.check_output(blast_command, shell=True, stderr=subprocess.STDOUT)
                print("New blast executed!")
            else:
                blast_stderr_dict[blast_command] = 'Previously done already!'
                print("Blast command completed previously!")


Blasting DK_0911_v03_h_ctg.protein.fa ...
Blast command generated:
blastp -query DK_0911_v03_p_ctg.protein.fa -db DK_0911_v03_h_ctg.protein.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/allele_analysis/DK_0911_v03_p_ctg.DK_0911_v03_h_ctg.0.001.blastp.outfmt6
Blast command previous completed!

Blasting DK_0911_v03_p_ctg.protein.fa ...
Blast command generated:
blastp -query DK_0911_v03_h_ctg.protein.fa -db DK_0911_v03_p_ctg.protein.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/allele_analysis/DK_0911_v03_h_ctg.DK_0911_v03_p_ctg.0.001.blastp.outfmt6
Blast command previous completed!

Blasting DK_0911_v03_p_ctg.gene.fa ...
Blast command generated:
blastn -query DK_0911_v03_h_ctg.gene.fa -db DK_0911_v03_p_ctg.gene.fa -outfmt 6 -evalue 0.001 -num_threads 16 > /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/allele_analysis/DK_0911_v03_h_

In [15]:
#make a sequence length dict for all genes and proteins for which blast was performed
seq_list = []
length_list =[]
for key in blast_dict.keys():
    for file in blast_dict[key]:
        for seq in SeqIO.parse(open(file), 'fasta'):
            seq_list.append(seq.id)
            length_list.append(len(seq.seq))
length_dict = dict(zip(seq_list, length_list))

In [16]:
#get assemblytics folders
ass_folders = [os.path.join(ASSEMBLYTICS_IN_PATH, x) for x in os.listdir(ASSEMBLYTICS_IN_PATH) if x.endswith('_php_8kbp')]
ass_folders.sort()

In [19]:
os.chdir('/home/gamran/genome_analysis/Warrior/Richard/scripts')
%run DK_0911_v03_dictionaries.ipynb
locusToIdDict = getLocusToIdDict()

In [267]:
def same_contig_blast(x,y):
    '''Function that checks if the blast hit in columne y is on the same contig as the the query sequence in
    column y.
    '''
    try:
        q_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', x).group(1)
        if y != 'NaN':
            hit_contig = re.search(r'[p|h]([a-z]*_[0-9]*)_?', y).group(1)
        else:
            hit_contig = 'NaN'
        return q_contig == hit_contig
    except AttributeError:
        print("AttributeError\nx: %s\ny: %s" % (x, y))
        sys.exit()
    except TypeError:
        print("x: %s\ny: %s" % (x, y))
        print("TypeError\ntype(x): %s\ntype(y): %s" % (type(x), type(y)))
        sys.exit()


In [268]:
def mapWithDict(x):
    if x == 'NaN':
        return x
    if x in locusToIdDict:
        return locusToIdDict[x]
    print("x: %s\n is not in the dictionary mapping loci to id." %x)
    sys.exit()

In [269]:
#read in the blast df and add QLgth and QCov columns
blast_out_dict = {}
blast_header = ['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery', 'StopQuery', 'StartTarget',\
              'StopTarget', 'e-value','BitScore']
for key in blast_stderr_dict.keys():
    file_loc = key.split('>')[-1][1:]
    # file_loc = .../output/defining_alleles/allele_analysis/DK_0911_v03_h_ctg.DK_0911_v03_p_ctg.0.001.blastp.outfmt6
    
    file_name = file_loc.split('/')[-1]
    # DK_0911_v03_h_ctg.DK_0911_v03_p_ctg.0.001.blastp.outfmt6
    
    tmp_df = pd.read_csv(file_loc, sep='\t', header=None, names=blast_header)
    
    tmp_df['QLgth'] = tmp_df['Query'].apply(lambda x: length_dict[x])
    tmp_df['QCov'] = tmp_df['AlnLgth']/tmp_df['QLgth']*100
    
    tmp_df.sort_values(by=['Query', 'e-value','BitScore', ],ascending=[True, True, False], inplace=True)
    
    # assert(len(tmp_df.loc[tmp_df['Query'] == 'pcontig_000:10023-10824']) == 0)
    
    #now make sure to add proteins/genes without blast hit to the DataFrame
    #check if gene blast or protein blast and pull out respective query file
    if file_loc.split('.')[-2] == 'blastp':
        tmp_blast_seq = [x for x in blast_dict['protein_fa'] if x.startswith(file_name.split('.')[0])][0]
    elif file_loc.split('.')[-2] == 'blastn':
        tmp_blast_seq = [x for x in blast_dict['gene_fa'] if x.startswith(file_name.split('.')[0])][0]
    tmp_all_blast_seq = []
    #make list of all query sequences
    for seq in SeqIO.parse(os.path.join(BLAST_DB, tmp_blast_seq), 'fasta'):
        tmp_all_blast_seq.append(str(seq.id))
    tmp_all_queries_w_hit = tmp_df["Query"].unique()
    tmp_queries_no_hit = set(tmp_all_blast_seq) - set(tmp_all_queries_w_hit)
    no_hit_list = []
    
    #loop over the quieres with no hit and make list of list out of them the first element being the query id
    for x in tmp_queries_no_hit:
        NA_list = ['NaN'] * len(tmp_df.columns)
        NA_list[0] = x
        no_hit_list.append(NA_list)
    
    tmp_no_hit_df = pd.DataFrame(no_hit_list)
    tmp_no_hit_df.columns = tmp_df.columns
    tmp_no_hit_df['QLgth'] = tmp_no_hit_df.Query.apply(lambda x: length_dict[x])
    #map stuff ad the tmp level
    
    tmp_df = tmp_df.append(tmp_no_hit_df)
    assert(len(tmp_df.loc[tmp_df['Query'] == 'NaN']) == 0)
    
    # assert(len(tmp_df.loc[tmp_df['Query'] == 'pcontig_000:10023-10824']) == 0)
    tmp_df['Query'] = tmp_df['Query'].apply(mapWithDict)
    tmp_df['Target'] = tmp_df['Target'].apply(mapWithDict)
    
    tmp_df['q_contig'] = tmp_df['Query'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)
    tmp_df['t_contig'] = tmp_df['Target'].str.extract(r'([p|h][a-z]*_[^.]*).?', expand = False)
    # print(tmp_df['q_contig'])
    
    #fix that if you don't extract anything return False and not 'nan'
    tmp_df['t_contig'].fillna(False, inplace=True)
    
    # print(tmp_df.apply(lambda row: print(row['Query'])), axis = 1)
    
    tmp_df['q_contig == t_contig'] = tmp_df.apply(lambda row: same_contig_blast(row['Query'], row['Target']), axis = 1)
    tmp_df.reset_index(inplace=True, drop=True)
    blast_out_dict[file_name] = tmp_df.iloc[:,:]

In [270]:
for folder in ass_folders:
    '''
    From oriented_coords.csv files generate gff files with the following set up.
    h_contig overlap
    'query', 'ref', 'h_feature', 'query_start', 'query_end', 'query_aln_len', 'strand', 'frame', 'attribute_h'
    p_contig overlap
    'ref', 'query', 'p_feature','ref_start','ref_end', 'alignment_length', 'strand', 'frame', 'attribute_p'.
    Save those file in a new folder for further downstream analysis. The same folder should include the contig and gene filtered
    gffs. 
    '''
    orient_coords_file = [os.path.join(folder, x) for x in os.listdir(folder) if x.endswith('oriented_coords.csv')][0]
    #load in df and generate several additions columns
    tmp_df = pd.read_csv(orient_coords_file, sep=',')
    #check if there is any resonable alignment if not go to the next one
    if len(tmp_df) == 0:
        print('Check on %s assemblytics' % (folder))
        continue
    tmp_df['p_feature'] = "haplotig"
    tmp_df['h_feature'] ='primary_contig'
    tmp_df['strand'] = "+"
    tmp_df['frame'] = 0
    tmp_df['query_aln_len'] = abs(tmp_df['query_end']-tmp_df['query_start'])
    tmp_df['alignment_length'] = abs(tmp_df['ref_end'] - tmp_df['ref_start'])
    tmp_df.reset_index(drop=True, inplace=True)
    tmp_df.sort_values('query', inplace=True)
    tmp_df['attribute_p'] = tmp_df.apply(lambda row: att_column_p(row['query'],row['query_start'], row['query_end'],\
                                                         row['query_length'], row['query_aln_len']), axis=1)
    
    tmp_df['attribute_h'] = tmp_df.apply(lambda row: att_column_h(row['ref'], row['ref_start'], row['ref_end'],\
                                                                  row['ref_length'], row['alignment_length']), axis=1)
    
    #generate tmp gff dataframe
    tmp_p_gff_df = tmp_df.loc[:, ['ref', 'query', 'p_feature','ref_start','ref_end', 'alignment_length', 'strand', 'frame', 'attribute_p']]
    #now filter added if start < stop swap round. And if start == 0. This was mostly?! only? the case for haplotigs
    tmp_p_gff_df['comp'] = tmp_p_gff_df['ref_start'] - tmp_p_gff_df['ref_end']
    tmp_swap_index  = tmp_p_gff_df['comp'] > 0
    tmp_p_gff_df.loc[tmp_swap_index, 'ref_start'] , tmp_p_gff_df.loc[tmp_swap_index, 'ref_end'] = tmp_p_gff_df['ref_end'], tmp_p_gff_df['ref_start']
    #and if 'ref_start' == g
    is_null_index  = tmp_p_gff_df['ref_start'] == 0
    tmp_p_gff_df.loc[is_null_index, 'ref_start'] = 1
    tmp_p_gff_df = tmp_p_gff_df.iloc[:, 0:9]
    tmp_p_gff_df.sort_values('ref_start',inplace=True)
    tmp_h_gff_df = tmp_df.loc[:, ['query', 'ref', 'h_feature', 'query_start', 'query_end', 'query_aln_len', 'strand', 'frame', 'attribute_h']]
    #same fix for tmp_h_gff
    tmp_h_gff_df['comp'] = tmp_h_gff_df['query_start'] - tmp_h_gff_df['query_end']
    tmp_swap_index  = tmp_h_gff_df['comp'] > 0
    tmp_h_gff_df.loc[tmp_swap_index, 'query_start'] , tmp_h_gff_df.loc[tmp_swap_index, 'query_end'] = tmp_h_gff_df['query_end'], tmp_h_gff_df['query_start']
    #and if 'query_start' == g
    is_null_index  = tmp_h_gff_df['query_start'] == 0
    tmp_h_gff_df.loc[is_null_index, 'query_start'] = 1
    tmp_h_gff_df = tmp_h_gff_df.iloc[:, 0:9]
    tmp_h_gff_df.sort_values('query_start', inplace=True)
    #make the outfolder and save gff files of overlap and filtered gene files
    folder_suffix = folder.split('/')[-1]
    #get the contig
    contig = re.search(r'p([a-z]*_[0-9]*)_', folder_suffix).group(1)
    out_folder = os.path.join(OUT_PATH, folder_suffix)
    if not os.path.exists(out_folder):
        os.mkdir(out_folder)
    out_name_p = os.path.join(out_folder, folder_suffix )
    tmp_p_gff_df.to_csv(out_name_p+'.p_by_h_cov.gff', sep='\t' ,header=None, index =None)
    out_name_h =  os.path.join(out_folder, folder_suffix)
    tmp_h_gff_df.to_csv(out_name_h+'.h_by_p_cov.gff' , sep='\t' ,header=None, index =None)
    #write those out to new folder for php together with the p and h gff of genes
    p_gene_gff = gene_contig_subset(p_gff, contig, 'gene')
    p_gene_gff.to_csv(out_name_p+'.p_gene.gff', sep='\t' ,header=None, index =None)
    h_gene_gff = gene_contig_subset(h_gff, contig, 'gene')
    h_gene_gff.to_csv(out_name_h+'.h_gene.gff', sep='\t' ,header=None, index =None)
    
    #next would be to do overlap between the gene gff and the corresponding alignment gff. Write this out. Dict for each gene and its corresponding h_contig/p_contig
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Check on /home/gamran/genome_analysis/Warrior/Richard/output/nucmer_assemblytics/Assemblytics/DK_0911_v03_pcontig_075_php_8kbp assemblytics


In [271]:
p_gene_gff.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,pcontig_107,EVM,gene,17,416,.,-,.,ID=evm.TU.pcontig_107.1;locus_tag=DK0911_18212...
1,pcontig_107,EVM,gene,570,1161,.,+,.,ID=evm.TU.pcontig_107.2;locus_tag=DK0911_18213...
2,pcontig_107,EVM,gene,1726,4134,.,+,.,ID=evm.TU.pcontig_107.3;locus_tag=DK0911_18214...
3,pcontig_107,EVM,gene,4177,6307,.,+,.,ID=evm.TU.pcontig_107.4;locus_tag=DK0911_18215...
4,pcontig_107,EVM,gene,6742,7821,.,-,.,ID=evm.TU.pcontig_107.5;locus_tag=DK0911_18216...


In [272]:
#now loop over the outfolders
out_folders = [os.path.join(OUT_PATH, x) for x in os.listdir(OUT_PATH) if x.endswith('_php_8kbp')]
out_folders.sort()

In [273]:
p_gene_h_contig_overlap_dict = {}
for out_folder in out_folders:
    tmp_folder_content = os.listdir(out_folder)
    tmp_p_gene_gff = [os.path.join(out_folder, x) for x in tmp_folder_content if x.endswith('p_gene.gff') ][0]
    tmp_p_gene_bed = pybedtools.BedTool(tmp_p_gene_gff).remove_invalid().saveas()
    tmp_p_by_h_gff = [os.path.join(out_folder, x) for x in tmp_folder_content if x.endswith('p_by_h_cov.gff') ][0]
    tmp_p_by_h_bed = pybedtools.BedTool(tmp_p_by_h_gff).remove_invalid().saveas()
    #generate a bed intersect
    p_gene_and_p_by_h = tmp_p_gene_bed.intersect(tmp_p_by_h_bed, wb=True)
    #check the length of the obtained intersect is 
    print("This is the length of the intersect %i" %len(p_gene_and_p_by_h ))
    if len(p_gene_and_p_by_h) == 0:
        continue
    #turn this into a dataframe 
    p_gene_and_p_by_h_df = p_gene_and_p_by_h.to_dataframe()
    p_gene_and_p_by_h_df['p_gene'] = p_gene_and_p_by_h_df[8].str.extract(r'ID=([^;]*);', expand=False)
    p_gene_and_p_by_h_df['p_protein'] = p_gene_and_p_by_h_df['p_gene'].str.replace('TU', 'model')
    #save this in the same folder as dataframe
    p_gene_and_p_by_h_df.to_csv(tmp_p_gene_gff.replace('p_gene.gff', 'gene_haplotig_intersect.df'), sep='\t', index=None)
    p_gene_and_p_by_h_df_2 = p_gene_and_p_by_h_df.loc[:, [10, 'p_protein']]
    #make a dict for the overlap of each gene with it's correspoding haplotig
    tmp_p_gene_h_overlap_df = p_gene_and_p_by_h_df_2.groupby('p_protein')[10].apply(list)
    tmp_p_gene_h_overlap_dict = dict(zip(tmp_p_gene_h_overlap_df.index, tmp_p_gene_h_overlap_df.values))
    p_gene_h_contig_overlap_dict.update(tmp_p_gene_h_overlap_dict)
    

This is the length of the intersect 786


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 472


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 552


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 500


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 498


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 457


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 358


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 334


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 417


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 302


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 365


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 344


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 402


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 95


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 336


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 291


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 302


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 241


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 241


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 275


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 246


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 135


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 250


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 205


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 0
This is the length of the intersect 207


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 89


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 107


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 173


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 219


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 115


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 224


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 196


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 213


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 165


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 185


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 148


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 89


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 144


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 154


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 115


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 114


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 6


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 131


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 74


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 71


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 81


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 73


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 77


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 104


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 43


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 72


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 92


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 82


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 108


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 84


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 79


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 0
This is the length of the intersect 3


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 52


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 42


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 35


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 40


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 13


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 64


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 26


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 8


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 21


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 17


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 21


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 23


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 31


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 10


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 14


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 20


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 7


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 7


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 8


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 12


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 8


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 7


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 25


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 4


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


This is the length of the intersect 0
This is the length of the intersect 1


['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
but file has 18 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))


In [274]:
#get out the p_protein on h_protein blast out dataframe and appand the h_contig overlap for each protein
_key = [x for x in blast_out_dict.keys() if x.startswith(p_genome) and x.split('.')[-2] == 'blastp' and x.endswith('outfmt6')][0]
p_gene_h_contig_overlap_df = pd.DataFrame([p_gene_h_contig_overlap_dict.keys(), p_gene_h_contig_overlap_dict.values()],  index = ['p_protein', 'h_contig_overlap']).T
_tmp_df = blast_out_dict[_key]
_tmp_df = _tmp_df.merge(p_gene_h_contig_overlap_df, how='outer' ,left_on='Query', right_on='p_protein')
_tmp_df['h_contig_overlap'].fillna(False, inplace=True)
#check if the protein hit resites on the overlapping haplotig
_tmp_df['t_contig == h_contig_overlap'] = _tmp_df.apply(lambda row: target_on_mapped_haplotig(row['t_contig'], row['h_contig_overlap']), axis =1 )
#blast_out_dict[_key] = _tmp_df

In [275]:
# p_gene_h_contig_overlap_df['p_protein'] = p_gene_h_contig_overlap_df['p_protein'].apply(lambda x: x.replace('model', 'TU'))

In [3]:
#write the dataframe again
_tmp_df.to_csv(os.path.join(OUT_PATH, _key+'.allele_analysis'), index=None, sep='\t')

NameError: name '_tmp_df' is not defined

In [277]:
#now mangale the data a bit
#write out t_contig == h_contig_overlap best hits

In [278]:
#Qcov_cut_off = ''
#PctID_cut_off = ''
if Qcov_cut_off and PctID_cut_off:
    pass
else: 
    print('No cut off defined! Do you want to define it. Please go ahead above.')

In [279]:
#now pull in the proteinortho_df for allele analysis
#this will add a new column to the dataframe called PO_allele which will be True when protein_ortho
#says it is 
#co-orthologs will be filtered based on same contig, and e-values

In [4]:
os.chdir('/home/gamran/genome_analysis/Warrior/Richard/scripts')
%run 'DK_0911_v03_proteinortho.ipynb'

Checking for correct files and directories...
gff file at: /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/proteinortho/ph_ctg_516/DK_0911_v03_h_ctg.protein.gff already exists... no new gff file was generated.
gff file at: /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/proteinortho/ph_ctg_516/DK_0911_v03_p_ctg.protein.gff already exists... no new gff file was generated.
Files and directories required for proteinortho analysis exist.

Checking whether proteinortho files already exist in /home/gamran/genome_analysis/Warrior/Richard/output/defining_alleles/proteinortho...
Folder reference dictionary:
{'poff-graph': 1, 'pin': 2, 'proteinortho': 1, 'phr': 2, 'proteinortho-graph': 1, 'blast-graph': 1, 'faa': 2, 'psq': 2, 'gff': 2, 'ffadj-graph': 1, 'poff': 1}
All proteinortho files, according to the reference dictionary, appear to already exist.

Proteinortho appears to have been ran previously, therefore it was not run this time.


In [281]:
poff_graph_header = ['Target', 'Query', 'evalue_ab', 'bitscore_ab', 'evalue_ba', 'bitscore_ba', 'same_strand' , 'simscore']
po_df = pd.read_csv(poff_graph_fn, sep='\t', header=None, names=poff_graph_header, comment='#' )
po_df['Query'] = po_df['Query'].map(locusToIdDict)
po_df['Target'] = po_df['Target'].map(locusToIdDict)

In [282]:
#filter out the co-orthologs
#co_po_df = po_df[po_df.Target.str.contains(',')].loc[:, ['Target', 'Query']].copy()
#po_df = po_df[~po_df.Target.str.contains(',')].loc[:, ['Target', 'Query']]
#not required anymore as those are pairs in the graph file as well

In [283]:
#no add a comparision column to po_df and _tmp_df
po_df['comp'] = po_df['Query'] + po_df['Target']
_tmp_df['comp'] = _tmp_df['Query'] + _tmp_df['Target']
#generate a new PO_allele column
_tmp_df['PO_allele'] = False
#and set it to True
_tmp_df.loc[_tmp_df[_tmp_df.comp.isin(po_df.comp.unique())].index, 'PO_allele' ] = True
#now drop the comp column again
_tmp_df = _tmp_df.drop('comp', 1)

In [284]:
# This accounts for proteinortho allele analysis as well
_tmp_grouped = _tmp_df[(_tmp_df['t_contig == h_contig_overlap'] == True)\
                      & (_tmp_df['PO_allele'] ==  True)].groupby('Query')

#filter on the evalue and than on the BitScore
best_hits_t_contig_and_h_contig = _tmp_grouped.apply(lambda g: g[(g['e-value'] == g['e-value'].min())])
best_hits_t_contig_and_h_contig = best_hits_t_contig_and_h_contig.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])

out_name = p_genome + '.h_contig_overlap.alleles'

best_hits_t_contig_and_h_contig.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
#now save including the filter as well
if Qcov_cut_off and PctID_cut_off:
    out_name = p_genome + '.h_contig_overlap.Qcov%s.PctID%s.alleles' % (Qcov_cut_off,PctID_cut_off)
    best_hits_t_contig_and_h_contig[(best_hits_t_contig_and_h_contig['QCov'] >= Qcov_cut_off) & (best_hits_t_contig_and_h_contig['PctID'] >= PctID_cut_off)]\
    .loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

In [285]:
#test if filtering like this works
if set(_tmp_df[(_tmp_df['t_contig == h_contig_overlap'] == True)\
                      & (_tmp_df['PO_allele'] ==  True)]['Query'].unique()) == set(best_hits_t_contig_and_h_contig['Query'].unique().tolist()):
    print("All good please proceed!")
else:
    print('The filter threw an error. Not the expected output')

All good please proceed!


In [286]:
#now drop everything that was already captured
query_ids_to_drop = best_hits_t_contig_and_h_contig['Query'].unique().tolist()
target_ids_to_drop = best_hits_t_contig_and_h_contig['Target'].unique().tolist()

In [287]:
#get all queries that are on the respective haplotig but have not been caught before
_tmp_groupbed = _tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop))&\
                        (_tmp_df['q_contig == t_contig'] == True)&\
                        (_tmp_df['PO_allele'] ==  True)].groupby('Query')
best_hits_same_contig_no_overlap = _tmp_groupbed.apply(lambda g: g[(g['e-value'] == g['e-value'].min())]) 
best_hits_same_contig_no_overlap = best_hits_same_contig_no_overlap.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])
out_name = p_genome + '.no_specific_h_contig_overlap.alleles'
best_hits_same_contig_no_overlap.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#now save including the filter as well
if Qcov_cut_off and PctID_cut_off:
    out_name = p_genome + '.no_specific_h_contig_overlap.Qcov%s.PctID%s.alleles' % (Qcov_cut_off,PctID_cut_off)
    best_hits_same_contig_no_overlap[(best_hits_same_contig_no_overlap['QCov'] >= Qcov_cut_off) & (best_hits_same_contig_no_overlap['PctID'] >= PctID_cut_off)]\
    .loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

In [288]:
#test if filtering like this works
if set(_tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop))& (_tmp_df['q_contig == t_contig'] == True)\
              & (_tmp_df['PO_allele'] ==  True)]['Query'].unique())\
        == set(best_hits_same_contig_no_overlap['Query'].unique().tolist()):
    print("All good please proceed!")
else:
    print('The filter threw an error. Not the expected output')

All good please proceed!


In [289]:
#now drop everything that was already captured
query_ids_to_drop = query_ids_to_drop + best_hits_same_contig_no_overlap['Query'].unique().tolist()
target_ids_to_drop = target_ids_to_drop + best_hits_same_contig_no_overlap['Target'].unique().tolist()

In [290]:
#get all else. Meaning not on the corresponding haplotig but other haplotig

In [291]:
_tmp_groupbed = _tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop))&\
                        (_tmp_df['q_contig == t_contig'] == False)\
                        & (_tmp_df['Target'] != 'NaN' )&
                        (_tmp_df['PO_allele'] ==  True)].groupby('Query')
best_hits_different_contig = _tmp_groupbed.apply(lambda g: g[(g['e-value'] == g['e-value'].min()) ])
best_hits_different_contig = best_hits_different_contig .reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])
out_name = p_genome + '.no_respective_h_contig_overlap.alleles'
best_hits_different_contig.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#now save including the filter as well
if Qcov_cut_off and PctID_cut_off:
    out_name = p_genome + '.no_respective_h_contig_overlap.Qcov%s.PctID%s.alleles' % (Qcov_cut_off,PctID_cut_off)
    best_hits_different_contig[(best_hits_different_contig['QCov'] >= Qcov_cut_off) & (best_hits_different_contig['PctID'] >= PctID_cut_off)]\
    .loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

In [292]:
if set(_tmp_df[(~_tmp_df['Query'].isin(query_ids_to_drop))& (_tmp_df['q_contig == t_contig'] == False) &\
               (_tmp_df['Target'] != 'NaN' )\
              &(_tmp_df['PO_allele'] ==  True)]['Query'].unique().tolist())\
        == set(best_hits_different_contig['Query'].unique().tolist()):
    print("All good please proceed!")
else:
    print('The filter threw an error. Not the expected output')

All good please proceed!


In [293]:
#now drop everything that was already captured
query_ids_to_drop = query_ids_to_drop + best_hits_different_contig['Query'].unique().tolist()
target_ids_to_drop = target_ids_to_drop + best_hits_different_contig['Target'].unique().tolist()

In [294]:
#now print out the p_proteins without alleles
out_name = p_genome + '.no_alleles'
_tmp_df[~_tmp_df['Query'].isin(query_ids_to_drop)].drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#for filtering different query and targets to drop are required to be dropped
if Qcov_cut_off and PctID_cut_off:
    query_ids_to_drop_filtered = \
    best_hits_t_contig_and_h_contig[(best_hits_t_contig_and_h_contig['QCov'] >= Qcov_cut_off) & (best_hits_t_contig_and_h_contig['PctID']>= PctID_cut_off)]['Query'].unique().tolist() \
    + best_hits_same_contig_no_overlap[(best_hits_same_contig_no_overlap['QCov'] >= Qcov_cut_off) & (best_hits_same_contig_no_overlap['PctID'] >= PctID_cut_off)]['Query'].unique().tolist()\
    + best_hits_different_contig[(best_hits_different_contig['QCov'] >= Qcov_cut_off) & (best_hits_different_contig['PctID'] >= PctID_cut_off)]['Query'].unique().tolist()
    
    target_ids_to_drop_filtered = \
    best_hits_t_contig_and_h_contig[(best_hits_t_contig_and_h_contig['QCov'] >= Qcov_cut_off) & (best_hits_t_contig_and_h_contig['PctID']>= PctID_cut_off)]['Target'].unique().tolist()\
    + best_hits_same_contig_no_overlap[(best_hits_same_contig_no_overlap['QCov'] >= Qcov_cut_off) & (best_hits_same_contig_no_overlap['PctID'] >= PctID_cut_off)]['Target'].unique().tolist() \
    + best_hits_different_contig[(best_hits_different_contig['QCov'] >= Qcov_cut_off) & (best_hits_different_contig['PctID'] >= PctID_cut_off)]['Target'].unique().tolist()
    out_name = p_genome + '.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    #save the dataframe
    _tmp_df[~_tmp_df['Query'].isin(query_ids_to_drop_filtered)].drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)    
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")                                                                                                                             

In [295]:
#save back the dataframe with '.analyzed' added to the key
blast_out_dict[_key+'.analyzed'] = _tmp_df

In [296]:
#pull out the h on p blast dataframe
_key = [x for x in blast_out_dict.keys() if x.startswith(h_genome) and x.split('.')[-2] == 'blastp' and x.endswith('outfmt6')][0]
_tmp_df = blast_out_dict[_key]


In [297]:
#pull out all h_contigs without alleles
_tmp_df_no_alleles = _tmp_df[~_tmp_df['Query'].isin(target_ids_to_drop)]
out_name = h_genome + '.no_alleles'
_tmp_df_no_alleles.drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

In [298]:
#now pull out all h_proteins that have a blast hit but were left behind with selection 
#based on proteinortho and best blast hits above. No QCov and PctID filtering
_tmp_df_no_alleles_grouped = _tmp_df_no_alleles[_tmp_df_no_alleles["Target"] != 'NaN'].groupby('Query')
h_protein_no_alleles = _tmp_df_no_alleles_grouped.apply(lambda g: g[(g['e-value'] == g['e-value'].min())]) 
h_protein_no_alleles = h_protein_no_alleles.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])
out_name = h_genome + '.best_p_hits.no_alleles'
h_protein_no_alleles.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
#write done the hits with no blast hit at all. Meaning blasting h on p gave no results
out_name = h_genome + '.no_p_hits.no_alleles'
_tmp_df_no_alleles[_tmp_df_no_alleles["Target"] == 'NaN']['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None) 

In [311]:
#now pull out all h_proteins that have a blast hit but were left behind with selection of best blast hits above. With QCov and PctID filtering.
if Qcov_cut_off and PctID_cut_off:
#pull out all h proteins without alleles when filtering was applied
    _tmp_df_no_alleles_filtered = _tmp_df[~_tmp_df['Query'].isin(target_ids_to_drop_filtered)]
    out_name = h_genome + '.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    _tmp_df_no_alleles_filtered.drop_duplicates(subset='Query')['Query'].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)

#pull out all the h proteins with blast hit when filtering was applied and filter those as well.
    #first get ride of all NAs in future version maybe convert QCov and PctID to floats or ints so this is not neccessary anymore
    _tmp_df_no_alleles_filtered_no_NA = _tmp_df_no_alleles_filtered[(_tmp_df_no_alleles_filtered["Target"] != 'NaN')]
    _tmp_df_no_alleles_filtered_grouped = \
        _tmp_df_no_alleles_filtered_no_NA[(_tmp_df_no_alleles_filtered_no_NA['QCov'] >= Qcov_cut_off) &\
                                    (_tmp_df_no_alleles_filtered_no_NA['PctID'] >= PctID_cut_off)].groupby('Query') 

    h_protein_no_alleles_filtered = _tmp_df_no_alleles_filtered_grouped.apply(lambda g: g[(g['e-value'] == g['e-value'].min())]) 
    h_protein_no_alleles_filtered = h_protein_no_alleles_filtered.reset_index(drop=True).groupby('Query').apply(lambda g: g[(g['BitScore'] == g['BitScore'].max())])
    out_name = h_genome + '.best_p_hits.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    h_protein_no_alleles_filtered.loc[:, ['Query', 'Target']].to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
    target_ids_to_drop_filtered = target_ids_to_drop_filtered + h_protein_no_alleles_filtered['Query'].unique().tolist()
    out_name = h_genome + '.no_p_hits.Qcov%s.PctID%s.no_alleles' % (Qcov_cut_off,PctID_cut_off)
    _tmp_df_no_alleles_filtered[~_tmp_df_no_alleles_filtered['Query'].isin(target_ids_to_drop_filtered)].drop_duplicates(subset='Query')['Query']\
    .to_csv(os.path.join(ALLELE_PATH, out_name), sep='\t', index=None, header=None)
else:
    print("No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.")

No filter selected for Qcov and PctID cut-off. If single filter desired set other filter to 1000.


In [312]:
#write the dataframe again
_tmp_df.to_csv(os.path.join(OUT_PATH, _key+'.allele_analysis'), index=None, sep='\t')

In [313]:
_tmp_df.columns

Index(['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery',
       'StopQuery', 'StartTarget', 'StopTarget', 'e-value', 'BitScore',
       'QLgth', 'QCov', 'q_contig', 't_contig', 'q_contig == t_contig'],
      dtype='object')

In [314]:
#might be good to write these analyzed dataframes out as well to make life a bit easier
blast_out_dict['DK_0911_v03_p_ctg.DK_0911_v03_h_ctg.0.001.blastp.outfmt6.analyzed'].columns

Index(['Query', 'Target', 'PctID', 'AlnLgth', 'NumMis', 'NumGap', 'StartQuery',
       'StopQuery', 'StartTarget', 'StopTarget', 'e-value', 'BitScore',
       'QLgth', 'QCov', 'q_contig', 't_contig', 'q_contig == t_contig',
       'p_protein', 'h_contig_overlap', 't_contig == h_contig_overlap',
       'PO_allele'],
      dtype='object')

In [None]:
#write the dataframe again
_tmp_df.to_csv(os.path.join(OUT_PATH, _key+'.allele_analysis'), index=None, sep='\t')