### Notebook to get synteny measures between two assemblies
The goal is to get a tool that takes in orthofinder results and a CDS files to check fof the length of synteny with different gene categories.

The idea is that BUSCOs are more syntenous then effectors or such. 

The program tries to count how many neighbours of an allele pair are in the same orthogroup. This requires first to anchor the allele pairing e.g. find out what the 1:1 allele is and then walk outward to see if each others neighbours are in the same orthogroup. For now we won't allow for any skips and just go one by one.

GENOMEA_orthoarray =  [1,2,3,4]
GENOMEB_orthoarray =  [1,2,3,4]

should result into a tuple that where the first element is the number of observed matches and the second is the number of possible matches. here this would be (8,8).

If this analysis is performed on two different set of gene groups this may tell something if microsynteny is more conserved within one gene group then another.

#### initial outline

get a gene ID -> get its neightbours n in two different arrays up and down stream -> get the orthogroup arrays for the neighbours 
get a gene ID  -> get orthogroup of the gene -> get all members of the orthogroup that belong to the other genome -> get their neighbours +/- 1 -> get the best seed where both neighbours add up,  
* if we have only one best match that is easy. Use this seed as 1:1 match get the neighbourhood array up and down -> compare the arrays one element at a time and safe the tuple in a dictonary.
* if multiple hits. Move one further out with each hit and look at those with the same ideas.

Things also to generate:
A match dictionary for allele pairing.  
A unqiue gene dictionary.  
A paraloge dictionray. Meaning where there is no allele pairing but thing are in the same ortho group.

In [1]:
import pandas as pd
import os
import re
from Bio import SeqIO
from Bio import SeqUtils
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil
from Bio.Seq import Seq
import pysam
from Bio import SearchIO
import json
import glob
import scipy.stats as stats
import statsmodels as sms
import statsmodels.sandbox.stats.multicomp
import distance
import seaborn as sns
from pybedtools import BedTool
import matplotlib
from sklearn.externals.joblib import Parallel, delayed
import itertools as it
import tempfile
from scipy.signal import argrelextrema
import scipy
from IPython.display import Image
from PIL import Image
from collections import OrderedDict



In [2]:
### define some path that at the end should come as args
ORTHOFINDER_FILE_NAME = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/OrthoFinder_all/Orthogroups_2.txt'
QUERY_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/Pst104E_annotations/Pst_104E_v13_p_ctg.gene.bed'
SUBJECT_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/DK0911_annotations/DK_0911_v04_p_ctg.genes.gene.bed'
QUERY_SELECT_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/test_folder/Pst_104E_v13_p_ctg.gene1000.bed'

In [3]:
#now write some functions
def get_neighbours(gene_id, bed_filename, n=5, direction='up'):
    """A function that either takes a filename to return an array of downstream and upstream neighbouring genes.
    Input: 
        gene_id
        bed6 filename
        n being the number of neighbours we want to get
        direction being up or down.
    Output:
        returns the largest possible array of neighbours up to n.
        The 0 element is always the closest neighbour now matter if you return and 'up' or 'down' array."""
    bed_6_header = ['chrom', 'start', 'stop', 'gene_id', 'phase', 'strand']
    try:
        bed_df = pd.read_csv(bed_filename, sep='\t', header=None, names=bed_6_header)
    except:
        print('Check if the bedfiles are bed6')
    if not direction in ['up', 'down']:
        print('Ensure direction is up or down.')
        
    #fix to make sure the gene id is actually in the bed_file if not just return an empty list    
    if gene_id not in bed_df['gene_id'].unique():
        return []
    gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
    contig = bed_df.loc[gene_index, ['chrom']]['chrom']
    contig_index = bed_df[bed_df['chrom']== contig].index
    if direction == 'up':
        index_list = []
        if (gene_index+(n)) in contig_index:
            for i in range(gene_index+1, gene_index+(n+1)):
                index_list.append(i)
        else:
            for i in range(gene_index+1, contig_index[-1]+1):
                index_list.append(i)
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()
    if direction == 'down':
        index_list = []
        if (gene_index-n) in contig_index:
            for i in range(gene_index-(n), gene_index):
                index_list.append(i)
        else:
            for i in range(contig_index[0], gene_index):
                index_list.append(i)
        index_list.reverse()
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()

In [4]:
def get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, other_id):
    """The function returns all the potential orthologs of the comparative species
    Input:
        gene_id
        gene_to_ortho_dict, the ortho dict
        orthofinder_dict, the dictionary of the orthgroups to get all genes belonging to the orthogroup in question.
        other_id, is the identifier of the comparative species."""

    return [x for x in orthofinder_dict[gene_to_ortho_dict[gene_id]] if x.startswith(other_id)]

In [5]:
def count_pairings(query_ortho, subject_ortho, simple = True):
    """Function that counts the pair wise overlap of two lists.
    In the simple = True mode this is a stricked 1:1 pairing.
    In the simple = Flase mode this is a sliding window of 1 vs [0, 1 , 2] elements.
    Input: 
        query_ortho list
        subject_orth list
        simple True or False
    output:
        obs the number of observed pairings
        max_len the number of potential pairings"""
    if len(query_ortho) <= len(subject_ortho):
        max_len = len(query_ortho)
    else:
        max_len = len(subject_ortho)
    obs = 0
    if simple == True:
        for i in range(0, max_len):
            if query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
        return obs, max_len
            
        
    if simple == False:
        for i in range(0, max_len):
            if i > 0 and i < max_len -1:    
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1] \
                or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            elif i == max_len -1:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1]:
                    obs = obs + 1
            elif query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
            elif i == 0:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            
        return obs, max_len

In [6]:
def get_up_and_down(gene_id, query_downstream_ortho,query_upstream_ortho, subject_bed_fn, n, simple):
    """A function that returns a list of two tuples. The first tuple is the for the downstream pairing the second for the
    upstream pairing of obs, max possible matches of orthologs.
    Input: 
           Gene_id to check from the subject genome, e.g. initial orthogroup matches of query gene.
           query_downstream_ortho is the list of downstream ortho groups from the gene_ids neighbours.
           query_upstream_ortho is the list of upstream ortho groups from the gene_ids neighbours.
           subject_bed_fn is the absolute path of the subject_bed_fn to get the neighbouring genes of the gene_id.
           n is the numbers of neighbours to search.
           simple can be True or False for searching without skiping (window of three) or with skipping enabled.
    Output:
           [(obs_down, max_down), (obs_up, max_up)]"""
    
    
    if simple not in [True, False]:
        simple = True
    
    subject_upstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(gene_id, subject_bed_fn, n=n, direction='up')]
    subject_downstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(gene_id, subject_bed_fn, n=n, direction='down')]
    
    #define in case something has no upstream or downstream hits
    obs_up, max_up, obs_down, max_down = 0, 0, 0, 0
    
    if len(query_upstream_ortho) > 0 and len(subject_upstream_ortho) > 0: 
        if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
            obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, simple=simple)       
            #now look at the downstream
            obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, simple=simple)
            
            if len(query_downstream_ortho) > 0 and len(query_upstream_ortho):
                if query_downstream_ortho[0] == query_upstream_ortho[0]:
                    print("Orthogroups of up and downstream are the same. Gene_id: %s" % gene_id)
                    if len(subject_downstream_ortho) > 0 and len(query_upstream_ortho) >0 :
                        obs_up_new, max_up_new = count_pairings(query_upstream_ortho, subject_downstream_ortho, simple=simple)
                        obs_down_new, max_down_new = count_pairings(query_downstream_ortho, subject_upstream_ortho, simple=simple)
                        if up_down_ratio([(obs_down_new, max_down_new), (obs_up_new, max_up_new)])\
                        > up_down_ratio([(obs_down, max_down), (obs_up, max_up)]):
                        #think about if we also want to test for 
                        #(max_down_new+max_up_new) > (max_down, max_up) :
                            obs_up, max_up = obs_up_new, max_up_new
                            obs_down, max_down = obs_down_new, max_down_new
            
            
            return [(obs_down, max_down), (obs_up, max_up)]
        
    if len(subject_downstream_ortho) > 0 and len(query_upstream_ortho) >0 :
        if query_upstream_ortho[0] == subject_downstream_ortho[0]:
            obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho, simple=simple)
            obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho, simple=simple)
            return [(obs_down, max_down), (obs_up, max_up)]
        
    if len(subject_downstream_ortho) > 0 and len(query_downstream_ortho) > 0:
        if subject_downstream_ortho[0] == query_downstream_ortho[0]:
            obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, simple=simple)       
            #now look at the downstream
            obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, simple=simple)
            return [(obs_down, max_down), (obs_up, max_up)]
    
    return [(obs_down, max_down), (obs_up, max_up)]

In [79]:
def count_pairings_array(query_ortho, subject_ortho, n, simple = True):
    """Function that does a pairwise by element comparison of two arrays.
    In the simple = True mode this is a stricked 1:1 pairing.
    In the simple = Flase mode this is a sliding window of 1 vs [0, 1 , 2] elements.
    Input: 
        query_ortho list
        subject_orth list
        simple True or False
    output:
        result array that has the positional overlap of the array.
        e.g. for n = 8 with one list being lenght 8 and the other length 6
        [0, 1, 0, 1, 0, 1, nan, nan]"""
    
    array = np.empty(n)
    array[:] = np.nan
    array
    
    if simple == True:
        for i in range(0, n):
            if i < len(query_ortho) and i < len(subject_ortho):
                if query_ortho[i] == subject_ortho[i]:
                    array[i] = int(1)
                else:
                    array[i] = int(0)
            else:
                continue
        return array
            
        
    if simple == False:

        for i in range(0, len(query_ortho)):
            
            if i == len(subject_ortho)-1:
                if query_ortho[i] == subject_ortho[i-1] or query_ortho[i] == subject_ortho[i]:
                    array[i] = int(1)
                else:
                    array[i] = int(0)
            elif i > 0 and i < len(subject_ortho):    
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1] \
                or query_ortho[i] == subject_ortho[i+1]:
                    array[i] = int(1)
                else:
                    array[i] = int(0)

            elif i == 0:
                do_it = False
                try:
                    if query_ortho[i] == subject_ortho[i+1]:
                        do_it = True
                except IndexError:
                        pass
                try:
                    if query_ortho[i] == subject_ortho[i]:
                        do_it = True
                except IndexError:
                        pass
                if do_it == True:
                    array[i] = int(1)
                else:
                    array[i] = int(0)
        return array
        
        

In [8]:
count_pairings_array([0, 1, 2, 6, 6, 7],[0, 0, 1], n=5, simple = False )

array([  1.,   1.,   0.,  nan,  nan])

In [29]:
def get_up_and_down_array(gene_id, query_downstream_ortho,query_upstream_ortho, subject_bed_fn, n, simple):
    """A function that returns a list of two tuples. The first tuple is the for the downstream pairing the second for the
    upstream pairing of obs, max possible matches of orthologs.
    Input: 
           Gene_id to check from the subject genome, e.g. initial orthogroup matches of query gene.
           query_downstream_ortho is the list of downstream ortho groups from the gene_ids neighbours.
           query_upstream_ortho is the list of upstream ortho groups from the gene_ids neighbours.
           subject_bed_fn is the absolute path of the subject_bed_fn to get the neighbouring genes of the gene_id.
           n is the numbers of neighbours to search.
           simple can be True or False for searching without skiping (window of three) or with skipping enabled.
    Output:
           [(obs_down, max_down), (obs_up, max_up)]"""
    
    
    if simple not in [True, False]:
        simple = True
    
    subject_upstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(gene_id, subject_bed_fn, n=n, direction='up')]
    subject_downstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(gene_id, subject_bed_fn, n=n, direction='down')]
    
    #now move to empty arrays
    
    
    #define in case something has no upstream or downstream hits
    array = array = np.empty(n)
    array[:] = np.nan
    array
    up_array, down_array = array, array
    
    if simple == True:
    
        if len(query_upstream_ortho) > 0 and len(subject_upstream_ortho) > 0: 

            if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho, n, simple=simple)       
                #now look at the downstream
                down_array = count_pairings_array(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)

                if len(query_downstream_ortho) > 0 and len(query_upstream_ortho) > 0:
                    if query_downstream_ortho[0] == query_upstream_ortho[0]:
                        print("Orthogroups of up and downstream are the same. Gene_id: %s" % gene_id)
                        up_array_new = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                        down_array_new = count_pairings(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                        if up_down_ratio_array([down_array_new, up_array_new])\
                            > up_down_ratio_array([down_array, up_array]):
                            #think about if we also want to test for 
                            #(max_down_new+max_up_new) > (max_down, max_up) :
                            down_array, up_array = down_array_new, up_array_new


                return [down_array, up_array]

        if len(subject_downstream_ortho) > 0 and len(query_upstream_ortho) >0 :
            if query_upstream_ortho[0] == subject_downstream_ortho[0]:
                up_array = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                down_array = count_pairings_array(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                return [down_array, up_array]

        if len(subject_downstream_ortho) > 0 and len(query_downstream_ortho) > 0:
            if subject_downstream_ortho[0] == query_downstream_ortho[0]:
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho,n, simple=simple)       
                #now look at the downstream
                down_array = count_pairings(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)
                return [down_array, up_array]

        return [down_array, up_array]
    
    elif simple == False:
        
        do_it = False
        one_exists = False
        
        if len(query_upstream_ortho) > 0 and len(subject_upstream_ortho) > 0: 
            try:
                if query_upstream_ortho[0] == subject_upstream_ortho[0] or \
                query_upstream_ortho[0] == subject_upstream_ortho[1]:
                    do_it = True
            
            except IndexError:
                if query_upstream_ortho[0] == subject_upstream_ortho[0]:
                    do_it = True
                    
            if do_it == True:        
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho, n, simple=simple)       
                #now look at the downstream
                down_array = count_pairings_array(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)

                if len(query_downstream_ortho) > 0 and len(query_upstream_ortho) > 0:
                    do_it_too = False
                    try:
                        if query_downstream_ortho[0] == query_upstream_ortho[0] or \
                        query_upstream_ortho[0] == subject_upstream_ortho[1]:
                            do_it_too = True
                    except IndexError:
                        if query_downstream_ortho[0] == query_upstream_ortho[0]:
                            do_it_too = True
                            
                        if do_it_too == True:
                            print("Orthogroups of up and downstream are the same. Gene_id: %s" % gene_id)
                            up_array_new = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                            down_array_new = count_pairings(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                            if up_down_ratio_array([down_array_new, up_array_new])\
                                > up_down_ratio_array([down_array, up_array]):
                                #think about if we also want to test for 
                                #(max_down_new+max_up_new) > (max_down, max_up) :
                                down_array, up_array = down_array_new, up_array_new


                    return [down_array, up_array]

        if len(subject_downstream_ortho) > 0 and len(query_upstream_ortho) >0 :
            try:
                
                if query_upstream_ortho[0] == subject_downstream_ortho[0] or \
                query_upstream_ortho[0] == subject_upstream_ortho[1]:
                    do_it = True
            except IndexError:
                if query_upstream_ortho[0] == subject_downstream_ortho[0]:
                    do_it = True
            if do_it == True:
                up_array = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                down_array = count_pairings_array(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                return [down_array, up_array]

        if len(subject_downstream_ortho) > 0 and len(query_downstream_ortho) > 0:
            try: 
                if subject_downstream_ortho[0] == query_downstream_ortho[0] or \
                subject_downstream_ortho[0] == query_downstream_ortho[1]:
                    do_it = True
            except IndexError:
                if subject_downstream_ortho[0] == query_downstream_ortho[0]:
                    do_it = True
            if do_it == True:
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho,n, simple=simple)       
                    #now look at the downstream
                down_array = count_pairings_array(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)
                return [down_array, up_array]

        return [down_array, up_array]

In [38]:
def array_to_tuple(array_list):
    if len(array_list) != 2:
        print('The length of the list is not 2.')
        return 0
    tuple_list = []
    for array in array_list:
        tuple_list.append(((np.nan_to_num(array)).sum(),np.count_nonzero(~np.isnan(array)) ))
    return tuple_list

In [11]:
def up_down_ratio_array(up_down_list):
    if len(up_down_list) != 2:
        print('The length of the list is not 2.')
        return 0
    sum_observed = np.nan_to_num(up_down_list[0]).sum() + np.nan_to_num(up_down_list[1]).sum()
    sum_possible = np.count_nonzero(~np.isnan(up_down_list[0])) + np.count_nonzero(~np.isnan(up_down_list[1]))
    if sum_possible > 0:
        
        return sum_observed/sum_possible
    else:
        return 0

In [12]:
def up_down_ratio(up_down_list):
    if len(up_down_list) == 0:
        return 0
    if (up_down_list[0][1] + up_down_list[1][1]) > 0:
        ratio = (up_down_list[0][0] + up_down_list[1][0]) / (up_down_list[0][1] + up_down_list[1][1])
        return ratio
    else:
        return 0

In [73]:
#generate dicts that will be used to track things
allele_dict = {}
singleton_dict = {}
paralog_dict = {}
up_match_dict = {}
down_match_dict = {}

In [74]:
#get the identifiers if not provided.
query_id = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]
subject_id = pd.read_csv(SUBJECT_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]

In [75]:
#get the orthofinder results read in
#this is the dict for all orthogroups and what members they have
orthofinder_dict = {}
with open(ORTHOFINDER_FILE_NAME) as fh:
    for line in fh:
        line.strip()
        orthofinder_dict[int(line.split(':')[0].strip('OG'))] = line.split(':')[1].strip().split(' ')

In [76]:
#this is the dict for proteins and what orthogroup they have
gene_to_ortho_dict = {}
for key,value in orthofinder_dict.items():
    for item in value:
        gene_to_ortho_dict[item] = key

In [64]:
#gene_id = 'Pst104E_00003'
gene_id = 'Pst104E_00000'
print([x for x in orthofinder_dict[gene_to_ortho_dict[gene_id]]if x.startswith('DK0911')])
n=8
simple = False

['DK0911_15056']


In [65]:
#populate the query arrays for neighbours and their corresponding ortho group ids
query_upstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='up')
query_downstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='down')
query_upstream_ortho = [gene_to_ortho_dict[x] for x in query_upstream]
query_downstream_ortho = [gene_to_ortho_dict[x] for x in query_downstream]

In [66]:
#get the first orthologs (aka potential allele pairings)
first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
if len(first_orthologs) == 0:
    singleton_dict[gene_id] = True
elif len(first_orthologs) == 1:
    down_up_list = get_up_and_down_array(first_orthologs[0],\
                        query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6,\
                                         n=n, simple=simple)
    down_match_dict[gene_id], up_match_dict[gene_id] = down_up_list[0], down_up_list[1]
    if sum((~np.isnan(down_up_list[0]))) > 0 and sum((~np.isnan(down_up_list[1]))) > 0:
        allele_dict[gene_id] = first_orthologs[0]
    else:
        paralog_dict[gene_id] = first_orthologs[0]
elif len(first_orthologs) > 1:
    ortho_dict = {}
    for ortho in first_orthologs:
        down_up_list = get_up_and_down_array(ortho,query_downstream_ortho,query_upstream_ortho,\
                                           SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)
        if sum((~np.isnan(down_up_list[0]))) == 0 and sum((~np.isnan(down_up_list[1]))) == 0:
            continue
        else:
            ortho_dict[ortho] = down_up_list
    
    
    max_list = [(0,0), (0,0)]
    real_ortho = ''
    real_ortho_list = []
    
    for ortho, ortho_list in ortho_dict.items():
        ortho_tuple_list = array_to_tuple(ortho_list)
        if up_down_ratio(ortho_tuple_list) > up_down_ratio(max_list):
            print(ortho_tuple_list)
            max_list = ortho_list
            real_ortho = ortho
            real_ortho_list = ortho_list
            down_up_list = real_ortho_list
    if max_list != [(0,0), (0,0)]:
        allele_dict[gene_id] = real_ortho
        down_match_dict[gene_id], up_match_dict[gene_id] = ortho_list[0], ortho_list[1]
    else:
        paralog_dict[gene_id] = list(ortho_dict.keys())

print(down_up_list)
#think about catching paralogs meaning that +1/-1 are not in the same orthogroup

[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ 1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.])]


In [67]:
down_match_dict

{'Pst104E_00000': array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]),
 'Pst104E_00001': array([  1.,  nan,  nan,  nan,  nan,  nan,  nan,  nan]),
 'Pst104E_00002': array([ 1.,  1.,  1.,  0.,  1.,  1.,  1.,  0.]),
 'Pst104E_00022': array([ 1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.]),
 'Pst104E_00033': array([  1.,   1.,   0.,   0.,   0.,  nan,  nan,  nan])}

In [68]:
up_match_dict

{'Pst104E_00000': array([ 1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.]),
 'Pst104E_00001': array([ 1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.]),
 'Pst104E_00022': array([ 0.,  1.,  1.,  1.,  1.,  0.,  0.,  0.]),
 'Pst104E_00033': array([ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.])}

In [20]:
print(query_downstream_ortho, query_upstream_ortho)

[3456, 34, 1077, 6914, 1007, 14, 3819, 8542] [94582, 685, 3115, 18428, 39658, 22733, 262, 15049]


In [21]:
if len(first_orthologs) == 1:
    subject_ortho_up = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='up')]
    subject_ortho_down = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='down')]
elif len(first_orthologs) > 1:
    subject_ortho_up = [gene_to_ortho_dict[x] for x in get_neighbours(real_ortho, SUBJECT_GENOME_GENE_BED6, n=n, direction='up')]
    subject_ortho_down = [gene_to_ortho_dict[x] for x in get_neighbours(real_ortho, SUBJECT_GENOME_GENE_BED6, n=n, direction='down')]

print(subject_ortho_down, subject_ortho_up)    

    

[3456, 34, 1077, 9190, 4, 4, 2, 9189] [685, 39657, 3115, 18428, 39658, 14638, 28956, 22733]


In [24]:
get_up_and_down_array(first_orthologs[0],\
                        query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6,\
                                         n=n, simple=simple)

[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]),
 array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]

In [49]:
for x in first_orthologs:
    print(get_up_and_down_array(x,\
                        query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6,\
                                         n=n, simple=simple))

[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  na

[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])]
[array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]), array([ nan,  nan,  na

In [81]:
gene_done_list = []
for x in range(0,1000):
    if x < 10:
        gene_id = 'Pst104E_0000%s' % x
    elif x < 100:
        gene_id = 'Pst104E_000%s' % x
    elif x < 1000:
        gene_id = 'Pst104E_00%s' % x
    gene_done_list.append(gene_id)
    #now some testing here    
    query_upstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='up')
    query_downstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='down')
    query_upstream_ortho = [gene_to_ortho_dict[x] for x in query_upstream]
    query_downstream_ortho = [gene_to_ortho_dict[x] for x in query_downstream]
    
    first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
    if len(first_orthologs) == 0:
        singleton_dict[gene_id] = True
    elif len(first_orthologs) == 1:
        down_up_list = get_up_and_down_array(first_orthologs[0],\
                            query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6,\
                                             n=n, simple=simple)
        down_match_dict[gene_id], up_match_dict[gene_id] = down_up_list[0], down_up_list[1]
        if sum((~np.isnan(down_up_list[0]))) > 0 and sum((~np.isnan(down_up_list[1]))) > 0:
            allele_dict[gene_id] = first_orthologs[0]
        else:
            paralog_dict[gene_id] = first_orthologs[0]
    elif len(first_orthologs) > 1:
        ortho_dict = {}
        for ortho in first_orthologs:
            down_up_list = get_up_and_down_array(ortho,query_downstream_ortho,query_upstream_ortho,\
                                               SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)
            if sum((~np.isnan(down_up_list[0]))) == 0 and sum((~np.isnan(down_up_list[1]))) == 0:
                continue
            else:
                ortho_dict[ortho] = down_up_list


        max_list = [(0,0), (0,0)]
        real_ortho = ''
        real_ortho_list = []

        for ortho, ortho_list in ortho_dict.items():
            ortho_tuple_list = array_to_tuple(ortho_list)
            if up_down_ratio(ortho_tuple_list) > up_down_ratio(max_list):
                print(ortho_tuple_list)
                max_list = ortho_list
                real_ortho = ortho
                real_ortho_list = ortho_list
                down_up_list = real_ortho_list
        if max_list != [(0,0), (0,0)]:
            allele_dict[gene_id] = real_ortho
            down_match_dict[gene_id], up_match_dict[gene_id] = ortho_list[0], ortho_list[1]
        else:
            paralog_dict[gene_id] = list(ortho_dict.keys())

[(3.0, 3), (5.0, 8)]




[(0.0, 8), (7.0, 8)]
[(0.0, 8), (1.0, 8)]
[(2.0, 8), (5.0, 8)]
[(3.0, 8), (4.0, 8)]
[(4.0, 8), (3.0, 8)]
[(6.0, 8), (1.0, 8)]
[(7.0, 8), (0.0, 8)]
[(1.0, 8), (6.0, 8)]
[(1.0, 8), (8.0, 8)]
[(0.0, 3), (3.0, 8)]
[(1.0, 3), (2.0, 8)]
[(0.0, 8), (2.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (2.0, 8)]
[(1.0, 2), (2.0, 8)]
[(1.0, 8), (1.0, 5)]
[(1.0, 8), (2.0, 8)]
[(1.0, 8), (1.0, 8)]
[(0.0, 8), (2.0, 8)]
[(1.0, 8), (2.0, 8)]
[(1.0, 8), (3.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (2.0, 8)]
[(0.0, 8), (2.0, 8)]
[(1.0, 8), (2.0, 8)]
[(0.0, 8), (2.0, 8)]
[(0.0, 8), (2.0, 8)]
[(0.0, 8), (3.0, 7)]
[(1.0, 8), (2.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (2.0, 8)]
[(0.0, 8), (2.0, 8)]
[(1.0, 8), (2.0, 8)]
[(1.0, 8), (2.0, 8)]
[(0.0, 8), (1.0, 8)]
[(1.0, 4), (2.0, 8)]
[(1.0, 2), (2.0, 8)]
[(0.0, 8), (4.0, 8)]
[(3.0, 8), (7.0, 8)]
[(0.0, 8), (2.0, 8)]
[(1.0, 8), (1.0, 8)]
[(1.0, 5), (2.0, 8)]
[(1.0, 8), (1.0, 8)]
[(0.0, 8), (2.0, 8)]
[(1.0, 5), (2.0, 8)]
[(0.0, 8), (2.0, 8)]
[(1.0, 6), (1

[(1.0, 5), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 1)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 4)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 2), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 1), (0.0, 8)]
[(0.0, 8), (1.0, 4)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 1)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 1)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(3.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(3.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(2.0, 8), (1.0, 8)]
[(1.0, 8), (0.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 4)]
[(0.0, 3), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 3)]
[(1.0, 8), (0

[(0.0, 8), (1.0, 8)]
[(2.0, 8), (1.0, 8)]
[(1.0, 8), (0.0, 8)]
[(3.0, 8), (0.0, 8)]
[(7.0, 8), (7.0, 8)]
[(7.0, 8), (7.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 1), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 4)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (2.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 8)]
[(7.0, 8), (7.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0.0, 8)]
[(1.0, 8), (0

In [80]:
#problem gene_id
print(gene_id)

Pst104E_00136


In [None]:
print(query_downstream_ortho, query_upstream_ortho)

In [None]:
down_match_dict

In [None]:
query_upstream

In [None]:
'DK0911_15063'

In [None]:
gene_done_list = []
for x in range(0,1000):
    if x < 10:
        gene_id = 'Pst104E_0000%s' % x
    elif x < 100:
        gene_id = 'Pst104E_000%s' % x
    elif x < 1000:
        gene_id = 'Pst104E_00%s' % x
    gene_done_list.append(gene_id)
    #now some testing here    
    query_upstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='up')
    query_downstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='down')
    query_upstream_ortho = [gene_to_ortho_dict[x] for x in query_upstream]
    query_downstream_ortho = [gene_to_ortho_dict[x] for x in query_downstream]
    
    first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
    if len(first_orthologs) == 0:
        singleton_dict[gene_id] = True
    elif len(first_orthologs) == 1:
        up_and_down_list = get_up_and_down(first_orthologs[0],query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)
        up_down_dict[gene_id] = up_and_down_list
        if up_and_down_list[0][0] > 0 and up_and_down_list[1][0] > 0:
            allele_dict[gene_id] = first_orthologs[0]
        else:
            paralog_dict[gene_id] = first_orthologs[0]
    elif len(first_orthologs) > 1:
        ortho_dict = {}
        for ortho in first_orthologs:
            up_and_down_list = get_up_and_down(ortho,query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)
            if up_and_down_list[0][0] == 0 and up_and_down_list[1][0] == 0:
                continue
            else:
                ortho_dict[ortho] = up_and_down_list
        max_list = [(0,0), (0,0)]
        for ortho, ortho_list in ortho_dict.items():
            if up_down_ratio(ortho_list) > up_down_ratio(max_list) and ortho_list[0][0] > 0 and ortho_list[1][0] >0:
                max_list = ortho_list
        if max_list[0][0] > 0 and max_list[1][0] > 0:
            allele_dict[gene_id] = first_orthologs[0]
        else:
            paralog_dict[gene_id] = first_orthologs[0] 

### get in the ideas of np arrarys

In [None]:
array = np.empty(8)
array[:] = np.nan
array

In [None]:
array = np.empty(8)
array[:] = np.nan
array
array[2] = int(1)

np.count_nonzero(np.isnan(array))


In [None]:
[array, array]

In [None]:
np.concatenate((array,array),axis=0)

In [None]:
array + array

In [None]:
singleton_dict

In [None]:
allele_list = list(allele_dict.keys())
singleton_list = list(singleton_dict.keys())
paralogs_list = list(paralog_dict.keys())

In [None]:
set(gene_done_list) - set(allele_list) - set(singleton_list) - set(paralogs_list)

In [None]:
up_down_dict.keys()

In [None]:
up_and_down_list[0][0]

In [None]:
first_orthologs

#now we need to work on the case when we have multiple orthologs. This will be done on a 1:1 basis moving outward if the first neighbours match return a 1.
#if in the first round nothing matches don't move forward. Just return all possible paraloge pairings.
#if multiple match move one further out and add half a point for each match either side.
#still need to think about missing orthologs from the original search and how to pull those in  

In [None]:
def up_down_ratio(up_down_list):
    if len(up_and_down_list) == 0:
        return 0
    if (up_and_down_list[0][1] + up_and_down_list[1][1]) > 0:
        ratio = (up_and_down_list[0][0] + up_and_down_list[1][0]) / (up_and_down_list[0][1] + up_and_down_list[1][1])
        return ratio
    else:
        return 0

In [None]:
ortho_dict = {}
for ortho in first_orthologs:
    print(ortho)
    up_and_down_list = get_up_and_down(ortho,query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)
    if up_and_down_list[0][0] == 0 and up_and_down_list[1][0] == 0:
        continue
    else:
        ortho_dict[ortho] = up_and_down_list
max_list = [(0,0), (0,0)]
for ortho, ortho_list in ortho_dict.items():
    if up_down_ratio(max_list) < up_down_ratio(ortho_list):
        max_list = ortho_list


In [None]:
ortho_dict

In [None]:
broken = ortho
gene_to_ortho_dict[ortho]

In [None]:
print([gene_to_ortho_dict[x] for x in get_neighbours(ortho, SUBJECT_GENOME_GENE_BED6, n=n, direction='up')])
print([gene_to_ortho_dict[x] for x in get_neighbours(ortho, SUBJECT_GENOME_GENE_BED6, n=n, direction='down')])

In [None]:
ortho

In [None]:
#get the first orthologs (aka potential allele pairings)
first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
if len(first_orthologs) == 0:
    singleton_dict[gene_id] = True
elif len(first_orthologs) == 1:
    subject_upstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='up')]
    subject_downstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='down')]
    if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
        obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, simple=simple)       
        #now look at the downstream
        obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, simple=simple)
    elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
        obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho, simple=simple)
        obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho, simple=simple)

In [None]:
#now we need to look at the pairing of the directions
#and add stuff up
if query_upstream_ortho[0] == subject_upstream_ortho[0]:
    obs_up = 0
    obs_down = 0 
    if len(query_upstream_ortho) <= len(subject_upstream_ortho):
        
        max_up = len(query_upstream_ortho)
    else:
        max_up = len(subject_upstream_ortho)
        
    for i in range(0, max_up):
        if query_upstream_ortho[i] == subject_upstream_ortho[i]:
            obs_up = obs_up + 1
            
    #now look at the downstream
    if len(query_downstream_ortho) <= len(subject_downstream_ortho):
        
        max_down = len(query_downstream_ortho)
    else:
        max_down = len(subject_downstream_ortho)
        
    for i in range(0, max_down):
        if query_downstream_ortho[i] == subject_downstream_ortho[i]:
            obs_down = obs_down + 1
    
elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
    obs_up = 0
    obs_down = 0
    if len(query_upstream_ortho) <= len(subject_downstream_ortho):
        max_up = len(query_upstream_ortho)
    else:
        max_up = len(subject_downstream_ortho)
    for i in range(0, obs_down):
        if query_upstream_ortho[i] == subject_downstream_ortho[i]:
            obs_up = obs_up + 1
            
    #now look at the downstream
    if len(query_downstream_ortho) <= len(subject_upstream_ortho):
        
        max_down = len(query_downstream_ortho)
    else:
        max_down = len(subject_upstream_ortho)
        
    for i in range(0, max_down):
        if query_downstream_ortho[i] == subject_upstream_ortho[i]:
            obs_down = obs_down + 1
print('obs_up %i, max_up %i, obs_down %i, max_down %i' % (obs_up, max_up, obs_down, max_down))
print('sub_up %s\nquery_up %s\nsub_down %s\nquery_down %s' % (subject_upstream_ortho, query_upstream_ortho,subject_downstream_ortho,query_downstream_ortho ))

In [None]:
def count_pairings(query_ortho, subject_ortho, simple = True):
    """Function that counts the pair wise overlap of two lists.
    In the simple = True mode this is a stricked 1:1 pairing.
    In the simple = Flase mode this is a sliding window of 1 vs [0, 1 , 2] elements.
    Input: 
        query_ortho list
        subject_orth list
        simple True or False
    output:
        obs the number of observed pairings
        max_len the number of potential pairings"""
    if len(query_ortho) <= len(subject_ortho):
        max_len = len(query_ortho)
    else:
        max_len = len(subject_ortho)
    obs = 0
    if simple == True:
        for i in range(0, max_len):
            if query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
        return obs, max_len
            
        
    if simple == False:
        for i in range(0, max_len):
            if i > 0 and i < max_len -1:    
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1] \
                or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            elif i == max_up -1:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1]:
                    obs = obs + 1
            elif query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
            elif i == 0:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            
        return obs, max_len

In [None]:
#second iterations that allows for +1/=-1 difference once the first neighbour is the same
#now we need to look at the pairing of the directions
#and add stuff up
if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho)       
    #now look at the downstream
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho)
elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho)
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho)
print('obs_up %i, max_up %i, obs_down %i, max_down %i' % (obs_up, max_up, obs_down, max_down))
print('sub_up %s\nquery_up %s\nsub_down %s\nquery_down %s' % (subject_upstream_ortho, query_upstream_ortho,subject_downstream_ortho,query_downstream_ortho ))

In [None]:
#second iterations that allows for +1/=-1 difference once the first neighbour is the same
#now we need to look at the pairing of the directions
#and add stuff up
if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, False)       
    #now look at the downstream
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, False)
elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho, False)
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho, False)
print('obs_up %i, max_up %i, obs_down %i, max_down %i' % (obs_up, max_up, obs_down, max_down))
print('sub_up %s\nquery_up %s\nsub_down %s\nquery_down %s' % (subject_upstream_ortho, query_upstream_ortho,subject_downstream_ortho,query_downstream_ortho ))

In [None]:
get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='down')

In [None]:
first_orthologs

In [None]:
query_upstream_ortho

In [None]:
subject_upstream_ortho

In [None]:
query_downstream_ortho

In [None]:
subject_downstream_ortho

In [None]:
line.split(':')[1].strip()

In [None]:
get_neighbours('Pst104E_00002', QUERY_GENOME_GENE_BED6, n=10, direction = 'down')

In [None]:
bed_6_header = ['chrom', 'start', 'stop', 'gene_id', 'phase', 'strand']
bed_df = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None, names=bed_6_header)
gene_id = pd.read_csv(QUERY_SELECT_GENE_BED6, sep='\t', header=None, names=bed_6_header).loc[0,['gene_id']]['gene_id']


In [None]:
n=5
direction = 'up'
gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
contig = bed_df.loc[gene_index, ['chrom']]['chrom']

In [None]:
contig_index = bed_df[bed_df['chrom']== contig].index

In [None]:
gene_id = 'Pst104E_00002'
gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
contig = bed_df.loc[gene_index, ['chrom']]['chrom']
contig_index = bed_df[bed_df['chrom']== contig].index
direction = 'down'
if direction == 'up':
    index_list = []
    if (gene_index+(n)) in contig_index:
        for i in range(gene_index+1, gene_index+(n+1)):
            index_list.append(i)
    else:
        for i in range(gene_index+1, contig_index[-1]+1):
            index_list.append(i)
    print(bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist())
if direction == 'down':
    index_list = []
    if (gene_index-n) in contig_index:
        for i in range(gene_index-(n), gene_index):
            index_list.append(i)
    else:
        for i in range(contig_index[0], gene_index):
            index_list.append(i)
    index_list.reverse()
    print(bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist())

In [None]:
gene_index-n in contig_index

In [None]:
index_list

In [None]:
gene_index[0]

In [None]:
[gene_index: gene_index+6]