### Notebook to get synteny measures between two assemblies
The goal is to get a tool that takes in orthofinder results and a CDS files to check fof the length of synteny with different gene categories.

The idea is that BUSCOs are more syntenous then effectors or such. 

The program tries to count how many neighbours of an allele pair are in the same orthogroup. This requires first to anchor the allele pairing e.g. find out what the 1:1 allele is and then walk outward to see if each others neighbours are in the same orthogroup. We can run this simple without allowing for skips and in simple=Flase mode to allow for a skip by one.

GENOMEA_orthoarray =  [1,2,3,4]
GENOMEB_orthoarray =  [1,2,3,5]

The result of a comparison is an array of length n where n is the size of the tested synteny block. The initial outcome is an array of length n where 1 represents a orthology mathch, where 0 represents not an orthology match and where np.nan represents the lack of test for orthology match (e.g. where one of the genes is at the edge of a contig). These arrays can be converted in tuples, where the first is number of observed matches (sum of 1s) and where the second value is the number of possible values (0 and 1 == non np.nan in the array).  
For the case above the match array for n = 5 should be [1,1,1,0,nan] and the tuple (3,4)

If this analysis is performed on two different set of gene groups this may tell something if microsynteny is more conserved within one gene group then another. Also have a look at http://chibba.pgml.uga.edu/mcscan2/ for synteny analysis.

#### initial outline

get a gene ID -> get its neightbours n in two different arrays up and down stream -> get the orthogroup arrays for the neighbours 
get a gene ID  -> get orthogroup of the gene -> get all members of the orthogroup that belong to the other genome -> get their neighbours +/- 1 -> get the best seed where both neighbours add up,  
* if we have only one best match that is easy. Use this seed as 1:1 match get the neighbourhood array up and down -> compare the arrays one element at a time and safe the match_up and match_down dictionray.
* if multiple hits. Move one further out with each hit and look at those with the same ideas.

Things also to generate:
A match dictionary for allele pairing.  
A unqiue gene dictionary.  
A paraloge dictionray. Meaning where there is no allele pairing but thing are in the same ortho group.

In [1]:
import pandas as pd
import os
import re
from Bio import SeqIO
from Bio import SeqUtils
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil
from Bio.Seq import Seq
import pysam
from Bio import SearchIO
import json
import glob
import scipy.stats as stats
import statsmodels as sms
import statsmodels.sandbox.stats.multicomp
import distance
import seaborn as sns
from pybedtools import BedTool
import matplotlib
from sklearn.externals.joblib import Parallel, delayed
import itertools as it
import tempfile
from scipy.signal import argrelextrema
import scipy
from IPython.display import Image
from PIL import Image
from collections import OrderedDict
from datetime import date



In [16]:
### define some path that at the end should come as args
ORTHOFINDER_FILE_NAME = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/result_update_redundant_protein_sets_01032019/Orthogroups_3_combined.csv'
SUBJECT_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/Pst104E_annotations/Pst_104E_v13_ph_ctg.genes.gene.bed'
QUERY_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/DK0911_annotations/DK_0911_v04_ph_ctg.genes.gene.bed'

In [3]:
OUT_PATH = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/DK0911_vs_Ps104E_v13'
if not os.path.exists(OUT_PATH):
    os.mkdir(OUT_PATH)

In [4]:
#now write some functions
def get_neighbours(gene_id, bed_filename, n=5, direction='up'):
    """A function that either takes a filename to return an array of downstream and upstream neighbouring genes.
    Input: 
        gene_id
        bed6 filename
        n being the number of neighbours we want to get
        direction being up or down.
    Output:
        returns the largest possible array of neighbours up to n.
        The 0 element is always the closest neighbour no matter if you return and 'up' or 'down' array."""
    bed_6_header = ['chrom', 'start', 'stop', 'gene_id', 'phase', 'strand']
    try:
        bed_df = pd.read_csv(bed_filename, sep='\t', header=None, names=bed_6_header)
    except:
        print('Check if the bedfiles are bed6')
    if not direction in ['up', 'down']:
        print('Ensure direction is up or down.')
        
    #fix to make sure the gene id is actually in the bed_file if not just return an empty list    
    if gene_id not in bed_df['gene_id'].unique():
        print('Warning gene %s is not in bed file!' % gene_id)
        return []
    gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
    contig = bed_df.loc[gene_index, ['chrom']]['chrom']
    contig_index = bed_df[bed_df['chrom']== contig].index
    if direction == 'up':
        index_list = []
        if (gene_index+(n)) in contig_index:
            for i in range(gene_index+1, gene_index+(n+1)):
                index_list.append(i)
        else:
            for i in range(gene_index+1, contig_index[-1]+1):
                index_list.append(i)
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()
    if direction == 'down':
        index_list = []
        if (gene_index-n) in contig_index:
            for i in range(gene_index-(n), gene_index):
                index_list.append(i)
        else:
            for i in range(contig_index[0], gene_index):
                index_list.append(i)
        index_list.reverse()
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()

In [5]:
def dict_to_df(new_dict, path, date, name):
    """Saves one of the generate dicts as dataframe.
    INPUT:
        dict
        path
        data
        name"""
    df = pd.DataFrame.from_dict(new_dict, orient='index')
    df['Query'] = df.index
    df.rename(columns={0: 'Target'}, inplace = True)
    df.reset_index(drop=True, inplace=True)
    df.loc[:, ['Query', 'Target']].to_csv(os.path.join(path, ('%s_%s.csv')% (date, name)))

In [6]:
def get_ortho_dict(file_name):
    """Function that takes a orthofinder file name and generates an orthofinder dict.
    Input:
        CSV Filename of orthofinder output.
    Output:
        Orthofinder dict with keys intergers of the numerical part of the orthofinder group.
        Values are the gene identifier of each orthogroup."""
    orthofinder_dict = {}
    try:
        with open(file_name) as fh:
            for line in fh:
                if line.startswith('OG'):
                    line.strip()
                    OG = line.split('\t')[0]
                    value = [x.strip() for x in line.split(OG)[1].replace('\t',',').split(',') if x != '']
                    orthofinder_dict[int(line.split('\t')[0].strip('OG'))] = value
        return orthofinder_dict
    except FileNotFoundError:
        print("Please check the orthofinder in put file.")

In [7]:
def get_gene_to_ortho_dict(orthofinder_dict):
    """Function that makes a gene to ortho dict."""
    gene_to_ortho_dict = {}
    for key,value in orthofinder_dict.items():
        for item in value:
            gene_to_ortho_dict[item] = key
    return gene_to_ortho_dict

In [8]:
def get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, other_id):
    """The function returns all the potential orthologs of the comparative species
    Input:
        gene_id
        gene_to_ortho_dict, the ortho dict
        orthofinder_dict, the dictionary of the orthgroups to get all genes belonging to the orthogroup in question.
        other_id, is the identifier of the comparative species."""

    return [x for x in orthofinder_dict[gene_to_ortho_dict[gene_id]] if x.startswith(other_id)]

In [9]:
def count_pairings_array(query_ortho, subject_ortho, n=5, simple = False):
    """Function that does a pairwise by element comparison of two arrays.
    In the simple = True mode this is a stricked 1:1 pairing.
    In the simple = Flase mode this is a sliding window of 1 either way. E.g. position query 1 is compared
    to position subject 0, 1 and 2.
    Input: 
        query_ortho list
        subject_orth list
        n is the length of the comparison to be made.
        simple True or False, default is Flase allowing for skiping of 1.
    output:
        result array that has the positional overlap of the array.
        e.g. for n = 8 with one list being length 8 and the other length 6
        [0, 1, 0, 1, 0, 1, nan, nan]"""
    
    array = np.empty(n)
    array[:] = np.nan
    array
    
    if simple == True:
        #this is a simple 1:1 comparison for the two arrays.
        #comparison is only possible till the shortest list is done.
        for i in range(0, n):
            if i < len(query_ortho) and i < len(subject_ortho):
                if query_ortho[i] == subject_ortho[i]:
                    array[i] = int(1)
                else:
                    array[i] = int(0)
            else:
                continue
        return array
            
        
    if simple == False:
        #do a three point comparison when possible else do whats possible.
        #retunr the array
        for i in range(0, len(query_ortho)):
            
            if i == len(subject_ortho)-1:
                if query_ortho[i] == subject_ortho[i-1] or query_ortho[i] == subject_ortho[i]:
                    array[i] = int(1)
                else:
                    array[i] = int(0)
            elif i > 0 and i < len(subject_ortho):    
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1] \
                or query_ortho[i] == subject_ortho[i+1]:
                    array[i] = int(1)
                else:
                    array[i] = int(0)

            elif i == 0:
                do_it = False
                try:
                    if query_ortho[i] == subject_ortho[i+1]:
                        do_it = True
                except IndexError:
                        pass
                try:
                    if query_ortho[i] == subject_ortho[i]:
                        do_it = True
                except IndexError:
                        pass
                if do_it == True:
                    array[i] = int(1)
                else:
                    array[i] = int(0)
        return array
        
        

In [10]:
def get_up_and_down_array(gene_id, query_downstream_ortho,query_upstream_ortho, subject_bed_fn, n, simple):
    ### TO-DO think about adding control on how to initially find pairings 0-0 and how to do the counting
    ### of matches later.
    
    """A function that returns a list two arrays. The first array is the for the downstream pairing the second for the
    upstream pairing 1 being a match, 0 being no match, nan being no match possible.
    Input: 
           Gene_id to check from the subject genome, e.g. initial orthogroup matches of query gene.
           query_downstream_ortho is the list of downstream ortho groups from the gene_ids neighbours.
           query_upstream_ortho is the list of upstream ortho groups from the gene_ids neighbours.
           subject_bed_fn is the absolute path of the subject_bed_fn to get the neighbouring genes of the gene_id.
           n is the numbers of neighbours to search.
           simple can be True or False for searching without skiping (window of three) or with skipping enabled.
    Output:
           [down_array,up_array]"""
    
    
    if simple not in [True, False]:
        simple = False
    
    #get the ortho dictionaries for the gene_id which is an ortho hit of the query.
    subject_upstream_ortho = [gene_to_ortho_dict[x] for x \
                              in get_neighbours(gene_id, subject_bed_fn, n=n, direction='up')]
    subject_downstream_ortho = [gene_to_ortho_dict[x] for x \
                                in get_neighbours(gene_id, subject_bed_fn, n=n, direction='down')]
    
    #now move to empty arrays
    
    
    #define in case something has no upstream or downstream hits
    array = array = np.empty(n)
    array[:] = np.nan
    up_array, down_array = array, array
    
    if simple == True:
        
        if len(query_upstream_ortho) > 0 and len(subject_upstream_ortho) > 0: 
        #check the upstream upstream matching and ask if 0 in subject is 0 in the query  
        #TO-DO check if we maybe want to do 0, 1 subject == 0 as well.
            if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho, n, simple=simple)       
                #now look at the downstream
                down_array = count_pairings_array(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)

                if len(query_downstream_ortho) > 0:
                    if query_downstream_ortho[0] == subject_upstream_ortho[0]:
                        print("Orthogroups of up and downstream are the same. Gene_id: %s" % gene_id)
                        up_array_new = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                        down_array_new = count_pairings_array(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                        #this compares if the one has more pairing then the other combination of possible 
                        #neighboorhood match arrays
                        if up_down_ratio_array([down_array_new, up_array_new])\
                                > up_down_ratio_array([down_array, up_array]):
                                #think about if we also want to test for 
                                #(max_down_new+max_up_new) > (max_down, max_up) :
                            down_array, up_array = down_array_new, up_array_new
                        #break the stallmate by assiningn randomly
                        elif up_down_ratio_array([down_array_new, up_array_new])\
                                == up_down_ratio_array([down_array, up_array])\
                                and((np.random.random() < 0.5)):
                            down_array, up_array = down_array_new, up_array_new


                return [down_array, up_array]

        if len(query_upstream_ortho) >0 and len(subject_downstream_ortho) > 0:
            
            if query_upstream_ortho[0] == subject_downstream_ortho[0]:
                up_array = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                down_array = count_pairings_array(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                return [down_array, up_array]

        if len(query_downstream_ortho) > 0 and len(subject_downstream_ortho) > 0: 
            if subject_downstream_ortho[0] == query_downstream_ortho[0]:
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho,n, simple=simple)       
                #now look at the downstream
                down_array = count_pairings_array(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)
                return [down_array, up_array]

        return [down_array, up_array]
    
    elif simple == False:
        
        do_it = False
        one_exists = False
        ###To-Do this still needs to be fixed to check if also the downstream match
        ###Right now this generates a bias in the analysis that the +1 position is more common the -1
        if len(query_upstream_ortho) > 0 and len(subject_upstream_ortho) > 0: 
            try:
                if query_upstream_ortho[0] == subject_upstream_ortho[0] or \
                query_upstream_ortho[0] == subject_upstream_ortho[1]:
                    do_it = True
            
            except IndexError:
                if query_upstream_ortho[0] == subject_upstream_ortho[0]:
                    do_it = True
                    
            if do_it == True:        
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho, n, simple=simple)       
                #now look at the downstream
                down_array = count_pairings_array(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)

                if len(query_downstream_ortho) > 0 and len(subject_upstream_ortho) > 0:
                    do_it_too = False
                    try:
                        if query_downstream_ortho[0] == subject_upstream_ortho[0] or \
                        query_downstream_ortho[0] == subject_upstream_ortho[1]:
                            do_it_too = True
                    except IndexError:
                        if query_downstream_ortho[0] == subject_upstream_ortho[0]:
                            do_it_too = True
                            
                        if do_it_too == True:
                            #print("Orthogroups of up and downstream are the same. Gene_id: %s" % gene_id)
                            up_array_new = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                            down_array_new = count_pairings_array(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                            if up_down_ratio_array([down_array_new, up_array_new])\
                                > up_down_ratio_array([down_array, up_array]):
                                #think about if we also want to test for 
                                #(max_down_new+max_up_new) > (max_down, max_up) :
                                down_array, up_array = down_array_new, up_array_new
                            elif up_down_ratio_array([down_array_new, up_array_new])\
                                == up_down_ratio_array([down_array, up_array])\
                                and((np.random.random() < 0.5)):
                                down_array, up_array = down_array_new, up_array_new


                    return [down_array, up_array]

        if len(query_upstream_ortho) >0 and len(subject_downstream_ortho) > 0:
            try:
                
                if query_upstream_ortho[0] == subject_downstream_ortho[0] or \
                query_upstream_ortho[0] == subject_downstream_ortho[1]:
                    do_it = True
            except IndexError:
                if query_upstream_ortho[0] == subject_downstream_ortho[0]:
                    do_it = True
            if do_it == True:
                up_array = count_pairings_array(query_upstream_ortho, subject_downstream_ortho, n, simple=simple)
                down_array = count_pairings_array(query_downstream_ortho, subject_upstream_ortho, n, simple=simple)
                return [down_array, up_array]

        if len(query_downstream_ortho) > 0 and len(subject_downstream_ortho) > 0:
            try: 
                if subject_downstream_ortho[0] == query_downstream_ortho[0] or \
                subject_downstream_ortho[0] == query_downstream_ortho[1]:
                    do_it = True
            except IndexError:
                if subject_downstream_ortho[0] == query_downstream_ortho[0]:
                    do_it = True
            if do_it == True:
                up_array = count_pairings_array(query_upstream_ortho, subject_upstream_ortho,n, simple=simple)       
                    #now look at the downstream
                down_array = count_pairings_array(query_downstream_ortho, subject_downstream_ortho,n, simple=simple)
                return [down_array, up_array]

        return [down_array, up_array]

In [11]:
def array_to_tuple(array_list):
    """Converts arrays to (obs, max_poss_obs) tuples ignoring nans."""
    if len(array_list) != 2:
        print('The length of the list is not 2.')
        return 0
    tuple_list = []
    for array in array_list:
        tuple_list.append(((np.nan_to_num(array)).sum(),np.count_nonzero(~np.isnan(array)) ))
    return tuple_list

In [12]:
def up_down_ratio_array(up_down_list):
    """Returns the ratio of observed matches over possible matches of both down and up array together."""
    if len(up_down_list) != 2:
        print('The length of the list is not 2.')
        return 0
    sum_observed = np.nan_to_num(up_down_list[0]).sum() + np.nan_to_num(up_down_list[1]).sum()
    sum_possible = np.count_nonzero(~np.isnan(up_down_list[0])) + np.count_nonzero(~np.isnan(up_down_list[1]))
    if sum_possible > 0:
        
        return sum_observed/sum_possible
    else:
        return 0

In [13]:
def up_down_ratio(up_down_list):
    """Up down ratio for tuples (obs, max_poss_obs)."""
    if len(up_down_list) == 0:
        return 0
    if (up_down_list[0][1] + up_down_list[1][1]) > 0:
        ratio = (up_down_list[0][0] + up_down_list[1][0]) / (up_down_list[0][1] + up_down_list[1][1])
        return ratio
    else:
        return 0

In [14]:
#generate dicts that will be used to track things
allele_dict = {}
singleton_dict = {}
paralog_dict = {}
up_match_dict = {}
down_match_dict = {}

In [17]:
#get the identifiers if not provided.
query_id = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]
subject_id = pd.read_csv(SUBJECT_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]

In [18]:
#get the orthofinder results read in
#this is the dict for all orthogroups and what members they have
orthofinder_dict = get_ortho_dict(ORTHOFINDER_FILE_NAME)

In [19]:
#this is the dict for proteins and what orthogroup they have
gene_to_ortho_dict = get_gene_to_ortho_dict(orthofinder_dict)

In [20]:
query_genes = pd.read_csv(SUBJECT_GENOME_GENE_BED6, sep='\t', header=None)[3]

In [21]:
#check whats in the input orthofile and what's missing
for gene in query_genes:
    try:
        gene_to_ortho_dict[gene]
    except:
        print(gene)

#### Main program

In [None]:
#generate dicts that will be used to track things
allele_dict = {}
singleton_dict = {}
paralog_dict = {}
up_match_dict = {}
down_match_dict = {}
n=8
simple = False
#testing set loop
gene_done_list = []
df = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None)
for gene_id in df[3].tolist():
    gene_done_list.append(gene_id)
    #now some testing here    
    query_upstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='up')
    query_downstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='down')
    query_upstream_ortho = [gene_to_ortho_dict[x] for x in query_upstream]
    query_downstream_ortho = [gene_to_ortho_dict[x] for x in query_downstream]
    
    first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
    if len(first_orthologs) == 0:
        #if we don't have an ortho hit put it in the singleton dict
        singleton_dict[gene_id] = True
    elif len(first_orthologs) == 1:
        down_up_list = get_up_and_down_array(first_orthologs[0],\
                            query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6,\
                                             n=n, simple=simple)
        down_match_dict[gene_id], up_match_dict[gene_id] = down_up_list[0], down_up_list[1]
        
        #this here defines "alleles". Alleles are genes where the immediate neighbour is conserved.
        #consider simple True/False here.
        try:
            if down_up_list[0][0] == 1 and down_up_list[1][0] == 1:
                allele_dict[gene_id] = first_orthologs[0]
            else:
                paralog_dict[gene_id] = first_orthologs
        except IndexError:
                paralog_dict[gene_id] = first_orthologs
    
    elif len(first_orthologs) > 1:
        #generate a potential ortho_dict that stores the options
        #afterwards we loop over the option and see what is best.
        ortho_dict = {}
        for ortho in first_orthologs:
            down_up_list = get_up_and_down_array(ortho,query_downstream_ortho,query_upstream_ortho,\
                                               SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)
            #TO-DO check for edge cases here
            if sum((~np.isnan(down_up_list[0]))) == 0 and sum((~np.isnan(down_up_list[1]))) == 0:
                continue
            else:
                ortho_dict[ortho] = down_up_list

        array = array = np.empty(n)
        array[:] = np.nan
        max_list = [(0,0), (0,0)]
        real_ortho = ''
        real_ortho_list = [array, array]
        
        #this loop intentifies the best ortho hit. If two get the best score
        #the first one is kept. has a change element.
        #TO-DO consider how splits are handled better.
        for ortho, ortho_list in ortho_dict.items():
            ortho_tuple_list = array_to_tuple(ortho_list)
            #fix the issue with always staying on the state when equal
            if (up_down_ratio(ortho_tuple_list) == up_down_ratio(max_list)) \
            and (np.random.random() < 0.5):
                max_list = ortho_tuple_list
                real_ortho = ortho
                real_ortho_list = ortho_list
                
            elif up_down_ratio(ortho_tuple_list) > up_down_ratio(max_list):
                print(ortho_tuple_list)
                max_list = ortho_tuple_list
                real_ortho = ortho
                real_ortho_list = ortho_list
        

        down_match_dict[gene_id], up_match_dict[gene_id] = real_ortho_list[0], real_ortho_list[1]
        
        #this is how alleles and paralogs are defined.
        #TO-DO consider if paralogs should be treated differently then simply assiging all in the
        #ortho dict
        
        if ortho_dict == {}:
            paralog_dict[gene_id] = first_orthologs
        else:
        
            paralog_dict[gene_id] = list(ortho_dict.keys())

            try:
                if real_ortho_list[0][0] == 1 and real_ortho_list[1][0] == 1:
                    allele_dict[gene_id] = real_ortho
                else:
                    paralog_dict[gene_id] = list(ortho_dict.keys())
            except IndexError:
                paralog_dict[gene_id] = list(ortho_dict.keys())
        print('Done comparing!')

Done comparing!
[(0.0, 2), (5.0, 6)]
[(0.0, 2), (6.0, 6)]
Done comparing!
[(1.0, 3), (4.0, 4)]
[(1.0, 3), (5.0, 5)]
Done comparing!
[(2.0, 4), (3.0, 3)]
[(2.0, 4), (4.0, 4)]
Done comparing!
[(3.0, 5), (1.0, 2)]
Done comparing!
[(3.0, 6), (2.0, 2)]
[(4.0, 6), (2.0, 2)]
Done comparing!
[(5.0, 7), (1.0, 1)]
[(6.0, 7), (1.0, 1)]
Done comparing!
[(1.0, 8), (0.0, 0)]
[(1.0, 4), (0.0, 0)]
[(1.0, 2), (0.0, 0)]
[(6.0, 8), (0.0, 0)]
Done comparing!
Done comparing!
[(1.0, 1), (2.0, 2)]
Done comparing!
[(0.0, 2), (1.0, 1)]
[(1.0, 2), (1.0, 1)]
Done comparing!
[(1.0, 3), (0.0, 0)]
[(2.0, 3), (0.0, 0)]
Done comparing!
Done comparing!
[(0.0, 1), (5.0, 5)]
Done comparing!
[(1.0, 2), (4.0, 4)]
Done comparing!
[(2.0, 3), (3.0, 3)]
Done comparing!
[(3.0, 4), (2.0, 2)]
Done comparing!
[(4.0, 5), (1.0, 1)]
Done comparing!
Done comparing!
Done comparing!
Done comparing!
[(1.0, 1), (5.0, 6)]
[(1.0, 1), (1.0, 1)]
Done comparing!
[(2.0, 2), (0.0, 1)]
[(2.0, 2), (4.0, 5)]
Done comparing!
[(3.0, 3), (3.0, 4)]
Do

[(2.0, 2), (4.0, 6)]
Done comparing!
[(5.0, 5), (5.0, 8)]
Done comparing!
Done comparing!
[(5.0, 6), (1.0, 2)]
Done comparing!
[(2.0, 8), (7.0, 8)]
Done comparing!
[(3.0, 8), (7.0, 8)]
Done comparing!
[(4.0, 8), (7.0, 8)]
Done comparing!
[(4.0, 8), (3.0, 8)]
[(5.0, 8), (7.0, 8)]
Done comparing!
[(6.0, 8), (7.0, 8)]
Done comparing!
[(4.0, 8), (3.0, 8)]
[(7.0, 8), (7.0, 8)]
Done comparing!
[(7.0, 8), (7.0, 8)]
Done comparing!
[(7.0, 8), (8.0, 8)]
Done comparing!
[(7.0, 8), (6.0, 6)]
Done comparing!
[(4.0, 8), (5.0, 5)]
[(7.0, 8), (5.0, 5)]
Done comparing!
[(4.0, 8), (4.0, 4)]
[(7.0, 8), (4.0, 4)]
Done comparing!
[(7.0, 8), (3.0, 3)]
Done comparing!
[(5.0, 8), (2.0, 2)]
[(7.0, 8), (2.0, 2)]
Done comparing!
Done comparing!
Done comparing!
[(0.0, 0), (6.0, 8)]
Done comparing!
[(0.0, 1), (5.0, 8)]
[(1.0, 1), (5.0, 8)]
Done comparing!
[(2.0, 2), (5.0, 8)]
Done comparing!
[(3.0, 3), (5.0, 8)]
Done comparing!
[(0.0, 1), (2.0, 7)]
[(4.0, 4), (4.0, 7)]
Done comparing!
[(4.0, 5), (1.0, 6)]
[(3.0, 

[(4.0, 4), (6.0, 8)]
Done comparing!
[(4.0, 5), (6.0, 8)]
Done comparing!
[(6.0, 6), (4.0, 8)]
Done comparing!
[(7.0, 7), (3.0, 8)]
Done comparing!
[(7.0, 8), (4.0, 8)]
Done comparing!
[(8.0, 8), (3.0, 8)]
[(8.0, 8), (4.0, 8)]
Done comparing!
[(8.0, 8), (4.0, 8)]
Done comparing!
[(0.0, 8), (1.0, 8)]
[(0.0, 8), (1.0, 2)]
[(8.0, 8), (3.0, 8)]
Done comparing!
Done comparing!
Done comparing!
[(0.0, 8), (4.0, 5)]
Done comparing!
[(4.0, 8), (2.0, 3)]
Done comparing!
[(1.0, 8), (1.0, 2)]
[(6.0, 8), (1.0, 2)]
Done comparing!
[(6.0, 8), (0.0, 1)]
Done comparing!
Done comparing!
[(1.0, 1), (3.0, 5)]
Done comparing!
[(2.0, 2), (2.0, 4)]
Done comparing!
[(2.0, 3), (1.0, 4)]
[(3.0, 3), (1.0, 3)]
Done comparing!
[(4.0, 5), (0.0, 2)]
Done comparing!
[(1.0, 2), (1.0, 8)]
Done comparing!
Done comparing!
[(2.0, 3), (7.0, 8)]
Done comparing!
Done comparing!
[(1.0, 8), (7.0, 8)]
[(1.0, 4), (7.0, 8)]
Done comparing!
[(2.0, 8), (6.0, 8)]
[(2.0, 5), (7.0, 8)]
Done comparing!
[(3.0, 8), (5.0, 8)]
[(3.0, 6), (

[(8.0, 8), (4.0, 8)]
Done comparing!
[(7.0, 8), (3.0, 8)]
[(8.0, 8), (3.0, 8)]
Done comparing!
[(8.0, 8), (2.0, 8)]
Done comparing!
[(8.0, 8), (1.0, 7)]
Done comparing!
[(8.0, 8), (0.0, 6)]
Done comparing!
Done comparing!
Done comparing!
Done comparing!
Done comparing!
Done comparing!
Done comparing!
[(0.0, 1), (3.0, 3)]
Done comparing!
[(1.0, 2), (2.0, 2)]
Done comparing!
[(2.0, 3), (1.0, 1)]
Done comparing!
[(3.0, 4), (0.0, 0)]
Done comparing!
[(0.0, 0), (2.0, 3)]
Done comparing!
Done comparing!
Done comparing!
Done comparing!
[(1.0, 7), (2.0, 3)]
[(2.0, 6), (2.0, 3)]
Done comparing!
Done comparing!
Done comparing!
[(0.0, 0), (8.0, 8)]
Done comparing!
[(1.0, 1), (8.0, 8)]
Done comparing!
[(2.0, 2), (8.0, 8)]
Done comparing!
[(3.0, 3), (7.0, 8)]
Done comparing!
[(4.0, 4), (6.0, 8)]
Done comparing!
[(5.0, 5), (5.0, 7)]
Done comparing!
[(6.0, 6), (4.0, 6)]
Done comparing!
[(7.0, 7), (3.0, 5)]
Done comparing!
[(8.0, 8), (2.0, 4)]
Done comparing!
[(8.0, 8), (1.0, 3)]
Done comparing!
Done 

In [25]:
down_match_df = pd.DataFrame.from_dict(down_match_dict).T
up_match_df = pd.DataFrame.from_dict(up_match_dict).T

In [27]:
#save it out 
d = date.today().strftime("%y%m%d")
name = OUT_PATH.split('/')[-1]
up_match_df.to_csv(os.path.join(OUT_PATH, ('%s_%s_up_match_simple_true_df.csv' % (d, name))))
down_match_df.to_csv(os.path.join(OUT_PATH, ('%s_%s_down_match_simple_true_df.csv'% (d, name))))

In [28]:
for key, value in paralog_dict.items():
    if type(value) == str:
        paralog_dict[key] = [value]
        

In [29]:
dict_to_df(allele_dict, OUT_PATH, d, '%s_allele' % name)
dict_to_df(singleton_dict, OUT_PATH, d, '%s_singleton' %name)
dict_to_df(paralog_dict, OUT_PATH, d, '%s_paraloge' %name)

In [30]:
print('hello')

hello
