### Notebook to get synteny measures between two assemblies
The goal is to get a tool that takes in orthofinder results and a CDS files to check fof the length of synteny with different gene categories.

The idea is that BUSCOs are more syntenous then effectors or such. 

The program tries to count how many neighbours of an allele pair are in the same orthogroup. This requires first to anchor the allele pairing e.g. find out what the 1:1 allele is and then walk outward to see if each others neighbours are in the same orthogroup. For now we won't allow for any skips and just go one by one.

GENOMEA_orthoarray =  [1,2,3,4]
GENOMEB_orthoarray =  [1,2,3,4]

should result into a tuple that where the first element is the number of observed matches and the second is the number of possible matches. here this would be (8,8).

If this analysis is performed on two different set of gene groups this may tell something if microsynteny is more conserved within one gene group then another.

#### initial outline

get a gene ID -> get its neightbours n in two different arrays up and down stream -> get the orthogroup arrays for the neighbours 
get a gene ID  -> get orthogroup of the gene -> get all members of the orthogroup that belong to the other genome -> get their neighbours +/- 1 -> get the best seed where both neighbours add up,  
* if we have only one best match that is easy. Use this seed as 1:1 match get the neighbourhood array up and down -> compare the arrays one element at a time and safe the tuple in a dictonary.
* if multiple hits. Move one further out with each hit and look at those with the same ideas.

Things also to generate:
A match dictionary for allele pairing.  
A unqiue gene dictionary.  
A paraloge dictionray. Meaning where there is no allele pairing but thing are in the same ortho group.

In [4]:
import pandas as pd
import os
import re
from Bio import SeqIO
from Bio import SeqUtils
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil
from Bio.Seq import Seq
import pysam
from Bio import SearchIO
import json
import glob
import scipy.stats as stats
import statsmodels as sms
import statsmodels.sandbox.stats.multicomp
import distance
import seaborn as sns
from pybedtools import BedTool
import matplotlib
from sklearn.externals.joblib import Parallel, delayed
import itertools as it
import tempfile
from scipy.signal import argrelextrema
import scipy
from IPython.display import Image
from PIL import Image
from collections import OrderedDict



In [17]:
### define some path that at the end should come as args
ORTHOFINDER_FILE_NAME = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/OrthoFinder_all/Orthogroups_2.txt'
QUERY_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/Pst104E_annotations/Pst_104E_v13_p_ctg.gene.bed'
SUBJECT_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/DK0911_annotations/DK_0911_v04_p_ctg.genes.gene.bed'
QUERY_SELECT_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/test_folder/Pst_104E_v13_p_ctg.gene1000.bed'

In [22]:
#now write some functions
def get_neighbours(gene_id, bed_filename, n=5, direction='up'):
    """A function that either takes a filename to return an array of downstream and upstream neighbouring genes.
    Input: 
        gene_id
        bed6 filename
        n being the number of neighbours we want to get
        direction being up or down.
    Output:
        returns the largest possible array of neighbours up to n.
        The 0 element is always the closest neighbour now matter if you return and 'up' or 'down' array."""
    bed_6_header = ['chrom', 'start', 'stop', 'gene_id', 'phase', 'strand']
    try:
        bed_df = pd.read_csv(bed_filename, sep='\t', header=None, names=bed_6_header)
    except:
        print('Check if the bedfiles are bed6')
    if not direction in ['up', 'down']:
        print('Ensure direction is up or down.')
    gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
    contig = bed_df.loc[gene_index, ['chrom']]['chrom']
    contig_index = bed_df[bed_df['chrom']== contig].index
    if direction == 'up':
        index_list = []
        if (gene_index+(n)) in contig_index:
            for i in range(gene_index+1, gene_index+(n+1)):
                index_list.append(i)
        else:
            for i in range(gene_index+1, contig_index[-1]+1):
                index_list.append(i)
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()
    if direction == 'down':
        index_list = []
        if (gene_index-n) in contig_index:
            for i in range(gene_index-(n), gene_index):
                index_list.append(i)
        else:
            for i in range(contig_index[0], gene_index):
                index_list.append(i)
        index_list.reverse()
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()

In [62]:
def get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, other_id):
    """The function returns all the potential orthologs of the comparative species
    Input:
        gene_id
        gene_to_ortho_dict, the ortho dict
        orthofinder_dict, the dictionary of the orthgroups to get all genes belonging to the orthogroup in question.
        other_id, is the identifier of the comparative species."""

    return [x for x in orthofinder_dict[gene_to_ortho_dict[gene_id]] if x.startswith(other_id)]

In [28]:
#get the identifiers if not provided.
query_id = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]
subject_id = pd.read_csv(SUBJECT_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]

In [34]:
#get the orthofinder results read in
#this is the dict for all orthogroups and what members they have
orthofinder_dict = {}
with open(ORTHOFINDER_FILE_NAME) as fh:
    for line in fh:
        line.strip()
        orthofinder_dict[line.split(':')[0]] = line.split(':')[1].strip().split(' ')

In [38]:
#this is the dict for proteins and what orthogroup they have
gene_to_ortho_dict = {}
for key,value in orthofinder_dict.items():
    for item in value:
        gene_to_ortho_dict[item] = key

In [84]:
gene_id = 'Pst104E_00002'
print([x for x in orthofinder_dict[gene_to_ortho_dict[gene_id]]if x.startswith('DK0911')])
n=8

['DK0911_15058']


In [74]:
orthofinder_dict[gene_to_ortho_dict[gene_id]]

['DK0911_15058',
 'KNZ49618.1',
 'POW13770.1',
 'POW17438.1',
 'Pst104E_00002',
 'XP_007406928.1',
 'jgi|Croqu1|660945|fgenesh1_kg.106_#_31_#_Locus8754v1rpkm8.61',
 'jgi|Melap1finSC_191|1668711|estExt_Genemark1.C_9250008',
 'jgi|Melli1|206489|MELLI_sc_4122.1',
 'jgi|Mellp2_3|103969|Mellp1.fgenesh2_pg.7_277',
 'jgi|PuccoSD80_1|7118|PCA_SD_11668-T1',
 'jgi|Pucst1|495976|maker-PST130_28344-snap-gene-0.5-mRNA-1',
 'jgi|Pucst_PST78_1|4751|PSTG_14207T0',
 'jgi|Puctr1|2417|PTTG_28700T0']

In [75]:
query_upstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='up')
query_downstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='down')
query_upstream_ortho = [gene_to_ortho_dict[x] for x in query_upstream]
query_downstream_ortho = [gene_to_ortho_dict[x] for x in query_downstream]

In [76]:
first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
if len(first_orthologs) == 1:
    subject_upstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='up')]
    subject_downstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='down')]

In [77]:
#now we need to look at the pairing of the directions
#and add stuff up
if query_upstream_ortho[0] == subject_upstream_ortho[0]:
    obs_up = 0
    obs_down = 0 
    if len(query_upstream_ortho) <= len(subject_upstream_ortho):
        
        max_up = len(query_upstream_ortho)
    else:
        max_up = len(subject_upstream_ortho)
        
    for i in range(0, max_up):
        if query_upstream_ortho[i] == subject_upstream_ortho[i]:
            obs_up = obs_up + 1
            
    #now look at the downstream
    if len(query_downstream_ortho) <= len(subject_downstream_ortho):
        
        max_down = len(query_downstream_ortho)
    else:
        max_down = len(subject_downstream_ortho)
        
    for i in range(0, max_down):
        if query_downstream_ortho[i] == subject_downstream_ortho[i]:
            obs_down = obs_down + 1
    
elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
    obs_up = 0
    obs_down = 0
    if len(query_upstream_ortho) <= len(subject_downstream_ortho):
        max_up = len(query_upstream_ortho)
    else:
        max_up = len(subject_downstream_ortho)
    for i in range(0, obs_down):
        if query_upstream_ortho[i] == subject_downstream_ortho[i]:
            obs_up = obs_up + 1
            
    #now look at the downstream
    if len(query_downstream_ortho) <= len(subject_upstream_ortho):
        
        max_down = len(query_downstream_ortho)
    else:
        max_down = len(subject_upstream_ortho)
        
    for i in range(0, max_down):
        if query_downstream_ortho[i] == subject_upstream_ortho[i]:
            obs_down = obs_down + 1

In [78]:
print('obs_up %i, max_up %i, obs_down %i, max_down %i' % (obs_up, max_up, obs_down, max_down))

obs_up 3, max_up 8, obs_down 2, max_down 2


In [81]:
print('sub_up %s\nquery_up %s\nsub_down %s\nquery_down %s' % (subject_upstream_ortho, query_upstream_ortho,subject_downstream_ortho,query_downstream_ortho ))

sub_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0012676', 'OG0039656', 'OG0008482', 'OG0000157', 'OG0000001']
query_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0037814', 'OG0012676', 'OG0039656', 'OG0008482', 'OG0002271']
sub_down ['OG0009613', 'OG0006130']
query_down ['OG0009613', 'OG0006130']


In [60]:
get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='down')

['DK0911_15057', 'DK0911_15056']

In [61]:
first_orthologs

['DK0911_15058']

In [56]:
query_upstream_ortho

['OG0000028', 'OG0005025', 'OG0015028', 'OG0037814', 'OG0012676']

In [57]:
subject_upstream_ortho

['OG0000028', 'OG0005025', 'OG0015028', 'OG0012676', 'OG0039656']

In [58]:
query_downstream_ortho

['OG0009613', 'OG0006130']

In [59]:
subject_downstream_ortho

['OG0009613', 'OG0006130']

In [32]:
line.split(':')[1].strip()

'Pst104E_25431'

In [25]:
get_neighbours('Pst104E_00002', QUERY_GENOME_GENE_BED6, n=10, direction = 'down')

['Pst104E_00001', 'Pst104E_00000']

In [8]:
bed_6_header = ['chrom', 'start', 'stop', 'gene_id', 'phase', 'strand']
bed_df = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None, names=bed_6_header)
gene_id = pd.read_csv(QUERY_SELECT_GENE_BED6, sep='\t', header=None, names=bed_6_header).loc[0,['gene_id']]['gene_id']


In [40]:
n=5
direction = 'up'
gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
contig = bed_df.loc[gene_index, ['chrom']]['chrom']

In [42]:
contig_index = bed_df[bed_df['chrom']== contig].index

In [79]:
gene_id = 'Pst104E_00002'
gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
contig = bed_df.loc[gene_index, ['chrom']]['chrom']
contig_index = bed_df[bed_df['chrom']== contig].index
direction = 'down'
if direction == 'up':
    index_list = []
    if (gene_index+(n)) in contig_index:
        for i in range(gene_index+1, gene_index+(n+1)):
            index_list.append(i)
    else:
        for i in range(gene_index+1, contig_index[-1]+1):
            index_list.append(i)
    print(bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist())
if direction == 'down':
    index_list = []
    if (gene_index-n) in contig_index:
        for i in range(gene_index-(n), gene_index):
            index_list.append(i)
    else:
        for i in range(contig_index[0], gene_index):
            index_list.append(i)
    index_list.reverse()
    print(bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist())

['Pst104E_00001', 'Pst104E_00000']


In [67]:
gene_index-n in contig_index

True

In [65]:
index_list

[994, 995, 996, 997, 998]

In [36]:
gene_index[0]

999

In [50]:
[gene_index: gene_index+6]

SyntaxError: invalid syntax (<ipython-input-50-0f1b2462552f>, line 1)