### Notebook to get synteny measures between two assemblies
The goal is to get a tool that takes in orthofinder results and a CDS files to check fof the length of synteny with different gene categories.

The idea is that BUSCOs are more syntenous then effectors or such. 

The program tries to count how many neighbours of an allele pair are in the same orthogroup. This requires first to anchor the allele pairing e.g. find out what the 1:1 allele is and then walk outward to see if each others neighbours are in the same orthogroup. For now we won't allow for any skips and just go one by one.

GENOMEA_orthoarray =  [1,2,3,4]
GENOMEB_orthoarray =  [1,2,3,4]

should result into a tuple that where the first element is the number of observed matches and the second is the number of possible matches. here this would be (8,8).

If this analysis is performed on two different set of gene groups this may tell something if microsynteny is more conserved within one gene group then another.

#### initial outline

get a gene ID -> get its neightbours n in two different arrays up and down stream -> get the orthogroup arrays for the neighbours 
get a gene ID  -> get orthogroup of the gene -> get all members of the orthogroup that belong to the other genome -> get their neighbours +/- 1 -> get the best seed where both neighbours add up,  
* if we have only one best match that is easy. Use this seed as 1:1 match get the neighbourhood array up and down -> compare the arrays one element at a time and safe the tuple in a dictonary.
* if multiple hits. Move one further out with each hit and look at those with the same ideas.

Things also to generate:
A match dictionary for allele pairing.  
A unqiue gene dictionary.  
A paraloge dictionray. Meaning where there is no allele pairing but thing are in the same ortho group.

In [1]:
import pandas as pd
import os
import re
from Bio import SeqIO
from Bio import SeqUtils
import pysam
from Bio.SeqRecord import SeqRecord
from pybedtools import BedTool
import numpy as np
import pybedtools
import time
import matplotlib.pyplot as plt
import sys
import subprocess
import shutil
from Bio.Seq import Seq
import pysam
from Bio import SearchIO
import json
import glob
import scipy.stats as stats
import statsmodels as sms
import statsmodels.sandbox.stats.multicomp
import distance
import seaborn as sns
from pybedtools import BedTool
import matplotlib
from sklearn.externals.joblib import Parallel, delayed
import itertools as it
import tempfile
from scipy.signal import argrelextrema
import scipy
from IPython.display import Image
from PIL import Image
from collections import OrderedDict



In [2]:
### define some path that at the end should come as args
ORTHOFINDER_FILE_NAME = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/OrthoFinder_all/Orthogroups_2.txt'
QUERY_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/Pst104E_annotations/Pst_104E_v13_p_ctg.gene.bed'
SUBJECT_GENOME_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/DK0911_annotations/DK_0911_v04_p_ctg.genes.gene.bed'
QUERY_SELECT_GENE_BED6 = '/home/benjamin/genome_assembly/Warrior/DK0911_v04/comp_orthology/orthofinder/test_folder/Pst_104E_v13_p_ctg.gene1000.bed'

In [45]:
#now write some functions
def get_neighbours(gene_id, bed_filename, n=5, direction='up'):
    """A function that either takes a filename to return an array of downstream and upstream neighbouring genes.
    Input: 
        gene_id
        bed6 filename
        n being the number of neighbours we want to get
        direction being up or down.
    Output:
        returns the largest possible array of neighbours up to n.
        The 0 element is always the closest neighbour now matter if you return and 'up' or 'down' array."""
    bed_6_header = ['chrom', 'start', 'stop', 'gene_id', 'phase', 'strand']
    try:
        bed_df = pd.read_csv(bed_filename, sep='\t', header=None, names=bed_6_header)
    except:
        print('Check if the bedfiles are bed6')
    if not direction in ['up', 'down']:
        print('Ensure direction is up or down.')
        
    #fix to make sure the gene id is actually in the bed_file if not just return an empty list    
    if gene_id not in bed_df['gene_id'].unique():
        return []
    gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
    contig = bed_df.loc[gene_index, ['chrom']]['chrom']
    contig_index = bed_df[bed_df['chrom']== contig].index
    if direction == 'up':
        index_list = []
        if (gene_index+(n)) in contig_index:
            for i in range(gene_index+1, gene_index+(n+1)):
                index_list.append(i)
        else:
            for i in range(gene_index+1, contig_index[-1]+1):
                index_list.append(i)
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()
    if direction == 'down':
        index_list = []
        if (gene_index-n) in contig_index:
            for i in range(gene_index-(n), gene_index):
                index_list.append(i)
        else:
            for i in range(contig_index[0], gene_index):
                index_list.append(i)
        index_list.reverse()
        return bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist()

In [4]:
def get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, other_id):
    """The function returns all the potential orthologs of the comparative species
    Input:
        gene_id
        gene_to_ortho_dict, the ortho dict
        orthofinder_dict, the dictionary of the orthgroups to get all genes belonging to the orthogroup in question.
        other_id, is the identifier of the comparative species."""

    return [x for x in orthofinder_dict[gene_to_ortho_dict[gene_id]] if x.startswith(other_id)]

In [15]:
def count_pairings(query_ortho, subject_ortho, simple = True):
    """Function that counts the pair wise overlap of two lists.
    In the simple = True mode this is a stricked 1:1 pairing.
    In the simple = Flase mode this is a sliding window of 1 vs [0, 1 , 2] elements.
    Input: 
        query_ortho list
        subject_orth list
        simple True or False
    output:
        obs the number of observed pairings
        max_len the number of potential pairings"""
    if len(query_ortho) <= len(subject_ortho):
        max_len = len(query_ortho)
    else:
        max_len = len(subject_ortho)
    obs = 0
    if simple == True:
        for i in range(0, max_len):
            if query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
        return obs, max_len
            
        
    if simple == False:
        for i in range(0, max_len):
            if i > 0 and i < max_len -1:    
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1] \
                or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            elif i == max_len -1:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1]:
                    obs = obs + 1
            elif query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
            elif i == 0:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            
        return obs, max_len

In [57]:
def get_up_and_down(gene_id, query_downstream_ortho,query_upstream_ortho, subject_bed_fn, n, simple):
    
    
    subject_upstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(gene_id, subject_bed_fn, n=n, direction='up')]
    subject_downstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(gene_id, subject_bed_fn, n=n, direction='down')]
    
    #define in case something has no upstream or downstream hits
    obs_up, max_up, obs_down, max_down = 0, 0, 0, 0
    
    if len(query_upstream_ortho) > 0 and len(subject_upstream_ortho) > 0: 
        if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
            obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, simple=simple)       
            #now look at the downstream
            obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, simple=simple)
            return [(obs_down, max_down), (obs_up, max_up)]
        
    elif len(subject_downstream_ortho) > 0 and len(query_upstream_ortho) >0 :
        if query_upstream_ortho[0] == subject_downstream_ortho[0]:
            obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho, simple=simple)
            obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho, simple=simple)
            return [(obs_down, max_down), (obs_up, max_up)]
        
    elif len(subject_downstream_ortho) > 0 and len(query_downstream_ortho) > 0:
        if subject_downstream_ortho[0] == query_downstream_ortho[0]:
            obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, simple=simple)       
            #now look at the downstream
            obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, simple=simple)
            return [(obs_down, max_down), (obs_up, max_up)]
    
    return [(obs_down, max_down), (obs_up, max_up)]

In [7]:
#generate dicts that will be used to track things
allele_dict = {}
singleton_dict = {}
paralog_dict = {}

In [8]:
#get the identifiers if not provided.
query_id = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]
subject_id = pd.read_csv(SUBJECT_GENOME_GENE_BED6, sep='\t', header=None).loc[0,[3]][3].split('_')[0]

In [9]:
#get the orthofinder results read in
#this is the dict for all orthogroups and what members they have
orthofinder_dict = {}
with open(ORTHOFINDER_FILE_NAME) as fh:
    for line in fh:
        line.strip()
        orthofinder_dict[line.split(':')[0]] = line.split(':')[1].strip().split(' ')

In [10]:
#this is the dict for proteins and what orthogroup they have
gene_to_ortho_dict = {}
for key,value in orthofinder_dict.items():
    for item in value:
        gene_to_ortho_dict[item] = key

In [24]:
#gene_id = 'Pst104E_00003'
gene_id = 'Pst104E_00003'
print([x for x in orthofinder_dict[gene_to_ortho_dict[gene_id]]if x.startswith('DK0911')])
n=8
simple = True

['DK0911_03127', 'DK0911_03133', 'DK0911_03917', 'DK0911_08493', 'DK0911_08500', 'DK0911_08605', 'DK0911_08613', 'DK0911_09268', 'DK0911_12572', 'DK0911_12633', 'DK0911_15059', 'DK0911_15247', 'DK0911_16983', 'DK0911_17341', 'DK0911_20052', 'DK0911_30677']


In [21]:
#orthofinder_dict[gene_to_ortho_dict[gene_id]]

In [25]:
#populate the query arrays for neighbours and their corresponding ortho group ids
query_upstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='up')
query_downstream = get_neighbours(gene_id, QUERY_GENOME_GENE_BED6, n=n, direction='down')
query_upstream_ortho = [gene_to_ortho_dict[x] for x in query_upstream]
query_downstream_ortho = [gene_to_ortho_dict[x] for x in query_downstream]

In [26]:
#get the first orthologs (aka potential allele pairings)
first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
if len(first_orthologs) == 0:
    singleton_dict[gene_id] = True
elif len(first_orthologs) == 1:
    up_and_down_list = get_up_and_down(first_orthologs[0],query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)
    print(up_and_down_list)
#think about catching paralogs meaning that +1/-1 are not in the same orthogroup

In [27]:
up_and_down_list

[(1, 1), (4, 8)]

In [28]:
first_orthologs

['DK0911_03127',
 'DK0911_03133',
 'DK0911_03917',
 'DK0911_08493',
 'DK0911_08500',
 'DK0911_08605',
 'DK0911_08613',
 'DK0911_09268',
 'DK0911_12572',
 'DK0911_12633',
 'DK0911_15059',
 'DK0911_15247',
 'DK0911_16983',
 'DK0911_17341',
 'DK0911_20052',
 'DK0911_30677']

#now we need to work on the case when we have multiple orthologs. This will be done on a 1:1 basis moving outward if the first neighbours match return a 1.
#if in the first round nothing matches don't move forward. Just return all possible paraloge pairings.
#if multiple match move one further out and add half a point for each match either side.

In [58]:
ortho_dict = {}
for ortho in first_orthologs:
    print(ortho)
    ortho_dict[ortho] = get_up_and_down(ortho,query_downstream_ortho,query_upstream_ortho, SUBJECT_GENOME_GENE_BED6, n=n, simple=simple)

DK0911_03127
DK0911_03133
DK0911_03917
DK0911_08493
DK0911_08500
DK0911_08605
DK0911_08613
DK0911_09268
DK0911_12572
DK0911_12633
DK0911_15059
DK0911_15247
DK0911_16983
DK0911_17341
DK0911_20052
DK0911_30677


In [59]:
ortho_dict

{'DK0911_03127': [(0, 0), (0, 0)],
 'DK0911_03133': [(0, 0), (0, 0)],
 'DK0911_03917': [(0, 0), (0, 0)],
 'DK0911_08493': [(0, 0), (0, 0)],
 'DK0911_08500': [(0, 0), (0, 0)],
 'DK0911_08605': [(0, 0), (0, 0)],
 'DK0911_08613': [(0, 0), (0, 0)],
 'DK0911_09268': [(0, 0), (0, 0)],
 'DK0911_12572': [(0, 0), (0, 0)],
 'DK0911_12633': [(0, 0), (0, 0)],
 'DK0911_15059': [(3, 3), (2, 8)],
 'DK0911_15247': [(0, 0), (0, 0)],
 'DK0911_16983': [(0, 0), (0, 0)],
 'DK0911_17341': [(0, 0), (0, 0)],
 'DK0911_20052': [(0, 0), (0, 0)],
 'DK0911_30677': [(0, 0), (0, 0)]}

In [56]:
broken = ortho
gene_to_ortho_dict[ortho]

'OG0000028'

In [47]:
print([gene_to_ortho_dict[x] for x in get_neighbours(ortho, SUBJECT_GENOME_GENE_BED6, n=n, direction='up')])
print([gene_to_ortho_dict[x] for x in get_neighbours(ortho, SUBJECT_GENOME_GENE_BED6, n=n, direction='down')])

[]
[]


In [44]:
ortho

'DK0911_20052'

In [126]:
#get the first orthologs (aka potential allele pairings)
first_orthologs = get_orthologs(gene_id, gene_to_ortho_dict, orthofinder_dict, subject_id)
if len(first_orthologs) == 0:
    singleton_dict[gene_id] = True
elif len(first_orthologs) == 1:
    subject_upstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='up')]
    subject_downstream_ortho = [gene_to_ortho_dict[x] for x in get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='down')]
    if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
        obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, simple=simple)       
        #now look at the downstream
        obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, simple=simple)
    elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
        obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho, simple=simple)
        obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho, simple=simple)

In [100]:
#now we need to look at the pairing of the directions
#and add stuff up
if query_upstream_ortho[0] == subject_upstream_ortho[0]:
    obs_up = 0
    obs_down = 0 
    if len(query_upstream_ortho) <= len(subject_upstream_ortho):
        
        max_up = len(query_upstream_ortho)
    else:
        max_up = len(subject_upstream_ortho)
        
    for i in range(0, max_up):
        if query_upstream_ortho[i] == subject_upstream_ortho[i]:
            obs_up = obs_up + 1
            
    #now look at the downstream
    if len(query_downstream_ortho) <= len(subject_downstream_ortho):
        
        max_down = len(query_downstream_ortho)
    else:
        max_down = len(subject_downstream_ortho)
        
    for i in range(0, max_down):
        if query_downstream_ortho[i] == subject_downstream_ortho[i]:
            obs_down = obs_down + 1
    
elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
    obs_up = 0
    obs_down = 0
    if len(query_upstream_ortho) <= len(subject_downstream_ortho):
        max_up = len(query_upstream_ortho)
    else:
        max_up = len(subject_downstream_ortho)
    for i in range(0, obs_down):
        if query_upstream_ortho[i] == subject_downstream_ortho[i]:
            obs_up = obs_up + 1
            
    #now look at the downstream
    if len(query_downstream_ortho) <= len(subject_upstream_ortho):
        
        max_down = len(query_downstream_ortho)
    else:
        max_down = len(subject_upstream_ortho)
        
    for i in range(0, max_down):
        if query_downstream_ortho[i] == subject_upstream_ortho[i]:
            obs_down = obs_down + 1
print('obs_up %i, max_up %i, obs_down %i, max_down %i' % (obs_up, max_up, obs_down, max_down))
print('sub_up %s\nquery_up %s\nsub_down %s\nquery_down %s' % (subject_upstream_ortho, query_upstream_ortho,subject_downstream_ortho,query_downstream_ortho ))

obs_up 3, max_up 5, obs_down 2, max_down 2
sub_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0012676', 'OG0039656']
query_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0037814', 'OG0012676']
sub_down ['OG0009613', 'OG0006130']
query_down ['OG0009613', 'OG0006130']


In [110]:
def count_pairings(query_ortho, subject_ortho, simple = True):
    """Function that counts the pair wise overlap of two lists.
    In the simple = True mode this is a stricked 1:1 pairing.
    In the simple = Flase mode this is a sliding window of 1 vs [0, 1 , 2] elements.
    Input: 
        query_ortho list
        subject_orth list
        simple True or False
    output:
        obs the number of observed pairings
        max_len the number of potential pairings"""
    if len(query_ortho) <= len(subject_ortho):
        max_len = len(query_ortho)
    else:
        max_len = len(subject_ortho)
    obs = 0
    if simple == True:
        for i in range(0, max_len):
            if query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
        return obs, max_len
            
        
    if simple == False:
        for i in range(0, max_len):
            if i > 0 and i < max_len -1:    
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1] \
                or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            elif i == max_up -1:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i-1]:
                    obs = obs + 1
            elif query_ortho[i] == subject_ortho[i]:
                obs = obs + 1
            elif i == 0:
                if query_ortho[i] == subject_ortho[i] or query_ortho[i] == subject_ortho[i+1]:
                    obs = obs + 1
            
        return obs, max_len

In [107]:
#second iterations that allows for +1/=-1 difference once the first neighbour is the same
#now we need to look at the pairing of the directions
#and add stuff up
if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho)       
    #now look at the downstream
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho)
elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho)
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho)
print('obs_up %i, max_up %i, obs_down %i, max_down %i' % (obs_up, max_up, obs_down, max_down))
print('sub_up %s\nquery_up %s\nsub_down %s\nquery_down %s' % (subject_upstream_ortho, query_upstream_ortho,subject_downstream_ortho,query_downstream_ortho ))

obs_up 3, max_up 5, obs_down 2, max_down 2
sub_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0012676', 'OG0039656']
query_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0037814', 'OG0012676']
sub_down ['OG0009613', 'OG0006130']
query_down ['OG0009613', 'OG0006130']


In [109]:
#second iterations that allows for +1/=-1 difference once the first neighbour is the same
#now we need to look at the pairing of the directions
#and add stuff up
if query_upstream_ortho[0] == subject_upstream_ortho[0]:   
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_upstream_ortho, False)       
    #now look at the downstream
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_downstream_ortho, False)
elif query_upstream_ortho[0] == subject_downstream_ortho[0]:
    obs_up, max_up = count_pairings(query_upstream_ortho, subject_downstream_ortho, False)
    obs_down, max_down = count_pairings(query_downstream_ortho, subject_upstream_ortho, False)
print('obs_up %i, max_up %i, obs_down %i, max_down %i' % (obs_up, max_up, obs_down, max_down))
print('sub_up %s\nquery_up %s\nsub_down %s\nquery_down %s' % (subject_upstream_ortho, query_upstream_ortho,subject_downstream_ortho,query_downstream_ortho ))

obs_up 4, max_up 5, obs_down 2, max_down 2
sub_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0012676', 'OG0039656']
query_up ['OG0000028', 'OG0005025', 'OG0015028', 'OG0037814', 'OG0012676']
sub_down ['OG0009613', 'OG0006130']
query_down ['OG0009613', 'OG0006130']


In [60]:
get_neighbours(first_orthologs[0], SUBJECT_GENOME_GENE_BED6, n=n, direction='down')

['DK0911_15057', 'DK0911_15056']

In [61]:
first_orthologs

['DK0911_15058']

In [56]:
query_upstream_ortho

['OG0000028', 'OG0005025', 'OG0015028', 'OG0037814', 'OG0012676']

In [57]:
subject_upstream_ortho

['OG0000028', 'OG0005025', 'OG0015028', 'OG0012676', 'OG0039656']

In [58]:
query_downstream_ortho

['OG0009613', 'OG0006130']

In [59]:
subject_downstream_ortho

['OG0009613', 'OG0006130']

In [32]:
line.split(':')[1].strip()

'Pst104E_25431'

In [25]:
get_neighbours('Pst104E_00002', QUERY_GENOME_GENE_BED6, n=10, direction = 'down')

['Pst104E_00001', 'Pst104E_00000']

In [8]:
bed_6_header = ['chrom', 'start', 'stop', 'gene_id', 'phase', 'strand']
bed_df = pd.read_csv(QUERY_GENOME_GENE_BED6, sep='\t', header=None, names=bed_6_header)
gene_id = pd.read_csv(QUERY_SELECT_GENE_BED6, sep='\t', header=None, names=bed_6_header).loc[0,['gene_id']]['gene_id']


In [40]:
n=5
direction = 'up'
gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
contig = bed_df.loc[gene_index, ['chrom']]['chrom']

In [42]:
contig_index = bed_df[bed_df['chrom']== contig].index

In [79]:
gene_id = 'Pst104E_00002'
gene_index = bed_df[bed_df['gene_id'] == gene_id].index[0]
contig = bed_df.loc[gene_index, ['chrom']]['chrom']
contig_index = bed_df[bed_df['chrom']== contig].index
direction = 'down'
if direction == 'up':
    index_list = []
    if (gene_index+(n)) in contig_index:
        for i in range(gene_index+1, gene_index+(n+1)):
            index_list.append(i)
    else:
        for i in range(gene_index+1, contig_index[-1]+1):
            index_list.append(i)
    print(bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist())
if direction == 'down':
    index_list = []
    if (gene_index-n) in contig_index:
        for i in range(gene_index-(n), gene_index):
            index_list.append(i)
    else:
        for i in range(contig_index[0], gene_index):
            index_list.append(i)
    index_list.reverse()
    print(bed_df.loc[index_list, ['gene_id']]['gene_id'].tolist())

['Pst104E_00001', 'Pst104E_00000']


In [67]:
gene_index-n in contig_index

True

In [65]:
index_list

[994, 995, 996, 997, 998]

In [36]:
gene_index[0]

999

In [50]:
[gene_index: gene_index+6]

SyntaxError: invalid syntax (<ipython-input-50-0f1b2462552f>, line 1)