# <span style="color:#ff1414"> BEDtools analysis. </span>

This is a script to answer research questions outlined elsewhere. In summary, this script:

1. compares methylation results between different methylation-callers, and between different methylation sequencing methods.

2. compares methylation between genes and non-gene regions

3. compares methylation between transposons and non-repetitive regions

4. compares transposons and genes


Note:
- PB/pb = PacBio
- ONT/ont = Oxford Nanopore Technology
- NP = Nanopolish

In [234]:
# load modules
import os
import glob
import pprint
import pandas as pd
import scipy
import numpy as np
import pybedtools
from pybedtools import BedTool

In [235]:
#First we need to define the base dirs
DIRS = {}
DIRS['BASE2'] = '/home/anjuni/analysis'
DIRS['FEATURES'] = os.path.join(DIRS['BASE2'], 'coverage', 'feature_files')
DIRS['RAND'] = os.path.join(DIRS['BASE2'], 'coverage', 'randomisation')
DIRS['COVERAGE'] = os.path.join(DIRS['BASE2'], 'coverage')
DIRS['GENE'] = os.path.join(DIRS['COVERAGE'], 'gene_level')
DIRS['GENE_BODY'] = os.path.join(DIRS['GENE'], 'gene_body')
DIRS['TE'] = os.path.join(DIRS['GENE'], 'te')
DIRS['GFF_INPUT'] = os.path.join(DIRS['BASE2'], 'gff_output')
DIRS['FIGURES'] = os.path.join(DIRS['BASE2'], 'figures')
DIRS['COMPARE'] = os.path.join(DIRS['BASE2'], 'gene_te_comparison')

In [236]:
#Quick chech if directories exist
for value in DIRS.values():
    if not os.path.exists(value):
        print('%s does not exist' % value)

In [79]:
#Make filepaths
bed_file_list = [fn for fn in glob.iglob('%s/*.bed' % DIRS['BED_INPUT'], recursive=True)]
gff_file_list = [fn for fn in glob.iglob('%s/*anno.gff3' % DIRS['GFF_INPUT'], recursive=True)]
te_file_list = [fn for fn in glob.iglob('%s/*.gff' % DIRS['GFF_INPUT'], recursive=True)]

In [80]:
#Check that the list works
print(*bed_file_list, sep='\n')
print(*gff_file_list, sep='\n')
print(*te_file_list, sep='\n')

/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_plus_tombo_sorted.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_CpG_tombo_np.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_tombo_np.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_plus_CpG_np_tombo.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_hc_tombo_sorted.CpG.plus.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_s_nanopolish.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/nanopolish_rerun_subtract.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_hc_nanopolish_sorted.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_plus_CpG_tombo_np.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/nano_plus_tombo_overlap.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_tombo_sorted.bed
/home/anjuni/analysis/bedtools_output/sequencing_comparison/5m

## <span style='color:#9ac615'> 8. Comparing methylated transposons and genes. <span/>

In [None]:
# make a bed file of the 500 most highly methylated and 500 least methylated genes
# write a function so you can find top 100 as well
# or use a cutoff for methylation fraction by ranking

In [58]:
# TE coverage files in bed6 format were generated in notebook 6 for the next part and then copied over to the input folder
# Gene annotation files in bed6 that were previously generated were copied over to the input folder

!cp /home/anjuni/analysis/coverage/gene_level/gene_body/input/*anno.s* /home/anjuni/analysis/gene_te_comparison/input
!cp /home/anjuni/analysis/coverage/gene_level/te/coverage/* /home/anjuni/analysis/gene_te_comparison/input

In [242]:
!cp /home/anjuni/analysis/coverage/gene_level/te/coverage/* /home/anjuni/analysis/gene_te_comparison/input

cp: omitting directory '/home/anjuni/analysis/coverage/gene_level/te/coverage/old'


In [243]:
# make a dataframe of transposons and rank by methylation level
te_fn_list = [fn for fn in glob.iglob('%s/*REPET*' % os.path.join(DIRS['COMPARE'], 'input'), recursive=True)]

In [244]:
te_fn_list

['/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed',
 '/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed',
 '/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed',
 '/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed',
 '/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed',
 '/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_p_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed']

In [245]:
headings = ['contig', 'start', 'stop', 'name', 'overlap_fraction', 'strand']

In [246]:
def fn_to_df_dict(fn_list):
    "Returns a dictionary of pandas dataframes for a list of file paths."
    df_dict = {}
    for fn in fn_list:
        df = pd.read_csv(fn, sep='\t', header=None, names = headings)
        df_dict[fn.split('/')[-1]] = df
    return df_dict

In [247]:
te_df_dict = fn_to_df_dict(te_fn_list)

In [248]:
print(*te_df_dict, sep='\n')

Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed


In [250]:
te_df_dict['Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'].head()

Unnamed: 0,contig,start,stop,name,overlap_fraction,strand
0,hcontig_000_003,148,221,ID=ms32995_hcontig_000_003_DXX-MITE_MCL744_Pst...,13,0.178082
1,hcontig_000_003,243,270,ID=ms206957_hcontig_000_003_TTTTCGAAATTGAA2;Ta...,0,0.0
2,hcontig_000_003,674,693,ID=ms206958_hcontig_000_003_CCTCCGTGTT2;Target...,6,0.31579
3,hcontig_000_003,860,879,ID=ms206959_hcontig_000_003_CAGGGAGTG2;Target=...,1,0.052632
4,hcontig_000_003,1048,1059,ID=ms206960_hcontig_000_003_TCTCC2;Target=TCTC...,3,0.272727


In [276]:
# rank by methylation level and subset to a new df
def top_te(df_dict, asc, number):
    """Takes a dictionary of dataframes, sorts by overlap fraction, and returns a dictionary of dataframes sliced by the top number."""
    top_dict = {}
    for key, value in df_dict.items():
        df = value.sort_values(by='overlap_fraction', ascending=asc).iloc[:500]
        top_dict[key] = df
        if asc == True:
            out_fn = os.path.join(DIRS['COMPARE'], 'low_te_files', key)
            out_fn = out_fn.replace('bed', '500.bed')
            df.to_csv(out_fn, sep='\t', header=None, index=None)
        if asc == False:
            out_fn = os.path.join(DIRS['COMPARE'], 'top_te_files', key)
            out_fn = out_fn.replace('bed', '500.bed')
            df.to_csv(out_fn, sep='\t', header=None, index=None)
    return top_dict

In [277]:
print(*top500_te_df_dict, sep='\n')

Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed


In [282]:
low500_te_df_dict['Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed'].shape

(500, 6)

In [280]:
top500_te_df_dict = top_te(te_df_dict, False, 500)
low500_te_df_dict = top_te(te_df_dict, True, 500)
#top100_te_df_dict = top_te(te_df_dict, False, 100)
#low100_te_df_dict = top_te(te_df_dict, True, 100)

In [281]:
print(*top500_te_df_dict, sep='\n')
print(*low500_te_df_dict, sep='\n')

Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.bed


In [283]:
top500_fn_list = [fn for fn in glob.iglob('%s/*' % os.path.join(DIRS['COMPARE'], 'top_te_files'), recursive=True)]
top500_dict = {}
for fn in top500_fn_list:
    top500_dict[fn.split('/')[-1]] = fn

low500_fn_list = [fn for fn in glob.iglob('%s/*' % os.path.join(DIRS['COMPARE'], 'low_te_files'), recursive=True)]
low500_dict = {}
for fn in top500_fn_list:
    low500_dict[fn.split('/')[-1]] = fn

In [284]:
low500_dict

{'Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_p_ctg.REPET.6mA_h

In [285]:
top500_dict

{'Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 'Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed': '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_p_ctg.REPET.6mA_h

In [286]:
low500_fn_list

['/home/anjuni/analysis/gene_te_comparison/low_te_files/Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/low_te_files/Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/low_te_files/Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/low_te_files/Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/low_te_files/Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/low_te_files/Pst_104E_v13_p_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed']

In [287]:
top500_fn_list

['/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.bed',
 '/home/anjuni/analysis/gene_te_comparison/top_te_files/Pst_104E_v13_p_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.500.bed']

In [None]:
# copy files to NCI
!scp -r top_te_files/ $NCI:/short/sd34/ap5514/te/input/
!scp -r low_te_files/ $NCI:/short/sd34/ap5514/te/input/

In [None]:
%%bash

#Ran this on NCI
#Add identifierse for top or bottom 500 TE methylation files

module load bedtools

cd /short/sd34/ap5514/te/input/top_te_files

for x in *bed
do
len=${#x}
sortBed -i ${x} > ${x::len-4}.top.sorted.bed
done

cd /short/sd34/ap5514/te/input/low_te_files
for x in *bed
do
len=${#x}
sortBed -i ${x} > ${x::len-4}.low.sorted.bed
done

In [None]:
%%bash

#copy sorted files from NCI
cd /home/anjuni/analysis/gene_te_comparison/
mkdir sorted_te_files
cd sorted_te_files
scp -r ap5514@r-dm.nci.org.au:/short/sd34/ap5514/te/input/top_te_files/*top* .
scp -r ap5514@r-dm.nci.org.au:/short/sd34/ap5514/te/input/low_te_files/*low* .

In [289]:
gene_list = [fn for fn in glob.iglob('%s/*e.bed' % os.path.join(DIRS['COMPARE'], 'input'), recursive=True)]
gene_dict = {}
for fn in gene_list:
    gene_dict[fn.split('/')[-1]] = fn

In [290]:
gene_dict

{'Pst_104E_v13_h_ctg.anno.sorted.score.bed': '/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_h_ctg.anno.sorted.score.bed',
 'Pst_104E_v13_p_ctg.anno.sorted.score.bed': '/home/anjuni/analysis/gene_te_comparison/input/Pst_104E_v13_p_ctg.anno.sorted.score.bed'}

In [None]:
def add_score(fn_dict):
    """This function adds an integer score to replace the '.' in the score column in the annotation bed file, so it can be parsed as bed6."""
    out_fn_dict = {}
    for key, value in fn_dict.items():
        in_fn = fn_dict[key]
        df = pd.read_csv(in_fn, sep='\t', header = None)
        for index, row in df.iterrows():
            df.iat[index,4] = 0 # change the gene end site to 500bp downstream of TSS
        out_fn = in_fn.replace('.bed', '.score.bed') # make the outfile name
        df.to_csv(out_fn, header=None, index=None, sep='\t') # save the new tss df to a bed file
        outkey = out_fn.split('/')[-1]
        out_fn_dict[outkey] = out_fn # save the outfile names to a dictionary
    return out_fn_dict

In [None]:
gene_dict = add_score(gene_dict)

In [362]:
te_list = [fn for fn in glob.iglob('%s/*d.bed' % os.path.join(DIRS['COMPARE'], 'sorted_te_files'), recursive=True)]
te_dict = {}
for fn in te_list:
    te_dict[fn.split('/')[-1]] = fn

In [363]:
te_dict

{'Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.low.sorted.bed': '/home/anjuni/analysis/gene_te_comparison/sorted_te_files/Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.low.sorted.bed',
 'Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.top.sorted.bed': '/home/anjuni/analysis/gene_te_comparison/sorted_te_files/Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.500.top.sorted.bed',
 'Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.low.sorted.bed': '/home/anjuni/analysis/gene_te_comparison/sorted_te_files/Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.low.sorted.bed',
 'Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.top.sorted.bed': '/home/anjuni/analysis/gene_te_comparison/sorted_te_files/Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.500.top.sorted.bed',
 'Pst_104E_v13_h_ctg.REPET.6mA_prob_smrtlink_sorted.cutoff.0.80.overlap.

In [364]:
#Quick chech if directories exist
for value in gene_dict.values():
    if not os.path.exists(value):
        print('%s does not exist' % value)

In [365]:
#Quick chech if directories exist
for value in te_dict.values():
    if not os.path.exists(value):
        print('%s does not exist' % value)

In [None]:
#ran for these two groups:
#overall closest gene
#non-overlapping overall closest gene

In [399]:
def genes_near_te(te_bed_fn, gene_bed_fn, io, iu, di, out_fh):
    """Takes two bed6 filenames and returns dataframe with 5' and 3' distances."""
    te = BedTool(te_bed_fn)
    gene = BedTool(gene_bed_fn)
    df = te.closest(gene ,io=io, iu=iu, id=di, N=True, d=True).to_dataframe()
    df.rename(columns={1:'start', 2:'stop', 12:'distance', 11: 'strand', 5:'score', 3:'name', 9:'gene', 0:'contig'}, inplace=True)
    df = df[['contig', 'start', 'stop', 'name', 'score', 'gene', 'distance']]
    df = df[df['distance'] < 1000]
    folder = 'output' + '/' + out_fh
    out_fn = te_bed_fn.replace('sorted_te_files', folder)
    df.to_csv(out_fn, sep='\t', index=None, header=None)
    return df

In [411]:
# run for genes in both directions
for tkey, tvalue in te_dict.items():
    for gkey, gvalue in gene_dict.items():
        if tkey[13] == gkey[13]:
            genes_near_te(tvalue, gvalue, True, False, False, 'both')

['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_typ

In [412]:
# run for genes in both directions
for tkey, tvalue in te_dict.items():
    for gkey, gvalue in gene_dict.items():
        if tkey[13] == gkey[13]:
            genes_near_te(tvalue, gvalue, False, False, False, 'with_overlaps')

['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_type, _names, self.field_count()))
['chrom', 'start', 'end', 'name', 'score', 'strand', 'thickStart', 'thickEnd', 'itemRgb', 'blockCount', 'blockSizes', 'blockStarts']
but file has 13 fields; you can supply custom names with the `names` kwarg
  % (self.file_typ

In [402]:
df.columns

Index(['contig', 'start', 'stop', 'name', 'score', 'gene', 'distance'], dtype='object')

In [405]:
header_row_gt = ['contig', 'start', 'stop', 'name', 'score', 'gene', 'distance']

##### Troubleshooting:

Previous errors in the closest() function were due to '.' in the score column, which needs to be an integer between 0 and 1000 in bed6 file format.
Errors in closest() function are likely due to white space in the name column for transposons.
This is being remedied by running everything again for TE annotation files with only their ID in the name column.
New bed6 annotation files were generated for TEs, and these only have ID, hopefully without white space that cause errors.