# <span style="color:#ff1414"> BEDtools analysis. </span>

This is a script to answer research questions outlined elsewhere. In summary, this script:

1. compares methylation results between different methylation-callers, and between different methylation sequencing methods.

2. compares methylation between genes and non-gene regions

3. compares methylation between transposons and non-repetitive regions

4. compares transposons and genes


Note:
- PB/pb = PacBio
- ONT/ont = Oxford Nanopore Technology
- NP = Nanopolish

In [104]:
import pybedtools
import scipy

import matplotlib.patches as mpatches

import numpy as np # need for  stats

from scipy.stats import wilcoxon
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

In [142]:
# load modules
import os
import glob
import pprint
from pybedtools import BedTool
from scipy.stats import spearmanr
import pandas as pd
from scipy.stats import wilcoxon
import scipy

In [106]:
#First we need to define the base dirs

DIRS = {}
DIRS['BASE2'] = '/home/anjuni/analysis'
DIRS['BASE1'] = '/home/anjuni/methylation_calling/pacbio'
DIRS['GFF_INPUT'] = os.path.join(DIRS['BASE2'], 'gff_output')
DIRS['FIGURES'] = os.path.join(DIRS['BASE2'], 'figures')
DIRS['COVERAGE'] = os.path.join(DIRS['BASE2'], 'coverage')
DIRS['FEATURES'] = os.path.join(DIRS['COVERAGE'], 'feature_files')
DIRS['RAND'] = os.path.join(DIRS['BASE2'], 'coverage', 'randomisation')
DIRS['TE_SF'] = os.path.join(DIRS['COVERAGE'], 'superfamily_files')
DIRS['GENE'] = os.path.join(DIRS['COVERAGE'], 'gene_level')
DIRS['GENE_BODY'] = os.path.join(DIRS['GENE'], 'gene_body')
DIRS['BOTH_U_D'] = os.path.join(DIRS['GENE'], 'both_upstream_downstream')
DIRS['DOWN_STR'] = os.path.join(DIRS['GENE'], 'downstream')
DIRS['UP_STR'] = os.path.join(DIRS['GENE'], 'upstream')
DIRS['TSS'] = os.path.join(DIRS['GENE'], 'tss_6mA_only')
DIRS['TE'] = os.path.join(DIRS['GENE'], 'te')

In [107]:
#Quick chech if directories exist
for value in DIRS.values():
    if not os.path.exists(value):
        print('%s does not exist' % value)

## <span style='color:#14c4ff'> 6. Intersecting methylation with gene annotation files. <span/>

In [108]:
# Function to easily convert the values in the file name dict into bedtools objects
def make_bed_dict(fn_dict):
    """Takes an input filename dictionary and outputs a dictionary of pybedtools objects for the filenames."""
    bed_dict = {}
    for key, value in fn_dict.items():      
        bed_dict[key] = BedTool(value)
    return bed_dict

### <span style='color:#14c4ff'> 6.A Intersecting 5mC and 6mA with gene body. <span/>

#### <span style='color:#a347ff'> 6.A.1  Collecting annotation files <span/>

In [9]:
# Copy the gene annotation files to the input folder
# sections: gene_body, upstream, downstream, both_u_d, tss_6mA_only, te

! cp /home/anjuni/analysis/gff_output/*combined*  /home/anjuni/analysis/coverage/gene_level/gene_body/input/

In [109]:
# Converted the gff anno files to bed files in the anno_file_prep notebook
# Set filepaths for the anno bed files

gene_body_fn_dict = {}
for fn in glob.iglob('%s/*.bed' % os.path.join(DIRS['GENE_BODY'], 'input'), recursive=True):
    key = fn.split('/')[-1]
    gene_body_fn_dict[key] = fn

# Make dictionary of bedtools objects and check if it worked (it did!)
gene_body_bed_dict = make_bed_dict(gene_body_fn_dict)

In [110]:
gene_body_fn_dict

{'Pst_104E_v13_h_ctg_combined_sorted_anno.bed': '/home/anjuni/analysis/coverage/gene_level/gene_body/input/Pst_104E_v13_h_ctg_combined_sorted_anno.bed',
 'Pst_104E_v13_p_ctg_combined_sorted_anno.bed': '/home/anjuni/analysis/coverage/gene_level/gene_body/input/Pst_104E_v13_p_ctg_combined_sorted_anno.bed'}

In [111]:
gene_body_bed_dict

{'Pst_104E_v13_h_ctg_combined_sorted_anno.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/gene_body/input/Pst_104E_v13_h_ctg_combined_sorted_anno.bed)>,
 'Pst_104E_v13_p_ctg_combined_sorted_anno.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/gene_body/input/Pst_104E_v13_p_ctg_combined_sorted_anno.bed)>}

#### <span style='color:#a347ff'> 6.A.2 Collecting methylation files and randomised methylation files for the rest of the analysis.<span/>

In [112]:
# Make a dictionary of methylation files
methyl_fn_dict = {} # lowest cutoff as it had the highest similarity between sequencers/callers
methyl_fn_dict['5mC_hc_tombo_sorted.cutoff.0.80.bed'] = os.path.join(DIRS['FEATURES'], '5mC_hc_tombo_sorted.cutoff.0.80.bed',)
methyl_fn_dict['6mA_hc_tombo_sorted.cutoff.0.80.bed'] = os.path.join(DIRS['FEATURES'], '6mA_hc_tombo_sorted.cutoff.0.80.bed',)

# Make dictionary of bedtools objects and check if it worked (it did!)
methyl_bed_dict = make_bed_dict(methyl_fn_dict)

In [113]:
methyl_bed_dict

{'5mC_hc_tombo_sorted.cutoff.0.80.bed': <BedTool(/home/anjuni/analysis/coverage/feature_files/5mC_hc_tombo_sorted.cutoff.0.80.bed)>,
 '6mA_hc_tombo_sorted.cutoff.0.80.bed': <BedTool(/home/anjuni/analysis/coverage/feature_files/6mA_hc_tombo_sorted.cutoff.0.80.bed)>}

In [114]:
# Randomisation
# Generating randomised files as controls

# Dictionary of filepaths 
methyl_rand_fn_dict = {}
methyl_rand_fn_dict['5mC_hc_tombo_sorted.cutoff.0.80.bed'] = '/home/anjuni/analysis/coverage/randomisation/5mC_hc_tombo_sorted.cutoff.0.80_rand.bed'
methyl_rand_fn_dict['6mA_hc_tombo_sorted.cutoff.0.80.bed'] = '/home/anjuni/analysis/coverage/randomisation/6mA_hc_tombo_sorted.cutoff.0.80_rand.bed'

# Make a dictionary of randomised methylation bed files
methyl_rand_bed_dict = make_bed_dict(methyl_rand_fn_dict)

In [115]:
methyl_rand_bed_dict

{'5mC_hc_tombo_sorted.cutoff.0.80.bed': <BedTool(/home/anjuni/analysis/coverage/randomisation/5mC_hc_tombo_sorted.cutoff.0.80_rand.bed)>,
 '6mA_hc_tombo_sorted.cutoff.0.80.bed': <BedTool(/home/anjuni/analysis/coverage/randomisation/6mA_hc_tombo_sorted.cutoff.0.80_rand.bed)>}

#### <span style='color:#a347ff'> 6.A.3 Running gene body coverage for methylation and randomised methylation.<span/>

In [55]:
# Make a function to do overlaps for gene files and methylation
def coverage_gene(genebed_dict, methylbed_dict, genefn_dict, old_folder_name, new_folder_name):
    """Create coverage files from:
    Inputs: dictionary of gene pybedtools objects, dictionary of methylation pybedtools objects, dictionary of gene filenames, old folder name and new folder name.
    Output: dictionary of pandas dataframes for all coverage files."""
    feature_overlap_df_dict = {}
    for gkey, gbed in genebed_dict.items():
        for mkey, mbed in methylbed_dict.items():
            tmp_df = gbed.coverage(mbed, s=True).to_dataframe().iloc[:,[0,1,2,3,6,9]] # make a dataframe to put headings
            tmp_df.rename(columns={'thickStart': 'overlap_count', 'blockCount': 'overlap_fraction'}, inplace=True) # rename headings
            tmp_fn = genefn_dict[gkey].replace('.bed', '.%s.overlap.bed' % mkey[:-4]) # change output file path
            tmp_fn = tmp_fn.replace(old_folder_name, new_folder_name)
            feature_overlap_df_dict[tmp_fn.split('/')[-1]] = tmp_df # file name as key and dataframe as value for overlap dict
            tmp_df.to_csv(tmp_fn, sep='\t', header=None, index=None) # save to a csv(pybedtools outputs more d.p. than BEDTools)
    return feature_overlap_df_dict

In [133]:
te_methyl_overlap_dict = coverage_gene(te_bed_dict, methyl_bed_dict, te_fn_dict, 'input', 'coverage')
te_rand_methyl_overlap_dict = coverage_gene(te_bed_dict, methyl_bed_dict, te_fn_dict, 'input', 'rand')

In [134]:
# run overlap over the gene body for the genes
gene_methyl_overlap_dict = coverage_gene(gene_body_bed_dict, methyl_bed_dict, gene_body_fn_dict, 'input', 'coverage')
gene_rand_methyl_overlap_dict = coverage_gene(gene_body_bed_dict, methyl_rand_bed_dict, gene_body_fn_dict, 'input', 'rand')

### <span style='color:#14c4ff'> 6.B Intersecting 5mC and 6mA with upstream and downstream regions. <span/>

#### <span style='color:#a347ff'> 6.B.1 Making flank files for the upstream and downstream regions.<span/>

In [116]:
gene_anno_dict = {}
for fn in glob.iglob('%s/*combined_sorted_anno.bed' % DIRS['GENE_ANNO'], recursive=True):
    key = fn.split('/')[-1]
    gene_anno_dict[key] = fn

# Make dictionary of bedtools objects and check if it worked (it did!)
gene_bed_dict = make_bed_dict(gene_anno_dict)

KeyError: 'GENE_ANNO'

In [32]:
gene_bed_dict

{'Pst_104E_v13_h_ctg_combined_sorted_anno.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/gene_anno/Pst_104E_v13_h_ctg_combined_sorted_anno.bed)>,
 'Pst_104E_v13_p_ctg_combined_sorted_anno.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/gene_anno/Pst_104E_v13_p_ctg_combined_sorted_anno.bed)>}

In [33]:
# Make files of upstream and downstream regions
def make_flank_files(genebed_dict, genefn_dict, genome_size_fn, up_window, down_window, old_folder_name, new_folder_name, suffix):
    """This function makes flanking regions upstream and downstream of the input feature file."""
    out_fn_dict = {}
    for gkey, gbed in genebed_dict.items():
            tmp_bed = gbed.flank(g=genome_size_f_fn, l=up_window, r=down_window, s= True)
            tmp_fn = genefn_dict[gkey].replace('.bed', '.%s.bed' % suffix)
            tmp_fn = tmp_fn.replace(old_folder_name, new_folder_name) # change output file path
            tmp_bed.saveas(tmp_fn) # save flank to a csv
            out_fn_dict[tmp_fn.split('/')[-1]] = tmp_fn # file name as key and dataframe as value for overlap dict
    return out_fn_dict

In [35]:
genome_size_f_fn = os.path.join(DIRS['WINDOW_INPUT'], 'Pst_104E_v13_ph_ctg.sorted.genome_file')

In [36]:
# Make filename dict for inputs
upstream_flank_fn_dict = make_flank_files(gene_bed_dict, gene_anno_dict, genome_size_f_fn, 1000, 0, 'gene_anno', 'upstream/input', 'upstream')
downstream_flank_fn_dict = make_flank_files(gene_bed_dict, gene_anno_dict, genome_size_f_fn, 0, 1000, 'gene_anno', 'downstream/input', 'downstream')
both_u_d_flank_fn_dict = make_flank_files(gene_bed_dict, gene_anno_dict, genome_size_f_fn, 1000, 1000, 'gene_anno', 'both_upstream_downstream/input', 'both_u_d')

In [37]:
# Make bedtool object dictionary of inputs
upstream_flank_bed_dict = make_bed_dict(upstream_flank_fn_dict)
downstream_flank_bed_dict = make_bed_dict(downstream_flank_fn_dict)
both_u_d_flank_bed_dict = make_bed_dict(both_u_d_flank_fn_dict)

In [120]:
# Convert list to dict of fn
def file_name_dict(file_list):
    """Outputs a dictionary of input file paths for a given list of input file paths."""
    file_dict = {}
    for file in file_list:
        file_dict[file.split('/')[-1]] = file
    return file_dict

In [123]:
upstream_flank_fn_dict = file_name_dict([fn for fn in glob.iglob('%s/*bed' % os.path.join(DIRS['UP_STR'], 'input'), recursive=True)])
downstream_flank_fn_dict = file_name_dict([fn for fn in glob.iglob('%s/*bed' % os.path.join(DIRS['DOWN_STR'], 'input'), recursive=True)])
both_u_d_flank_fn_dict = file_name_dict([fn for fn in glob.iglob('%s/*bed' % os.path.join(DIRS['BOTH_U_D'], 'input'), recursive=True)])

In [124]:
upstream_flank_bed_dict = make_bed_dict(upstream_flank_fn_dict)
downstream_flank_bed_dict = make_bed_dict(downstream_flank_fn_dict)
both_u_d_flank_bed_dict = make_bed_dict(both_u_d_flank_fn_dict)

#### <span style='color:#a347ff'> 6.B.2 Running upstream, downstream, and both_u_d coverage for methylation and randomised methylation.<span/>

In [39]:
# Make a function to do overlaps for flank files and methylation
def coverage_flank(genebed_dict, methylbed_dict, genefn_dict, old_folder_name, new_folder_name, suffix):
    """Create coverage files from:
    Inputs: dictionary of gene pybedtools objects, dictionary of methylation pybedtools objects, dictionary of gene filenames, old folder name and new folder name.
    Output: dictionary of pandas dataframes for all coverage files."""
    feature_overlap_df_dict = {}
    for gkey, gbed in genebed_dict.items():
        for mkey, mbed in methylbed_dict.items():
            tmp_df = gbed.coverage(mbed, s=True).to_dataframe().iloc[:,[0,1,2,3,6,9]] # make a dataframe to put headings
            tmp_df.rename(columns={'thickStart': 'overlap_count', 'blockCount': 'overlap_fraction'}, inplace=True) # rename headings
            tmp_fn = genefn_dict[gkey].replace('.%s.bed' % suffix, '.bed')
            tmp_fn = tmp_fn.replace('.bed', '.%s.%s.overlap.bed' % (mkey[:-4], suffix)) # change output file path
            tmp_fn = tmp_fn.replace(old_folder_name, new_folder_name)
            feature_overlap_df_dict[tmp_fn.split('/')[-1]] = tmp_df # file name as key and dataframe as value for overlap dict
            tmp_df.to_csv(tmp_fn, sep='\t', header=None, index=None) # save to a csv(pybedtools outputs more d.p. than BEDTools)
    return feature_overlap_df_dict

In [135]:
# Make dictionaries of dataframes for each coverage file
upstream_coverage_dict = coverage_flank(upstream_flank_bed_dict, methyl_bed_dict, upstream_flank_fn_dict, 'input', 'coverage', 'upstream')
downstream_coverage_dict = coverage_flank(downstream_flank_bed_dict, methyl_bed_dict, downstream_flank_fn_dict, 'input', 'coverage', 'downstream')
both_u_d_coverage_dict = coverage_flank(both_u_d_flank_bed_dict, methyl_bed_dict, both_u_d_flank_fn_dict, 'input', 'coverage', 'both_u_d')

In [None]:
print(*upstream_coverage_dict, sep='\n')

Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed
Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed
Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed
Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed


In [None]:
# Make dictionaries of dataframes for randomised methylation for each coverage file
upstream_rand_coverage_dict = coverage_flank(upstream_flank_bed_dict, methyl_rand_bed_dict, upstream_flank_fn_dict, 'input', 'rand', 'upstream')
downstream_rand_coverage_dict = coverage_flank(downstream_flank_bed_dict, methyl_rand_bed_dict, downstream_flank_fn_dict, 'input', 'rand', 'downstream')
both_u_d_rand_coverage_dict = coverage_flank(both_u_d_flank_bed_dict, methyl_rand_bed_dict, both_u_d_flank_fn_dict, 'input', 'rand', 'both_u_d')

### <span style='color:#14c4ff'> 6.C Intersecting 6mA with transcription start site. <span/>

#### <span style='color:#a347ff'> 6.C.1 Making TSS files.<span/>

In [None]:
# 6mA is at TSS or slightly downstream (+/- 500bp)
# Make gene files with just the first 500bp of the gene
# Run coverage for these

In [19]:
# write a function to take the files in gene_anno_dict, convert to df and edit the df and save out
def make_tss_file(fn_dict):
    out_fn_dict = {}
    for key, value in fn_dict.items():
        in_fn = fn_dict[key]
        df = pd.read_csv(in_fn, sep='\t', header = None)
        for index, row in df.iterrows():
            tss = row[1] + 500
            df.iat[index,2] = tss # change the gene end site to 500bp downstream of TSS
        out_fn = in_fn.replace('.bed', '.tss.bed')
        out_fn = out_fn.replace('gene_anno', 'tss_6mA_only/input') # make the outfile name
        df.to_csv(out_fn, header=None, index=None, sep='\t') # save the new tss df to a bed file
        outkey = out_fn.split('/')[-1]
        out_fn_dict[outkey] = out_fn # save the outfile names to a dictionary
    return out_fn_dict

In [22]:
# Make the TSS files and return dictionary of filepaths to TSS files
tss_fn_dict = make_tss_file(gene_anno_dict)

# Make a bedtools object dictionary of TSS files
tss_bed_dict = make_bed_dict(tss_fn_dict)

In [129]:
tss_fn_dict = file_name_dict([fn for fn in glob.iglob('%s/*bed' % os.path.join(DIRS['TSS'], 'input'), recursive=True)])
tss_bed_dict = make_bed_dict(tss_fn_dict)

In [130]:
tss_bed_dict

{'Pst_104E_v13_h_ctg_combined_sorted_anno.tss.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/tss_6mA_only/input/Pst_104E_v13_h_ctg_combined_sorted_anno.tss.bed)>,
 'Pst_104E_v13_p_ctg_combined_sorted_anno.tss.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/tss_6mA_only/input/Pst_104E_v13_p_ctg_combined_sorted_anno.tss.bed)>}

#### <span style='color:#a347ff'> 6.C.2 Running TSS coverage for methylation files and randomised methylation files .<span/>

In [None]:
def coverage_gene(genebed_dict, methylbed_dict, genefn_dict, old_folder_name, new_folder_name):
    """Create coverage files from:
    Inputs: dictionary of gene pybedtools objects, dictionary of methylation pybedtools objects, dictionary of gene filenames, old folder name and new folder name.
    Output: dictionary of pandas dataframes for all coverage files."""
    feature_overlap_df_dict = {}
    for gkey, gbed in genebed_dict.items():
        for mkey, mbed in methylbed_dict.items():
            tmp_df = gbed.coverage(mbed, s=True).to_dataframe().iloc[:,[0,1,2,3,6,9]] # make a dataframe to put headings
            tmp_df.rename(columns={'thickStart': 'overlap_count', 'blockCount': 'overlap_fraction'}, inplace=True) # rename headings
            tmp_fn = genefn_dict[gkey].replace('.bed', '.%s.overlap.bed' % mkey[:-4]) # change output file path
            tmp_fn = tmp_fn.replace(old_folder_name, new_folder_name)
            feature_overlap_df_dict[tmp_fn.split('/')[-1]] = tmp_df # file name as key and dataframe as value for overlap dict
            tmp_df.to_csv(tmp_fn, sep='\t', header=None, index=None) # save to a csv(pybedtools outputs more d.p. than BEDTools)
    return feature_overlap_df_dict

In [187]:
# Make a function to do overlaps for tss files and methylation
def coverage_tss(genebed_dict, methylbed_dict, genefn_dict, old_folder_name, new_folder_name, suffix):
    """Create coverage files from:
    Inputs: dictionary of gene pybedtools objects, dictionary of methylation pybedtools objects, dictionary of gene filenames, old folder name and new folder name.
    Output: dictionary of pandas dataframes for all coverage files."""
    feature_overlap_df_dict = {}
    for gkey, gbed in genebed_dict.items():
        for mkey, mbed in methylbed_dict.items():
            tmp_df = gbed.coverage(mbed, s=True).to_dataframe().iloc[:,[0,1,2,3,6,9]] # make a dataframe to put headings
            tmp_df.rename(columns={'thickStart': 'overlap_count', 'blockCount': 'overlap_fraction'}, inplace=True) # rename headings
            tmp_fn = genefn_dict[gkey].replace('.%s.bed' % suffix, '.bed')
            tmp_fn = tmp_fn.replace('.bed', '.%s.%s.overlap.bed' % (mkey[:-4], suffix)) # change output file path
            tmp_fn = tmp_fn.replace(old_folder_name, new_folder_name)
            feature_overlap_df_dict[tmp_fn.split('/')[-1]] = tmp_df # file name as key and dataframe as value for overlap dict
            tmp_df.to_csv(tmp_fn, sep='\t', header=None, index=None) # save to a csv(pybedtools outputs more d.p. than BEDTools)
    return feature_overlap_df_dict

In [189]:
# Run coverage for tss and return dictionary of dataframes
tss_overlap_dict = coverage_tss(tss_bed_dict, methyl_bed_dict, tss_fn_dict, 'input', 'coverage', 'tss')

In [158]:
print(*tss_overlap_dict, sep='\n')

Pst_104E_v13_h_ctg_combined_sorted_anno.tss.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_h_ctg_combined_sorted_anno.tss.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg_combined_sorted_anno.tss.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed
Pst_104E_v13_p_ctg_combined_sorted_anno.tss.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed


In [190]:
# Run randomised coverage for tss and return dictionary of dataframes
tss_rand_dict = coverage_tss(tss_bed_dict, methyl_rand_bed_dict, tss_fn_dict, 'input', 'rand', 'tss')

In [226]:
# get only 6mA for tss
del tss_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed']
del tss_overlap_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed']
del tss_rand_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed']
del tss_rand_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed']

### <span style='color:#14c4ff'> 6.D Intersecting transposons files with methylation. <span/>

#### <span style='color:#a347ff'> 6.D.1 Collecting transposon annotation files .<span/>

In [53]:
# Converted the gff anno files to bed files in the anno_file_prep notebook
# Set filepaths for the te anno bed files

!cp /home/anjuni/analysis/coverage/feature_files/*REPET* /home/anjuni/analysis/coverage/gene_level/te/input

te_fn_dict = {}
for fn in glob.iglob('%s/*.bed' % os.path.join(DIRS['TE'], 'input'), recursive=True):
    key = fn.split('/')[-1]
    te_fn_dict[key] = fn

# Make dictionary of bedtools objects and check if it worked (it did!)
te_bed_dict = make_bed_dict(te_fn_dict)

In [161]:
te_bed_dict

{'Pst_104E_v13_h_ctg.REPET.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/te/input/Pst_104E_v13_h_ctg.REPET.bed)>,
 'Pst_104E_v13_p_ctg.REPET.bed': <BedTool(/home/anjuni/analysis/coverage/gene_level/te/input/Pst_104E_v13_p_ctg.REPET.bed)>}

In [132]:
print(*te_bed_dict, sep='\n')
print(*methyl_bed_dict, sep='\n')

Pst_104E_v13_p_ctg.REPET.bed
Pst_104E_v13_h_ctg.REPET.bed
5mC_hc_tombo_sorted.cutoff.0.80.bed
6mA_hc_tombo_sorted.cutoff.0.80.bed


#### <span style='color:#a347ff'> 6.D.2 Running transposon coverage for methylation files and randomised methylation files.<span/>

In [None]:
te_methyl_overlap_dict = coverage_gene(te_bed_dict, methyl_bed_dict, te_fn_dict, 'input', 'coverage')
te_rand_methyl_overlap_dict = coverage_gene(te_bed_dict, methyl_bed_dict, te_fn_dict, 'input', 'rand')

### <span style='color:#14c4ff'> 6.E Testing statistical significance of gene coverage files. <span/>

#### <span style='color:#a347ff'> 6.E.1 Running Wilcoxon test.<span/>

Wilcoxon test used to test for significance of methylation at genes.

In [83]:
def coverage_wilcoxon_same_key(obs_df_dict, exp_df_dict):
    """This function returns a dictionary of Wilcoxon statistic and p-value for a test of observed and randomised sites."""
    wilcoxon_dict = {}
    for okey, ovalue in obs_df_dict.items():
        for ekey, evalue in exp_df_dict.items():
            if okey == ekey:
                o_df = ovalue
                e_df = evalue
                obs = o_df['overlap_fraction']
                exp = e_df['overlap_fraction']
                stat, p = wilcoxon(obs, exp)
                wilcoxon_dict[okey] = stat, p
    return wilcoxon_dict

In [143]:
# Run wilcoxon for gene body
gene_body_wilcoxon = coverage_wilcoxon_same_key(gene_methyl_overlap_dict, gene_rand_methyl_overlap_dict)

In [144]:
# Run wilcoxon for regions outside gene body
upstream_wilcoxon = coverage_wilcoxon_same_key(upstream_coverage_dict, upstream_rand_coverage_dict)
downstream_wilcoxon = coverage_wilcoxon_same_key(downstream_coverage_dict, downstream_rand_coverage_dict)
both_u_d_wilcoxon = coverage_wilcoxon_same_key(both_u_d_coverage_dict, both_u_d_rand_coverage_dict)

In [145]:
# Run wilcoxon for TSS
tss_wilcoxon = coverage_wilcoxon_same_key(tss_overlap_dict, tss_rand_dict)

In [169]:
te_wilcoxon = coverage_wilcoxon_same_key(te_methyl_overlap_dict, te_rand_methyl_overlap_dict)

  z = (T - mn - correction) / se


In [170]:
te_wilcoxon

{'Pst_104E_v13_h_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed': (0.0,
  nan),
 'Pst_104E_v13_h_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed': (0.0,
  nan),
 'Pst_104E_v13_p_ctg.REPET.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed': (0.0,
  nan),
 'Pst_104E_v13_p_ctg.REPET.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed': (0.0,
  nan)}

#### <span style='color:#a347ff'> 6.E.2 Making a table of Wilcoxon stats.<span/>

In [None]:
wilcoxon_df = pd.DataFrame.from_dict(ug_spearman_dict, orient='index')
spearman_df.rename(columns={0: 'Wilcoxon T statistic', 1: 'p-value'}, inplace=True)

spearman_df.to_csv(out_fn, header=True, sep = '\t')

In [172]:
dlist = [gene_body_wilcoxon, upstream_wilcoxon, downstream_wilcoxon, both_u_d_wilcoxon, tss_wilcoxon]

In [177]:
def make_df_list(dict_list):
    dfs = []
    for dct in dict_list:
        df = pd.DataFrame.from_dict(dct, orient='index')
        df.rename(columns={0: 'Wilcoxon T statistic', 1: 'p-value'}, inplace=True)
        dfs.append(df)
    return dfs

In [178]:
dfs_list = make_df_list(dlist)

In [182]:
# save out the wilcoxon stats
wilcoxon_df = pd.concat(dfs_list)
out_fn = os.path.join(DIRS['FIGURES'], 'coverage', 'gene_level_wilcoxon_t_table.tsv')
wilcoxon_df.to_csv(out_fn, header=True, sep = '\t')

In [183]:
wilcoxon_df

Unnamed: 0,Wilcoxon T statistic,p-value
Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed,49414335.5,0.0
Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed,40109462.0,0.0
Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed,46935264.5,2.2878469999999999e-240
Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed,42178461.5,0.0
Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed,44702720.5,0.0
Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed,45586442.5,0.0
Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed,43225635.5,0.0
Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed,46128200.0,2.5304880000000003e-248
Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed,44772456.0,1.642831e-291
Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed,45154961.5,1.4499549999999999e-270


In [164]:
# this dict has all the tss overlap files
#tss_all_wilcoxon = tss_wilcoxon
tss_all_wilcoxon

{'Pst_104E_v13_h_ctg_combined_sorted_anno.tss.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed': (45858018.5,
  3.739094782863053e-220),
 'Pst_104E_v13_h_ctg_combined_sorted_anno.tss.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed': (41348298.0,
  7.061467931510914e-307),
 'Pst_104E_v13_p_ctg_combined_sorted_anno.tss.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed': (48814229.0,
  0.0),
 'Pst_104E_v13_p_ctg_combined_sorted_anno.tss.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed': (40870391.5,
  0.0)}

In [191]:
gene_methyl_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'].head()

Unnamed: 0,chrom,start,end,name,overlap_count,overlap_fraction
0,hcontig_000_003,1022,1469,gene_model_hcontig_0000_03.1,47,0.105145
1,hcontig_000_003,4849,5854,gene_model_hcontig_0000_03.2,2,0.00199
2,hcontig_000_003,7889,8965,gene_model_hcontig_0000_03.3,6,0.005576
3,hcontig_000_003,11922,13626,gene_model_hcontig_0000_03.4,52,0.030516
4,hcontig_000_003,14562,15702,EVM prediction%2hcontig_0000_003.5,16,0.014035


#### <span style='color:#a347ff'> 6.E.3 Running Wilcoxon for all the gene types.<span/>

# <span style='color:#a347ff'> H contig <span/>

In [283]:
h_contig_df = None

In [284]:
# make one big data frame of gene_ID and all the coverage for every region and their corresponding randomised score

# initial df = just the gene names
# get gene names from a file
h_contig_df = gene_methyl_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name']]

In [351]:
h_contig_df.head()

Unnamed: 0,name,5mC_gene_body,6mA_gene_body,5mC_gene_body_rand,6mA_gene_body_rand,5mC_upstream,6mA_upstream,5mC_upstream_rand,6mA_upstream_rand,5mC_downstream,6mA_downstream,5mC_downstream_rand,6mA_downstream_rand,6mA_tss,6mA_tss_rand,5mC_both,6mA_both,5mC_both_rand,6mA_both_rand,Gene_type
0,gene_model_hcontig_0000_03.1,0.105145,0.040268,0.013423,0.004474,0.124,0.034,0.018,0.01,0.109,0.035,0.014,0.017,0.048,0.004,0.1165,0.0345,0.016,0.0135,Effector
1,gene_model_hcontig_0000_03.2,0.00199,0.00597,0.01592,0.010945,0.028,0.012,0.012,0.011,0.016,0.028,0.017,0.007,0.0,0.014,0.022,0.02,0.0145,0.009,Effector
2,gene_model_hcontig_0000_03.3,0.005576,0.004647,0.012082,0.008364,0.005,0.002,0.012,0.01,0.013,0.002,0.016,0.005,0.008,0.008,0.009,0.002,0.014,0.0075,Other genes
3,gene_model_hcontig_0000_03.4,0.030516,0.028756,0.01115,0.013498,0.028,0.081,0.011,0.017,0.013,0.039,0.007,0.018,0.03,0.024,0.0205,0.06,0.009,0.0175,Other genes
4,EVM prediction%2hcontig_0000_003.5,0.014035,0.055263,0.007017,0.016667,0.013,0.042,0.007,0.017,0.012,0.039,0.011,0.014,0.058,0.012,0.0125,0.0405,0.009,0.0155,TE gene


In [None]:
# the dictionaries
gene_methyl_overlap_dict
gene_rand_methyl_overlap_dict
upstream_coverage_dict
upstream_rand_coverage_dict
downstream_coverage_dict
downstream_rand_coverage_dict
both_u_d_coverage_dict
both_u_d_rand_coverage_dict
tss_overlap_dict
tss_rand_dict

In [286]:
#gene body
h_contig_df = h_contig_df.join(gene_methyl_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '5mC_gene_body'}, inplace=True)
h_contig_df = h_contig_df.join(gene_methyl_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_gene_body'}, inplace=True)
h_contig_df = h_contig_df.join(gene_rand_methyl_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '5mC_gene_body_rand'}, inplace=True)
h_contig_df = h_contig_df.join(gene_rand_methyl_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_gene_body_rand'}, inplace=True)

#upstream
h_contig_df = h_contig_df.join(upstream_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '5mC_upstream'}, inplace=True)
h_contig_df = h_contig_df.join(upstream_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_upstream'}, inplace=True)
h_contig_df = h_contig_df.join(upstream_rand_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '5mC_upstream_rand'}, inplace=True)
h_contig_df = h_contig_df.join(upstream_rand_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_upstream_rand'}, inplace=True)

#downstream
h_contig_df = h_contig_df.join(downstream_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '5mC_downstream'}, inplace=True)
h_contig_df = h_contig_df.join(downstream_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_downstream'}, inplace=True)
h_contig_df = h_contig_df.join(downstream_rand_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '5mC_downstream_rand'}, inplace=True)
h_contig_df = h_contig_df.join(downstream_rand_coverage_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_downstream_rand'}, inplace=True)

#tss
h_contig_df = h_contig_df.join(tss_overlap_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_tss'}, inplace=True)
h_contig_df = h_contig_df.join(tss_rand_dict['Pst_104E_v13_h_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
h_contig_df.rename(columns={'overlap_fraction': '6mA_tss_rand'}, inplace=True)

In [234]:
# add the both_u_d column
def add_both_ud_to_df(df, upstream_col, downstream_col, new_col_name):
    """This function adds a column with the average of the upstream and downstream methylation coverage to the dataframe."""
    both_list = []
    for index, row in df.iterrows():
        both = ( row[upstream_col] + row[downstream_col] ) / 2
        both_list.append(both)
    df[new_col_name] = pd.Series(both_list)

In [287]:
add_both_ud_to_df(h_contig_df, '5mC_upstream', '5mC_downstream', '5mC_both')
add_both_ud_to_df(h_contig_df, '6mA_upstream', '6mA_downstream', '6mA_both')
add_both_ud_to_df(h_contig_df, '5mC_upstream_rand', '5mC_downstream_rand', '5mC_both_rand')
add_both_ud_to_df(h_contig_df, '6mA_upstream_rand', '6mA_downstream_rand', '6mA_both_rand')

In [239]:
#add column with gene types

#make a list of genes in each type
busco_list_h = pd.read_csv(os.path.join(DIRS['GFF_INPUT'], 'Pst_104E_v12_h_busco.gene_name.bed'), sep='\t', header=None)[3].tolist()
effector_list_h = pd.read_csv(os.path.join(DIRS['GFF_INPUT'], 'Pst_104E_v13_h_ctg.effectors.bed'), sep='\t', header=None)[3].tolist()
te_list_h = pd.read_csv(os.path.join(DIRS['GFF_INPUT'], 'Pst_104E_v13_h_ctg.TE.sorted.bed'), sep='\t', header=None)[3].tolist()

In [290]:
# add gene types
h_contig_df['Gene_type'] = 'Other genes'
h_contig_df.loc[h_contig_df.name.isin(effector_list_h), 'Gene_type'] = 'Effector'
h_contig_df.loc[h_contig_df.name.isin(busco_list_h), 'Gene_type'] = 'BUSCO'
h_contig_df.loc[h_contig_df.name.isin(te_list_h), 'Gene_type'] = 'TE gene'

In [293]:
# make dataframes for each group
def gene_type_df(df, gene):
    df2 = df[(df['Gene_type'] == gene)]
    return df2

In [332]:
busco_df_h = gene_type_df(h_contig_df, 'BUSCO')
effector_df_h = gene_type_df(h_contig_df, 'Effector')
te_gene_df_h = gene_type_df(h_contig_df, 'TE gene')
other_gene_df_h = gene_type_df(h_contig_df, 'Other genes')

In [292]:
column_list = [ '5mC_gene_body', '6mA_gene_body', '5mC_upstream', '6mA_upstream', '5mC_downstream', '6mA_downstream', '6mA_tss', '5mC_both', '6mA_both']

In [316]:
#run spearman on each gene subtype

def wilcoxon_gene_types(df, columns_list):
    """This function returns a dictionary of spearman's R statistic and p-value for a test of observed and expected sites."""
    wilcoxon_dict = {}
    for column in columns_list:
        obs = df[column]
        rand = column + '_rand'
        exp = df[rand]
        stat, p = wilcoxon(obs, exp)
        wilcoxon_dict[column] = stat, p
    return wilcoxon_dict

In [333]:
# run wilcoxon test to see whether each gene non-randomly associated with methylation
busco_wilcoxon_dict_h = wilcoxon_gene_types(busco_df_h, column_list)
effector_wilcoxon_dict_h = wilcoxon_gene_types(effector_df_h, column_list)
te_gene_wilcoxon_dict_h = wilcoxon_gene_types(te_gene_df_h, column_list)
other_gene_wilcoxon_dict_h = wilcoxon_gene_types(other_gene_df_h, column_list)

In [338]:
# save out dataframes of each wilcoxon dict as tsv

def save_wilcoxon_dict(wdict, name, genome):
    df = pd.DataFrame.from_dict(wdict, orient='index')
    df.rename(columns={0: 'Wilcoxon T statistic', 1: 'p-value'}, inplace=True)
    out_fn = os.path.join(DIRS['FIGURES'], 'coverage', ('%s_wilcoxon_t_table_%s.tsv' % (name, genome)))
    df.to_csv(out_fn, header=True, sep = '\t')

In [339]:
save_wilcoxon_dict(busco_wilcoxon_dict_h, 'busco', 'h')
save_wilcoxon_dict(effector_wilcoxon_dict_h, 'effector', 'h')
save_wilcoxon_dict(te_gene_wilcoxon_dict_h, 'te_gene', 'h')
save_wilcoxon_dict(other_gene_wilcoxon_dict_h, 'other_gene', 'h')

In [358]:
#save out the large df of coverage results
h_contig_df.to_csv((os.path.join(DIRS['FIGURES'], 'coverage', 'coverage_df_h.tsv')), header=True, index=False, sep = '\t')

In [359]:
h_contig_df.head()

Unnamed: 0,name,5mC_gene_body,6mA_gene_body,5mC_gene_body_rand,6mA_gene_body_rand,5mC_upstream,6mA_upstream,5mC_upstream_rand,6mA_upstream_rand,5mC_downstream,6mA_downstream,5mC_downstream_rand,6mA_downstream_rand,6mA_tss,6mA_tss_rand,5mC_both,6mA_both,5mC_both_rand,6mA_both_rand,Gene_type
0,gene_model_hcontig_0000_03.1,0.105145,0.040268,0.013423,0.004474,0.124,0.034,0.018,0.01,0.109,0.035,0.014,0.017,0.048,0.004,0.1165,0.0345,0.016,0.0135,Effector
1,gene_model_hcontig_0000_03.2,0.00199,0.00597,0.01592,0.010945,0.028,0.012,0.012,0.011,0.016,0.028,0.017,0.007,0.0,0.014,0.022,0.02,0.0145,0.009,Effector
2,gene_model_hcontig_0000_03.3,0.005576,0.004647,0.012082,0.008364,0.005,0.002,0.012,0.01,0.013,0.002,0.016,0.005,0.008,0.008,0.009,0.002,0.014,0.0075,Other genes
3,gene_model_hcontig_0000_03.4,0.030516,0.028756,0.01115,0.013498,0.028,0.081,0.011,0.017,0.013,0.039,0.007,0.018,0.03,0.024,0.0205,0.06,0.009,0.0175,Other genes
4,EVM prediction%2hcontig_0000_003.5,0.014035,0.055263,0.007017,0.016667,0.013,0.042,0.007,0.017,0.012,0.039,0.011,0.014,0.058,0.012,0.0125,0.0405,0.009,0.0155,TE gene


# <span style='color:#a347ff'> P contig <span/>

In [340]:
# make one big data frame of gene_ID and all the coverage for every region and their corresponding randomised score

# initial df = just the gene names
# get gene names from a file
p_contig_df = gene_methyl_overlap_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name']]

In [342]:
#gene body
p_contig_df = p_contig_df.join(gene_methyl_overlap_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '5mC_gene_body'}, inplace=True)
p_contig_df = p_contig_df.join(gene_methyl_overlap_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_gene_body'}, inplace=True)
p_contig_df = p_contig_df.join(gene_rand_methyl_overlap_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '5mC_gene_body_rand'}, inplace=True)
p_contig_df = p_contig_df.join(gene_rand_methyl_overlap_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_gene_body_rand'}, inplace=True)

#upstream
p_contig_df = p_contig_df.join(upstream_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '5mC_upstream'}, inplace=True)
p_contig_df = p_contig_df.join(upstream_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_upstream'}, inplace=True)
p_contig_df = p_contig_df.join(upstream_rand_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '5mC_upstream_rand'}, inplace=True)
p_contig_df = p_contig_df.join(upstream_rand_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.upstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_upstream_rand'}, inplace=True)

#downstream
p_contig_df = p_contig_df.join(downstream_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '5mC_downstream'}, inplace=True)
p_contig_df = p_contig_df.join(downstream_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_downstream'}, inplace=True)
p_contig_df = p_contig_df.join(downstream_rand_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.5mC_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '5mC_downstream_rand'}, inplace=True)
p_contig_df = p_contig_df.join(downstream_rand_coverage_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.downstream.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_downstream_rand'}, inplace=True)

#tss
p_contig_df = p_contig_df.join(tss_overlap_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_tss'}, inplace=True)
p_contig_df = p_contig_df.join(tss_rand_dict['Pst_104E_v13_p_ctg_combined_sorted_anno.6mA_hc_tombo_sorted.cutoff.0.80.tss.overlap.bed'][['name', 'overlap_fraction']].set_index('name'), on='name').copy()
p_contig_df.rename(columns={'overlap_fraction': '6mA_tss_rand'}, inplace=True)

In [344]:
add_both_ud_to_df(p_contig_df, '5mC_upstream', '5mC_downstream', '5mC_both')
add_both_ud_to_df(p_contig_df, '6mA_upstream', '6mA_downstream', '6mA_both')
add_both_ud_to_df(p_contig_df, '5mC_upstream_rand', '5mC_downstream_rand', '5mC_both_rand')
add_both_ud_to_df(p_contig_df, '6mA_upstream_rand', '6mA_downstream_rand', '6mA_both_rand')

In [346]:
#add column with gene types

#make a list of genes in each type
busco_list_p = pd.read_csv(os.path.join(DIRS['GFF_INPUT'], 'Pst_104E_v12_p_busco.gene_name.bed'), sep='\t', header=None)[3].tolist()
effector_list_p = pd.read_csv(os.path.join(DIRS['GFF_INPUT'], 'Pst_104E_v13_p_ctg.effectors.bed'), sep='\t', header=None)[3].tolist()
te_list_p = pd.read_csv(os.path.join(DIRS['GFF_INPUT'], 'Pst_104E_v13_p_ctg.TE.sorted.bed'), sep='\t', header=None)[3].tolist()

# add gene types
p_contig_df['Gene_type'] = 'Other genes'
p_contig_df.loc[p_contig_df.name.isin(effector_list_p), 'Gene_type'] = 'Effector'
p_contig_df.loc[p_contig_df.name.isin(busco_list_p), 'Gene_type'] = 'BUSCO'
p_contig_df.loc[p_contig_df.name.isin(te_list_p), 'Gene_type'] = 'TE gene'

#make gene type df
busco_df_p = gene_type_df(p_contig_df, 'BUSCO')
effector_df_p = gene_type_df(p_contig_df, 'Effector')
te_gene_df_p = gene_type_df(p_contig_df, 'TE gene')
other_gene_df_p = gene_type_df(p_contig_df, 'Other genes')

In [349]:
busco_wilcoxon_dict_p = wilcoxon_gene_types(busco_df_p, column_list)
effector_wilcoxon_dict_p = wilcoxon_gene_types(effector_df_p, column_list)
te_gene_wilcoxon_dict_p = wilcoxon_gene_types(te_gene_df_p, column_list)
other_gene_wilcoxon_dict_p = wilcoxon_gene_types(other_gene_df_p, column_list)

save_wilcoxon_dict(busco_wilcoxon_dict_p, 'busco', 'p')
save_wilcoxon_dict(effector_wilcoxon_dict_p, 'effector', 'p')
save_wilcoxon_dict(te_gene_wilcoxon_dict_p, 'te_gene', 'p')
save_wilcoxon_dict(other_gene_wilcoxon_dict_p, 'other_gene', 'p')

In [345]:
p_contig_df.columns

Index(['name', '5mC_gene_body', '6mA_gene_body', '5mC_gene_body_rand',
       '6mA_gene_body_rand', '5mC_upstream', '6mA_upstream',
       '5mC_upstream_rand', '6mA_upstream_rand', '5mC_downstream',
       '6mA_downstream', '5mC_downstream_rand', '6mA_downstream_rand',
       '6mA_tss', '6mA_tss_rand', '5mC_both', '6mA_both', '5mC_both_rand',
       '6mA_both_rand'],
      dtype='object')

In [350]:
p_contig_df.head()

Unnamed: 0,name,5mC_gene_body,6mA_gene_body,5mC_gene_body_rand,6mA_gene_body_rand,5mC_upstream,6mA_upstream,5mC_upstream_rand,6mA_upstream_rand,5mC_downstream,6mA_downstream,5mC_downstream_rand,6mA_downstream_rand,6mA_tss,6mA_tss_rand,5mC_both,6mA_both,5mC_both_rand,6mA_both_rand,Gene_type
0,gene_model_pcontig_000.1,0.152112,0.04469,0.01476,0.01107,0.172324,0.075718,0.01436,0.007833,0.117,0.033,0.012,0.01,0.064,0.016,0.144662,0.054359,0.01318,0.008916,Other genes
1,gene_model_pcontig_000.2,0.082353,0.013235,0.016176,0.010294,0.091,0.024,0.016,0.011,0.02,0.015,0.012,0.008,0.018,0.008,0.0555,0.0195,0.014,0.0095,Other genes
2,gene_model_pcontig_000.3,0.068794,0.046859,0.018943,0.00997,0.028,0.023,0.015,0.009,0.043,0.063,0.012,0.008,0.036,0.012,0.0355,0.043,0.0135,0.0085,Effector
3,gene_model_pcontig_000.4,0.002788,0.002788,0.012082,0.01394,0.0,0.0,0.016,0.017,0.004,0.0,0.015,0.015,0.002,0.012,0.002,0.0,0.0155,0.016,Other genes
4,gene_model_pcontig_000.5,0.006355,0.001733,0.019064,0.009821,0.013,0.002,0.01,0.009,0.02,0.0,0.005,0.01,0.002,0.01,0.0165,0.001,0.0075,0.0095,Other genes


In [357]:
#save out the large df of coverage results
p_contig_df.to_csv((os.path.join(DIRS['FIGURES'], 'coverage', 'coverage_df_p.tsv')), header=True, index=False, sep = '\t')

# Old stuff

In [120]:
# spearman_dict = spearman_coverage(coverage_fn_dict)
spearman_dict

{'5mC_hc_tombo_sorted.cutoff.0.99.10kb.overlap.5mC_hc_tombo_sorted.cutoff.0.99.10kb.overlap.bed': (1.0,
  0.0),
 '5mC_hc_tombo_sorted.cutoff.0.99.10kb.overlap.6mA_hc_tombo_sorted.cutoff.0.99.10kb.overlap.bed': (0.7384667531527627,
  0.0),
 '5mC_hc_tombo_sorted.cutoff.0.99.10kb.overlap.Pst_104E_v13_ph_ctg.REPET.sorted.filtered.superfamily.10kb.overlap.bed': (0.44630536410739424,
  0.0),
 '5mC_hc_tombo_sorted.cutoff.0.99.10kb.overlap.Pst_104E_v13_ph_ctg.TE.sorted.10kb.overlap.bed': (0.25369718437406363,
  1.690789556321972e-233),
 '5mC_hc_tombo_sorted.cutoff.0.99.10kb.overlap.Pst_104E_v13_ph_ctg.anno.sorted.10kb.overlap.bed': (-0.38281643716088243,
  0.0),
 '6mA_hc_tombo_sorted.cutoff.0.99.10kb.overlap.5mC_hc_tombo_sorted.cutoff.0.99.10kb.overlap.bed': (0.7384667531527627,
  0.0),
 '6mA_hc_tombo_sorted.cutoff.0.99.10kb.overlap.6mA_hc_tombo_sorted.cutoff.0.99.10kb.overlap.bed': (1.0,
  0.0),
 '6mA_hc_tombo_sorted.cutoff.0.99.10kb.overlap.Pst_104E_v13_ph_ctg.REPET.sorted.filtered.superfami

In [113]:
spearman_dict

{'5mC_hc_tombo_sorted.cutoff.0.00.10kb.overlap.5mC_hc_tombo_sorted.cutoff.0.00.10kb.overlap.bed': (0.9999999999999998,
  0.0),
 '5mC_hc_tombo_sorted.cutoff.0.00.10kb.overlap.6mA_hc_tombo_sorted.cutoff.0.00.10kb.overlap.bed': (-0.47934594757954724,
  0.0),
 '5mC_hc_tombo_sorted.cutoff.0.00.10kb.overlap.Pst_104E_v13_ph_ctg.REPET.sorted.filtered.superfamily.10kb.overlap.bed': (-0.12843414371679912,
  7.829840669103839e-60),
 '5mC_hc_tombo_sorted.cutoff.0.00.10kb.overlap.Pst_104E_v13_ph_ctg.TE.sorted.10kb.overlap.bed': (0.07554648842181759,
  1.0681836180294743e-21),
 '5mC_hc_tombo_sorted.cutoff.0.00.10kb.overlap.Pst_104E_v13_ph_ctg.anno.sorted.10kb.overlap.bed': (0.14866356711046494,
  9.316874268222433e-80),
 '6mA_hc_tombo_sorted.cutoff.0.00.10kb.overlap.5mC_hc_tombo_sorted.cutoff.0.00.10kb.overlap.bed': (-0.4793459475795472,
  0.0),
 '6mA_hc_tombo_sorted.cutoff.0.00.10kb.overlap.6mA_hc_tombo_sorted.cutoff.0.00.10kb.overlap.bed': (1.0,
  0.0),
 '6mA_hc_tombo_sorted.cutoff.0.00.10kb.overl

### Part 1
1. A. Run everything again for only genes that are not in transposons. 3 files: p, h and ph


1. B. Or just subset the dataframes for "combined" by removing any rows in "TE" and save this df to csv under the right name
2. And add the df to the coverage dictionary :)
3. Just rerun the bed file dictionary generator for each one
4. And add the input file to the gene_anno_dict :)

Need to run this for:
- upstream/input
- upstream/coverage
- upstream/rand
- downstream/input
- downstream/coverage
- downstream/rand
- both_upstream_downstream/input
- both_upstream_downstream/coverage
- both_upstream_downstream/rand
- gene_body/input
- gene_body/coverage
- gene_body/rand

### Part 2
1. subset df to get genes that ONLY have methylation in gene body/ upstream/ downstream/ both/ TSS
2. subset randomisation df for these too
3. test significance
4. use these as section of expression analysis

In [2505]:
# save out the Spearman's rho values as a table
spearman_df = pd.DataFrame.from_dict(ug_spearman_dict, orient='index')
spearman_df.rename(columns={0: 'R', 1: 'p-value'}, inplace=True)
out_fn = os.path.join(DIRS['FIGURES'], 'expression', 'spearman_r_table.tsv')
spearman_df.to_csv(out_fn, header=True, sep = '\t')

In [None]:
# 6mA: Spearman's rho between ranked methylated genes in df and ranked expressed genes in df

In [None]:
# Ignore both_u_d files for now!
# For the both_u_d files:
# iterrows through df and if name == name:
# add the frac1 + frac2 and divide by 2
# append the average to a list
# make a new df of ... ok this is too hard I'll ask Ben on Monday

In [None]:
# Make files ranking genes expressed in each life cycle stage from high to low expression

# Make a new file of the highly expressed genes for each life cycle stage
# Make a new file of the lowly expressed genes for each life cycle stage

# Make a file of genes that are highly expressed in all stages -> for loop that checks whether gene is in all "highly expressed" files
