# <span style="color:#ff1414"> BEDtools analysis. </span>

This is a script to answer research questions outlined elsewhere. In summary, this script:

1. compares methylation results between different methylation-callers, and between different methylation sequencing methods.

2. compares methylation between genes and non-gene regions

3. compares methylation between transposons and non-repetitive regions

4. compares transposons and genes


Note:
- PB/pb = PacBio
- ONT/ont = Oxford Nanopore Technology
- NP = Nanopolish

In [14]:
import pybedtools
from pybedtools import BedTool
import os
import glob
import pprint
import numpy # need for p-value stats
import scipy

In [103]:
#First we need to define the base dirs
DIRS ={}
DIRS['BASE1'] = '/home/anjuni/methylation_calling/pacbio'
DIRS['BASE2'] = '/home/anjuni/analysis'
DIRS['BED_INPUT'] = os.path.join(DIRS['BASE2'], 'bedtools_output', 'sequencing_comparison')
DIRS['GFF_INPUT'] = os.path.join(DIRS['BASE2'], 'gff_output')
DIRS['WINDOW_OUTPUT'] = os.path.join(DIRS['BASE2'], 'windows')
DIRS['REF'] = '/home/anjuni/Pst_104_v13_assembly/'

In [6]:
#Quick chech if directories exist
for value in DIRS.values():
    if not os.path.exists(value):
        print('%s does not exist' % value)

In [8]:
#Make filepaths
bed_file_list = [fn for fn in glob.iglob('%s/*.bed' % DIRS['BED_INPUT'], recursive=True)]
gff_file_list = [fn for fn in glob.iglob('%s/*anno.gff3' % DIRS['GFF_INPUT'], recursive=True)]
te_file_list = [fn for fn in glob.iglob('%s/*.gff' % DIRS['GFF_INPUT'], recursive=True)]

In [9]:
#Check that the list works
print(*bed_file_list, sep='\n')
print(*gff_file_list, sep='\n')
print(*te_file_list, sep='\n')

/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/6mA_tombo_sorted.cutoff.0.5.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/5mC_tombo_sorted.cutoff.0.4.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/6mA_tombo_sorted.cutoff.0.6.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/6mA_prob_smrtlink_sorted.cutoff.0.95.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/5mC_nanopolish_sorted.cutoff.1.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/5mC_tombo_sorted.cutoff.0.2.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/5mC_nanopolish_sorted.cutoff.0.9.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/5mC_tombo_sorted.cutoff.1.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/filtered_bed/5mC_nanopolish_sorted.cutoff.0.4.bed
/home/an

## <span style='color:deeppink'> 1. Comparing methylation sequencing methods <span/>

In [8]:
%%bash

# find overlap between 6mA from PacBio and Nanopore for 6mA data

pb=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_prob_smrtlink_sorted.bed # use basecall accuracy instead of Phred score
ont=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_hc_tombo_sorted.bed # use sites with non-zero methylation

out1=/home/anjuni/analysis/bedtools_output/sequencing_comparison/6mA_pb_ont.bed
out2=/home/anjuni/analysis/bedtools_output/sequencing_comparison/6mA_ont_pb.bed

echo $pb
echo $ont

bedtools intersect -a $pb -b $ont > $out1
bedtools intersect -a $ont -b $pb > $out2

/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_prob_smrtlink_sorted.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_hc_tombo_sorted.bed


In [12]:
%%bash

#check how many overlapping sites there were

cd /home/anjuni/analysis/bedtools_output/sequencing_comparison/
echo PacBio sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_prob_smrtlink_sorted.bed | wc -l

echo Overlapping sites:
less 6mA_pb_ont.bed | wc -l

echo Nanopore sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_hc_tombo_sorted.bed | wc -l

echo Overlapping sites:
less 6mA_ont_pb.bed | wc -l

echo Total adenine sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_tombo_sorted.bed | wc -l

PacBio sites:
88932
Overlapping sites:
84733
Nanopore sites:
83451878
Overlapping sites:
84733
Total adenine sites:
85779879


In [27]:
# Descriptive Statistics

print('Percentage overlap between PacBio and Nanopore as a proportion of PB sites:', (84733/88932))
# overlap between pb and ont, divided by total PacBio sites

print('Percentage overlap between PacBio and Nanopore as a proportation of ONT sites:', (84733/83451878))
# = overlap between pb and ont, divided by total Nanopore sites

print('Percentage adenine methylation:', (84733/85779879))
# = overlapping sites, divided by total number of adenines (gained from number of lines on tombo file. tombo counts all adenines)

Percentage overlap between PacBio and Nanopore as a proportion of PB sites: 0.9527841496874017
Percentage overlap between PacBio and Nanopore as a proportation of ONT sites: 0.0010153516257596982
Percentage adenine methylation: 0.0009877957510292129


#### <span style='color:deeppink'> Observations <span/>

Very high similarity between Nanopore and PacBio, when compared to PacBio. But PacBio sites are only a small fraction of Tombo sites, and only include highly accurate sites.

When overlapping PacBio and all Nanopore (Tombo) sites, there was a higher overlap (88932) than when overlapping only non-zero PB and ONT sites (84733). This indicates PB detected sites that Nanopore did not, and these were high probability sites that were missed, as PB only had high probability (>99% basecall accuracy) sites.

There are also more overlapped sites when using the zero-probability sites from tombo, compared to only using only high confidence sites from both. This also suggests that Tombo/Nanopore had missed some methylated sites.

## <span style='color:#ff14ff'> 2. Comparing methylation detection methods <span/>

In [3]:
%%bash

# compare overlap between Tombo and Nanopolish for 5mC data
np=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_nanopolish_sorted.bed
tombo=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.bed

out1=/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_np_tombo.bed
out2=/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_tombo_np.bed

echo $np
echo $tombo

bedtools intersect -a $np -b $tombo > $out1
bedtools intersect -a $tombo -b $np > $out2

/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_nanopolish_sorted.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.bed


In [5]:
%%bash

#check how many overlapping sites there were

cd /home/anjuni/analysis/bedtools_output/sequencing_comparison/

echo Nanopolish sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_nanopolish_sorted.bed | wc -l

echo Overlapping sites:
less 5mC_np_tombo.bed | wc -l

echo Tombo sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.bed | wc -l

echo Overlapping sites:
less 5mC_tombo_np.bed | wc -l

echo Total cytosine sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_tombo_sorted.bed | wc -l

echo Total CpG sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_nanopolish_sorted.bed | wc -l

Nanopolish sites:
3783438
Overlapping sites:
1681653
Tombo sites:
67308386
Overlapping sites:
1681653
Total cytosine sites:
68536018
Total CpG sites:
5302131


In [14]:
# Descriptive Statistics

print('Percentage overlap between Nanopolish and Tombo as a proportion of NP sites:', (1681653/3783438))
# overlap between np and tombo, divided by total NP sites

print('Percentage overlap between Nanopolish and Tombo as a proportation of Tombo sites:', (1681653/67308386))
# = overlap between np and tombo, divided by total Tombo sites

print('Percentage cytosine methylation:', (1681653/68536018))
# = overlapping sites, divided by total number of cytosines (gained from number of lines on tombo file. tombo counts all cytosines)

print('Percentage of CpG sites methylated:', (1681653/5302131))
# = overlapping sites, divided by total number of CpG sites (gained from number of lines on np file. np counts all cpg sites)

Percentage overlap between Nanopolish and Tombo as a proportion of NP sites: 0.4444774831779984
Percentage overlap between Nanopolish and Tombo as a proportation of Tombo sites: 0.02498430136179465
Percentage cytosine methylation: 0.02453677714395371
Percentage of CpG sites methylated: 0.3171654944021564


#### <span style='color:#ff14ff'> Observations <span/>
While adenine methylation had high similarity between ONT and PB, cytosine methylation had only 44% similarity between NP and tombo. This is likely because NP only has CpG sites, and Tombo has all cytosine sites, so Tombo will detect far more potentially methylated sites, even those that are not CpG sites, so it will have far more sites than NP to begin with.

In [24]:
%%bash

#check how many cytosine sites and CpG sites there are

cd /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/

echo CpG sites:
less 5mC_nanopolish_sorted.bed | wc -l

echo Cytosine sites:
less 5mC_tombo_sorted.bed | wc -l

CpG sites:
5302131
Cytosine sites:
68536018


In [23]:
# Descriptive Statistics

print('Percentage of CpG sites as a proportion of cytosine sites:', (5302131/68536018))
# np sites divided by tombo sites

Percentage of CpG sites as a proportion of cytosine sites: 0.07736269416761271


#### <span style='color:#ff14ff'> Solution <span/>
1. Make a file of methylated CpG sites detected by Tombo.
2. Intersect this with methylated (CpG) sites detected by Nanopolish.

I will be overlapping CpG sites from Tombo and NP, because NP only has CpG sites and Tombo has all cytosine sites.
So from the start, the overlap wouldn't have been accurate, because tombo considers sites that NP does not.

### <span style='color:#ff14ff'> CpG Sites <span/>

In [39]:
%%bash

# make a file of CpG sites from tombo
all_cpg=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_nanopolish_sorted.bed
all_tombo=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.bed
tombo_cpg=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.CpG.bed

bedtools intersect -a $all_cpg -b $all_tombo > $tombo_cpg

In [19]:
%%bash

# intersect Tombo and NP CpG sites
tombo_cpg=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.CpG.bed
np_cpg=/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_nanopolish_sorted.bed
m_tombo_cpg=/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_CpG_tombo_np.bed
m_np_cpg=/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_CpG_np_tombo.bed

bedtools intersect -a $tombo_cpg -b $np_cpg > $m_tombo_cpg
bedtools intersect -a $np_cpg -b $tombo_cpg > $m_np_cpg

In [8]:
%%bash

#check how many overlapping sites there were
cd /home/anjuni/analysis/bedtools_output/sequencing_comparison/

echo Nanopolish methylated CpG sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_nanopolish_sorted.bed | wc -l

echo Tombo methylated CpG sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.CpG.bed | wc -l

echo Overlapping methylated CpG sites:
less 5mC_CpG_tombo_np.bed | wc -l

echo Total CpG sites:
less /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_nanopolish_sorted.bed | wc -l

Nanopolish methylated CpG sites:
3783438
Tombo methylated CpG sites:
2352724
Overlapping methylated CpG sites:
1681653
Total CpG sites:
5302131


In [9]:
# Descriptive Statistics

print('Percentage overlap between Nanopolish and Tombo as a proportion of NP sites:', (1681653/3783438))
# overlap between np and tombo, divided by total NP sites

print('Percentage overlap between Nanopolish and Tombo as a proportation of Tombo sites:', (1681653/2352724))
# = overlap between np and tombo, divided by total Tombo sites

print('Percentage of CpG sites methylated:', (1681653/5302131))
# = overlapping sites, divided by total number of CpG sites (gained from number of lines on np file. np counts all cpg sites)

Percentage overlap between Nanopolish and Tombo as a proportion of NP sites: 0.4444774831779984
Percentage overlap between Nanopolish and Tombo as a proportation of Tombo sites: 0.7147684981323776
Percentage of CpG sites methylated: 0.3171654944021564


#### <span style='color:#ff14ff'> Observations <span/>
While the overlap between Tombo and NP didn't change after only comparing CpG sites, the nearly 50% overlap may be because NP only considers one strand while Tomb considers both.

This may be resolved by using only the plus file from Tombo and comparing its results to NP.

#### <span style='color:#ff14ff'> insert diagram: venn diagram of naopolish CpG and tombo CpG <span/>

## <span style='color:#8a14ff'> 3. Making cutoff files. <span/>

### <span style='color:#8a14ff'> 3.A Making cutoff files for overlapping files from previous section. <span/>

In [None]:
%%bash

#Move the tombo hc file to the 'sequencing_comparison' folder with the other overlapped files to continue analysis
cp 5mC_hc_tombo_sorted.bed ~/analysis/bedtools_output/sequencing_comparison/

In [35]:
#Make filepaths for both 6mA files, both CpG files, and the tombo file
bed_file_list = ['/home/anjuni/analysis/bedtools_output/sequencing_comparison/6mA_ont_pb.bed', \
                 '/home/anjuni/analysis/bedtools_output/sequencing_comparison/6mA_pb_ont.bed', \
                 '/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_CpG_tombo_np.bed', \
                 '/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_CpG_np_tombo.bed', \
                 '/home/anjuni/analysis/bedtools_output/sequencing_comparison/5mC_hc_tombo_sorted.bed']

In [36]:
# Make the list of cutoffs
cutoff_list = [1.00, 0.99, 0.95, 0.90, 0.80, 0.70, 0.60, 0.50, 0.40, 0.30, 0.20, 0.10]

In [42]:
# Define function to filter
def score_filter(feature, L):
    """Returns True if feature is longer than L"""
    return float(feature.score) >= L

def filter_by_cutoffs(bed_files, cutoffs, initial_file_path, final_file_path):
    """Filters files by the list of cutoffs given, and renames the file according to the cutoff."""
    for file in bed_files:
        pybed_object = BedTool(file)
        for x in cutoffs:
            filtered_file = pybed_object.filter(score_filter, x)
            cutoff_name = '.cutoff.' + str(x) + '.bed'
            out_filename = file.replace('.bed', cutoff_name)
            out_filename = out_filename.replace(initial_file_path, final_file_path)
            filtered_file.saveas(out_filename)

In [None]:
#Run the function to filter all files
initial_fp = '/home/anjuni/analysis/bedtools_output/sequencing_comparison/'
final_fp = '/home/anjuni/analysis/bedtools_output/cutoffs_from_intersects/'
filter_by_cutoffs(bed_file_list, cutoff_list)

### <span style='color:#8a14ff'> 3.B Making cutoff files from original methylation-calling files, and overlap them, so you have a similar cutoff for both files. This is because the cutoffs from the previous section were based on the cutoffs in the -a file in bedtools. <span/>

In [40]:
# make file handles for the five input files
sorted_bed_files = ['/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_nanopolish_sorted.bed', \
                   '/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.CpG.bed', \
                   '/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/5mC_hc_tombo_sorted.bed', \
                   '/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_prob_smrtlink_sorted.bed', \
                   '/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/6mA_hc_tombo_sorted.bed']

In [45]:
#Run the function to filter all the sorted bed files
initial_fp1 = '/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/'
final_fp1 = '/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs/'
filter_by_cutoffs(sorted_bed_files, cutoff_list, initial_fp1, final_fp1)

In [61]:
%%bash

#Move the 6mA files and 5mC files to separate folders
cd /home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/
mkdir cutoffs_6mA
mkdir cutoffs_5mC
mv cutoffs/6mA* cutoffs_6mA
mv cutoffs/5mC* cutoffs_5mC
rmdir cutoffs

In [62]:
# make directories for 6mA and 5mC cutoff files
DIRS['BED_CUTOFFS'] = os.path.join(DIRS['BASE1'], 'input', 'sorted_bed_files', 'cutoffs')
DIRS['6MA_CUTOFFS'] = os.path.join(DIRS['BASE1'], 'input', 'sorted_bed_files', 'cutoffs_6mA')
DIRS['5MC_CUTOFFS'] = os.path.join(DIRS['BASE1'], 'input', 'sorted_bed_files', 'cutoffs_5mC')

In [63]:
print(DIRS['BED_CUTOFFS'])
print(DIRS['6MA_CUTOFFS'])
print(DIRS['5MC_CUTOFFS'])

/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC


In [69]:
# make a list of 6mA cutoff files from Nanopore and PacBio
ont_6mA = [fn for fn in glob.iglob('%s/6mA_hc_tombo*.bed' % DIRS['6MA_CUTOFFS'], recursive=True)]
pb_6mA = [fn for fn in glob.iglob('%s/6mA_prob_smrtlink*.bed' % DIRS['6MA_CUTOFFS'], recursive=True)]

#test out these lists by printing
print(*ont_6mA, sep='\n')
print(*pb_6mA, sep='\n')

/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.90.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.60.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.30.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.80.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.70.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.20.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.10.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.40.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.99.bed
/

In [70]:
# make a list of 5mC cutoff files from Nanopolish and Tombo
np_5mC = [fn for fn in glob.iglob('%s/5mC_hc_nanopolish*.bed' % DIRS['5MC_CUTOFFS'], recursive=True)]
tombo_CpG_5mC = [fn for fn in glob.iglob('%s/5mC_hc_tombo_sorted.CpG*.bed' % DIRS['5MC_CUTOFFS'], recursive=True)]
tombo_5mC = [fn for fn in glob.iglob('%s/5mC_hc_tombo_sorted.c*.bed' % DIRS['5MC_CUTOFFS'], recursive=True)]

#test out these lists by printing
print(*np_5mC, sep='\n')
print(*tombo_CpG_5mC, sep='\n')
print(*tombo_5mC, sep='\n')

/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.90.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.60.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.80.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.40.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.50.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.10.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.20.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5mC/5mC_hc_nanopolish_sorted.cutoff.0.70.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_5

In [77]:
# the lists are not sorted, so sort them before doing cutoffs
ont_6mA = sorted(ont_6mA)
pb_6mA = sorted(pb_6mA)
np_5mC = sorted(np_5mC)
tombo_CpG_5mC = sorted(tombo_CpG_5mC)
tombo_5mC = sorted(tombo_5mC)

In [78]:
#Check if it worked. (It did!) :D
print(*ont_6mA, sep='\n')
print(*pb_6mA, sep='\n')
print(*np_5mC, sep='\n')
print(*tombo_CpG_5mC, sep='\n')
print(*tombo_5mC, sep='\n')

/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.10.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.20.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.30.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.40.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.50.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.60.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.70.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.80.bed
/home/anjuni/methylation_calling/pacbio/input/sorted_bed_files/cutoffs_6mA/6mA_hc_tombo_sorted.cutoff.0.90.bed
/

In [84]:
#make the filepaths for intersects output

DIRS['I_FROM_C'] = os.path.join(DIRS['BASE2'], 'bedtools_output', 'intersects_from_cutoffs')

In [None]:
# make a for loop to take a list of cutoffs, and a list of -a files and a list of -b files to intersect
def intersect_cutoffs(list_a, list_b, n_elements):
    """Take a list of files and intersect them with another list of files, where files are matched by methylation cutoff."""
    for i in (0, n_elements):
        a_bed = BedTool(list_a[i])
        b_bed = BedTool(list_b[i])
        outname = 
        intersected_cutoff = a_bed.intersect(b_bed).saveas(out_name)

In [None]:
# Make lists of cutoffs from each of the 5 initial BED files
smrtlink

## <span style='color:#144fff'> 4. Making windows. <span/>

In [110]:
# Make folder for windows. Each BED file will contain a series of windows
#os.mkdir(DIRS['WINDOW_OUTPUT'])
gene_fn = '/home/anjuni/analysis/gff_output/Pst_104E_v13_ph_ctg.anno.sorted.gff3'
te_fn = '/home/anjuni/analysis/gff_output/Pst_104E_v13_ph_ctg.TE.sorted.gff3'
reference_genome = os.path.join(DIRS['REF'], 'Pst_104E_v13_ph_ctg.fa')

In [116]:
# Make the genome size file for windows
!samtools faidx /home/anjuni/Pst_104_v13_assembly/Pst_104E_v13_ph_ctg.fa
!cut -f 1,2 /home/anjuni/Pst_104_v13_assembly/Pst_104E_v13_ph_ctg.fa.fai > /home/anjuni/analysis/gff_output/Pst_104E_v13_ph_ctg.genome_file
# Note: this does put the p contig values before h contig ones, while annotation files put h contig before p contig
# May be a problem in the future but probs not

In [97]:
pprint.pprint(DIRS) # for reference

{'BASE1': '/home/anjuni/methylation_calling/pacbio',
 'BASE2': '/home/anjuni/analysis',
 'BED_INPUT': '/home/anjuni/analysis/bedtools_output/sequencing_comparison',
 'GFF_INPUT': '/home/anjuni/analysis/gff_output',
 'I_FROM_C': '/home/anjuni/analysis/bedtools_output/intersects_from_cutoffs',
 'WINDOW_OUTPUT': '/home/anjuni/analysis/windows'}


In [117]:
# Make the window BED files
# Test it out on large windows for a small dataset
# Define all file paths
window_fn_dict = {}
window_bed_dict = {}
window_fn_dict['100kb'] = os.path.join(DIRS['WINDOW_OUTPUT'], 'Pst_104E_v13_ph_ctg_w100kb.bed')
window_fn_dict['30kb'] = os.path.join(DIRS['WINDOW_OUTPUT'], 'Pst_104E_v13_ph_ctg_w30kb.bed')
window_fn_dict['10kb'] = os.path.join(DIRS['WINDOW_OUTPUT'], 'Pst_104E_v13_ph_ctg_w10kb.bed')
genome_size_f_fn = '/home/anjuni/analysis/gff_output/Pst_104E_v13_ph_ctg.genome_file'

In [118]:
#just here to name output files :)
#window_fn_dict['100kb_ont_6mA_0.10'] = ont_6mA[0]
#window_fn_dict['100kb_pb_6mA_0.10'] = pb_6mA[0]
#window_fn_dict['100kb_genes'] = gene_fn
#window_fn_dict['100kb_TE'] = te_fn

In [133]:
# Check whether the dictionary looks nice :) (it does!) :D
pprint.pprint(window_fn_dict)

{'100kb': '/home/anjuni/analysis/windows/Pst_104E_v13_ph_ctg_w100kb.bed',
 '10kb': '/home/anjuni/analysis/windows/Pst_104E_v13_ph_ctg_w10kb.bed',
 '30kb': '/home/anjuni/analysis/windows/Pst_104E_v13_ph_ctg_w30kb.bed'}


In [119]:
# Make the actual windows! :D
!bedtools makewindows -g {genome_size_f_fn} -w 100000 > {window_fn_dict['100kb']}
!bedtools makewindows -g {genome_size_f_fn} -w 30000 > {window_fn_dict['30kb']}
!bedtools makewindows -g {genome_size_f_fn} -w 10000 > {window_fn_dict['10kb']}

In [139]:
#new make a bedtools window dataframe
for key, value in window_fn_dict.items() :
    window_bed_dict[key] = BedTool(value)

In [140]:
pprint.pprint(window_bed_dict)

{'100kb': <BedTool(/home/anjuni/analysis/windows/Pst_104E_v13_ph_ctg_w100kb.bed)>,
 '10kb': <BedTool(/home/anjuni/analysis/windows/Pst_104E_v13_ph_ctg_w10kb.bed)>,
 '30kb': <BedTool(/home/anjuni/analysis/windows/Pst_104E_v13_ph_ctg_w30kb.bed)>}


## <span style='color:#148aff'> 5. Intersecting methylation with gene annotation files. <span/>

## <span style='color:#14c4ff'> 6. Analysing gene expression files. <span/>

## <span style='color:#15c66e'> 7. Intersecting with transposons expression files. <span/>

## <span style='color:#9ac615'> 8. Comparing methylated transposons and genes. <span/>

## <span style='color:#ffa347'> 9. Identifying effector genes. <span/>

## <span style='color:#ff4f14'> 10. Expression of methylation machinery throughout Pst life cycle. <span/>