#### Summary:
This notebook will be used to process the raw ATAC data and call peaks (first unfiltered with Homer, then reproducible with IDR). I'll then use the H3K27ac data to call positive and negative peaks from both control and NASH treated data. Compared to previous iterations, I'll be raising our H3K27Ac threshold here to 32 (rather than 16).

Output files:
- strain_treat_annot.txt: original peak annotation file 
- strain_treat_filt_annot.txt: filtered peaks with H3K27ac > 16 and "intron" or "Intergenic"
- strain_treat_filt_annot_cut.bed: bed file of ^ with chrY and chrUn peaks removed

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import csv
import os
from scipy.stats.stats import pearsonr,spearmanr
from os import listdir
import fnmatch

# Import poised and active enhancers from prior analysis

In [2]:
dataDirectory = '/home/h1bennet/strains/data/ATAC/control_cohort2/'
workingDirectory = '/home/h1bennet/strains_machinelearning/results/00_New_ATAC_H3K27Ac_Model/'
if not os.path.isdir(workingDirectory):
    os.mkdir(workingDirectory)
os.chdir(workingDirectory)


In [3]:
poised_enhancers = pd.read_csv(
    '/home/h1bennet/strains/results/06_Strains_Control_Cohort2_ATAC/poised_enhancers/C57Bl6J_poised_enhancer_peaks.txt',
    sep='\t')

active_enhancers = pd.read_csv(
    '/home/h1bennet/strains/results/06b_Strains_Control_Combined_H3K27Ac/active_enhancers/C57Bl6J_active_enhancer_peaks.txt',
    sep='\t')

# Convert peak file to BED

Bed file format requires:
- chrom - name of the chromosome or scaffold. Any valid seq_region_name can be used, and chromosome names can be given with or without the 'chr' prefix.
- chromStart - Start position of the feature in standard chromosomal coordinates (i.e. first base is 0).
- chromEnd - End position of the feature in standard chromosomal coordinates

Optional columns: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts

from: https://m.ensembl.org/info/website/upload/bed.html

In [51]:
if not os.path.isdir('./bed_files/'):
    os.mkdir('./bed_files/')
    
if not os.path.isdir('./bg_files/'):
    os.mkdir('./bg_files/')

### Control

In [52]:
s

# remove chrY and chrUn
print(poised_enhancers.shape[0], 'poised enhancers')
chr_bool_filter = ~poised_bed.Chr.str.contains('Y|M|random|Un')
poised_bed = poised_bed.reindex(poised_bed.index[chr_bool_filter])
print(poised_bed.shape[0], 'filtered poised enhancers')

# write out
poised_bed.to_csv('./bed_files/poised_enhancers.bed',
                  sep='\t',
                  index=False,
                  header=False)

37146 poised enhancers
37141 filtered poised enhancers


In [53]:
#just going to use the required cols (and maybe strand and score? why not, I guess)
save_cols = ['chr', 'start', 'end', 'PeakID', 'strand']
active_bed = active_enhancers.loc[:,save_cols]
active_bed.columns = active_bed.columns.str.capitalize()

# remove chrY and chrUn
print(active_enhancers.shape[0], 'active enhancers')
chr_bool_filter = ~active_bed.Chr.str.contains('Y|M|random|Un')
active_bed = active_bed.reindex(active_bed.index[chr_bool_filter])
print(active_bed.shape[0], 'filtered active enhancers')

# write out
active_bed.to_csv('./bed_files/active_enhancers.bed',
                  sep='\t',
                  index=False,
                  header=False)

36151 active enhancers
36151 filtered active enhancers


# Generate backgrounds sets

## Random matched-GC content sets

Bed file is the positives we've just chosen (only matching GC content). Make sure to remove the top row (col names) in terminal with sed '1d' before running the script! Col names are the same for all the C57 output files so not a huge loss to just delete them and then overwrite the file.

### Poised

In [54]:
%%bash
bed_file="/home/h1bennet/strains_machinelearning/results/00_New_ATAC_H3K27Ac_Model/bed_files/poised_enhancers.bed" #+ peaks, no chrY, no chrUn
output=${bed_file/.bed/bg}
output=${output/bed_files/bg_files}
echo $output
python /home/zes017/Spacing/Codes/generate_background_coordinates.py $bed_file $output

/home/h1bennet/strains_machinelearning/results/00_New_ATAC_H3K27Ac_Model/bg_files/poised_enhancersbg
reading genome mm10
fasta file does not exist: chr20
fasta file does not exist: chr21
fasta file does not exist: chr22
done reading genome
0 0
target GC: 0.3869313593539183 background GC: 0.3857134248005751 target length: 200 numTargetPositions 3715 backgroundPositions 3715
0 0
target GC: 0.43455035002686665 background GC: 0.43279295498806014 target length: 200 numTargetPositions 3714 backgroundPositions 3714
0 0
target GC: 0.46161954765744995 background GC: 0.45883801241497696 target length: 200 numTargetPositions 3714 backgroundPositions 3714
0 0
target GC: 0.48474690360790457 background GC: 0.48054423627681725 target length: 200 numTargetPositions 3714 backgroundPositions 3714
0 0
target GC: 0.5067366720516281 background GC: 0.5006028018228057 target length: 200 numTargetPositions 3714 backgroundPositions 3714
0 0
target GC: 0.531763597199713 background GC: 0.5236472457314232 target 

### Active

In [55]:
%%bash
bed_file="/home/h1bennet/strains_machinelearning/results/00_New_ATAC_H3K27Ac_Model/bed_files/active_enhancers.bed" #+ peaks, no chrY, no chrUn
output=${bed_file/.bed/bg}
output=${output/bed_files/bg_files}
echo $output
python /home/zes017/Spacing/Codes/generate_background_coordinates.py $bed_file $output

/home/h1bennet/strains_machinelearning/results/00_New_ATAC_H3K27Ac_Model/bg_files/active_enhancersbg
reading genome mm10
fasta file does not exist: chr20
fasta file does not exist: chr21
fasta file does not exist: chr22
done reading genome
0 0
target GC: 0.38043694690260227 background GC: 0.37887030555183443 target length: 200 numTargetPositions 3616 backgroundPositions 3616
0 0
target GC: 0.4299349930843112 background GC: 0.4270487121790182 target length: 200 numTargetPositions 3615 backgroundPositions 3615
0 0
target GC: 0.4580912863069906 background GC: 0.45702469671002427 target length: 200 numTargetPositions 3615 backgroundPositions 3615
0 0
target GC: 0.48124343015207727 background GC: 0.47530122554578763 target length: 200 numTargetPositions 3615 backgroundPositions 3615
0 0
target GC: 0.5039100968187408 background GC: 0.49764180480715403 target length: 200 numTargetPositions 3615 backgroundPositions 3615
0 0
target GC: 0.5289474412170776 background GC: 0.5208329032568113 target