# Convert ChIP-Seq Signals to a Pattern of Mark States

Given a BedGraph file showing "fold-change over control" signals for arbitrary genomic intervals, generate a sequence of methylation states per nucleosome, where 0 indicates no tail methylated, 1 indicates one tail methylated, and 2 indicates both tails methylated.

<br>

### Before running this notebook:

- Download the desired BigWig file for an epigenetic mark with "fold signal over control" values from the ENCODE database.
- Move the BigWig file to directory containing UCSC sequence tool executables.
- Run the bigWigToBedGraph executable from UCSC using `./ bigWigToWig <BigWigFileName> <OutputFileName> chrom=chr##` from the directory containing the executable.
- Move the BedGraph to the machine containing this notebook
- Fill in the inputs to this notebook and run all cells.

### Setup

Import necessary modules.

In [1]:
import numpy as np
import bioframe as bf

### Specify Inputs

Conversion from raw ChIP-seq to methylation sequence depends on the following input parameters.

In [2]:
# Fraction of all histone TAILS modified with methylation mark
fraction_methylated = 0.4

# Max num. iters to adjust thresholds to converge on fraction methylated target
max_iters = 1000

# Relative tolerance around fraction methylated target
rel_tol = 0.001

# Location of Bed Graph file; should contain "fold-change over control" values
file_path = "/Users/jwakim/Downloads/ENCFF919DOR_H3K27me3_Bedgraph.bed"

# Output file name
out_path = "/Users/jwakim/Downloads/ENCFF919DOR_H3K27me3_methyl.txt"

# Chromosome number for sequence of interest
chromosome = "chr16"

# Bead discretization in units of base pair
bp_per_nucleosome = 200

### Calibrate Conversion

Signal cutoffs will be defined such that a specified fraction of histone tails are methylated. Cutoffs are defined by percentile values in the overall distribution of signals.

In [3]:
pct_cutoff_width = (2 * fraction_methylated) / 3
pct_cutoffs = np.array([pct_cutoff_width, 2*pct_cutoff_width])

print("Percentile Cutoffs between 0/1 and 1/2 tails methylated:")
print(pct_cutoffs)

Percentile Cutoffs between 0/1 and 1/2 tails methylated:
[0.26666667 0.53333333]


### Redistribute interval to match desired discretization

The BedGraph file discretizes signals into arbitrary bins. Redistribute the signals into widths matching individual nucleosomes.

In [4]:
signals = bf.read_table(file_path, schema='bed4')
signals.columns = signals.columns[:-1].tolist() + ['fold_change']
signals.head()

Unnamed: 0,chrom,start,end,fold_change
0,chr16,0,10279,0.0
1,chr16,10279,10489,0.78156
2,chr16,10489,10858,0.0
3,chr16,10858,11068,0.78156
4,chr16,11068,12789,0.0


In [5]:
# Get the size of the chromosomes
chromsizes = bf.fetch_chromsizes("hg38", as_bed=False)
chromsizes = chromsizes[:22]

# Discretize each chromosome into 200 bp units
bins = bf.binnify(chromsizes, binsize=200)
bins = bins[bins.chrom == chromosome]

# Determine overlap of ChIP-seq signals with evenly-spaced chromosome bins
overlap = bf.overlap(bins, signals, return_overlap="True")

# Scale signals with size of overlap with evenly-spaced chromosome bins
overlap["fold_change_scaled"] = overlap["fold_change_"] / (overlap['True_end']- overlap['True_start'])
overlap["fold_change_scaled"] *= bp_per_nucleosome

# Group signals corresponding to the same region
overlap["fold_change_scaled"] = overlap.groupby(["chrom", "start", "end"])["fold_change_scaled"].transform("sum")
processed_signals = overlap.drop_duplicates(subset=['chrom', 'start', 'end'])
processed_signals = processed_signals[["chrom", "start", "end", "fold_change_scaled"]]

In [6]:
processed_signals.head()

Unnamed: 0,chrom,start,end,fold_change_scaled
0,chr16,0,200,0.0
1,chr16,200,400,0.0
2,chr16,400,600,0.0
3,chr16,600,800,0.0
4,chr16,800,1000,0.0


In [7]:
num_beads = len(processed_signals)
fold_change_scaled = processed_signals["fold_change_scaled"].to_numpy().flatten()
avg_fold_change = np.average(fold_change_scaled)
cutoffs = np.percentile(fold_change_scaled, pct_cutoffs)

for i in range(max_iters):

    if (i+1) % 100 == 0:
        print(f"Cutoffs: {cutoffs}")

    one_mark = np.where((fold_change_scaled >= cutoffs[0]))
    two_marks = np.where((fold_change_scaled >= cutoffs[1]))

    methyl = np.zeros(num_beads, dtype=int)
    methyl[one_mark] = 1
    methyl[two_marks] = 2

    obs_frac_methyl = np.sum(methyl) / (2 * num_beads)
    err = obs_frac_methyl - fraction_methylated

    if -rel_tol <= err <= rel_tol:
        print("Convergence Successful!")
        print(f"Num Iters: {i+1}")
        print(f"Cutoffs: {cutoffs}")
        print(f"Fraction Tails Methylated: {obs_frac_methyl}")
        break

    cutoffs[0] += avg_fold_change * err * (i / max_iters)
    cutoffs[1] += 2 * avg_fold_change * err * (i / max_iters)

    if i == (max_iters-1):
        print("Failed to converge.")
        print(f"Observed fraction tails methylated: {round(obs_frac_methyl, 3)}")



Cutoffs: [16.209699765431143 32.419399530862286]
Convergence Successful!
Num Iters: 108
Cutoffs: [16.329855360303696 32.65971072060739]
Fraction Tails Methylated: 0.4009535258539004


In [8]:
# Save Output Methylation Sequence
with open(out_path, "w") as f:
    for i in range(num_beads):
        f.write(f"{methyl[i]}\n")

### Summary:

Given ChIP-seq outputs indicating "fold-change over control" signals for various genomic intervals, we redistributed those signals into nucleosome-scale bins and we applied cutoffs so that some setpoint fraction number of tails are methylated. We store the methylation sequence in an output directory for use in future simulations.