# ChIP-seq Analysis Pipeline

This pipeline enables you to analyze and visualize your ChIP-seq datasets with an array of downstream analysis and visualization tools. The pipeline includes peak calling with MACS2 (Zhang, Yong, et al., 2008), peak binding plots, an interactive genome browser, peak annotation, and enrichment analysis with Enrichr (Kuleshov, Maxim V., et al., 2016) and ChEA3 (Keenan, Alexandra B., et al., 2019).

In [None]:
#%%appyter init
from appyter import magic
magic.init(lambda _=globals: _(), verbose=True)

In [None]:
# Basic libraries
import pandas as pd
import os
import requests, json
import sys
from time import sleep
import time
import numpy as np
import warnings
import re
import shutil
import subprocess

# Visualization
import plotly
from plotly import tools
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
plotly.offline.init_notebook_mode() # To embed plots in the output cell of the notebook

import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib import rcParams
from matplotlib.lines import Line2D
%matplotlib inline

import IPython
from IPython.display import HTML, display, Markdown, IFrame

import chart_studio
import chart_studio.plotly as py


# Data analysis
from itertools import combinations
import scipy.spatial.distance as dist
import scipy.stats as ss
from sklearn.decomposition import PCA
from sklearn.preprocessing import quantile_transform

from rpy2 import robjects
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

# External Code
from utils import *

In [None]:
%%appyter hide_code_exec
{% do SectionField(
    name='Data_Section',
    title='Load your Data',
    subtitle='Load your ChIP-seq dataset and set analysis parameters',
    img='analysis.png'
    
) %}

In [None]:
%%appyter code_exec
{% set treatment_chipseq_filename = FileField(
    name='treatment_chipseq_filename', 
    label='Treatment ChIP-seq file (.bam, .bed, or .narrowpeak)', 
    default='GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed',    
    examples={'GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed': "https://appyters.maayanlab.cloud/storage/ChIPseq/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed"}, section='Data_Section')

%}

{% set background_chipseq_filename = FileField(
    name='background_chipseq_filename', 
    label='(Optional) Background ChIP-seq file (.bam or .bed)', 
    default='',    
    section='Data_Section')

%}

{% set macs = BoolField(
    name='macs', 
    label='Peak calling?', 
    default='false',
    description='Check if you want peak calling analysis (MACS2)', 
    section='Data_Section',
) 
%}

{% set max_genes = IntField(
    name='max_genes', 
    label='Maximum annotated genes from peak calling', 
    min=0, 
    max=10000, 
    default=1000, 
    description='The number of genes with highest scores', 
    section='Data_Section')
%}

{% set regionTSS = IntField(
    name='regionTSS', 
    label='TSS region', 
    min=0, 
    max=10000, 
    default=3000, 
    description='The region within the value of all TSSs in a gene', 
    section='Data_Section')
%}


In [None]:
%%appyter code_exec
treatment_chipseq_filename = "{{treatment_chipseq_filename.value}}"
background_chipseq_filename = "{{background_chipseq_filename.value}}"

macs = {{macs.value}}
max_genes = {{max_genes.value}}
regionTSS = {{regionTSS.value}}

In [None]:
warnings.filterwarnings('ignore')
random.seed(0)
pandas2ri.activate()
chart_studio.tools.set_credentials_file(username='mjjeon', api_key='v0rpMa6lhST28Sq7XqtM')
results = {}
table_counter = 1
figure_counter = 1

In [None]:
%%appyter markdown
{% if macs.value == True %}
# Peak Calling using MACS2
Peak calling is a computational method used to identify areas in the genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing experiment. A commonly used tool for identifying transcription factor binding sites is called Model-based Analysis of ChIP-seq (MACS) (Zhang, Yong, et al., 2008). The MACS algorithm captures the influence of genome complexity to evaluate the significance of enriched ChIP regions. Although MACS was developed for the detection of transcription factor binding sites, MACS is also suited for detecting broad regions. MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be used either for the ChIP sample alone, or along with a control sample which increases specificity of the peak calls.  
{% endif %}

In [None]:
if macs == True:
    if background_chipseq_filename == "":
        command = ["macs2", "callpeak", "-t", treatment_chipseq_filename, "--name", treatment_chipseq_filename, "-B"]
    else:
        command = ["macs2", "callpeak", "-t", treatment_chipseq_filename, "-c", background_chipseq_filename, "--name", treatment_chipseq_filename, "-B"]
    result = subprocess.run(command, capture_output=True)
    error_msg = str(result.stderr)
    if "Done!" not in error_msg:
        raise Exception("Error during MACS2 analysis! Please check the input files. See the error message below: \n"+error_msg)
    bed_filename = treatment_chipseq_filename+"_summits.bed"
else:
    bed_filename = treatment_chipseq_filename

# Profile of ChIP Peaks Binding to TSS Regions

In [None]:
%%appyter markdown
A common visualization technique is to obtain a global evaluation of the enrichment around the Transcription Start Site (TSS) (+- {{regionTSS.value}}bp). Here we visualize the input ChIP data as a heatmap and as a profile plot using ChIPseeker (Yu et al., 2015).

In [None]:
robjects.r('''tag_matrix <- function(inputfilename, outputfilename, minTSS=-3000, maxTSS=3000) {
    
        # Load packages
        suppressMessages(require(ChIPseeker))
        suppressMessages(require(TxDb.Hsapiens.UCSC.hg19.knownGene))
        suppressMessages(require(clusterProfiler))
        
        txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
        peak <- readPeakFile(inputfilename)
        promoter <- getPromoters(TxDb=txdb, upstream=maxTSS, downstream=maxTSS)
        tagMatrix <- getTagMatrix(peak, windows=promoter)
        
        # save
        write.table(as.data.frame(tagMatrix), outputfilename, sep=",")
        return (tagMatrix)
        }''')

In [None]:
chipseeker = robjects.r['tag_matrix']
chipseeker(bed_filename, bed_filename+"_tag_matrix_output.csv", -regionTSS, regionTSS)
peakAnno = pd.read_csv(bed_filename+"_tag_matrix_output.csv", index_col=0)
peakAnno = (peakAnno  # Use `name` and `product` as index.
        .assign(sum=peakAnno.sum(axis=1))  # Add temporary 'sum' column to sum rows.
        .sort_values(by='sum', ascending=False)  # Sort by row sum descending order.
        .iloc[:, :-1])  # Remove temporary `sum` column.
peakAnno.columns = [*range(-regionTSS, regionTSS+1, 1)] 

In [None]:
f, ax = plt.subplots(figsize=(5, 7))
ax = sns.heatmap(peakAnno, yticklabels=False, xticklabels=regionTSS, cmap='Reds', cbar=False)
plt.xlabel("Distance (bp)")
plt.ylabel("Peaks")
plt.show()
figure_counter = display_object(figure_counter, "Profile of ChIP peaks binding to TSS regions", istable=False)

In [None]:
fig = px.line(peakAnno.sum(), title="Average Profile of ChIP peaks binding to TSS region", labels={
    "index": "Genomic Region",
    "value": "Read Count Frequency"
})
fig.update_layout(showlegend=False)
fig.show()
figure_counter = display_object(figure_counter, "Average Profile of ChIP peaks binding to TSS region", istable=False)

# Genome Browser Visualization

To view the peak locations over the whole genome, an IGV-based genome browser (Robinson, James T., et al., 2020) provides means to explore the coverage of peak regions over all chromosomes and to generate figures that visualize the peaks.

In [None]:
%%appyter code_eval
from IPython.display import IFrame
shutil.copyfile(bed_filename, "./peaks.bed")
IFrame(src="{{ url_for('static', filename='test.html') }}#{{ url_for(_session, filename='peaks.bed', public=True) }}", width=800, height=600)

# Peak Annotation Analysis

In [None]:
%%appyter markdown
Peak annotation is performed by ChIPseeker (Yu et al., 2015), which annotates the peaks to their nearest gene and to the peak location; whether a peak is within an Intron, an Exon, and somewhere else. Users can define the transcription start site (TSS) region. The default TSS region is from -{{regionTSS.value}}bp to +{{regionTSS.value}}bp. 

In [None]:
robjects.r('''chipseeker <- function(inputfilename, outputfilename, minTSS=-3000, maxTSS=3000) {
    
        # Load packages
        suppressMessages(require(ChIPseeker))
        suppressMessages(require(TxDb.Hsapiens.UCSC.hg19.knownGene))
        suppressMessages(require(clusterProfiler))
        
        txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
        
        # Peak Annotation
        peakAnno <- annotatePeak(inputfilename, tssRegion=c(minTSS, maxTSS), TxDb=txdb, annoDb="org.Hs.eg.db", verbose=FALSE)
        
        
        # save
        write.table(as.data.frame(peakAnno), outputfilename, sep=",")
        
        }''')

In [None]:
chipseeker = robjects.r['chipseeker']
chipseeker(bed_filename, bed_filename+"_peak_annotation_output.csv", -regionTSS, regionTSS)
peakAnno = pd.read_csv(bed_filename+"_peak_annotation_output.csv")
sorted_peakAnno_groupby_gene = peakAnno.groupby("SYMBOL").sum().sort_values("V5", ascending=False)
top_genes = sorted_peakAnno_groupby_gene.iloc[:max_genes, :].index.tolist()

In [None]:
display(peakAnno.sort_values("V5", ascending=False))
table_counter = display_object(table_counter, "Peak Annotation Result", istable=True)
display(create_download_link(peakAnno, filename="Peak_Annotation_Result.csv"))

# Visualization of the Genomic Annotations of Peaks

Pie charts and bar plots are provided to visualize the genomic annotation. Peaks are assigned to genomic annotations that classify peaks to be in the TSS, Exon, 5’ UTR, 3’ UTR, Intronic, or Intergenic.

In [None]:
def pie_plot(data):    
    fig = px.pie(data, values='count', names=data.index)
    fig.show()

In [None]:
peakAnno["count"] = [re.sub('Intron [^\n]+', "Intron", re.sub('Intron[^\n]+', "Intron", x)) for x in peakAnno["annotation"]]
peakAnno["count"] = [re.sub('Exon [^\n]+', "Exon", re.sub('Exon[^\n]+', "Exon", x)) for x in peakAnno["count"]]
pie_plot(peakAnno["count"].value_counts())
figure_counter = display_object(figure_counter, "Genomic Annotation of Peaks in Pie Plot", istable=False)

# Enrichment Analysis with Enrichr

Enrichment analysis is a statistical procedure used to identify biological terms which are over-represented in a given gene set. These include signaling pathways, molecular functions, diseases, and a wide variety of other biological terms obtained by integrating prior knowledge of gene function from multiple resources. Enrichr (Kuleshov et al. 2016) is a web-based application that performs enrichment analysis using a large collection of gene-set libraries. Enrichr provide various interactive approaches to display the enrichment results. The pipeline merges the peak annotation results at the gene set level and selects the top-ranked genes by their scores. These top gene sets are submitted to Enrichr for analysis.

In [None]:
results = run_enrichr(geneset=top_genes, signature_label="The annotated genes")
result = results["result"]
display(Markdown("*Enrichment Analysis Result*"))
display_link("https://amp.pharm.mssm.edu/Enrichr/enrich?dataset={}".format(result["shortId"]))
        

# Enrichment Analysis with ChEA3

ChEA3 is a web-based transcription factor (TF) enrichment analysis tool that integrates transcription factor/target knowledge from multiple sources (Keenan, Alexandra B., et al., 2019). ChEA3 can aid in identifying the TFs responsible for regulating the expression of a collection of target genes.

In [None]:
chea3_result = run_chea3(top_genes, "chea3")

# display result tables
for key, item in chea3_result.items():
    df = pd.DataFrame(item).drop(["Query Name"], axis=1)
    display_result_table(df, key, table_counter)

# References

Keenan, Alexandra B., et al. "ChEA3: transcription factor enrichment analysis by orthogonal omics integration." Nucleic acids research 47.W1 (2019): W212-W224.
<br>
Kuleshov, Maxim V., et al. "Enrichr: a comprehensive gene set enrichment analysis web server 2016 update." Nucleic acids research 44.W1 (2016): W90-W97.
<br>
Robinson, James T., et al. "igv. js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV)." bioRxiv (2020).
<br>
Yu, Guangchuang, Li-Gen Wang, and Qing-Yu He. "ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization." Bioinformatics 31.14 (2015): 2382-2383.
<br>
Zhang, Yong, et al. "Model-based analysis of ChIP-Seq (MACS)." Genome biology 9.9 (2008): 1-9.
<br>