Hunter Bennett | Glass Lab | Brain Aging Project | 19 Feb 2021

This script takes a basic look at the quality control statistics of H3K27Ac ChIP-seq  input libraries. Mainly we look at clonality, total reads, mapping efficiency, and IP efficiency (call variable peaks for a quick and dirty assessment of IP efficiency). This script also generates a UCSC browser hub for visualization of data to aid in sample seletion based on ChIP quality. Input selection is particularly important in this pipeline since we do not have ATAC-seq data for neurons, oligodendrocytes, or astrocytes.

In [1]:
### header ###
__author__ = "Hunter Bennett"
__license__ = "BSD"
__email__ = "hunter.r.bennett@gmail.com"
%load_ext autoreload
%autoreload 2
### imports ###
import sys
%matplotlib inline
import os
import re
import glob
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt 
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 200
sns.set(font_scale=1)
sns.set_context('talk')
sns.set_style('white')

# import custom functions
import sys
sys.path.insert(0, '/home/h1bennet/code/')
from hbUtils import ngs_qc, quantile_normalize_df

### Set working directory

In [2]:
dataDirectory = '/data/mm10/Brain_MPSIIIA/ChIP/input/PU1/'
workingDirectory = '/home/h1bennet/brain_aging/results/00_PU1_H3K27Ac_QC/'
if not os.path.isdir(workingDirectory):
    os.mkdir(workingDirectory)
os.chdir(workingDirectory)

# Quality control

In [3]:
qc = ngs_qc(dataDirectory, 'atac')

/data/mm10/Brain_MPSIIIA/ChIP/input/PU1//
./PU1_qc/


<Figure size 432x288 with 0 Axes>

In [4]:
qc

Unnamed: 0,uniquePositions,fragmentLengthEstimate,tagsPerBP,clonality,GC_Content,totalReads,uniquelyMappedReads,multiMappedReads,unmappedReads,uniquelyMappedFraction,mappedFraction,frac_unmappedReads_mismatch,frac_unmappedReads_short,frac_unmappedReads_other
00_mouse_BL6_PU1_ChIP_input_10_day_AL_l20200911_GGTCACGA_GTATTATG,9049548.0,80.0,0.005367,1.617,0.394,19009364.0,12887738.0,5558455.0,563171.0,0.677968,0.970374,,,
00_mouse_BL6_PU1_ChIP_input_3_week_AL_l20200911_CTAGCGCT_GTGTAGAC,7562887.0,227.0,0.003992,1.439,0.398,13701493.0,9546347.0,3760275.0,394871.0,0.696738,0.97118,,,
00_mouse_MPSIIIAhet_M_P21_PU1_input_1_AL_20191122_GACGAC,11203222.0,82.0,0.004564,1.111,0.397,15794405.0,11583590.0,3262658.0,948157.0,0.733398,0.939969,,,
01_mouse_BL6_M_8week_PU1_ChIP_input_BL6_466_AL_l20191226_CATGGC,8930038.0,80.0,0.003556,1.086,0.402,13812277.0,9254164.0,3132788.0,1425325.0,0.669996,0.896807,,,
01_mouse_BL6_M_8week_PU1_input_1A_JOS_20190801_CGGAAT,4854858.0,177.0,0.002014,1.131,0.391,7815445.0,5126635.0,2013390.0,675420.0,0.655962,0.913579,,,
01_mouse_BL6_M_8week_PU1_input_1B_JOS_20190801_CGGAAT,4854858.0,177.0,0.002014,1.131,0.391,7815445.0,5126635.0,2013390.0,675420.0,0.655962,0.913579,,,
01_mouse_BL6_M_8week_PU1_input_3_AL_20191226_CATGGC,8930038.0,80.0,0.003556,1.086,0.402,13812277.0,9254164.0,3132788.0,1425325.0,0.669996,0.896807,,,
01_mouse_C57_M_8week_PU1_ChIP_input_438_AL_l20191206_GATCAG,5629596.0,182.0,0.002144,1.038,0.414,7792079.0,5388332.0,1846743.0,557004.0,0.691514,0.928517,,,
02_mouse_MPSIIIAhet_M_4month_PU1_ChIP_input_AL_l20200925_ATCCACTG_AGGTGCGT,11537716.0,188.0,0.004466,1.055,0.421,16748351.0,11154706.0,4179234.0,1414411.0,0.666018,0.915549,,,
03_mouse_MPSIIIAhet_M_P240_PU1_input_2_AL_20191122_CTAGCT,12310597.0,80.0,0.004919,1.089,0.405,17981410.0,12682943.0,3764492.0,1533975.0,0.705336,0.914691,,,


### Plot tag count distribution

In [6]:
# tds = glob.glob(dataDirectory+'/*')
# tds = np.sort(tds)

# fig, axs = plt.subplots(2,3, figsize=(15, 10), sharex=True, sharey=True)

# for ax, td in zip(axs.flatten(), tds):
#     df = pd.read_csv(td+'/tagCountDistribution.txt', sep='\t', index_col=0)
#     df.loc[1:10, :].plot.bar(ax=ax, legend=False)
#     ax.set_xlabel('Tags per position')
#     ax.set_ylabel('Fraction of Positions')
#     ax.set_title(td.split('/')[-1].split('_AL')[0], fontsize=8)

# make browser hub

Browser hub naming strategy (CapitalizeFirstLetters):  
hrb_project_qc/viz_celltype_ChIPTarget/input

Browser color strategy:  
* QC:
    * Sox9: 99,99,99
    * Olig2: 49,163,84
    * NeuN: 222,45,38
    * PU1: 49,130,189
* Visualize: TBD

In [5]:
np.sort(os.listdir(dataDirectory))

array(['00_mouse_BL6_PU1_ChIP_input_10_day_AL_l20200911_GGTCACGA_GTATTATG',
       '00_mouse_BL6_PU1_ChIP_input_3_week_AL_l20200911_CTAGCGCT_GTGTAGAC',
       '00_mouse_MPSIIIAhet_M_P21_PU1_input_1_AL_20191122_GACGAC',
       '01_mouse_BL6_M_8week_PU1_ChIP_input_BL6_466_AL_l20191226_CATGGC',
       '01_mouse_BL6_M_8week_PU1_input_1A_JOS_20190801_CGGAAT',
       '01_mouse_BL6_M_8week_PU1_input_1B_JOS_20190801_CGGAAT',
       '01_mouse_BL6_M_8week_PU1_input_3_AL_20191226_CATGGC',
       '01_mouse_C57_M_8week_PU1_ChIP_input_438_AL_l20191206_GATCAG',
       '02_mouse_MPSIIIAhet_M_4month_PU1_ChIP_input_AL_l20200925_ATCCACTG_AGGTGCGT',
       '03_mouse_MPSIIIAhet_M_P240_PU1_input_2_AL_20191122_CTAGCT',
       '04_mouse_BL6_M_26month_PU1_ChIP_month_AL_l20200911_TCATCCTT_AGCGAGCT',
       '05_mouse_MPSIIIA_M_P21_PU1_input_3A_AL_20191122_TCGAAG',
       '05_mouse_MPSIIIA_M_P21_PU1_input_3B_JOS_20191122_TCGAAG',
       '05_mouse_MPSIIIA_M_P21_PU1_input_3_AL_20191122_TCGAAG',
       '06_mouse_MPS

In [8]:
makeMultiWigHub.pl hrb_BrainAging_QC_PU1_Input mm10 \
-gradient 158,202,225 8,81,156 \
-force -d /data/mm10/Brain_MPSIIIA/ChIP/input/PU1/*

SyntaxError: invalid syntax (<ipython-input-8-4be1916e1147>, line 1)