# BST281 Final Project Pipeline

Group 2  
Dongyuan Song, Siquan Wang, Xutao Wang, Linying Zhang

## Set Up
Import packages; set working direcotries.

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42
rcParams['font.sans-serif'] = 'Arial'
import warnings
warnings.filterwarnings("ignore")
import urllib3
urllib3.disable_warnings()
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from IPython.display import FileLinks

In [3]:
current_path = os.getcwd()
print(current_path)

C:\Users\songdongyuan\group02_final_project_packet


Set working directory. Default is this package folder.

In [4]:
os.chdir(current_path)

Enable using R in Jupyter notebook.

In [5]:
%load_ext rpy2.ipython

## RNA-seq analysis

In [6]:
expr_df = pd.read_csv("expressionFile_counts_MM.csv")

In [7]:
expr_df = expr_df.set_index(expr_df.columns[0])
expr_df.head()

Unnamed: 0_level_0,..NM89_RPMI_salmon.quant.sf,..NM90_RPMI_HS5_salmon.quant.sf,..NM91_MM1S_salmon.quant.sf,..NM92_MM1S_HS5_salmon.quant.sf,..NM95_KMS11_salmon.quant.sf,..NM96_KMS11_HS5_salmon.quant.sf
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5_8S_rRNA,0.0,0.0,0.0,0.0,0.0,0.0
5S_rRNA,6.45945,21.44398,10.03,10.11391,0.0,1.01
7SK,3.03,3.26734,0.78,0.0,0.0,2.78045
A1BG,980.97371,1196.1893,38.39037,79.9608,4.6805,20.19474
A1BG-AS1,944.947,1099.25405,3.76547,21.01,1.84924,2.56537


### Quality Control
Filter out none or low expressed genes.

In [8]:
print(expr_df.shape)

(58671, 6)


Here we only keep genes which counts are larger than 1 in each samples.

In [9]:
mask_low_vals = (expr_df > 0).sum(axis=1) == 6
expr_df = expr_df.loc[mask_low_vals, :]
print(expr_df.shape)

(22366, 6)


Save the result in working directory.

In [10]:
expr_df.to_csv('filtered.tsv',sep='\t')

### Normalization and Differential Expression Analysis

This step was finished in R. Use Bioconductor Package *edgeR*, *limma* and *DEseq2*.

In [17]:
!Rscript RNA_seq.R

[1] 151367130 149539132  41809527 107076869 118426651  40173989

out of 22366 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 1, 0.0045% 
LFC < 0 (down)   : 156, 0.7% 
outliers [1]     : 604, 2.7% 
low counts [2]   : 0, 0% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results




载入程辑包：'gplots'

The following object is masked from 'package:stats':

    lowess

载入需要的程辑包：limma
载入需要的程辑包：methods
载入需要的程辑包：AnnotationDbi
载入需要的程辑包：stats4
载入需要的程辑包：BiocGenerics
载入需要的程辑包：parallel

载入程辑包：'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from 'package:limma':

    plotMA

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, 

## Mint-ChIP analysis

### Quality Control

The input Mint-ChIP files are BAM file already. Use **fastqc** to do quality control.

In [8]:
%%bash
sbatch fastqc.sh

#!/bin/bash
#SBATCH -p general
#SBATCH -J fastqc
#SBATCH -n 4
#SBATCH -N 1
#SBATCH -t 0-10:00
#SBATCH --mem 8000
#SBATCH -o fastqc.out
#SBATCH -e fastqc.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=dsong@hsph.harvard.edu

cd /n/home08/songdongyuan/BST281/

source new-modules.sh
module load fastqc/0.11.5-fasrc01

fastqc -o ~/BST281/fastqc_output -t 16 
~/BST281/chip/Alignment_Post_Processing_15005.bam 
~/BST281/chip/Alignment_Post_Processing_15009.bam 
~/BST281/chip/Alignment_Post_Processing_15022.bam 
~/BST281/chip/Alignment_Post_Processing_15175.bam 
~/BST281/chip/Alignment_Post_Processing_15180.bam 
~/BST281/chip/Alignment_Post_Processing_15193.bam 
~/BST281/chip/Alignment_Post_Processing_15223.bam 
~/BST281/chip/Alignment_Post_Processing_15280.bam

-bash: line 13: cd: /n/home08/songdongyuan/BST281/: No such file or directory
-bash: line 15: new-modules.sh: No such file or directory
-bash: line 16: module: command not found
-bash: line 18: fastqc: command not found
-bash: line 19: /home/songdongyuan/BST281/chip/Alignment_Post_Processing_15005.bam: No such file or directory
-bash: line 20: /home/songdongyuan/BST281/chip/Alignment_Post_Processing_15009.bam: No such file or directory
-bash: line 21: /home/songdongyuan/BST281/chip/Alignment_Post_Processing_15022.bam: No such file or directory
-bash: line 22: /home/songdongyuan/BST281/chip/Alignment_Post_Processing_15175.bam: No such file or directory
-bash: line 23: /home/songdongyuan/BST281/chip/Alignment_Post_Processing_15180.bam: No such file or directory
-bash: line 24: /home/songdongyuan/BST281/chip/Alignment_Post_Processing_15193.bam: No such file or directory
-bash: line 25: /home/songdongyuan/BST281/chip/Alignment_Post_Processing_15223.bam: No such file or directory
-bash: lin

Show the fastqc reports.

In [15]:
FileLinks(os.path.join('./fastqc_output'), included_suffixes=['.html'])

The reports show that the quality is fine. Use the BAM file for next step.

### Peak Calling

Use MACS2 do peak calling. Notice some parameters: file type is BAMPE, q = 0.01.

In [17]:
%%bash
sbatch MACS2.sh

#!/bin/bash
#SBATCH -p general
#SBATCH -J macs2
#SBATCH -n 4
#SBATCH -N 1
#SBATCH -t 0-10:00
#SBATCH --mem 8000
#SBATCH -o macs2.out
#SBATCH -e macs2.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=dsong@hsph.harvard.edu

cd /n/home08/songdongyuan/BST281/

source new-modules.sh
module load macs2/2.1.1.20160309-fasrc01

macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15005.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15005 -q 0.01
macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15009.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15009 -q 0.01
macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15022.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15022 -q 0.01
macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15175.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15175 -q 0.01
macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15180.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15180 -q 0.01
macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15193.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15193 -q 0.01
macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15223.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15223 -q 0.01
macs2 callpeak -t ~/BST281/chip/Alignment_Post_Processing_15280.bam --outdir ~/BST281/macs2_output -f BAMPE -g hs -n 15280 -q 0.01

-bash: line 1: sbatch: command not found
-bash: line 15: cd: /n/home08/songdongyuan/BST281/: No such file or directory
-bash: line 17: new-modules.sh: No such file or directory
-bash: line 18: module: command not found
-bash: line 20: macs2: command not found
-bash: line 21: macs2: command not found
-bash: line 22: macs2: command not found
-bash: line 23: macs2: command not found
-bash: line 24: macs2: command not found
-bash: line 25: macs2: command not found
-bash: line 26: macs2: command not found
-bash: line 27: macs2: command not found


In [18]:
FileLinks(os.path.join('./macs2_output'), included_suffixes=['.xls'])

### Differential Binding Analysis