Process_CLIP_data

A collection of R scripts used to process iCLIP or other types of CLIP data

Requirements and installation

Installation of the required software and packages should take < 30 min.

Depends on R and RStudio (optional), freely available to install on Windows, macOS or Linux here: https://cran.r-project.org/index.html https://www.rstudio.com/products/rstudio/

The code has been tested with R >= v4.0.0 and <= v4.1.0, and corresponding package versions. It should also be compatible with R v3.5 onwards although this is less thoroughly tested.

To run all the scripts below, packages will need to be installed as follows (see details of which are required for each script in the corresponding sections):

install.packages("tidyverse") install.packages("parallel") install.packages("BiocManager") BiocManager::install("GenomicRanges") BiocManager::install("rtracklayer") BiocManager::install("GenomicFeatures") BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")

Instructions and test data

Using the provided test data, each script should take < 2 min to run on a standard desktop computer.

iCLIP_Genialis_feature_annotation.R

This script takes data that has been through the iCount pipeline (v2.0.1dev; https://icount.readthedocs.io/en/latest/); our data was processed through iMaps on the Genialis server (https://imaps.genialis.com/iclip). BED files of crosslink sites or clusters can all be annotated using the script, as well as tab-delimited text files containing crosslink sites, such as those generated by iCount peak calling that include all crosslink sites, together with the score and FDR (see info at the start of the script for restrictions on column names).

It requires the tidyverse, rtracklayer, GenomicRanges and GenomicFeatures packages, as well as BiocManager for installation of the latter three.

Whilst some of the output files from iMaps include the genes to which each crosslink site maps, the features are not annotated. This script reannotates either the BED files containing only significant crosslink sites, or the tsv files containing all crosslink sites (with FDR annotated), together with a GTF file, and assigns the most likely gene/feature bound at that site. This is done in a hierarchical way - full details provided in annotation of the script.

The script gives the flexibility to prioritise either 3'UTR (default) or CDS (--prioritise_CDS flag) in the hierarchy.

It also gives the option of running in a more detailed mode (--detailed), which will identify and report all possible genes and features to which the crosslink site could be assigned - note this will take a lot longer to run!

The script can be tested using the Mouse_GRCm38.90_chr17_100kb.gtf and three iCLIP...chr17_100kb.txt files as input; these files include data covering a 100kb region of chromosome 17 (including genes such as Tnf, which is expected to be bound in the datasets provided) for replicate iCLIP datasets. The test can be run with default parameters by setting the working directory to test_data and replacing the line:

Args <- commandArgs(trailingOnly=TRUE)

with:

Args <- c(list.files(pattern = "\\.gtf$"), list.files(pattern = "100kb\\.txt$"))

The expected outputs can be found in the iCLIP...chr17_100kb_annotated.tsv files.

Combine_replicate_datasets.R

This script takes as input a list of replicate datasets for crosslink sites, and outputs the crosslink sites that identified as significant in at least a user-specified number of replicates.

It requires the tidyverse package.

If an 'FDR' column is present in the data, it will be filtered to include only sites with FDR <= 0.05 (unless this value is amended by the user), otherwise, all sites in the dataset are assumed to be significant.

The input data is assumed to come from the iCLIP_Genialis_feature_annotation.R script, and so the column names must match the output from this.

The 'start' and 'end' columns are summarised into a single 'position' column (since they are always identical for crosslink sites), and sum and mean of the score column are calculated (although only including replicates for which the site was significant).

The filtered data including only the sites that are significant in at least the specified number of replicates is saved out as a tab-delimited text file with the user-specified name.

As written, the inputs specified are to take the three iCLIP...chr17_100kb_annotated.tsv files (in test_data) as input (with working directory set to test_data), and report only sites that are significant (FDR <= 0.05) in at least 2 replicates. The expected output file (iCLIP_combined_replicates.txt) is provided.

kmer_analysis.R

This script takes a gtf file together with annotated iCLIP datasets (that have been through iCLIP_Genialis_feature_annotation.R above), and calculates z-scores for kmers of length specified by the user. Ideally, crosslink sites should be limited to only those that are significant, or can be the output from Combine_replicate_datasets.R.

It requires the tidyverse, parallel, rtracklayer, BSgenome.Mmusculus.UCSC.mm10, GenomicRanges and GenomicFeatures packages, as well as BiocManager for installation of the Bioconductor packages.

Z-scores are calculated for the whole gene and for each feature type (3'UTR, CDS, 5'UTR and intron). Sites mapping to non-coding genes or intergenic regions are excluded. By default all sites are analysed, but subsampling to ensure either that an equal number of sites are analysed between different datasets, or between different features within the same datasets, can be done - see details in the script annotation.

The value of k is provided by the user (default 5).

For full datasets, it is recommended to run with parallel processing on a computer cluster. However, the script can be tested on the iCLIP_combined_replicates.txt file, with Mouse_GRCm38.90_chr17_100kb.gtf, by setting the working directory to test_data and replacing the line:

Args <- commandArgs(trailingOnly=TRUE)

with:

Args <- c("iCLIP_combined_replicates.txt", "Mouse_GRCm38.90_chr17_100kb.gtf")

The expected output when run with default parameters (k = 5 and no subsampling) can be found in the iCLIP_combined_replicates_5mer_zscores.txt file. Note that due to the sparsity of the data, this test results in relatively low z scores that are not very robust, and many NA values.

Running on our CLIP data

Fastq files for the CLIP data generated and/or analysed in our manuscript (https://www.biorxiv.org/content/10.1101/2021.06.03.446738v2) are available via the GEO accessions GSE176313 (ZFP36L1 CLIP) and GSE96074 (ZFP36 CLIP).

Before they can be run through these scripts, significant crosslink sites should first be identified using the iCount pipeline (documentation and instructions: https://icount.readthedocs.io/en/latest/).

Our analysis was performed by first using the tab-delimited text files that contain all crosslink sites, generated as an output by iCount peaks (using the parameter --scores) as input to iCLIP_Genialis_feature_annotation.R.

Alternatively, BED files of only significant crosslink sites from iCount peaks, or clusters from iCount clusters can also be used as input.

These can be submitted either by adding an appropriate list.files pattern to detect these files or listing them individually when defining Args to run the iCLIP_Genialis_feature_annotation.R script locally, or providing the file names as trailing arguments when submitting the script via the linux command line (as described in the script).

In addition, the sample GTF file should be replaced by a complete GTF file, matching the genome build (though unnecessary to match annotation release) used by iCount. Our data was annotated using the Ensembl GRCm38 annotation release 90, available here: http://ftp.ensembl.org/pub/release-90/gtf/mus_musculus/Mus_musculus.GRCm38.90.gtf.gz

Combine_replicate_datasets.R should then be run on the output annotated files from iCLIP_Genialis_feature_annotation.R, separately for each group of replicates, with the file names for the annotated replicate data provided as a list, and an appropriate output filename provided. For our analysis, min_Replicates was amended as follows:

CD8 T cell ZFP36L1 CLIP (GSE176313): min_Replicates <- 2 CD4 T cell 4h ZFP36 CLIP (GSE96074): min_Replicates <- 3 CD4 T cell 72h ZFP36 CLIP (GSE96074): min_Replicates <- 3

The kmer_analysis.R script was not run as part of this publication, but could be run on either the individual replicate outputs from iCLIP_Genialis_feature_annotation.R, or on the combined replicate data, using the same GTF file as used for the annotation.

Release v1.0 DOI:

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
test_data		test_data
.gitignore		.gitignore
Combine_replicate_datasets.R		Combine_replicate_datasets.R
LICENSE		LICENSE
Process_CLIP_data.Rproj		Process_CLIP_data.Rproj
README.md		README.md
iCLIP_Genialis_feature_annotation.R		iCLIP_Genialis_feature_annotation.R
kmer_analysis.R		kmer_analysis.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_data

test_data

.gitignore

.gitignore

Combine_replicate_datasets.R

Combine_replicate_datasets.R

LICENSE

LICENSE