Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
R
exec
man
vignettes
.Rbuildignore
DESCRIPTION
LICENSE
NAMESPACE
NEWS
NOTICE
README
equalizer.Rproj

README

INTRODUCTION
================================================================================

Affymetrix microarrays are designed using a reference genome from human, mouse, 
or another organism. Single Nucleotide Polymorphisms (SNPs) in the genomes of 
individual humans and model organisms that vary from the reference annotations 
affect the binding of cDNA to microarray probes, affecting the results of 
experiments measuring gene expression in organisms with distinct genomes. 
Genetic experiments performed using these arrays are subject to systemic bias 
unless the effect of these SNPs is accounted for.

The software package equalizer drastically reduces this problem for the commonly 
used Affymetrix IVT and Gene ST platforms by using whole-genome sequence data to 
remove probes which overlap SNPs. The customized annotation files generated by 
equalizer can be used for normalization in oligo or other packages. 

REQUIREMENTS
================================================================================

1) Python (I have tested versions 2.7.3 and 3.4). 
   Earlier outdated versions of Python may not work. 
   Python is available at http://python.org.
   
2) bedtools (I have used 2.15.0 and 2.17.0) 
   bedtools is available at https://github.com/arq5x/bedtools2

3) A VCF file for each genome you wish to scan for SNPs that overlap 
   probes. Generate this yourself or download a VCF file from your favorite 
   genome provider. A sample VCF file is provided in the example bundle.

4) Affymetrix microarray platform description files. I have made tarballs of 
   several commonly used annotation files available on my personal website, 
   http://davidquigley.com/equalizer.html 

   These files are available individually from Affymetrix's NetAffx website. 
   For IVT arrays, where ARRNAME is your array platform:
    a) ARRNAME.annot.csv
    b) ARRNAME.cdf
    c) ARRNAME.probe_tab
    d) ARRNAME.bed
    e) A CEL file in the same format as ARRNAME
         This is any CEL file from the ARRNAME platform
    
    For ST arrays, where ARRNAME is your array platform
    a) ARRNAME.probeset.csv
    b) ARRNAME.transcript.csv
    c) ARRNAME.pgf
    d) ARRNAME.mps
    e) ARRNAME.clf
    d) ARRNAME.bed
    
5) R (at the time of writing, I used 3.1.1). 
   R is freely available at http://r-project.org/


SELF-CONTAINED EXAMPLES
================================================================================

These examples require R, python, and bedtools. I have verified they work on a 
fresh installation of Ubuntu linux that contains recent installations of R and 
bedtools. I have run this code on a T2.large instance on Amazon AWS; the 
T2.medium instance has insufficient RAM to run the full example below. 
Start an instance and install Bedtools and R with the following commands:

# INSTALL BEGINS
sudo apt-get update
sudo apt-get install r-base -y
wget https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz
tar -xzf BEDTools.v2.17.0.tar.gz
cd bedtools-2.17.0
sudo make 
cd ..
# INSTALL ENDS

RUN A MINIMAL EXAMPLE:
--------------------------------------------------------------------------------

A minimal self-contained example that uses a small VCF file with SNPs in a 
single gene (Cdc26) to rewrite a Mouse Gene ST probe description fileset is 
available at:
http://davidquigley.com/software/equalizer/equalizer_minimal_example.tar.gz

# EXAMPLE BEGINS
wget http://davidquigley.com/software/equalizer/equalizer_minimal_example.tar.gz
tar -xzf equalizer_minimal_example.tar.gz
cd equalizer_minimal
#  ****************************************************************
#  * IMPORTANT: Before continuing, set the value of BEDTOOLS_PATH *
#  ****************************************************************
BEDTOOLS_PATH='/home/ubuntu/bedtools-2.17.0/bin'
sudo Rscript equalizer_minimal_example.R $BEDTOOLS_PATH 
# EXAMPLE ENDS

The expected results from the last two lines of output for this code are 
> [1] "Original number of probes for Cdc26: 12"
> [1] "Probes remaining after equalizer (should be 3): 3"


RUN AN EXAMPLE WITH EQTL ANALYSIS:
--------------------------------------------------------------------------------

A self-contained example that uses a small VCF file with SNPs in a single gene 
(Cdc26) to rewrite a Mouse Gene ST probe description fileset and calculate eQTLs 
is available at:
http://davidquigley.com/software/equalizer/equalizer_example.tar.gz

*NOTE* This file is very large (500 Mb) because it contains a set of Affymetrix 
CEL files published in (Sjolund et al. PNAS 2014). These files are archived at:
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46077

# EXAMPLE BEGINS
wget http://davidquigley.com/software/equalizer/equalizer_example.tar.gz
tar -xzf equalizer_example.tar.gz
cd equalizer_full
#  ****************************************************************
#  * IMPORTANT: Before continuing, set the value of BEDTOOLS_PATH *
#  ****************************************************************
BEDTOOLS_PATH='/home/ubuntu/bedtools-2.17.0/bin'
sudo Rscript equalizer_example.R $BEDTOOLS_PATH 
sh eqtl.sh
# EXAMPLE ENDS



RUNNING NOTES
================================================================================

********************************************
IMPORTANT: IF YOU ARE NOT AN ADMINISTRATOR
********************************************

If you do not have superuser rights on your machine, and wish to generate the R 
package, you will have to modify the create_package.R script created by 
equalizer to install the package locally in a part of the hard drive where you 
have write access, rather than in a location available to the entire system. You 
can also ask an administrator to install the source package at any time.

********************************************
WHAT TRACK OF THE BED FILE SHOULD I USE?
********************************************

The BED file from Affymetrix contains four tracks:

Affymetrix ARRAYNAME_exon probeset(transcriptID_probesetID)
Affymetrix ARRAYNAME_gene level exon(transcriptID)
Affymetrix ARRAYNAME_gene probeset(transcriptID)
Affymetrix ARRAYNAME_probe(transcriptID_probesetID)

The purpose of the BED file in this context is to obtain the mapped positions of each 
individual probe. Using MoGene-2_0-st-v1.mm10.bed as an example, if you call:
grep 17210850_17210851 MoGene-2_0-st-v1.mm10.bed

you'll see:
chr1	3102029	3102110	17210850_17210851	0	+	3102029	3102110	204,102,51	8	24,24,24,24,24,24,24,24,	0,1,5,6,7,51,52,57,
chr1	3102029	3102053	17210850_17210851	0	+	3102029	3102053	0,0,0	1		
chr1	3102030	3102054	17210850_17210851	0	+	3102030	3102054	0,0,0	1		
chr1	3102034	3102058	17210850_17210851	0	+	3102034	3102058	0,0,0	1		
chr1	3102035	3102059	17210850_17210851	0	+	3102035	3102059	0,0,0	1		
chr1	3102036	3102060	17210850_17210851	0	+	3102036	3102060	0,0,0	1		
chr1	3102080	3102104	17210850_17210851	0	+	3102080	3102104	0,0,0	1		
chr1	3102081	3102105	17210850_17210851	0	+	3102081	3102105	0,0,0	1		
chr1	3102086	3102110	17210850_17210851	0	+	3102086	3102110	0,0,0	1

Note that the first entry, from the probeset track, spans chr1:3102029-3102110. 
The 8 on that line indicates 8 probes make up that probeset. The remaining lines
are the locations of the eight individual probes (e.g. chr1:3102029-3102053). If
there is a SNP between chr1:3102029-3102110 that does NOT intersect one of the 
probe locations, it should not affect the hybridization.

Therefore, retain only the last track. Do not use the probeset track and 
intermingle it with the probe track.