# Download & Preprocessing

## 1.Obtaining Raw Data

Search [GEO](https://www.ncbi.nlm.nih.gov/geo/) to find relevant microarray data <b>(in this case, Seven CAD-related microarray datasets)</b> for the meta-analysis.

#### `crossmeta`

> `crossmeta` streamlines the cross-platform effect size and pathway meta-analysis of microarray data. For the analysis, you will need a list of Affymetrix, Illumina, and/or Agilent GSE numbers from [GEO](https://www.ncbi.nlm.nih.gov/geo/). All 21 species in the current [homologene](http://1.usa.gov/1TGoIy7) build are supported.

[Bioconductor - crossmeta](https://www.bioconductor.org/packages/release/bioc/html/crossmeta.html)

In [1]:
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")

# BiocManager::install("crossmeta")

In [2]:
data_dir <- "../data/"

In [3]:
library(crossmeta)

In [4]:
# gather all GSEs
gse_names_mRNAs  <- c(
    "GSE34918", 
    "GSE62646", 
    "GSE60993", 
    "GSE61144"
)
gse_names_miRNAs  <- c(
    "GSE24548", 
    "GSE53211", 
    "GSE53675"
)

In [5]:
print(gse_names_mRNAs)
print(gse_names_miRNAs)

[1] "GSE34918" "GSE62646" "GSE60993" "GSE61144"
[1] "GSE24548" "GSE53211" "GSE53675"


In [6]:
# Download raw data to `data_dir` directory.
# get_raw(gse_names=gse_names, data_dir=data_dir)

<details>
    <summary>Logs</summary>
    
```R
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34918/suppl//GSE34918_RAW.tar?tool=geoquery' を試しています
Content type 'application/x-tar' length 27443200 bytes (26.2 MB)
==================================================
downloaded 26.2 MB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34918/suppl//GSE34918_non-normalized.txt.gz?tool=geoquery' を試しています
Content type 'application/x-gzip' length 1663173 bytes (1.6 MB)
==================================================
downloaded 1.6 MB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE62nnn/GSE62646/suppl//GSE62646_RAW.tar?tool=geoquery' を試しています
Content type 'application/x-tar' length 433612800 bytes (413.5 MB)
==================================================
downloaded 413.5 MB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60993/suppl//GSE60993_RAW.tar?tool=geoquery' を試しています
Content type 'application/x-tar' length 6584320 bytes (6.3 MB)
==================================================
downloaded 6.3 MB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60993/suppl//GSE60993_non-normalized.txt.gz?tool=geoquery' を試しています
Content type 'application/x-gzip' length 10962907 bytes (10.5 MB)
==================================================
downloaded 10.5 MB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE61nnn/GSE61144/suppl//GSE61144_non_normalized.txt.gz?tool=geoquery' を試しています
Content type 'application/x-gzip' length 8457250 bytes (8.1 MB)
==================================================
downloaded 8.1 MB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE24nnn/GSE24548/suppl//GSE24548_RAW.tar?tool=geoquery' を試しています
Content type 'application/x-tar' length 419840 bytes (410 KB)
==================================================
downloaded 410 KB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE53nnn/GSE53211/suppl//GSE53211_average.txt.gz?tool=geoquery' を試しています
Content type 'application/x-gzip' length 7820 bytes
==================================================
downloaded 7820 bytes

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE53nnn/GSE53211/suppl//GSE53211_difference.txt.gz?tool=geoquery' を試しています
Content type 'application/x-gzip' length 5842 bytes
==================================================
downloaded 5842 bytes

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE53nnn/GSE53211/suppl//GSE53211_fold_change.txt.gz?tool=geoquery' を試しています
Content type 'application/x-gzip' length 5834 bytes
==================================================
downloaded 5834 bytes

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE53nnn/GSE53211/suppl//GSE53211_non-normalized_data.txt.gz?tool=geoquery' を試しています
Content type 'application/x-gzip' length 23721 bytes (23 KB)
==================================================
downloaded 23 KB

 URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE53nnn/GSE53675/suppl//GSE53675_RAW.tar?tool=geoquery' を試しています
Content type 'application/x-tar' length 14458880 bytes (13.8 MB)
==================================================
downloaded 13.8 MB

```    
    
</details>

## 2.Checking Raw Illumina Data

It is difficult to automate loading raw Illumina data files because they lack a standardized format. `crossmeta` will attempt to fix the headers of raw Illumina data files so that they can be loaded. <b>If `crossmeta` fails, you will have to edit the headers of the raw Illumina data files yourself or omit the offending studies.</b>

To edit raw Illumina data headers, I recommend that you download and set [Sublime Text 2](https://www.sublimetext.com/2) as your default text editor. It has very nice regular expression capabilities. [Here](https://cheatography.com/davechild/cheat-sheets/regular-expressions/) is a good regular expression cheat-sheat.

Raw illumina files will be in `data_dir` in a seperate folder for each GSE. They are usually `.txt` files and include non-normalized in their name. Ensure the following:

- <b>Detection p-values</b>: present (usually every second column)
- <b>File format</b>: tab seperated .txt file
- <b>File name</b>: includes non-normalized

Also ensure that column names have the following format:

- <b>Probe ID</b>: ID_REF
- <b>Expression values</b>: AVG_Signal-sample_name
- <b>Detection p-values</b>: Detection-sample_name

To open these files one at a time with your default text editor:

```R
# this is why we gathered Illumina GSEs
open_raw_illum(illum_names, data_dir)
```

In [7]:
# open_raw_illum(illum_names)

In [8]:
# library(lydata)

# # location of raw data
# data_dir <- system.file("extdata", package = "lydata")

## 3.Loading and Annotating Data

After downloading the raw data, <b>it must be loaded and annotated.</b> The necessary bioconductor annotation data packages will be downloaded as needed.

In [9]:
# Loads and annotates raw data previously downloaded with get_raw
esets_mRNAs  <- load_raw(gse_names=gse_names_mRNAs,  data_dir=data_dir)

In [11]:
esets_mRNAs

$GSE34918
ExpressionSet (storageMode: lockedEnvironment)
assayData: 50605 features, 5 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM857496 GSM857497 ... GSM857500 (5 total)
  varLabels: title geo_accession ... title.raw (41 total)
  varMetadata: labelDescription
featureData
  featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_3311190 (50605
    total)
  fvarLabels: ENTREZID PROBE ... SYMBOL (5 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL10558 

$GSE62646
ExpressionSet (storageMode: lockedEnvironment)
assayData: 40470 features, 98 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1530765 GSM1530766 ... GSM1530862 (98 total)
  varLabels: title geo_accession ... scan_date (39 total)
  varMetadata: labelDescription
featureData
  featureNames: 7896740 8023937 ... 8105991.9 (40470 total)
  fvarLabels: ID GB_LIST ... ENTREZID_HS (16 total)
  fvarMetadata: Column Description la

In [10]:
# Loads and annotates raw data previously downloaded with get_raw
esets_miRNAs <- load_raw(gse_names=gse_names_miRNAs, data_dir=data_dir)

https://ftp.ncbi.nlm.nih.gov/geo/series/GSE24nnn/GSE24548/matrix/

OK

Found 1 file(s)

GSE24548_series_matrix.txt.gz

Using locally cached version: ../data//GSE24548/GSE24548_series_matrix.txt.gz

Setting options('download.file.method.GEOquery'='auto')

Setting options('GEOquery.inmemory.gpl'=FALSE)

Parsed with column specification:
cols(
  ID_REF = [31mcol_character()[39m,
  GSM605087 = [32mcol_double()[39m,
  GSM605088 = [32mcol_double()[39m,
  GSM605089 = [32mcol_double()[39m,
  GSM605090 = [32mcol_double()[39m,
  GSM605091 = [32mcol_double()[39m,
  GSM605092 = [32mcol_double()[39m,
  GSM605113 = [32mcol_double()[39m
)

https://ftp.ncbi.nlm.nih.gov/geo/series/GSE24nnn/GSE24548/matrix/

OK

Found 1 file(s)

GSE24548_series_matrix.txt.gz

Using locally cached version: ../data//GSE24548/GSE24548_series_matrix.txt.gz

Parsed with column specification:
cols(
  ID_REF = [31mcol_character()[39m,
  GSM605087 = [32mcol_double()[39m,
  GSM605088 = [32mcol_double()[39m,

Reading file ../data//GSE53211/GSE53211_non-normalized_data_fixed.txt ... ...


Couldn't find raw data for: GSE53675

Couldn't load raw Agilent data for: GSE24548

Couldn't load raw Illumina data for: GSE53211



In [12]:
esets_miRNAs

NULL

### Reference:

- [Cross-Platform Meta Analysis](http://bioconductor.riken.jp/packages/3.8/bioc/vignettes/crossmeta/inst/doc/crossmeta-vignette.html)