# [`GEOquery`](https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html)

Reading the NCBI's GEO microarray SOGT files in R/BioConductor.

In [1]:
origin <- getwd()

# Load my own function from the library directory
libdir <- "../lib/"
setwd(libdir)
source("GEO_utils.R")
setwd(origin)

# Change directory to workspace (where to store the data).
basedir <- "../data/"
setwd(basedir)

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To 

### GEO

- [GEO expression omnibus](http://www.ncbi.nlm.nih.gov/geo/) is the largest repository of gene expression data.
- SOFT stands for Simple Ombinus Format in Text.
- There are actually four types of GEO SOFT file available:

|abbreviation|GEO types|description|
|:-:|:-:|:-|
|GPL|GEO Platform|These files describe a particular type of microarray. They are annotation files.|
|GSM|GEO Sample|Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the `VALUE` column.|
|GSE|GEO Series| Lists of `GSM` files that together form a single experiment.|
|GDS|GEO Dataset|These are curated files that hold a summarized combination of a `GSE` file and its `GSM` files. They contain normalized expression levels for each gene from each sample (i.e. just the `VALUE` field from the `GSM` file).|

#### [Supplementary Table 3.](https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-019-54603-2/MediaObjects/41598_2019_54603_MOESM2_ESM.pdf) Basic information of the microarray datasets from GEO

|Data source | Platform  | `colname` | `ctrls` | Control (n) | `cases` | Case (n) | 
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|GSE34198|GPL6102 | `"group"` | `c("control")` |48| `c("AIM", "AIMD6")` |49|
|GSE62646|GPL6244 | `"cardiovascular disease state"` | `c("CAD")` |14| `c("STEMI")` |84|
|GSE60993|GPL6884 | `"title"` | `c("Normal")` | 7| `c("STEMI", "NSTEMI")` |17|
|GSE61144|GPL6106 | `"title"` | `c("Normal")` |10| `c("STEMI")` | 7|
|GSE24548|GPL8227 | `"title"` | `c("FAMI Control")` | 3| `c("FAMI patient")` | 4|
|GSE53211|GPL18049| `"title"` | `c("Healthy-control")` | 4| `c("STEMI")` | 9|
|GSE61741|GPL9040 | `"disease"` | `c("normal")` |~~34~~ 94| `c("myocardial_infarction")` |62|

In [2]:
gse_names <- list("GSE34198", "GSE62646", "GSE60993", "GSE61144", "GSE24548", "GSE53211", "GSE61741")
gpl_names <- list("GPL6102", "GPL6244", "GPL6884", "GPL6106", "GPL8227", "GPL18049", "GPL9040")
colnames  <- list("group", "cardiovascular disease state", "title", "title", "title", "title", "disease")
controls  <- list(c("control"), c("CAD"), c("Normal"), c("Normal"), c("FAMI Control"), c("Healthy-control"), c("normal"))
cases     <- list(c("AIM", "AIMD6"), c("STEMI"), c("STEMI", "NSTEMI"), c("STEMI"), c("FAMI patient"), c("STEMI"), c("myocardial_infarction"))

In [3]:
print(paste("gse_names:", length(gse_names), "data"))
print(paste("gpl_names:", length(gpl_names), "data"))
print(paste("colnames :", length(colnames ), "data"))
print(paste("controls :", length(controls ), "data"))
print(paste("cases    :", length(cases    ), "data"))

[1] "gse_names: 7 data"
[1] "gpl_names: 7 data"
[1] "colnames : 7 data"
[1] "controls : 7 data"
[1] "cases    : 7 data"


In [4]:
for (i in 1:length(gse_names)){
    gse_name <- gse_names[i]
    gpl_name <- gpl_names[i]
    colname  <- colnames [i]
    ctrl <- controls[i]
    case <- cases[i]
    print(paste("Processing", gse_name))
    
    gset <- GSE2MArrayLM(GSE=gse_name, GPL=gpl_name, destdir=unlist(gse_name), colname=colname, ctrls=ctrl, cases=case)
    write.table(gset, file=paste("GEOquery/", "MArrayLM/", gse_name, "_preprocessed.csv", sep=""), sep="\t")
}

[1] "Processing GSE34198"


Found 1 file(s)

GSE34198_series_matrix.txt.gz

Using locally cached version: GSE34198/GSE34198_series_matrix.txt.gz

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = [31mcol_character()[39m
)

See spec(...) for full column specifications.

Using locally cached version of GPL6102 found here:
GSE34198/GPL6102.annot.gz 



[1] "48 control samples."
[1] "49 case samples."
[1] "0 other samples."


“Partial NA coefficients for 1865 probe(s)”


[1] "Processing GSE62646"


Found 1 file(s)

GSE62646_series_matrix.txt.gz

Using locally cached version: GSE62646/GSE62646_series_matrix.txt.gz

Parsed with column specification:
cols(
  .default = col_double()
)

See spec(...) for full column specifications.

Using locally cached version of GPL6244 found here:
GSE62646/GPL6244.annot.gz 



[1] "14 control samples."
[1] "84 case samples."
[1] "0 other samples."
[1] "Processing GSE60993"


Found 1 file(s)

GSE60993_series_matrix.txt.gz

Using locally cached version: GSE60993/GSE60993_series_matrix.txt.gz

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = [31mcol_character()[39m
)

See spec(...) for full column specifications.

Using locally cached version of GPL6884 found here:
GSE60993/GPL6884.annot.gz 



[1] "7 control samples."
[1] "17 case samples."
[1] "9 other samples."
[1] "Processing GSE61144"


Found 1 file(s)

GSE61144_series_matrix.txt.gz

Using locally cached version: GSE61144/GSE61144_series_matrix.txt.gz

Parsed with column specification:
cols(
  .default = col_double()
)

See spec(...) for full column specifications.

“cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL6nnn/GPL6106/annot/GPL6106.annot.gz': HTTP status was '404 Not Found'”
Annotation GPL not available, so will use submitter GPL instead

Using locally cached version of GPL6106 found here:
GSE61144/GPL6106.soft 



[1] "10 control samples."
[1] "14 case samples."
[1] "0 other samples."
[1] "Processing GSE24548"


Found 1 file(s)

GSE24548_series_matrix.txt.gz

Parsed with column specification:
cols(
  ID_REF = [31mcol_character()[39m,
  GSM605087 = [32mcol_double()[39m,
  GSM605088 = [32mcol_double()[39m,
  GSM605089 = [32mcol_double()[39m,
  GSM605090 = [32mcol_double()[39m,
  GSM605091 = [32mcol_double()[39m,
  GSM605092 = [32mcol_double()[39m,
  GSM605113 = [32mcol_double()[39m
)

“cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL8nnn/GPL8227/annot/GPL8227.annot.gz': HTTP status was '404 Not Found'”
Annotation GPL not available, so will use submitter GPL instead

File stored at: 

GSE24548/GPL8227.soft

“Duplicated column names deduplicated: 'SPOT_ID' => 'SPOT_ID_1' [6]”


[1] "3 control samples."
[1] "4 case samples."
[1] "0 other samples."
[1] "Processing GSE53211"


Found 1 file(s)

GSE53211_series_matrix.txt.gz

Parsed with column specification:
cols(
  ID_REF = [31mcol_character()[39m,
  GSM1287710 = [32mcol_double()[39m,
  GSM1287711 = [32mcol_double()[39m,
  GSM1287712 = [32mcol_double()[39m,
  GSM1287713 = [32mcol_double()[39m,
  GSM1287714 = [32mcol_double()[39m,
  GSM1287715 = [32mcol_double()[39m,
  GSM1287716 = [32mcol_double()[39m,
  GSM1287717 = [32mcol_double()[39m,
  GSM1287719 = [32mcol_double()[39m,
  GSM1287721 = [32mcol_double()[39m,
  GSM1287723 = [32mcol_double()[39m,
  GSM1287724 = [32mcol_double()[39m,
  GSM1287726 = [32mcol_double()[39m,
  GSM1287728 = [32mcol_double()[39m,
  GSM1287730 = [32mcol_double()[39m,
  GSM1287731 = [32mcol_double()[39m,
  GSM1287732 = [32mcol_double()[39m,
  GSM1287733 = [32mcol_double()[39m
)

“cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL18nnn/GPL18049/annot/GPL18049.annot.gz': HTTP status was '404 Not Found'”
Annotation GPL not available, so

[1] "4 control samples."
[1] "9 case samples."
[1] "5 other samples."


“Partial NA coefficients for 24 probe(s)”


[1] "Processing GSE61741"


Found 1 file(s)

GSE61741_series_matrix.txt.gz

Using locally cached version: GSE61741/GSE61741_series_matrix.txt.gz

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = [31mcol_character()[39m
)

See spec(...) for full column specifications.

“cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL9nnn/GPL9040/annot/GPL9040.annot.gz': HTTP status was '404 Not Found'”
Annotation GPL not available, so will use submitter GPL instead

Using locally cached version of GPL9040 found here:
GSE61741/GPL9040.soft 



[1] "94 control samples."
[1] "62 case samples."
[1] "893 other samples."


### Reference

- [GEO - Reading the NCBI's GEO microarray SOFT files in R/BioConductor - (2016-10-03)](https://mdozmorov.github.io/BIOS567/assets/presentation_Bioconductor/GEO.pdf)
- [GEO2R - GEO - NCBI](https://www.ncbi.nlm.nih.gov/geo/geo2r)