# FIT (Fast Import Tool)

The FIT imports appropriate data from NCBI GEO.  This program is written in R.

The Gene Expression Omnibus (GEO) from NCBI is a public repository for genetic based experimental data. The data is mostly microarray-based results, typically measuring mRNA, genomic DNA, and protein data. At the most basic level of organization of GEO, there are three entity types that may be supplied by users: Platforms, Samples, and Series. Additionally, there is a curated entity called a GEO dataset.

A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A Platform may reference many Samples that have been submitted by multiple submitters.

A Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform and may be included in multiple Series. A Series record defines a set of related Samples considered to be part of a group, how the Samples are related, and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique and stable GEO accession number (GSExxx).

GEO DataSets (GDSxxx) are curated sets of GEO Sample data. A GDS record represents a collection of biologically and statistically comparable GEO Samples and forms the basis of GEO's suite of data display and analysis tools. Samples within a GDS refer to the same Platform, that is, they share a common set of probe elements. Value measurements for each Sample within a GDS are assumed to be calculated in an equivalent manner, that is, considerations such as background processing and normalization are consistent across the dataset. Information reflecting experimental design is provided through GDS subsets.

GeneChip RMA (GC-RMA) is an improved form of RMA that is able to use the sequence-specific probe affinities of the GeneChip probes to attain more accurate gene expression values.

**Enter the series for import:**

In [1]:
series = "GSE27562"

### Libraries used

To install libraries, from R console use:
```R
install.packages("ggplot2")
```

For Bioconductor packages first get biocLite via
```R
source("https://bioconductor.org/biocLite.R")
```
Then install packages. Example:
```R
biocLite("GEOquery")
```

In [2]:
library(GEOquery)
library(affy)
library(gcrma)
library(RColorBrewer)
library(affyPLM)
library(simpleaffy)

Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Welcome to Bioconductor

    

### Get/Create directories

Assumes this notebook is in notebook folder.

In [3]:
notebook_dir <- getwd()
main_dir <- dirname(notebook_dir)
gse_dir = file.path(main_dir,"gseRaw")
if (!dir.exists(gse_dir)) {
    dir.create(gse_dir)
}

### Load GEO data

Specifically we are interested in only GSE Data.

In [4]:
gse <- getGEO(GEO = series, destdir = gse_dir)
show(gse)

Found 1 file(s)
GSE27562_series_matrix.txt.gz
Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
File stored at: 
/Users/terek/Documents/Github/GenClass-Stability/gseRaw/GPL570.soft


$GSE27562_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 162 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM681982 GSM681983 ... GSM682143 (162 total)
  varLabels: title geo_accession ... tissue:ch1 (40 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570 



In [5]:
suppFiles = getGEOSuppFiles(GEO = series, makeDirectory = TRUE, baseDir = gse_dir)

In [6]:
tarFiles = file.path(rownames(suppFiles)[1])
untarPath = file.path(dirname(tarFiles),"data")

In [9]:
untar(tarFiles, exdir=untarPath)
cels <- list.celfiles(untarPath, pattern = "[gz]")
cels

In [11]:
setwd(untarPath)
data = ReadAffy(filenames=cels)

In [12]:
data

“replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu133plus2cdf’”


AffyBatch object
size of arrays=1164x1164 features (74 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=162
number of genes=54675
annotation=hgu133plus2
notes=