# FIT (Fast Import Tool) - Notebook Version 1.0
The FIT imports GSE data from [NCBI GEO](https://www.ncbi.nlm.nih.gov/geo/). This is the entry point for all program data for GenClass-Stability. This program is written in R.


#### Background Information:
The NCBI Gene Expression Omnibus (GEO) is a repository for genetic based experimental data. The data is mostly microarray-based results, typically measuring RNA, DNA, and protein data. There are three basic types of data that GEO makes available: Platforms, Samples, and Series.

Platform records describes the array used for an experiment (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx).

A Sample record describes the conditions under which an individual Sample was taken and its measurements (e.g. expression of Rat 1 under a given condtion). Each Sample record is assigned a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform and may be included in multiple Series. 

A Series record defines a set of related Samples considered to be part of a group, how the Samples are related, and if and how they are ordered. A Series provides a description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique and stable GEO accession number (GSExxx).

###### Enter the series for import:
Must be all upper case. e.g. `"GSE27562"`

In [1]:
series = "GSE27562"

### Libraries
Must be pre-installed.  See *Extras.ipynb* for more info.

In [2]:
library(GEOquery)
library(affy)
library(simpleaffy)

Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Welcome to Bioconductor

    

### Get/Create directories
Assumes this notebook is in `GenClass-Stability/importTools/notebooks/`

In [3]:
notebook_dir <- getwd() # get the working directory
main_dir <- dirname(dirname(notebook_dir)) # get two levels up
gse_dir = file.path(main_dir,"GSE")
if (!dir.exists(gse_dir)) {
    dir.create(gse_dir)
}

### Load GEO data
Note: this function returns a list, since you thre could be multiple SubSeries.  For data used in the GenClass-Stability project there should be no subseries used.

In [4]:
gse <- getGEO(GEO = series, destdir = gse_dir)
if(length(gse) > 1) {
    print("WARNING: multiple SubSeries.")
}

Found 1 file(s)
GSE27562_series_matrix.txt.gz
Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
File stored at: 
/Users/terek/Documents/Github/GenClass-Stability/GSE/GPL570.soft


Get the phenotype data.  This is used later to get the column names used for GSMs.

In [5]:
series.pheno <- phenoData(gse[[1]])

### Download Supplementary Files
Get the raw expression data for the samples.  The raw expression data is needed to ensure that the normalization process used for the data is the same for all series tested. GSE objects contain already normalized data but may be normalized using different methods.  

Note: getGEOSuppFiles doesn't check if the files already exist, so a check is performed.

In [6]:
if (!dir.exists(file.path(gse_dir, series))) {
    suppFiles = getGEOSuppFiles(GEO = series, makeDirectory = TRUE, baseDir = gse_dir)
    tarFiles = file.path(rownames(suppFiles)[1])
    untarPath = file.path(dirname(tarFiles),"data")
    if (!dir.exists(file.path(untarPath))) {
        untar(tarFiles, exdir = untarPath)
    }
}

### Create AffyBatch from CELs

In [7]:
setwd(file.path(gse_dir,series,"data")) # TODO: convert series to upper case.
celfiles.data = ReadAffy()

### Download Probe Data

Currently there is no good way to check for previous probe data.  It will simply overwrite the library if it exists.  This ensures that RMA and other normalization techniques can be used.

In [8]:
probe_name = annotation(celfiles.data)
lib <- paste(probe_name, ".db", sep = "")
source("https://bioconductor.org/biocLite.R")
biocLite(lib)
library(lib, character.only=TRUE)

### Perform RMA Normalization and Filtering

In [9]:
celfiles.rma <- rma(celfiles.data)
celfiles.filtered_rma <- nsFilter(celfiles.rma, require.entrez=TRUE, remove.dupEntrez=TRUE)

“replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu133plus2cdf’”


Background correcting
Normalizing
Calculating Expression



Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid


Attaching package: ‘IRanges’

The following object is masked from ‘package:simpleaffy’:

    members





### Get expression data
The expression data is stored in a matrix.

In [11]:
rma_mat <- exprs(celfiles.filtered_rma[[1]])
colnames(rma_mat) <- pData(series.pheno)$title

### Store Expression
Matrix is used in further GSE series-specific processing.

In [12]:
setwd(gse_dir)
result_path = file.path(gse_dir,series,"rma_mat_filt.txt")
write.table(rma_mat,result_path,sep = "\t", row.names=TRUE)