# GSE27562 - Notebook Version 1.0

This is a series specific file that makes modifications to the GSE27562 matrix imported through FIT, exporting only the data needed for FaST processing.  See [GSE27562](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE27562) for more information on series. This program is written in R.

From the series data:
>In total, we collected blood from **57** women with a diagnosis of breast cancer and **37** with a benign diagnosis. We also collected blood from **31** women with normal initial mammograms as negative controls and 15 breast cancer patients following surgery.

We ignore patients following surgery since these may or may not have cancer tissue.  Instead we only consider three classes:  those with breast cancer (*malignant*) those with benign tumors (*benign*) and those with no cancer (*normal*).  This totals **125** samples.

### Get/Create Directories
Assumes this notebook is in `GenClass-Stability/importTools/notebooks/`

In [1]:
notebook_dir <- getwd() # get the working directory
main_dir <- dirname(dirname(notebook_dir)) # get two levels up
gse_dir = file.path(main_dir,"GSE","GSE27562")

In [2]:
setwd(gse_dir)

### Import Matrix
Assumes this notebook is in `GenClass-Stability/importTools/notebooks/` and GSE data from SIT is stored in `GenClass-Stability/GSE`.

In [3]:
matrix <- read.table("filteredRMA.txt",header=TRUE,row.names=1)

### Modify Matrix
The remaining code is specific to the GSE and the data that you want to test.  However, the format for classes and expressions should always be the same.

Remove `PBMC_` from column names.

In [4]:
classes <- gsub("PBMC_", "", colnames(matrix))

Remove `_training_` from column names.

In [5]:
toRemove <- rev(c(1:10))
toRemove <- paste(c("_training_"), toRemove, sep = '', collapse = ' ')
toRemove <- unlist(strsplit(toRemove, split=" "))
for (i in 1:length(toRemove)) {
    classes <- sub(toRemove[i][1], "", classes)
}

Remove `_validation_` from column names.

In [6]:
toRemove <- rev(c(1:47))
toRemove <- paste(c("_validation_"), toRemove, sep='', collapse=' ')
toRemove <- unlist(strsplit(toRemove, split=" "))
for (i in 1:length(toRemove))
{
    classes <- sub(toRemove[i][1], "", classes)
}

Modify the column names with the classes.

In [7]:
colnames(matrix) <- classes

Get select gene expressions.

In [8]:
patterns <- c("normal", "benign", "malignant")
expressions = matrix[ , grepl( paste(patterns, collapse="|") , names( matrix ) ) ]

### Write Classes
First remove extra numbers added by datafram

In [22]:
classes = gsub("\\..*","",colnames(expressions))
classes = as.matrix(classes)
classes = t(classes)
write.table(classes,file.path(gse_dir,"classes.txt"),sep = "\t", quote = FALSE, row.names=FALSE, col.names=FALSE)

### Write Expressions

In [11]:
expressions = t(expressions)
write.table(expressions,file.path(gse_dir,"exprs.txt"),sep = "\t", row.names=FALSE, col.names=FALSE)