# GSE39582 - Notebook Version 1.0

This is a series specific file that makes modifications to the GSE39582 matrix imported through FIT, exporting only the data needed for FaST processing.  See [GSE39582](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39582) for more information on series. This program is written in R.

From the series data:
>**566** samples fulfilled RNA quality requirements. Unsupervised consensus hierarchical clustering applied to gene expression data from a discovery subset of 443 CC samples identified six molecular subtypes...The subtypes C4 and C6, but not the subtypes C1, C2, C3, and C5, were independently associated with shorter relapse-free survival...

From this we know there are **566** total samples, divided into six subtypes.

### Get/Create Directories
Assumes this notebook is in `GenClass-Stability/importTools/notebooks/`

In [112]:
notebook_dir <- getwd() # get the working directory
main_dir <- dirname(dirname(notebook_dir)) # get two levels up
gse_dir = file.path(main_dir,"GSE","GSE39582")

In [113]:
setwd(gse_dir)

### Import Matrix
Assumes this notebook is in `GenClass-Stability/importTools/notebooks/` and GSE data from SIT is stored in `GenClass-Stability/GSE`.

In [114]:
matrix <- read.table("filteredRMA.txt",header=TRUE,row.names=1)

### Modify Matrix
The remaining code is specific to the GSE and the data that you want to test.  However, the format for classes and expressions should always be the same.

Reimport GSE data to change column names.

In [115]:
library(GEOquery)
gse <- getGEO(GEO = 'GSE39582', destdir = dirname(gse_dir))
if(length(gse) > 1) {
    print("WARNING: multiple SubSeries.")
}

Found 1 file(s)
GSE39582_series_matrix.txt.gz
Using locally cached version: /Users/terek/Documents/Github/GenClass-Stability/GSE/GSE39582_series_matrix.txt.gz
Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
Using locally cached version of GPL570 found here:
/Users/terek/Documents/Github/GenClass-Stability/GSE/GPL570.soft 


Change column names to subtype.

In [116]:
pheno <-phenoData(gse[[1]])
colnames(matrix)<-pheno$characteristics_ch1.30

Modify the column names with the classes.

In [123]:
classes <- gsub("cit.molecularsubtype: ", "", colnames(matrix))
colnames(matrix) <- classes

Get select gene expressions.

In [124]:
patterns <- c("C1","C2","C3","C4","C5","C6")
expressions = matrix[ , grepl( paste(patterns, collapse="|") , names( matrix ) ) ]

### Write Classes
First remove extra numbers added by datafram

In [127]:
classes = gsub("\\..*","",colnames(expressions))
classes = as.matrix(classes)
classes = t(classes)
write.table(classes,file.path(gse_dir,"classes.txt"),sep = "\t", quote = FALSE, row.names=FALSE, col.names=FALSE)

### Write Expressions

In [128]:
expressions = t(expressions)
write.table(expressions,file.path(gse_dir,"exprs.txt"),sep = "\t", row.names=FALSE, col.names=FALSE)