# Original Data Processing

## Overview
TARGET AML data used. Original file is AML_assay_clinical.csv from the ConsensusML Hackathon github. The original file contains both clinical data and RNAseq counts for approximately 187 patient samples.
<br>
<br>
In this notebook, we divide up the data into the following two sets:
1. Patients x RNAseq. The unique identifer for each patient sample was chosen as the entity_id. Some patients had mutliple samples. Since each sample's RNAseq data was different, both sets of data were used as separate data points. 
2. Patients x Risk. The data had four categories for risk: low, standard, high, and unknown. We removed patient samples with unknown risk scores. Low risk patients were given the 0 category, while standard and high risk patients were given the 1 category. One of the reasons for this is the low quantity of high risk patients compared to any other known risk level.

Files output are aml.data.RNA.csv and aml.data.labels.csv, respectively.

### Read in Data

In [7]:
all.data <- read.table("AML_assay_clinical.csv", sep=",", header=TRUE)

In [40]:
dim(all.data)

### Find indices of interest:
- 57: column "Risk.group"
- 82: column "entity_id"
- 86: first column of RNAseq count data.
- 21490 to 21492: X_no_feature, X_ambiguous, X_alignment_not_unique columns

In [19]:
which(dimnames(all.data)[[2]] == "Risk.group")
which(dimnames(all.data)[[2]] == "entity_id")
# Get last index of last column before RNAseq data starts
which(dimnames(all.data)[[2]] == "entity_type")
dim(all.data)[2] - 2
# Check that the column number from above line is start of the three unnecessary lines
dimnames(all.data)[[2]][dim(all.data)[2]-2]

### Remove patient samples with Unknown risk

In [41]:
indices.unknown <- which(all.data[,57] == "Unknown")
# Check how many patient samples we should be removing
length(indices.unknown)
# Remove patient samples from dataset
all.data <- all.data[-indices.unknown,]
dim(all.data)

### Divide data into tables

In [42]:
data.RNA <- all.data[, c(82, 86:21489)]
data.RNA[1:5,1:5]

Unnamed: 0,entity_id,ENSG00000000003.13,ENSG00000000005.5,ENSG00000000419.11,ENSG00000000457.12
1,c78f2949-4050-5f14-a401-beddf3ff9f61,-3.3177181,-5.407333,5.323569,2.961357
4,1507ac24-8ba2-5e57-a05a-86b5157c8377,-1.7809495,-6.121355,3.640955,3.708995
5,5f49d1ae-e832-59de-89e1-93b6327232f0,-0.1600151,-6.927145,4.30366,2.95584
6,9d464669-4fa5-5c48-b1b0-f452bee4682e,-2.0565009,-4.74701,4.011269,1.708597
8,dcc4d9f2-0f77-58ce-8064-f6e12bffa5e7,-1.9643428,-6.08368,5.198414,2.732599


In [88]:
# Get entity_id column
data.labels <- data.frame(matrix(nrow=dim(all.data)[1], ncol=2))
data.labels[,1] <- all.data[,82]
dimnames(data.labels)[[2]] <- c("entity_id", "risk")

In [89]:
# Change low risk labels to 0 and high + standard risk labels to 1
for (i in 1:nrow(all.data)) {
    if (all.data[i, 57] == "Low") {
        data.labels[i,2] <- 0
    } else {
        data.labels[i,2] <- 1
        }
}

### Output tables to files

In [94]:
write.csv(data.RNA, file = "aml.data.RNA.csv", row.names=FALSE)
write.csv(data.labels, file = "aml.data.labels.csv", row.names=FALSE)