<h3> R Notebook for predicting expression and scoring the results. 

Reading in Feature Matrix

In [1]:
featureMatrix=read.csv("training_matrix.csv", header = TRUE)

Modifying featureMatrix to make it friendly for Multivariate Linear Regression Parameter Estimation

In [2]:
featureMatrix=t(featureMatrix)
motifs=featureMatrix[1,]
featureMatrix=featureMatrix[-1,]
colnames(featureMatrix)=motifs
#Seeing what the Motifs are.
print(motifs)
#Printing a sample to see if everything was done correctly
head(featureMatrix)

 [1] "GTA[TC]GG[GA]TG"       "TTTTTTTTC"             "ATGT[AG]TGGG"         
 [4] "TT[TC]TTTTTT"          "[TG]C[CG]GCCT[AG][GC]" "ATC[CT]GTACA"         
 [7] "TTTTTC[AC]A"           "CCCGGCCC"              "GGCCCTGGC"            
[10] "[TC][CG][GC]CGCGTC"   


Unnamed: 0,GTA[TC]GG[GA]TG,TTTTTTTTC,ATGT[AG]TGGG,TT[TC]TTTTTT,[TG]C[CG]GCCT[AG][GC],ATC[CT]GTACA,TTTTTC[AC]A,CCCGGCCC,GGCCCTGGC,[TC][CG][GC]CGCGTC
RPL10,0,2,0,1,0,0,0,0,0,0
RPL11B,1,1,1,0,0,0,0,0,0,0
RPL12A,0,0,1,0,2,1,0,0,0,1
RPL13A,1,2,1,1,2,0,2,0,0,0
RPL13B,2,1,1,0,1,1,0,0,0,1
RPL14A,1,2,1,0,1,1,1,0,0,1


In [3]:
#Adding ones to first column for the estimation step
featureMatrix=cbind(rep(1,nrow(featureMatrix)),featureMatrix)
colnames(featureMatrix)=c("Ones",motifs)
#Printing a Sample to See if everything is correct
head(featureMatrix)

Unnamed: 0,Ones,GTA[TC]GG[GA]TG,TTTTTTTTC,ATGT[AG]TGGG,TT[TC]TTTTTT,[TG]C[CG]GCCT[AG][GC],ATC[CT]GTACA,TTTTTC[AC]A,CCCGGCCC,GGCCCTGGC,[TC][CG][GC]CGCGTC
RPL10,1,0,2,0,1,0,0,0,0,0,0
RPL11B,1,1,1,1,0,0,0,0,0,0,0
RPL12A,1,0,0,1,0,2,1,0,0,0,1
RPL13A,1,1,2,1,1,2,0,2,0,0,0
RPL13B,1,2,1,1,0,1,1,0,0,0,1
RPL14A,1,1,2,1,0,1,1,1,0,0,1


Reading in expression values recorded for training set

In [4]:
expressionValues=read.table("DREAM6_ExPred_PromoterActivities.txt",header=FALSE)
head(expressionValues)
promoterNames=expressionValues[,1]
expressionValues=expressionValues[,-1]

Unnamed: 0,V1,V2
1,RPL10,2.84
2,RPL11B,1.59
3,RPL12A,0.92
4,RPL13A,1.2
5,RPL13B,1.66
6,RPL14A,1.62


Now that we have both the Feature matrix X and the observed values y we can use Normal Equations to find the closed form solution. Normal Equations are feasible in this case because the number of features are not very large.

The parameter vector theta can be estimated as follows:
$$ \theta= (X^{T}X)^{-1}X^{T}y$$

In [5]:
featureMatrix=t(apply(featureMatrix,1,strtoi))
colnames(featureMatrix)=colnames(featureMatrix)=c("Ones",motifs)
#Printing a sample to see if everything is ok
head(featureMatrix)

Unnamed: 0,Ones,GTA[TC]GG[GA]TG,TTTTTTTTC,ATGT[AG]TGGG,TT[TC]TTTTTT,[TG]C[CG]GCCT[AG][GC],ATC[CT]GTACA,TTTTTC[AC]A,CCCGGCCC,GGCCCTGGC,[TC][CG][GC]CGCGTC
RPL10,1,0,2,0,1,0,0,0,0,0,0
RPL11B,1,1,1,1,0,0,0,0,0,0,0
RPL12A,1,0,0,1,0,2,1,0,0,0,1
RPL13A,1,1,2,1,1,2,0,2,0,0,0
RPL13B,1,2,1,1,0,1,1,0,0,0,1
RPL14A,1,1,2,1,0,1,1,1,0,0,1


In [6]:
#Using standard Notation
X=featureMatrix
y=expressionValues
theta=solve(t(X)%*%(X))%*%t(X)%*%y
rownames(theta)= NULL 
#Printing calculated Values of Theta
print(theta)

              [,1]
 [1,]  1.691091520
 [2,]  0.037912910
 [3,]  0.118572795
 [4,] -0.184852803
 [5,]  0.086321790
 [6,] -0.129624600
 [7,] -0.081866230
 [8,]  0.007193555
 [9,] -0.070986256
[10,] -0.378224878
[11,] -0.133789078


Reading in prediction matrix