# HW4 Problem:  Predicting Solubility -- using Applied Predictive Modeling

For this problem you will analyze a variant of the <i>Solubility</i> dataset studied in [APM]
(the <i>Applied Predictive Modeling</i> course text).  You can use the results about Solubility in this book in any way you choose.

This is easy if you use the APM methodology, which helps automate construction of models.
Some code from the [APM] book is also included below.

Because the solubility data we provide is not the same as the data used in the book,
the best-performing models derived in the book may not yield the best results for you.
However the models presented in the book should be very good starting points.

<hr style="border-width:20px;">

#  The Goal

In this assignment you are to predict the solubility values for a set of test data:
<ul><li>
Given the file <tt>training_set</tt>, develop a regression model that is as accurate as possible.
</li><li>
Use your model to predict solubility for each row of data in <tt>test_set.csv</tt>.
</li><li>
Put your predictions in a .csv file called  <tt>HW4_Solubility_Predictions.csv</tt> and upload it to CCLE.
</li></ul>

<hr style="border-width:20px;">

## Step 1: download your data, using your UID

<blockquote>

Download the music data at:
<br/>
http://datamining.cs.ucla.edu/cs249/hw4/solubility/___PUT_YOUR_UID_HERE___.zip

<br/>
<br/>
<i>For example, if your UID is  123456789, download the file</i>
    http://datamining.cs.ucla.edu/cs249/hw4/solubility/123456789.zip
    
</blockquote>
    
This zip file has two csv data files:  a training set and a test set.

<hr style="border-width:20px;">

## Step 2: construct a model from <tt>training_set.csv</tt>

Using the <tt>training_set.csv</tt> data, construct a regression model.

<br/>
<b>YOU CAN USE ANY ENVIRONMENT YOU LIKE TO BUILD A REGRESSION MODEL.</b>
Please construct the most accurate models you can.

<hr style="border-width:20px;">

## Step 3: generate predictions from <tt>test_set.csv</tt>
    
The rows of file <tt>test_set.csv</tt> have input features for a number of molecules.
Using your classifer, produce solubility predictions for each of them.

<br/>
Put one predicted class name per line in a CSV file <tt>HW4_Solubility_Predictions.csv</tt>.
This file should also have the header line "<tt>Solubility</tt>".

<br/>
<i>Your score on this problem will be the R-squared value of these predictions.</i>
<br/>

<hr style="border-width:20px;">

## Step 4: upload <tt>HW4_Solubility_Predictions.csv</tt> and your notebook to CCLE

Finally, go to CCLE and upload:
<ul><li>
your output CSV file <tt>HW4_Solubility_Predictions.csv</tt>
</li><li>
your notebook file <tt>HW4_Solubility_Predictions.ipynb</tt>
</li></ul>

We are not planning to run any of the uploaded notebooks.
However, your notebook should have the commands you used in developing your models ---
in order to show your work.
As announced, all assignment grading in this course will be automated,
and the notebook is needed in order to check results of the grading program.

<hr style="border-width:20px;">


In [1]:
not.installed <- function(pkg) !is.element(pkg, installed.packages()[,1])
    
#if (not.installed("caret")) install.packages("caret", repos="http://cran.us.r-project.org")

library(caret)

Loading required package: lattice
Loading required package: ggplot2


In [2]:
#library(help=caret)

### Ridge Regression

In [1]:
solTrain = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_solubility_data/training_set.csv"), header=TRUE )
solTest = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_solubility_data/test_set.csv"), header=TRUE )

solTrainX = solTrain[,1:228]
solTrainY = solTrain$solubility

solTestX = solTest

#detach(package:caret)  
library(caret)

library(kernlab)

library(doMC)
registerDoMC(4)

ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))

set.seed(100)
ridgeTune <- train(x = solTrainX, y = solTrainY,
                   method = "ridge",
                   tuneGrid = ridgeGrid,
                   trControl = trainControl(method = "repeatedcv", repeats = 5),
                   preProc = c("center", "scale", "BoxCox", "zv"))

ridgeTune

## Test set predictions

predictedClasses <- predict(ridgeTune, solTestX)
predictedClasses

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Loading required package: elasticnet
Loading required package: lars
Loaded lars 1.2



Ridge Regression 

855 samples
228 predictors

Pre-processing: centered (228), scaled (228), Box-Cox transformation (6) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 769, 770, 769, 769, 768, 770, ... 
Resampling results across tuning parameters:

  lambda       RMSE       Rsquared   RMSE SD     Rsquared SD
  0.000000000  0.7701664  0.8624147  0.08729035  0.03081631 
  0.007142857  0.7128186  0.8808624  0.07383092  0.02527786 
  0.014285714  0.6970521  0.8859298  0.07205995  0.02423736 
  0.021428571  0.6892628  0.8885236  0.07132552  0.02375062 
  0.028571429  0.6851488  0.8900134  0.07096535  0.02348634 
  0.035714286  0.6831351  0.8908953  0.07079528  0.02334170 
  0.042857143  0.6824745  0.8914032  0.07073679  0.02327055 
  0.050000000  0.6827585  0.8916649  0.07074842  0.02324783 
  0.057142857  0.6837402  0.8917572  0.07080972  0.02325987 
  0.064285714  0.6852685  0.8917268  0.07089261  0.02329241 
  0.071428571  0.6872326  0.8916083  0.070999

In [104]:
## Test set predictions

predictedClasses <- predict(ridgeTune, solTestX)
predictedClasses

Table = predictedClasses
write.table(Table, file="/Users/rutuja/Documents/Current Topics - Data Science/HW4/Solubility_output.csv",sep=",",append=TRUE,row.names=FALSE,col.names="solubility")

In write.table(Table, file = "/Users/rutuja/Documents/Current Topics - Data Science/HW4/Solubility_output.csv", : appending column names to file

### Partial Least Squares

In [1]:
solTrain = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_solubility_data/training_set.csv"), header=TRUE )
solTest = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_solubility_data/test_set.csv"), header=TRUE )

solTrainX = solTrain[,1:228]
solTrainY = solTrain$solubility

solTestX = solTest

#detach(package:caret)  
library(caret)

library(kernlab)

library(doMC)
registerDoMC(4)

set.seed(100)
plsTune <- train(x = solTrainX, y = solTrainY,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = trainControl(method = "repeatedcv", repeats = 5),
                 preProc = c("center", "scale", "BoxCox"))

plsTune

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Loading required package: pls

Attaching package: ‘pls’

The following object is masked from ‘package:caret’:

    R2

The following object is masked from ‘package:stats’:

    loadings



Partial Least Squares 

855 samples
228 predictors

Pre-processing: centered (228), scaled (228), Box-Cox transformation (6) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 769, 770, 769, 769, 768, 770, ... 
Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared   RMSE SD     Rsquared SD
   1     1.2308958  0.6400621  0.13412152  0.07756683 
   2     0.9994364  0.7638124  0.08814566  0.03789429 
   3     0.8998337  0.8085406  0.09545138  0.03630578 
   4     0.8094510  0.8444474  0.07933747  0.03082825 
   5     0.7866076  0.8530619  0.07919433  0.02906359 
   6     0.7575826  0.8637842  0.07735704  0.02767871 
   7     0.7371223  0.8710294  0.07605144  0.02697588 
   8     0.7238008  0.8758746  0.07277506  0.02555558 
   9     0.7115137  0.8802124  0.07064076  0.02509093 
  10     0.7024398  0.8829195  0.07178392  0.02573758 
  11     0.6977431  0.8845717  0.06882219  0.02446080 
  12     0.6936742  0.8860622  0.06720331  0.023468

### Cubist - Best Model

In [1]:
solTrain = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_solubility_data/training_set.csv"), header=TRUE )
solTest = read.csv( file("/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_solubility_data/test_set.csv"), header=TRUE )

solTrainX = solTrain[,1:228]
solTrainY = solTrain$solubility

solTestX = solTest

#detach(package:caret)  
library(caret)

library(kernlab)

library(doMC)
registerDoMC(4)

set.seed(100)
cbFit <- train(x = solTrainX, y = solTrainY,
                   method = "cubist",
                   trControl = trainControl(method = "repeatedcv", repeats = 5),
                   preProc = c("center", "scale", "BoxCox"))

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Loading required package: Cubist


In [2]:
cbFit

Cubist 

855 samples
228 predictors

Pre-processing: centered (228), scaled (228), Box-Cox transformation (6) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 769, 770, 769, 769, 768, 770, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE       Rsquared   RMSE SD     Rsquared SD
   1          0          0.7195140  0.8764835  0.07798379  0.03047806 
   1          5          0.6718760  0.8926382  0.07857420  0.02771545 
   1          9          0.6674753  0.8939317  0.07466411  0.02691704 
  10          0          0.6488811  0.8995890  0.06023732  0.02046697 
  10          5          0.6170116  0.9093927  0.06377073  0.01966855 
  10          9          0.6107867  0.9111466  0.05700288  0.01800022 
  20          0          0.6428688  0.9014879  0.06187159  0.02098340 
  20          5          0.6115506  0.9110348  0.06527973  0.01984940 
  20          9          0.6053685  0.9127894  0.05773374  0.01810042 

RMSE was used t

In [3]:
predictedClasses <- predict(cbFit, solTestX)

Table = predictedClasses
write.table(Table, file="/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Solubility_Predictions.csv",sep=",",append=TRUE,row.names=FALSE,col.names="solubility")

In write.table(Table, file = "/Users/rutuja/Documents/Current Topics - Data Science/HW4/HW4_Solubility_Predictions.csv", : appending column names to file