#R Datasets for CS249

This notebook explains how to install datasets from the CS249 texts ([ISL], [ESL], [APM]).
These datasets are modest-scale, but good for understanding modeling concepts.

It also shows how to download a "dataset of datasets", giving a list of about 750 datasets that are available in various R packages.  These range in size and usefulness, but many are famous datasets, and good to know about.

In [None]:
### %load_ext rmagic
### import rpy2 as Rpy

%load_ext rpy2.ipython

# new feature -- ipython is now part of rpy2.

In [34]:
%%R

not.installed <- function(pkg) !is.element(pkg, installed.packages()[,1])


#Datasets from ISL (Introduction to Statistical Learning)


[ISL] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning,
Springer-Verlag, 2013, ISBN: 9781461471387.

An introductory companion to [ESL], trying to make it more accessible, with examples in R.

home page: http://www-bcf.usc.edu/~gareth/ISL/ (with links to data, R code, errata)

free book PDF download: http://www-bcf.usc.edu/~gareth/ISL/ (Large PDF)



In [35]:
%%R

if (not.installed("ISLR"))  install.packages("ISLR")

library(help=ISLR)

Documentation for package 'ISLR'
		Information on package 'ISLR'

Description:

Package:            ISLR
Type:               Package
Title:              Data for An Introduction to Statistical Learning
                    with Applications in R
Version:            1.0
Date:               2013-06-10
Author:             Gareth James, Daniela Witten, Trevor Hastie and Rob
                    Tibshirani
Maintainer:         Trevor Hastie <hastie@stanford.edu>
Suggests:           MASS
Description:        The collection of datasets used in the book "An
                    Introduction to Statistical Learning with
                    Applications in R"
License:            GPL-2
LazyLoad:           yes
LazyData:           yes
URL:                http://www.StatLearning.com
Packaged:           2013-06-10 19:32:17 UTC; hastie
Depends:            R (>= 2.10)
NeedsCompilation:   no
Repository:         CRAN
Date/Publication:   2013-06-11 00:17:23
Built:              R 3.1.2; ; 2015-01-09 19:21:02 UT

#Datasets used in [ESL] (Elements of Statistical Learning)

    
[ESL] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,  Springer-Verlag, 2009, ISBN: 9780387848570.

A classic data mining text, written from the perspective of researchers in statistics.

home page: http://statweb.stanford.edu/~tibs/ElemStatLearn (includes data, R code, errata)

free book PDF download: http://statweb.stanford.edu/~tibs/ElemStatLearn/download.html (Large PDF)


In [36]:
%%R

if (not.installed("ElemStatLearn"))  install.packages("ElemStatLearn")

library(help=ElemStatLearn)


Documentation for package 'ElemStatLearn'
		Information on package 'ElemStatLearn'

Description:

Package:               ElemStatLearn
Date:                  2012-04-05
Title:                 Data sets, functions and examples from the book:
                       "The Elements of Statistical Learning, Data
                       Mining, Inference, and Prediction" by Trevor
                       Hastie, Robert Tibshirani and Jerome Friedman.
Version:               2012.04-0
Author:                Material from the book's webpage, R port and
                       packaging by Kjetil Halvorsen
Description:           Useful when reading the book above mentioned, in
                       the documentation referred to as `the book'.
Depends:               R (>= 2.10.0), stats
Suggests:              gam, splines, MASS, class, leaps, mda, lasso2,
                       lars, boot, prim, earth
LazyData:              yes
LazyDataCompression:   xz
Maintainer:            Kjetil Halvorsen <kjeti

#Datasets used in [APM] (Applied Predictive Modeling)

   Applied Predictive Modeling
   M. Kuhn and K. Johnson
   Springer-Verlag, 2013.
   ISBN: 978-1-4614-6848-6 (Print)
         978-1-4614-6849-3 (Online)

This book is similar to the [ISL] and [ESL] texts but focused on practical
model development and evaluation.  It offers lots of useful experience,
including case histories with R scripts.

As of now it seems this book is available for free PDF download:

   http://link.springer.com/book/10.1007%2F978-1-4614-6849-3

The UCLA Library also provides access to this, and many other,
books from Springer-Verlag:

   http://www.library.ucla.edu/libraries/sel/e-books-reference-sources

The book provides R packages (caret and AppliedPredictiveModeling)
that are useful for for semi-automated model selection and evaluation.
The book has examples with use of the following packages/models:

 C5.0, J48, M5, Nelder-Mead, PART, avNNet, cforest, ctree, cubist, earth,
 enet, fda, gbm, glm, glmnet, knn, lda, lm, mda, nb, nnet, pam, pcr, pls,
 rf, ridge, rpart, sparseLDA, svmPoly, svmRadial, treebag

Overview article (in PDF) describing the caret package:

<a href="http://www.jstatsoft.org/v28/i05/paper">www.jstatsoft.org/v28/i05/paper</a>


##Install all APM packages, datasets, and R code:


In [44]:
%%R

# Grid Search is often used in APM to search a model's parameter space, and
# some chapters use the "doMC" package to do Multi-Core computation
# (supported only on Linux or MacOS):

if (not.installed("doMC"))  install.packages("doMC")   # multicore computation in R


In [None]:
%%R

if (not.installed("AppliedPredictiveModeling")) install.packages("AppliedPredictiveModeling")

library(help=AppliedPredictiveModeling)


In [None]:
%%R

getPackages(1:19)    # download ALL the packages used in the book, including caret

scriptLocation()     # get the directory where book scripts are located

dir()                # list files in this directory

In [None]:
%%R

# Chs 10 and 17 evaluate many different models in case studies.
# To run Ch.10:

current_working_directory = getwd()  # remember current directory

chapter_code_directory = scriptLocation()
setwd( chapter_code_directory )
dir()
source("10_Case_Study_Concrete.R", echo=TRUE)

setwd(current_working_directory)  # return to working directory

##APM provides an extensive list of Models (in the caret package)

The chapters build the following models:
<pre>
 02_A_Short_Tour.R           lm, earth
 04_Over_Fitting.R           svmRadial, glm
 06_Linear_Regression.R      lm, pls, pcr, ridge, enet
 07_Non-Linear_Reg.R         avNNet, earth, svmRadial, svmPoly, knn
 08_Regression_Trees.R       rpart, ctree, M5, treebag, rf, cforest, gbm
 10_Case_Study_Concrete.R    lm, pls, enet, earth, svmRadial, avNNet, rpart,
                             treebag, ctree, rf, gbm, cubist, M5, Nelder-Mead
 11_Class_Performance.R      glm
 12_Discriminant_Analysis.R  svmRadial, glm, lda, pls, glmnet, pam
 13_Non-Linear_Class.R       mda, nnet, avNNet, fda, svmRadial, svmPoly, knn, nb
 14_Class_Trees.R            rpart, J48, PART, treebag, rf, gbm, C5.0
 16_Class_Imbalance.R        rf, glm, fda, svmRadial, rpart, C5.0
 17_Job_Scheduling.R         rpart, lda, sparseLDA, nnet, pls, fda, rf, C5.0,
                             treebag, svmRadial
 19_Feature_Select.R         rf, lda, svmRadial, nb, glm, knn, svmRadial, knn

caret's list of available models:

<a href="http://caret.r-forge.r-project.org/modelList.html>http://caret.r-forge.r-project.org/modelList.html</a>  # table of models

<a href="http://caret.r-forge.r-project.org/bytag.html>http://caret.r-forge.r-project.org/bytag.html</a>      # index of models by type

Training control methods used by the scripts:

 04_Over_Fitting.R           repeatedcv, cv, LOOCV, LGOCV, boot, boot632, repeatedcv
 06_Linear_Regression.R      cv
 07_Non-Linear_Reg.R         cv
 08_Regression_Trees.R       cv, oob
 10_Case_Study_Concrete.R    repeatedcv
 11_Class_Performance.R      repeatedcv
 12_Discriminant_Analysis.R  cv, LGOCV
 13_Non-Linear_Class.R       LGOCV
 14_Class_Trees.R            LGOCV
 16_Class_Imbalance.R        cv
 17_Job_Scheduling.R         repeatedcv
 19_Feature_Select.R         repeatedcv, cv
 </pre>

#A Dataset of R Datasets

A list of <a href="https://github.com/vincentarelbundock/Rdatasets">about 750 datasets that are included in R packages</a>
was developed by Vincent Arel Bundock.



There is an <a href="https://github.com/vincentarelbundock/Rdatasets/blob/master/Rdatasets.R">R script for copying and installing this dataset of datasets on your computer from its GitHub site</a>:

In [6]:
%%R

library(R2HTML)

try(dir.create('csv'))
try(dir.create('doc'))
try(dir.create('csv/datasets'))
try(dir.create('doc/datasets'))


packages = c("datasets", "boot", "KMsurv", "robustbase", "car", "cluster", "COUNT", "Ecdat", "gap", "ggplot2", "HistData", "lattice", "MASS", "plm", "plyr", "pscl", "reshape2", "rpart", "sandwich", "sem",  "survival", "vcd", "Zelig", "HSAUR", "psych", "quantreg", "geepack", "texmex", "multgee", "evir")
# Installed only packages that are not pre-installed.
# Credits: http://stackoverflow.com/a/9345167/756986
new.packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos="http://cran.rstudio.com")
index = data(package=packages)$results[,c(1,3,4)]
index = data.frame(index, stringsAsFactors=FALSE)
index_out = NULL

# Load packages which store datasets
for (i in packages) {
        library(i, character.only=TRUE)
}

# Save datasets
for (i in 1:nrow(index)) {
    dataset = index$Item[i]
    package = index$Package[i]
    # Load data in new environment (very hackish)
    e = new.env(hash = TRUE, parent = parent.frame(), size = 29L)
    cmd = paste('data(', dataset, ', envir=e)', sep='')
    eval(parse(text=cmd))
    d = e[[dataset]]
    if(class(d) %in% c('data.frame', 'matrix', 'numeric', 'table', 'ts')){
        cat("Processing data set: ", dataset, "\n")
        if(class(d)=='ts'){
            d = data.frame(time(d), d)
            colnames(d) = c('time', dataset)
        }
        try(dir.create(paste('csv/', package, sep='')))
        try(dir.create(paste('doc/', package, sep='')))
        dest_csv = paste('csv/', package, '/', dataset, '.csv', sep='')
        dest_doc = paste('doc/', package, '/', dataset, '.html', sep='')
        # Save data as CSV
        write.csv(d, dest_csv)
        # Save documentation as HTML
        help.ref = help(eval(dataset), package=eval(package))
        help.file = utils:::.getHelpFile(help.ref)
        tools::Rd2HTML(help.file, out=dest_doc)
        # Add entry to index out
        index_out = rbind(index_out, index[i,])
    }
}

# CSV index
index_out$csv = paste('https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/',
                      index_out$Package, '/', index_out$Item, '.csv', sep='')
index_out$doc = paste('https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/',
                      index_out$Package, '/', index_out$Item, '.html', sep='')
write.csv(index_out, file='datasets.csv', row.names=FALSE)

# HTML index
index_out$csv = paste("<a href='", index_out$csv, "'> CSV </a>", sep='')
index_out$doc = paste("<a href='", index_out$doc, "'> DOC </a>", sep='')
unlink('datasets.html')
rss = '
<style type="text/css">
  tr:nth-child(even){
          background-color: #E5E7E5;
  }
</style>
'
cat(rss, file='datasets.html')
HTML(index_out, file='datasets.html', row.names=FALSE, append=TRUE)


Processing data set:  AirPassengers 
Processing data set:  BJsales 
Processing data set:  BOD 
Processing data set:  Formaldehyde 
Processing data set:  HairEyeColor 
Processing data set:  InsectSprays 
Processing data set:  JohnsonJohnson 
Processing data set:  LakeHuron 
Processing data set:  LifeCycleSavings 
Processing data set:  Nile 
Processing data set:  OrchardSprays 
Processing data set:  PlantGrowth 
Processing data set:  Puromycin 
Processing data set:  Titanic 
Processing data set:  ToothGrowth 
Processing data set:  UCBAdmissions 
Processing data set:  UKDriverDeaths 
Processing data set:  UKgas 
Processing data set:  USAccDeaths 
Processing data set:  USArrests 
Processing data set:  USJudgeRatings 
Processing data set:  USPersonalExpenditure 
Processing data set:  VADeaths 
Processing data set:  WWWusage 
Processing data set:  WorldPhones 
Processing data set:  airmiles 
Processing data set:  airquality 
Processing data set:  anscombe 
Processing data set:  attenu 
Proce

In [None]:
%%R -w 800

table = data.frame( Item=index_out$Item, Title=substr(index_out$Title, 1,60) )
print(table)
