# Machine Learing using H2O in R
This demo will focus on performing a Kmeans clustering analysis for a data set using the R interface to the H2O machine learning platform. H2O is a java based machine learning platform that provides an R interface. If you don't yet have H2O installed in your R you can install by doing the following:

In [None]:
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turing/9/R")))

You'll also need a fairly recent and working version of Java installed on your system. This shouldn't be an issue for most people. If you don't yet have R installed on your machine, you've come to the wrong notebook...

This tutorial uses data from http://archive.ics.uci.edu/ml/datasets/seeds. The data file `seeds_dataset.txt` contains 210 observations of 7 variables with an *a priori* grouping assignment.

We begin by importing the data into a initializing the H2O library and runtime environment. H2O runs as a seperate Java process and the cluster initialization has many options (use `?h2o.init` to explore these). 

In [1]:
library('h2o')

h2o.init(nthreads=1)

Loading required package: statmod

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc



Next we'll get started with some data. Download the dataset `seeds_dataset.txt` and save it in the directory containing this notebook. To import data into H2O we use the `h2o.importFile()` function. The H2O data structures are seperate from R data structures, with some key differenences. However, we can export H2O structures as R data frames. 

In [None]:
#Import data from text file
seeds.hex <- h2o.importFile('./seeds_dataset.txt')
#create R data frame
seedsDF <- as.data.frame(seeds.hex)
#inspect R data frame
head(seedsDF)
summary(seedsDF)

Clean up the data and re-import.

In [None]:
#Import data from text file
seeds.hex <- h2o.importFile('./seeds_dataset_fixed.txt')
#create R data frame
seedsDF <- as.data.frame(seeds.hex)
#inspect R data frame
head(seedsDF)
summary(seedsDF)