# Machine Learning with H2O - Tutorial 2: Basic Data Manipulation

<hr>

**Objective**:

- This tutorial demonstrates basic data manipulation with H2O.

<hr>

**Titanic Dataset:**

- Source: https://www.kaggle.com/c/titanic/data

<hr>

**Full Technical Reference:**

- http://docs.h2o.ai/h2o/latest-stable/h2o-r/h2o_package.pdf

<br>


In [None]:
# Start and connect to a local H2O cluster
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)

In [None]:
# Import Titanic data (local CSV)
titanic = h2o.importFile("kaggle_titanic.csv")

In [None]:
# Explore the dataset using various functions
head(titanic, 10)

<br>

Explain why we need to transform

<br>

In [None]:
# Explore the column 'Survived'
h2o.describe(titanic[, 'Survived'])

In [None]:
# Use hist() to create a histogram
h2o.hist(titanic[, 'Survived'])

In [None]:
# Use table() to summarize 0s and 1s
h2o.table(titanic[, 'Survived'])

In [None]:
# Convert 'Survived' to categorical variable
titanic[, 'Survived'] = as.factor(titanic[, 'Survived'])

In [None]:
# Look at the summary of 'Survived' again
# The feature is now an 'enum' (enum is the name of categorical variable in Java)
h2o.describe(titanic[, 'Survived'])

<br>

Doing the same for 'Pclass'

<br>

In [None]:
# Explore the column 'Pclass'
h2o.describe(titanic[,'Pclass'])

In [None]:
# Use hist() to create a histogram
h2o.hist(titanic[, 'Pclass'])

In [None]:
# Use table() to summarize 1s, 2s and 3s
h2o.table(titanic[, 'Pclass'])

In [None]:
# Convert 'Pclass' to categorical variable
titanic[, 'Pclass'] = as.factor(titanic[, 'Pclass'])

In [None]:
# Look at the summary of 'Pclass' again
# The feature is now an 'enum' (enum is the name of categorical variable in Java)
h2o.describe(titanic[, 'Pclass'])