# Machine Learning with H2O - Tutorial 2: Basic Data Manipulation

<hr>

**Objective**:

- This tutorial demonstrates basic data manipulation with H2O.

<hr>

**Titanic Dataset:**

- Source: https://www.kaggle.com/c/titanic/data

<hr>

**Full Technical Reference:**

- http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html

<br>


In [None]:
# Start and connect to a local H2O cluster
import h2o
h2o.init(nthreads = -1)

<br>

In [None]:
# Import Titanic data (local CSV)
titanic = h2o.import_file("kaggle_titanic.csv")

In [None]:
# Explore the dataset using various functions
titanic.head(10)

<br>

Explain why we need to transform

<br>

In [None]:
# Explore the column 'Survived'
titanic['Survived'].summary()

In [None]:
# Use hist() to create a histogram
titanic['Survived'].hist()

In [None]:
# Use table() to summarize 0s and 1s
titanic['Survived'].table()

In [None]:
# Convert 'Survived' to categorical variable
titanic['Survived'] = titanic['Survived'].asfactor()

In [None]:
# Look at the summary of 'Survived' again
# The feature is now an 'enum' (enum is the name of categorical variable in Java)
titanic['Survived'].summary()

<br>

Doing the same for 'Pclass'

<br>

In [None]:
# Explore the column 'Pclass'
titanic['Pclass'].summary()

In [None]:
# Use hist() to create a histogram
titanic['Pclass'].hist()

In [None]:
# Use table() to summarize 1s, 2s and 3s
titanic['Pclass'].table()

In [None]:
# Convert 'Pclass' to categorical variable
titanic['Pclass'] = titanic['Pclass'].asfactor()

In [None]:
# Explore the column 'Pclass' again
titanic['Pclass'].summary()