* H2O is an in-memory platform for distributed, scalable machine learning.

* H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. 

* H2O provides implementations of many popular algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks, Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (H2O AutoML).

* H2O is extensible so that developers can add data transformations and custom algorithms of their choice and access them through all of those clients.


#### H2O Python Module

* H2O Python module provides access to the H2O JVM, as well as its extensions, objects, machine-learning algorithms, and modeling support capabilities, such as basic munging and feature generation.

* The H2O JVM provides a web server so that all communication occurs on a socket (specified by an IP address and a port) via a series of REST calls.

* The H2O Python module is not intended as a replacement for other popular machine learning frameworks such as scikit-learn, pylearn2, and their ilk, but is intended to bring H2O to a wider audience of data and machine learning devotees who work exclusively with Python.

* H2O from Python is a tool for rapidly turning over models, doing data munging, and building applications in a fast, scalable environment without any of the mental anguish about parallelism and distribution of work.


#### H2O 

* H2O is a Java-based software for data modeling and general computing. 

* The primary purpose of H2O is as a distributed (many machines), parallel (many CPUs), in memory (several hundred GBs Xmx) processing engine.


* There are two levels of parallelism:
1. within node
2. across (or between) nodes

* The goal of H2O is to allow simple horizontal scaling to a given problem in order to produce a solution faster. The conceptual paradigm MapReduce (AKA “divide and conquer and combine”), along with a good concurrent application structure, enable this type of scaling in H2O.

* For application developers and data scientists, the gritty details of thread-safety, algorithm parallelism, and node coherence on a network are concealed by simple-to-use REST calls.


#### H2O Object System

* H2O uses a distributed key-value store (the “DKV”) that contains pointers to the various objects of the H2O ecosystem. 

* Some shared objects are mutable by the client

* Some shared objects are read-only by the client, but are mutable by H2O (e.g. a model being constructed will change over time) and actions by the client may have side-effects on other clients (multi-tenancy is not a supported model of use, but it is possible for multiple clients to attach to a single H2O cluster).

These objects are:

* Key: A key is an entry in the DKV that maps to an object in H2O.

* Frame: A Frame is a collection of Vec objects. It is a 2D array of elements.

* Vec: A Vec is a collection of Chunk objects. It is a 1D array of elements.

* Chunk: A Chunk holds a fraction of the BigData. It is a 1D array of elements.

* ModelMetrics: A collection of metrics for a given category of model.

* Model: A model is an immutable object having predict and metrics methods.

* Job: A Job is a non-blocking task that performs a finite amount of work

### Installation

In [1]:
#dependencies needed to be installed
!pip install requests
!pip install tabulate
!pip install "colorama>=0.3.8"
!pip install future

Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.8.9


In [2]:
!pip install h2o

Collecting h2o
  Downloading h2o-3.32.1.3.tar.gz (164.8 MB)
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py): started
  Building wheel for h2o (setup.py): finished with status 'done'
  Created wheel for h2o: filename=h2o-3.32.1.3-py2.py3-none-any.whl size=164854343 sha256=c8ef263ff628a2c52589cf15c7d0af9095d827e5a179964a3e38b04bc29e7052
  Stored in directory: c:\users\dell\appdata\local\pip\cache\wheels\94\de\98\a3badf41ac2c2b02dc1a21c9b8f8d435b5eb68a52f9df8d3c1
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.32.1.3


http://www.oracle.com/technetwork/java/javase/downloads/index.html
    
H2O supprts Java SDK version 8-15

# Supported File Formats

H2O currently supports the following file types:

* CSV (delimited, UTF-8 only) files (including GZipped CSV)

* ORC

* SVMLight

* ARFF

* XLS (BIFF 8 only)

* XLSX (BIFF 8 only)

* Avro version 1.8.0 (without multifile parsing or column type modification)

* Parquet

In [5]:
import h2o

In [9]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)
  Starting server from C:\Users\Dell\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\Dell\AppData\Local\Temp\tmp9gm19t1o
  JVM stdout: C:\Users\Dell\AppData\Local\Temp\tmp9gm19t1o\h2o_Dell_started_from_python.out
  JVM stderr: C:\Users\Dell\AppData\Local\Temp\tmp9gm19t1o\h2o_Dell_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,05 secs
H2O_cluster_timezone:,America/Los_Angeles
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.3
H2O_cluster_version_age:,26 days
H2O_cluster_name:,H2O_from_python_Dell_z81239
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,864 Mb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


## Basics

**Importing files**

In [10]:
# Import a file from S3:
import h2o
h2o.init()
airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
airlines_df = h2o.import_file(path=airlines)

# Import a file from HDFS, you must include the node name:
import h2o
h2o.init()
airlines = "hdfs://node-1:/user/smalldata/airlines/allyears2k_headers.zip"
airlines_df = h2o.import_file(path=airlines)


Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,26 secs
H2O_cluster_timezone:,America/Los_Angeles
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.3
H2O_cluster_version_age:,26 days
H2O_cluster_name:,H2O_from_python_Dell_z81239
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,855 Mb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


Parse progress: |█████████████████████████████████████████████████████████| 100%
Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,57 secs
H2O_cluster_timezone:,America/Los_Angeles
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.3
H2O_cluster_version_age:,26 days
H2O_cluster_name:,H2O_from_python_Dell_z81239
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,832 Mb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


H2OResponseError: Server error java.lang.IllegalArgumentException:
  Error: java.net.UnknownHostException: node-1
  Request: POST /3/ImportFilesMulti
    data: {'paths': '[hdfs://node-1:/user/smalldata/airlines/allyears2k_headers.zip]'}


**Combining Columns from Two Dataset**

The cbind function allows you to combine datasets by adding columns from one dataset into another. Note that when using cbind, the two datasets must have the same number of rows. In addition, if the datasets contain common column names, H2O will append the joined column with 0.

In [12]:
import numpy as np

# Generate a random dataset with 10 rows 4 columns.
# Label the columns A, B, C, and D.
cols1_df = h2o.H2OFrame.from_python(np.random.randn(10,4).tolist(), column_names=list('ABCD'))
cols1_df.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


A,B,C,D
-0.769678,-1.04686,1.93747,-0.528779
-0.323689,-0.245458,-0.451048,1.05777
-0.550506,-1.17546,0.363695,0.728127
0.400188,-1.14307,-0.0635538,-0.846182
-1.21806,-1.4712,-0.794863,0.261055
-0.457715,-0.350138,1.08414,-0.0554972
-1.28459,-0.162088,0.607858,1.04999
-1.57106,0.67514,1.10133,0.715775
1.59937,0.217726,-0.414989,-0.180219
-0.145931,0.4119,-0.630669,-1.72855


<bound method H2OFrame.describe of >

In [13]:
# Generate a second random dataset with 10 rows and 2 columns.
# Label the columns, Y and Z.
cols2_df = h2o.H2OFrame.from_python(np.random.randn(10,2).tolist(), column_names=list('YZ'))
cols2_df.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


Y,Z
1.23577,-0.569096
-1.32025,0.9988
-0.297733,-1.61445
1.22773,-1.34383
1.21408,0.0146905
0.677902,-0.554999
-0.8872,-0.408345
-0.382518,0.355944
-1.26705,1.78747
0.184916,-0.0350165


<bound method H2OFrame.describe of >

In [14]:

# Add the columns from the second dataset into the first.
# H2O will append these as the right-most columns.
colsCombine_df = cols1_df.cbind(cols2_df)
colsCombine_df.describe

A,B,C,D,Y,Z
-0.769678,-1.04686,1.93747,-0.528779,1.23577,-0.569096
-0.323689,-0.245458,-0.451048,1.05777,-1.32025,0.9988
-0.550506,-1.17546,0.363695,0.728127,-0.297733,-1.61445
0.400188,-1.14307,-0.0635538,-0.846182,1.22773,-1.34383
-1.21806,-1.4712,-0.794863,0.261055,1.21408,0.0146905
-0.457715,-0.350138,1.08414,-0.0554972,0.677902,-0.554999
-1.28459,-0.162088,0.607858,1.04999,-0.8872,-0.408345
-1.57106,0.67514,1.10133,0.715775,-0.382518,0.355944
1.59937,0.217726,-0.414989,-0.180219,-1.26705,1.78747
-0.145931,0.4119,-0.630669,-1.72855,0.184916,-0.0350165


<bound method H2OFrame.describe of >


**Combining Rows from Two Datasets**


The rbind function to combine two similar datasets into a single large dataset. 
This can be used, for example, to create a larger dataset by combining data from a validation dataset with its training or testing dataset.

Note that when using rbind, the two datasets must have the same set of columns.





In [15]:
# Generate a random dataset with 100 rows 4 columns.
# Label the columns A, B, C, and D.
df1 = h2o.H2OFrame.from_python(np.random.randn(100,4).tolist(), column_names=list('ABCD'))
df1.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


A,B,C,D
-0.181347,-0.108001,-0.750324,-0.378007
0.487416,-2.28845,-0.113211,-0.21489
0.376845,0.883027,-0.338396,-0.0671125
-1.40542,0.446497,0.272481,-0.908014
0.00609763,-1.28933,-0.639123,0.580837
-1.70974,0.360339,-1.366,1.18906
0.317223,0.819715,0.608502,-0.825593
-2.13074,0.241734,-0.430647,2.10417
-1.47938,0.122659,0.709468,-1.14868
2.37946,0.95932,-0.738864,-0.827964


<bound method H2OFrame.describe of >

In [16]:
# Generate a second random dataset with 100 rows and 4 columns.
# Again, label the columns, A, B, C, and D.
df2 = h2o.H2OFrame.from_python(np.random.randn(100,4).tolist(), column_names=list('ABCD'))
df2.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


A,B,C,D
1.29338,-1.95029,-0.910084,0.464277
0.198808,-0.482703,0.68196,-0.10153
0.553279,-0.723882,-1.7414,0.717428
-0.989289,0.263803,2.95993,-0.241498
-0.995339,0.164363,-0.952606,0.987327
-0.317996,-0.831171,-2.02828,-0.0352861
0.0970458,0.584402,0.3517,-0.573392
-0.832019,1.44989,-0.36712,-1.49503
0.314752,1.03289,0.746285,0.906769
-0.595265,1.26811,0.103703,0.593597


<bound method H2OFrame.describe of >

In [17]:

# Bind the rows from the second dataset into the first dataset.
df1.rbind(df2)

A,B,C,D
-0.181347,-0.108001,-0.750324,-0.378007
0.487416,-2.28845,-0.113211,-0.21489
0.376845,0.883027,-0.338396,-0.0671125
-1.40542,0.446497,0.272481,-0.908014
0.00609763,-1.28933,-0.639123,0.580837
-1.70974,0.360339,-1.366,1.18906
0.317223,0.819715,0.608502,-0.825593
-2.13074,0.241734,-0.430647,2.10417
-1.47938,0.122659,0.709468,-1.14868
2.37946,0.95932,-0.738864,-0.827964




**Fill NAs**

Use this function to fill in NA values in a sequential manner up to a specified limit. When using this function, you will specify whether the method to fill the NAs should go forward (default) or backward, whether the NAs should be filled along rows (default) or columns, and the maximum number of consecutive NAs to fill (defaults to 1).

In [18]:
df = h2o.create_frame(rows=10,
                      cols=5,
                      real_fraction=1.0,
                      real_range=100,
                      missing_fraction=0.3,
                      seed=123)

Create Frame progress: |██████████████████████████████████████████████████| 100%


In [19]:
#Forward fill a row. In Python, the values for axis are 0 (row-wise) and 1 (column-wise)
filled = df.fillna(method="forward",axis=0,maxlen=1)
filled

C1,C2,C3,C4,C5
,,,-93.6409,-13.6593
57.4439,-93.71,25.4342,-93.6409,-13.6593
-92.4271,55.4314,84.6372,-43.4759,53.1715
-57.9583,27.4148,-26.9013,83.0921,-62.7819
-91.9426,-77.9814,-26.9013,83.0921,83.3661
-80.6142,12.5466,27.1672,60.5492,-13.2275
-80.6142,-47.3792,23.881,60.5492,-13.2275
-3.90288,31.2924,-96.4446,,
-3.90288,25.5683,-96.4446,37.7971,74.6371
-21.0949,15.0779,-96.6207,37.7971,-64.7896




**Groupby**

The group_by function allows you to group one or more columns and apply a function to the result. Specifically, the group_by function performs the following actions on an H2O Frame:

splits the data into groups based on some criteria

applies a function to each group independently

combines the results into an H2OFrame

The result is a new H2OFrame with columns equivalent to the number of groups created. The returned groups are sorted by the natural group-by column sort.

In [44]:
# Upload the airlines dataset
air = h2o.import_file("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")
air.dim

Parse progress: |█████████████████████████████████████████████████████████| 100%


[43978, 31]

In [45]:
# Find number of flights by airport
origin_flights = air.group_by("Origin")
origin_fights.count()
origin_fights.get_frame()

NameError: name 'origin_fights' is not defined

In [None]:
# Find number of flights per month based on the origin
cols = ["Origin","Month"]
flights_by_origin_month = air.group_by(by=cols).count(na ="all")
flights_by_origin_month.get_frame()

In [None]:
# Find months with the highest cancellation ratio
cancellation_by_month = air.group_by(by='Month').sum('Cancelled', na="all")
flights_by_month = air.group_by('Month').count(na="all")
cancelled = cancellation_by_month.get_frame()['sum_Cancelled']
flights = flights_by_month.get_frame()['nrow']
month_count = flights_by_month.get_frame()['Month']
ratio = cancelled/flights
month_count.cbind(ratio)

In [None]:
# Use group_by with multiple columns. Summarize the destination,
# arrival delays, and departure delays for an origin
cols_1 = ['Origin', 'Dest', 'IsArrDelayed', 'IsDepDelayed']
cols_2 = ["Dest", "IsArrDelayed", "IsDepDelayed"]
air[cols_1].group_by(by='Origin').sum(cols_2, na="ignore").get_frame()

**Imputing Data**

The impute function allows you to perform in-place imputation by filling missing values with aggregates computed on the “na.rm’d” vector. Additionally, you can also perform imputation based on groupings of columns from within the dataset. These columns can be passed by index or by column name to the by parameter. Note that if a factor column is supplied, then the method must be mode.

In [None]:
# Import the airlines dataset
air_path = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
air = h2o.import_file(path=air_path)
air.dim
[43978, 31]

# Mean impute the DepTime column based on the Origin and Distance columns
DeptTime_impute = air.impute("DepTime", method = "mean", by = ["Origin", "Distance"])
DeptTime_impute

In [None]:
# Revert imputations
air = h2o.import_file(path=air_path)

# Mode impute the TailNum column based on the Month and Year columns
mode_impute = air.impute("TailNum", method = "mode", by=["Month", "Year"])
mode_impute

**Merging two datasets**

In [35]:
# Create a dataset by inputting raw data.
df1 = h2o.H2OFrame.from_python({'A':['Hello', 'World', 'Welcome', 'To', 'H2O', 'World'],
                                'n': [0,1,2,3,4,5]})
df1.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


A,n
Hello,0
World,1
Welcome,2
To,3
H2O,4
World,5


<bound method H2OFrame.describe of >

In [36]:
# Generate a random dataset from python.
df2 = h2o.H2OFrame.from_python([[x] for x in np.random.randint(0, 10, size=20).tolist()], column_names=['n'])
df2.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


n
1
9
2
3
8
2
7
0
1
8


<bound method H2OFrame.describe of >

In [37]:
# Merge the first dataset into the second dataset. Note that only columns
# in common are merged (i.e, values in df2 greater than 5 will not be merged).
df3 = df2.merge(df1)
df3.describe

n,A
0,Hello
1,World
1,World
1,World
1,World
2,Welcome
2,Welcome
3,To
3,To
4,H2O


<bound method H2OFrame.describe of >

In [38]:
# Merge all of df2 into df1. Note that this will result in missing values for
# column A, which does not include values greater than 5.
df4 = df2.merge(df1, all_x=True)
df4.describe

n,A
0,Hello
1,World
1,World
1,World
1,World
2,Welcome
2,Welcome
3,To
3,To
4,H2O


<bound method H2OFrame.describe of >

**Pivoting Tables**

In [20]:
# Create a simple data frame by inputting values
df = h2o.H2OFrame({'colorID': ['1','2','3','3','1','4'],
                   'value': ['red','orange','yellow','yellow','red','blue'],
                   'amount': ['4','2','4','3','6','3']})

df

Parse progress: |█████████████████████████████████████████████████████████| 100%


colorID,value,amount
1,red,4
2,orange,2
3,yellow,4
3,yellow,3
1,red,6
4,blue,3




In [21]:
# Pivot the table on the colorID column and aligned on the amount column
df2 = df.pivot(index="amount",column="colorID",value="value")
df2

amount,1,2,3,4
2,,1.0,,
3,,,3.0,0.0
4,2.0,,3.0,
6,2.0,,,




**Replacing values in a frame**

In [22]:
path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

# Replace a single numerical datum. Note that columns and rows start at 0.
# so in the example below, the value in the 15th row and 3rd column will be set to 2.0.
df[14,2] = 2.0

# Replace a whole column. The example below multiplies all values in the first column by 3.
df[0] = 3*df[0]

# Replace by row mask. The example below searches for value less than 4.6 in the
# sepal_len column and replaces those values with 4.6.
df[df["sepal_len"] < 4.6, "sepal_len"] = 4.6

# Replace using ifelse. Similar to the previous example, this replaces values less than 4.6 with 4.6.
df["sepal_len"] = (df["sepal_len"] < 4.6).ifelse(4.6, df["sepal_len"])

# Replace missing values with 0.
df[df["sepal_len"].isna(), "sepal_len"] = 0

# Alternative with ifelse. Note the parantheses.
df["sepal_len"] = (df["sepal_len"].isna()).ifelse(0, df["sepal_len"])

Parse progress: |█████████████████████████████████████████████████████████| 100%


**Slicing Rows**

In [23]:
# Import the iris with headers dataset
path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

c1 = df[:,0]
c1.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


sepal_len
5.1
4.9
4.7
4.6
5.0
5.4
4.6
5.0
4.4
4.9


<bound method H2OFrame.describe of >

In [24]:
# Slice 1 column by name. The resulting dataset will include only the sepal_len column
# from the original dataset.
c1_1 = df[:, "sepal_len"]
c1_1.describe

sepal_len
5.1
4.9
4.7
4.6
5.0
5.4
4.6
5.0
4.4
4.9


<bound method H2OFrame.describe of >

**Slicing Columns**

In [25]:
# Import the iris with headers dataset
path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

# Slice 1 row by index
c1 = df[15,:]
c1.describe

# Slice a range of rows
c1_1 = df[range(25,50,1),:]
c1_1.describe

Parse progress: |█████████████████████████████████████████████████████████| 100%


sepal_len,sepal_wid,petal_len,petal_wid,class
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


<bound method H2OFrame.describe of >

**Sorting Columns**

Use the sort function in Python or the arrange function in R to create a new frame that is sorted by column(s) in ascending (default) or descending order. Note that when using sort, the original frame cannot contain any string columns.

If only one column is specified in the sort, then the final results are sorted according to that one single column either in ascending (default) or in descending order. However, if you specify more than one column in the sort, then H2O performs as described below:

Assuming two columns, X (first column) and Y (second column):

H2O will sort on the first specified column, so in the case of [0,1], the X column will be sorted first. Similarly, in the case of [1,0], the Y column will be sorted first.

H2O will sort on subsequent columns in the order they are specified, but only on those rows that have the same values as the first sorted column. No sorting will be done on subsequent columns if the values are not also duplicated in the first sorted column.

In [40]:
df1 = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/synthetic/smallIntFloats.csv.zip")
df1

Parse progress: |█████████████████████████████████████████████████████████| 100%


C1,C10
68379.0,-16186700.0
67108900.0,32768.0
32768.0,-870946000.0
32.0,131072.0
268435000.0,-29.1003
105383000.0,-239721000.0
350191.0,21551.4
-188.0,23987200.0
493.0,525.825
93104100.0,-163828000.0




In [41]:
# Sort on the first column only in ascending order (default)
sorted_column_indices=[0]
df2 = df1.sort(0)
df2

C1,C10
-1073590000.0,747438.0
-1073560000.0,-2097150.0
-1073520000.0,5110770.0
-1073420000.0,2220940.0
-1073360000.0,-5.7076
-1073360000.0,-4650.33
-1073260000.0,-1048580.0
-1073070000.0,8192.0
-1072910000.0,-1.49017
-1072910000.0,-9337.5




In [42]:
# Sort on the first column only in ascending order (default)
sorted_column_indices=[0]
df2 = df1.sort(0)
df2

C1,C10
-1073590000.0,747438.0
-1073560000.0,-2097150.0
-1073520000.0,5110770.0
-1073420000.0,2220940.0
-1073360000.0,-5.7076
-1073360000.0,-4650.33
-1073260000.0,-1048580.0
-1073070000.0,8192.0
-1072910000.0,-1.49017
-1072910000.0,-9337.5




In [43]:
# Sort on the second column in descending order
df4 = df1.sort(1, ascending=False)
df4


C1,C10
321418000.0,1073660000.0
448.0,1073570000.0
85.0,1073290000.0
-4096.0,1072910000.0
28.0,1072890000.0
-4194300.0,1072750000.0
6616880.0,1072540000.0
-50127.0,1072350000.0
-262144.0,1072070000.0
55.0,1071750000.0




**Splitting dataset into Training/Testing/Validating**

when splitting frames, H2O does not give an exact split. It’s designed to be efficient on big data using a probabilistic splitting method rather than an exact split. For example, when specifying a 0.75/0.25 split, H2O will produce a test/train split with an expected value of 0.75/0.25 rather than exactly 0.75/0.25. On small datasets, the sizes of the resulting splits will deviate from the expected value more than on big data, where they will be very close to exact.

In [29]:
# Import the prostate dataset
prostate = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv"
prostate_df = h2o.import_file(path=prostate)

# Split the data into Train/Test/Validation with Train having 70% and test and validation 15% each
train,test,valid = prostate_df.split_frame(ratios=[.7, .15])

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [30]:
train

ID,CAPSULE,AGE,RACE,DPROS,DCAPS,PSA,VOL,GLEASON
2,0,72,1,3,2,6.7,0.0,7
4,0,76,2,2,1,51.2,20.0,7
5,0,69,1,1,1,12.3,55.9,6
6,1,71,1,3,2,3.3,0.0,8
7,0,68,2,4,2,31.9,0.0,7
8,0,61,2,4,2,66.7,27.2,7
9,0,69,1,1,1,3.9,24.0,7
11,1,68,2,4,2,4.0,0.0,7
12,1,72,1,2,2,21.2,0.0,7
14,1,65,1,4,2,39.0,0.0,7




In [31]:
test

ID,CAPSULE,AGE,RACE,DPROS,DCAPS,PSA,VOL,GLEASON
15,0,75,1,1,1,7.5,0.0,5
31,1,54,1,3,1,8.4,18.3,6
32,0,72,1,2,1,6.5,22.5,7
34,1,60,1,3,2,9.5,0.0,7
35,0,65,1,2,1,11.1,17.7,6
53,0,66,1,3,1,8.8,39.9,6
65,1,59,1,4,1,30.7,0.0,7
69,0,65,1,2,1,1.3,6.8,5
70,1,68,2,3,1,9.6,32.0,6
78,1,62,1,2,1,1.9,0.0,6




In [32]:
valid

ID,CAPSULE,AGE,RACE,DPROS,DCAPS,PSA,VOL,GLEASON
1,0,65,1,2,1,1.4,0.0,6
3,0,70,1,1,2,4.9,0.0,6
10,0,68,2,1,2,13.0,0.0,6
13,1,72,1,4,2,22.7,0.0,9
21,1,58,1,2,1,3.1,0.0,7
26,0,77,1,1,1,8.8,0.0,5
37,0,54,1,2,1,1.0,0.0,6
45,1,65,2,3,1,83.7,32.0,9
56,1,57,1,3,1,7.4,18.3,7
63,0,76,1,2,1,9.5,14.4,7




In [34]:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Generate a GLM model using the training dataset
glm_classifier = H2OGeneralizedLinearEstimator(family="binomial", nfolds=10, alpha=0.5)
glm_classifier.train(y="CAPSULE", x=["AGE", "RACE", "PSA", "DCAPS"], training_frame=train)

# Predict using the GLM model and the testing dataset
predict = glm_classifier.predict(test)

# View a summary of the prediction
predict.head()

glm Model Build progress: |███████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%


predict,p0,p1
0,0.702567,0.297433
0,0.697912,0.302088
0,0.712578,0.287422
1,0.320262,0.679738
0,0.669101,0.330899
0,0.691744,0.308256
1,0.453686,0.546314
0,0.759836,0.240164
0,0.788344,0.211656
0,0.755299,0.244701




**Tokenize Strings**

In [26]:
# Create four simple, single-column Python data frames by inputting values
df1 = h2o.H2OFrame.from_python({'String':[' this is a string ']})
df1 = df1.ascharacter()
df2 = h2o.H2OFrame.from_python({'String':['this is another string']})
df2 = df2.ascharacter()
df3 = h2o.H2OFrame.from_python({'String':['this is a longer string']})
df3 = df3.ascharacter()
df4 = h2o.H2OFrame.from_python({'String':['this is tall, this is taller']})
df4 = df4.ascharacter()


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [27]:
# Combine the datasets into a single dataset.
combined = df1.rbind([df2, df3, df4])
combined



String
this is a string
this is another string
this is a longer string
"this is tall, this is taller"




In [28]:
# Tokenize the dataset.
# Notice that tokenized sentences are separated by empty rows.

tokenized = combined.tokenize(" ")
tokenized.describe

C1
this
is
a
string
this
is
another
string


<bound method H2OFrame.describe of >