Skip to content

RLadiesMadrid/H2O_Workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About H2O

In H2O Docs

About this workshop

WeCodeFest slides

Requirements

About the algorithms

Generalized Linear Models (GLM)

In H2O Docs

Introduction to Generalized Linear Models

Demo H2O World

Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.

Options

  • Datasets are commonly split into training, testing, and validation sets.
    • A training dataset is a dataset of examples used for learning, that is to fit the parameters of, for example, a classifier.
    • A validation dataset is a set of examples used to tune the hyperparameters of a classifier. It, as well as the testing set, should follow the same probability distribution as the training dataset.
    • A test dataset is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset.
  • K-fold cross-validation is used to validate a model internally, i.e., estimate the model performance without having to sacrifice a validation split. Also, you avoid statistical issues with your validation split (it might be a “lucky” split, especially for imbalanced data). Good values for K are around 5 to 10. Comparing the K validation metrics is always a good idea, to check the stability of the estimation, before “trusting” the main model.
  • Seed: This option specifies the random number generator (RNG) seed for algorithms that are dependent on randomization. When a seed is defined, the algorithm will behave deterministically.

Word2vec

In H2O Docs

The Word2vec algorithm takes a text corpus as an input and produces the word vectors as output. The algorithm first creates a vocabulary from the training text data and then learns vector representations of the words. The vector space can include hundreds of dimensions, with each unique word in the sample corpus being assigned a corresponding vector in the space. In addition, words that share similar contexts in the corpus are placed in close proximity to one another in the space.

Vignettes

GLM Booklet R Vignette.

Releases

No releases published

Packages

 
 
 

Languages