# Tweet sentiment classifier training on H2O to import in a Scala environment

In [1]:
import h2o

## Initialize H2O session

In [2]:
h2o.init(max_mem_size='64g')

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_171"; OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-2~14.04-b11); OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
  Starting server from /home/users/qcoic/miniconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpr886e_zj
  JVM stdout: /tmp/tmpr886e_zj/h2o_qcoic_started_from_python.out
  JVM stderr: /tmp/tmpr886e_zj/h2o_qcoic_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster version:,3.16.0.2
H2O cluster version age:,11 months and 30 days !!!
H2O cluster name:,H2O_from_python_qcoic_yghi8r
H2O cluster total nodes:,1
H2O cluster free memory:,56.89 Gb
H2O cluster total cores:,48
H2O cluster allowed cores:,48
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


## Import the [Sentiment140](http://help.sentiment140.com/for-students/) labelled dataset to train our classifier

In [3]:
tweets_df = h2o.import_file("training.1600000.processed.noemoticon.csv")

Parse progress: |██████████████████████████████████████████████████████████| 100%


## Keep the needed columns 

In [4]:
tweet_texts = tweets_df[['C6']]

In [5]:
tweet_texts['Sentiment'] = (tweets_df['C1'] == 4).ifelse("1", "0") # It is important to not let the outputs as integers 

In [6]:
tweet_texts.names = ['Text', 'Sentiment']

## Import [Stop Words](https://en.wikipedia.org/wiki/Stop_words)

In [7]:
# Set Stop Words
# The STOP WORDS we are importing are from the nltk package
import pandas as pd
import os

# Use local data file or download from GitHub
docker_data_path = "/home/h2o/data/nlp/stopwords.csv"
if os.path.isfile(docker_data_path):
  data_path = docker_data_path
else:
  data_path = "https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/h2o-world-2017/nlp/stopwords.csv"

STOP_WORDS = pd.read_csv(data_path, header=0)
STOP_WORDS = list(STOP_WORDS['STOP_WORD'])

## Define a [tokenizer](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) to transform tweet texts into sequences of words characterizing them

In [8]:
def tokenize(sentences, stop_word = STOP_WORDS):
    tokenized = sentences.tokenize("\\W+")
    tokenized_lower = tokenized.tolower()
    tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
    tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
    tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
    return tokenized_words

## Tokenize tweet texts

In [9]:
words = tokenize(tweet_texts["Text"])

## Train a [Word2Vec](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/word2vec.html) model on texts of the Sentiment140 dataset or load a pre-trained one

In [10]:
# Train Word2Vec Model
from h2o.estimators.word2vec import H2OWord2vecEstimator

# This takes time to run - left commented out
#w2v_model = H2OWord2vecEstimator(vec_size = 100, model_id = "w2v.hex")
#w2v_model.train(training_frame=words)

#Pre-trained model available on s3: https://s3.amazonaws.com/tomk/h2o-world/megan/w2v.hex
w2v_model = h2o.load_model("w2v.hex")

## Export the Word2Vec model in [MOJO](http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#whatisamojo) format

In [11]:
w2v_model.download_mojo() # Put the zip folder in the same folder as this notebook

'/home/users/qcoic/notebooks/w2v_hex.zip'

## Transform the tokenized tweet texts into vector of float numbers 

In [12]:
tweet_texts = tweet_texts.cbind(w2v_model.transform(words, aggregate_method = "AVERAGE"))

## Remove NA rows

In [13]:
tweet_texts = tweet_texts.na_omit()

## Split the tweet texts dataset into training and validation frames

In [14]:
train, valid = tweet_texts.split_frame(ratios=[.8])

## Train a [Gradient Boosting Machine](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html) model on the vectors of float numbers 

In [None]:
from h2o.estimators import H2OGradientBoostingEstimator

predictors = tweet_texts.names[2:]
response = 'Sentiment'

gbm_sentiment_classifier = H2OGradientBoostingEstimator(stopping_metric = "AUC", stopping_tolerance = 0.001,
                                                        stopping_rounds = 5, score_tree_interval = 10,
                                                        model_id = "gbm_sentiment_classifier.hex")

gbm_sentiment_classifier.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

gbm Model Build progress: |█

## Model performance measures ([AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) and [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix))

In [None]:
print("gbm sentiment classifier AUC: " + str(round(gbm_sentiment_classifier.auc(valid = True), 3)))

In [None]:
gbm_sentiment_classifier.confusion_matrix(valid = True)

This model is really simple and the optimization of its performance is not the main goal of our project. Feel free to choose a better training dataset, another Word2Vec model (or another model) and to optimize the GBM model (or another model). Just be careful that complex models quickly become big, need a lot of memory and slow the process. 

## Export the Gradient Boosting Machine model in [POJO](http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#whatisapojo) format

In [None]:
gbm_sentiment_classifier.download_pojo('.') # Put the java file in the same folder as this notebook