## Sentiment Analysis using H2O

Installing the H2O library

In [0]:
! pip install h2o



In [0]:
! apt-get install default-jre
!java -version

Reading package lists... Done
Building dependency tree       
Reading state information... Done
default-jre is already the newest version (2:1.11-68ubuntu1~18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-410
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)


In [0]:
import pandas as pd
import json
from sklearn.model_selection import train_test_split
import h2o
from h2o.automl import H2OAutoML as aml

### H2O

In [0]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,1 hour 12 mins
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.2
H2O cluster version age:,13 days
H2O cluster name:,H2O_from_python_unknownUser_sw1bms
H2O cluster total nodes:,1
H2O cluster free memory:,2.987 Gb
H2O cluster total cores:,2
H2O cluster allowed cores:,2


Upload the documents file

In [0]:
with open('AllAPI_sentiments_forh2o.json',encoding="utf8") as file:
            h2 = pd.DataFrame(json.load(file))

In [0]:
h2.shape

(811, 5)

Split the train and valid data

In [0]:
train = h2.iloc[0:500]
valid = h2.iloc[501:811]

In [0]:
print(train.shape)
print(valid.shape)

(500, 5)
(310, 5)


Convert the data in H2O format

In [0]:
train_h2o = h2o.H2OFrame(train)
train_h2o.head()

valid_h2o = h2o.H2OFrame(valid)
valid_h2o.head()

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


azure_api_score,google_sentiment_socre,ibm_score,amazon_sentiment_score,sentiment
0.765162,-0.1,0.477986,0.679894,negative
0.5,0.0,-0.391976,0.941681,negative
0.5,0.3,0.792806,0.699461,positive
0.904133,0.0,0.988573,0.674756,positive
0.5,0.0,0.340174,0.740363,positive
0.958334,0.4,0.75921,0.825042,positive
0.5,0.0,0.654095,0.670927,negative
0.5,0.3,0.882205,0.989447,positive
0.969348,0.2,0.872548,0.690706,positive
0.5,0.3,0.615808,0.747006,negative




for classification problem it is essential to do asfactor for taget variable

In [0]:
train_h2o['sentiment'] = train_h2o['sentiment'].asfactor()
valid_h2o['sentiment'] = valid_h2o['sentiment'].asfactor()

In [0]:
X_test_h2o = valid_h2o[:,:-1]
y_test_h2o = valid_h2o[:,-1]

In [0]:
f = X_test_h2o.columns
f

['azure_api_score',
 'google_sentiment_socre',
 'ibm_score',
 'amazon_sentiment_score']

In [0]:
target = "sentiment"

Train the model

In [0]:
aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = f, y = target,training_frame=train_h2o,validation_frame=valid_h2o)
aml.leaderboard

AutoML progress: |████████████████████████████████████████████████████████| 100%


model_id,auc,logloss,mean_per_class_error,rmse,mse
XGBoost_3_AutoML_20190809_160510,0.705489,0.476937,0.5,0.392509,0.154063
XGBoost_2_AutoML_20190809_160510,0.699436,0.465607,0.5,0.386332,0.149252
GLM_grid_1_AutoML_20190809_160510_model_1,0.697629,0.478484,0.495327,0.392812,0.154301
GBM_grid_1_AutoML_20190809_160510_model_1,0.694264,0.496333,0.5,0.400324,0.160259
DeepLearning_1_AutoML_20190809_160510,0.691482,0.516468,0.488526,0.405548,0.164469
GBM_2_AutoML_20190809_160510,0.689009,0.510342,0.5,0.40737,0.16595
XGBoost_grid_1_AutoML_20190809_160510_model_2,0.688985,0.48278,0.5,0.392985,0.154437
GBM_5_AutoML_20190809_160510,0.688438,0.471274,0.5,0.387042,0.149802
GBM_grid_1_AutoML_20190809_160510_model_2,0.687403,0.499212,0.495327,0.401748,0.161402
GBM_3_AutoML_20190809_160510,0.685132,0.50995,0.5,0.40463,0.163725




The Leaderboard gives XGBoost as the best model with 70% accuracy 

In [0]:
best_model = h2o.get_model(aml.leaderboard[1,'model_id'])

In [0]:
best_model.algo

'xgboost'

In [0]:
cm = best_model.confusion_matrix(valid = True)
cm

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4882395565509796: 


0,1,2,3,4
,negative,positive,Error,Rate
negative,7.0,43.0,0.86,(43.0/50.0)
positive,5.0,255.0,0.0192,(5.0/260.0)
Total,12.0,298.0,0.1548,(48.0/310.0)


