<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Data Analysis - H2O models:
</p><br>

## By Kumar Rahul




In [None]:
#To know the environment with the python kernal
import sys, os

sys.executable

## to open the notebook in presentation mode.

#jupyter nbconvert .ipynb --to slides --post serve

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [None]:
import pandas as pd 
import numpy as np
import h2o as h2o
h2o.init()
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt
#the output of plotting commands is displayed inline within Jupyter notebook
%matplotlib inline

### Run a R code in python kernal

To run R code from within python kernal. use pip to install rpy2 (if not already installed). (!pip install rmagic)

Use conda not pip if the R installation is done using conda (!conda install -c r rpy2)

The rmagic function has moved to rpy2 and thus the installation of rpy2 is needed. Once done, use the below code to load rpy2.ipython and follow with the code. YOu willl find a note on using %load_ext rmagic but this does not work now.

#!pip install rmagic
#!conda install -c r rpy2

In [None]:
%reload_ext rpy2.ipython

In [None]:
%R setwd('/Users/Rahul/Documents/Rahul Office/IIMB/Projects @ IIMB/Data')

In [None]:
%R .libPaths()

The python command to get the working directory

In [None]:
os.getcwd()


## Data Import and Manipulation

### 1. Importing a data set

This analysis is for customer feedback data on various products used by the customers over a period of time. The feedback was collected between 2017-2018 by the customer care and support division of a company.

Modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In case the file is not getting read, probably the utf-8 encoding is not correct.

Open csv file in notepad++ and change the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.

Reading the file through `pandas` and then converting it to `h2o` dataframe

In [None]:
raw_df = pd.read_csv( "", 
                        sep = ',', na_values = ['', ' '])
raw_df.columns = raw_df.columns.str.lower().str.replace(' ', '_')


pd.set_option("display.max_columns", None)

#raw_df.head()

#raw_df.columns
raw_h2f = h2o.H2OFrame(raw_df)
raw_h2f.head(4)

The data set is baised with 53% promotor, 43% passive and only 4% detractors. Traditional models may not yield good results in classifying detractors.

Re-grouping Passive to Detractors with an assumption that Passive customers may switch side but if views of these customers tends towards being detractors, the word of mouth (WOM) may not be good for the company.

In [None]:
raw_h2f = h2o.upload_file( "", header = 1,
                         sep = ',', na_strings = ['', ' '])

#raw_h2f.columns = h2o.H2OFrame.tolower().str.replace(' ', '_')
raw_h2f.columns

In [None]:
raw_h2f[1].types

filter_h2f = h2o.deep_copy(raw_h2f, 'filter_h2f')

#raw_h2f.describe()

### Feature lists

Get the numerical features, text features and categorical features in a list.

In [None]:
#numerical_features = []

temp_col_num = filter_h2f.columns_by_type('numeric')
col_num  = [int(elem) for elem in temp_col_num]
numerical_features  = filter_h2f[col_num].columns

In [None]:
# The below to features are redundant
remove_num_feature = ['sl_no','rate_recommend_products_services']

numerical_features = [x for x in numerical_features 
                      if x not in remove_num_feature ]

#print("Numeric features in data")  
#numerical_features

In [None]:
categorical_features = []

temp_col_num = filter_h2f.columns_by_type('categorical')
col_num  = [int(elem) for elem in temp_col_num]
categorical_features  = filter_h2f[col_num].columns


    
categorical_features = [cf for cf in categorical_features if cf not in ['nps_classification']]
print("Categorical features in data")
categorical_features

In [None]:
all_features = list(numerical_features)
all_features.extend(categorical_features)
all_features

In [None]:
new_h2f = h2o.deep_copy(filter_h2f[all_features], 'new_h2f')
new_h2f.na_omit()

# Identify predictors and response
response_col = 'merged_nps_classification'
predictors = [x for x in all_features if x not in ['merged_nps_classification']]

#### Significant Feature list
The below features were identified as significant features after the first run of the model. These features have been used in final model:

In [None]:
significant_predictors = [x for x in all_features if x in ['location',
                                                             'number_of_issue_reported',
                                                             'inclusions_exclusions_explained',
                                                             'papers_for_the_new_contracts_received',
                                                             'prmsso_type'
                                                            ]]

In [None]:
significant_predictors

## Model Building: Using the **H2O** 

### Train and Test split using H2O

In [None]:
train_h2f,test_h2f,valid_h2f = new_h2f.split_frame(ratios=[.70, .15,], seed = 42)

In [None]:
len(train_h2f)
len(test_h2f)
len(valid_h2f)

### Cross Validation Data as in H2O:

In general, for all algos that support the nfolds parameter, H2O’s cross-validation works as follows:

For example, for nfolds=5, 6 models are built. The first 5 models (cross-validation models) are built on 80% of the training data, and a different 20% is held out for each of the 5 models. Then the main model is built on 100% of the training data. This main model is the model you get back from H2O in R, Python and Flow (though the CV models are also stored and available to access later).

This main model contains training metrics and cross-validation metrics (and optionally, validation metrics if a validation frame was provided). The main model also contains pointers to the 5 cross-validation models for further inspection.

All 5 cross-validation models contain training metrics (from the 80% training data) and validation metrics (from their 20% holdout/validation data). To compute their individual validation metrics, each of the 5 cross-validation models had to make predictions on their 20% of of rows of the original training frame, and score against the true labels of the 20% holdout.

For the main model, this is how the cross-validation metrics are computed: The 5 holdout predictions are combined into one prediction for the full training dataset (i.e., predictions for every row of the training data, but the model making the prediction for a particular row has not seen that row during training). This “holdout prediction” is then scored against the true labels, and the overall cross-validation metrics are computed.

This approach has some implications. Scoring the holdout predictions freshly can result in different metrics than taking the average of the 5 validation metrics of the cross-validation models. For example, if the sizes of the holdout folds differ a lot (e.g., when a user-given fold_column is used), then the average should probably be replaced with a weighted average. Also, if the cross-validation models map to slightly different probability spaces, which can happen for small DL models that converge to different local minima, then the confused rank ordering of the combined predictions would lead to a significantly different AUC than the average.

More about cross-validation at: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/cross-validation.html

### AutoML

The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

In both the R and Python API, AutoML uses the same data-related arguments, x, y, training_frame, validation_frame, as the other H2O algorithms. Most of the time, all you’ll need to do is specify the data arguments. You can then configure values for max_runtime_secs and/or max_models to set explicit time or number-of-model limits on your run.

More about grid search at: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

### Build Model

AutoML performs hyperparameter search over a variety of H2O algorithms in order to deliver the best model. In AutoML, the following hyperparameters are supported by grid search. Random Forest and Extremely Randomized Trees are not grid searched (in the current version of AutoML), so they are not included in the list below.

> GBM Hyperparameters: `score_tree_interval`, `histogram_type`,`ntrees`,`max_depth`, `min_rows`, `learn_rate`, `sample_rate`, `col_sample_rate`, `col_sample_rate_per_tree`, `min_split_improvement`, 

> GLM Hyperparameters: `alpha`, `missing_values_handling`

> Deep Learning Hyperparameters: `epochs`,`adaptivate_rate`, `activation`, `rho`, `epsilon`, `input_dropout_ratio`, `hidden`, `hidden_dropout_ratios`



### Frames for Model

If the user doesn’t specify a validation_frame, then one will be created automatically by randomly partitioning the training data. The validation frame is required for early stopping of the individual algorithms, the grid searches and the AutoML process itself.

By default, AutoML uses cross-validation for all models, and therefore we can use cross-validation metrics to generate the leaderboard. If the leaderboard_frame is explicitly specified by the user, then that frame will be used to generate the leaderboard metrics instead of using cross-validation metrics.

For cross-validated AutoML, when the user specifies:

> * training: The training_frame is split into training (80%) and validation (20%).
* training + leaderboard: The training_frame is split into training (80%) and validation (20%).
* training + validation: Leave frames as-is.
* training + validation + leaderboard: Leave frames as-is.

If not using cross-validation (by setting nfolds = 0) in AutoML, then we need to make sure there is a test frame (aka. the “leaderboard frame”) to score on because cross-validation metrics will not be available. So when the user specifies:

>* training: The training_frame is split into training (80%), validation (10%) and leaderboard/test (10%).
* training + leaderboard: The training_frame is split into training (80%) and validation (20%). Leaderboard frame as-is.
* training + validation: The validation_frame is split into validation (50%) and leaderboard/test (50%). Training frame as-is.
* training + validation + leaderboard: Leave frames as-is.

In [None]:
from h2o.automl import H2OAutoML
?H2OAutoML

In [None]:
from h2o.automl import H2OAutoML

aml = H2OAutoML(max_models = 50, seed = 42,nfolds=0) #max_runtime_secs = 300 seed=42 max_models = 10)
                      # exclude_algos = ['None'], balance_classes = True
                       #seed works when max_runtime is not specified. 

#aml.train(x=predictors, y=response_col, training_frame= train_h2f, 
#                 validation_frame=valid_h2f)#, leaderboard_frame=test_h2f)

aml.train(x=predictors, y=response_col, training_frame= train_h2f, 
                 validation_frame=valid_h2f, leaderboard_frame=test_h2f)

### Leaderboard

The AutoML object includes a “leaderboard” of models that were trained in the process, including the 5-fold cross-validated model performance (by default). The number of folds used in the model evaluation process can be adjusted using the nfolds parameter. If the user would like to score the models on a specific dataset, they can specify the leaderboard_frame argument, and then the leaderboard will show scores on that dataset instead.

In [None]:
lb = aml.leaderboard
#lb.head(rows=lb.nrows)
lb

### Specific Models

#### Stacked Ensemble

To understand how the ensemble works, let's take a peek inside the Stacked Ensemble "All Models" model. The "All Models" ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard.

The leader model is stored in `aml.leader`

More about AutoML @ AutoML Tutorial: https://github.com/h2oai/h2o-tutorials/tree/master/h2o-world-2017/automl


In [None]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])

# Get the "All Models" Stacked Ensemble model
all_se = [mid for mid in model_ids if "StackedEnsemble" in mid]
all_se
se_model = h2o.get_model(all_se[1])

In [None]:
# Get the Stacked Ensemble metalearner model if it is the leader
#metalearner = h2o.get_model(aml.leader.metalearner()['name'])

#to get the specific ensemble metalearner model (). Here it is StackedEnsemble_AllModels which is not a leader.
metalearner = h2o.get_model(se_model.metalearner()['name'])
metalearner

Examine the variable importance of the metalearner (combiner) algorithm in the ensemble. This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM.

In [None]:
metalearner.coef_norm()

%matplotlib inline
metalearner.std_coef_plot()

In [None]:
aml.leader
#rf_model.varimp_plot()

#### DRF

To examine the specific model in the AutoML (e.g. DRF). 

In [None]:
drf_model = h2o.get_model([mid for mid in model_ids if "DRF" in mid][0])
drf_model

In [None]:
#drf_model

In [None]:
drf_model.actual_params

#### GBM

To examine the specific model in the AutoML (e.g. GBM). 

In [None]:
all_gbm_model = ([x for x in model_ids if "GBM" in x])
all_gbm_model

In [None]:
gbm_model = h2o.get_model(all_gbm_model[0])

In [None]:
gbm_model.confusion_matrix

### Model Performance

To view the model performance of the test set

In [None]:
label_1 ="Promoter"
label_0 = "Detractor"

In [None]:
# Now let's evaluate the model performance on a test set
predict_test_h2f = aml.leader.predict(test_h2f)

#glm_predict_test_h2f = saved_model.predict(test_h2f)

predict_test_h2f = h2o.H2OFrame.cbind(test_h2f[response_col],predict_test_h2f)
    
    
predict_test_df =predict_test_h2f.as_data_frame()
predict_test_df.head()

Using the cut-off of 0.71 to perform the classification on test set
Threshold `0.26083842534607526` is for max F1 score on the validation dataset. This is used by default in classifying the records in the test data

Threshold `0.8214663` is for max F0.5 score on the validation dataset. 


In [None]:
predict_test_df['predicted'] = predict_test_df.Promoter.map(lambda x: label_1 if x > 0.8214663 else label_0)
#glm_predict_test_df.columns.values[0] = 'actual'

predict_test_df[0:10]

Define Generic function to report the classification matrix and model statistics

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
def draw_cm(actual,predicted):
    plt.figure(figsize=(9,9))
    cm = metrics.confusion_matrix(actual,predicted)
    sn.heatmap(cm, annot=True,  fmt='.0f', xticklabels = [label_0, label_1] , 
               yticklabels = [label_0, label_1],cmap = 'Blues_r')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title('Classification Matrix Plot', size = 15);
    plt.show()

In [None]:
def measure_performance (clasf_matrix):
    measure = pd.DataFrame({
                        'sensitivity': [round(clasf_matrix[1,1]/(clasf_matrix[1,0]+clasf_matrix[1,1]),2)], 
                        'specificity': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)],
                        'recall': [round(clasf_matrix[1,1]/(clasf_matrix[1,0]+clasf_matrix[1,1]),2)],
                        'precision': [round(clasf_matrix[1,1]/(clasf_matrix[0,1]+clasf_matrix[1,1]),2)],
                        'overall_acc': [round((clasf_matrix[0,0]+clasf_matrix[1,1])/
                                              (clasf_matrix[0,0]+clasf_matrix[0,1]+clasf_matrix[1,0]+clasf_matrix[1,1]),2)]
                       })
    return measure

In [None]:
draw_cm(predict_test_df.merged_nps_classification, predict_test_df.predict )
draw_cm(predict_test_df.merged_nps_classification, predict_test_df.predicted )
#draw_cm(glm_predict_test_df.actual, glm_predict_test_df.predicted )

cm = metrics.confusion_matrix(predict_test_df.merged_nps_classification, predict_test_df.predicted)
model_test_metrics = pd.DataFrame(measure_performance(cm))
model_test_metrics

In [None]:
#model_perff

aml.leader.model_performance(test_h2f)

## Save the Model

There are two ways to save the leader model -- binary format and MOJO format. If you're taking your leader model to production, then we'd suggest the MOJO format since it's optimized for production use.

In [None]:
# save the model
#model_path = h2o.save_model(model=aml.leader, path="/Users/Rahul/Documents/", force=True)
#print(model_path)

#mojo
#aml.leader.download_mojo(path="/Users/Rahul/Documents/")


# load the model
#saved_model = h2o.load_model(model_path)


# THANK YOU

***
