<img src="https://drive.google.com/uc?id=1fqvUVxD8GcnwREynD5HQYndqP68SnIBd">

## Objective

**The challenge is to create a model that uses data from the first 24 hours of intensive care to predict patient survival. MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.**

## Data Description 

MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.

The data includes:

**Training data** for 91,713 encounters.  
**Unlabeled test data** for 39,308 encounters, which includes all the information in the training data except for the values for hospital_death.  
**WiDS Datathon 2020 Dictionary** with supplemental information about the data, including the category (e.g., identifier, demographic, vitals), unit of measure, data type (e.g., numeric, binary), description, and examples.  
**Sample submission files**

## H2O :

**H2O is ‘the open source in-memory, prediction engine for Big Data science’. H2O is a feature-rich, open source machine learning platform known for its R and Spark integration and its ease of use. It is a Java virtual machine that is optimised for doing in-memory processing of distributed, parallel machine learning algorithms on clusters.**

**The motive of H2O is to provide a platform which made easy for the non-experts to do experiments with machine learning.H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.**

In [1]:
# importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier ,AdaBoostClassifier
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import preprocessing


## Starting H2O and Inspecting the Cluster  

There are many tools for directly interacting with user-visible objects in the H2O cluster. Every new python session begins by initializing a connection between the python client and the H2O cluster.The h2o.init() function to initialize H2O. 

In [2]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpk52nfdng
  JVM stdout: /tmp/tmpk52nfdng/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpk52nfdng/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.1
H2O cluster version age:,1 month and 19 days
H2O cluster name:,H2O_from_python_unknownUser_vdn4fp
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [3]:
# loading dataset 
training_v2 = pd.read_csv("../input/widsdatathon2020/training_v2.csv")

In [4]:
# creating independent features X and dependant feature Y
y = pd.DataFrame(training_v2['hospital_death'])
X = training_v2
X = training_v2.drop('hospital_death',axis = 1)

In [5]:
# Remove Features with more than 75 percent missing values
train_missing = (X.isnull().sum() / len(X)).sort_values(ascending = False)
train_missing = train_missing.index[train_missing > 0.60]
X = X.drop(columns = train_missing)

In [6]:
#Convert categorical variable into dummy/indicator variables.
X = pd.get_dummies(X)

In [7]:
# Imputation transformer for completing missing values.
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(X))
new_data.columns = X.columns
X= new_data

In [8]:
# Threshold for removing correlated variables
threshold = 0.9

# Absolute value correlation matrix
corr_matrix = X.corr().abs()
corr_matrix.head()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()
# Select columns with correlations above threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
#Drop the columns with high correlations
X = X.drop(columns = to_drop)

There are 36 columns to remove.


In [9]:
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced')
for i in range(2):
    
    # Split into training and validation set
    train_features, valid_features, train_y, valid_y = train_test_split(X, y, test_size = 0.25, random_state = i)
    
    # Train using early stopping
    model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],eval_metric = 'auc', verbose = 200)
    
    # Record the feature importances
    feature_importances += model.feature_importances_


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[90]	valid_0's auc: 0.895395	valid_0's binary_logloss: 0.356349
Training until validation scores don't improve for 100 rounds
[200]	valid_0's auc: 0.891263	valid_0's binary_logloss: 0.313702
Early stopping, best iteration is:
[162]	valid_0's auc: 0.892908	valid_0's binary_logloss: 0.326333


In [10]:
# Make sure to average feature importances! 
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(X.columns), 'importance': feature_importances}).sort_values('importance', ascending = False)
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
# Drop features with zero importance
X = X.drop(columns = zero_features)

There are 17 features with 0.0 importance


In [11]:
X = y.join(X)


## H2OFrame :

H2OFrame is the primary data store for H2O.H2OFrame is similar to pandas’ DataFrame . One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus H2OFrame represents a mere handle to that data.

In [12]:
X = h2o.H2OFrame(X)

Parse progress: |█████████████████████████████████████████████████████████| 100%


## split_frame(): 

split_frame() splits a frame into distinct subsets of size determined by the given ratios.The number of subsets is always 1 more than the number of ratios given. This does not give an exact split and H2O is designed to be efficient on big data using a probabilistic splitting method rather than an exact split.

In [13]:
# split into train and validation sets
train, valid = X.split_frame(ratios = [.8], seed = 1234)

**asfactor()** converts columns in the current frame to categoricals.

In [14]:
train[0] = train[0].asfactor()
valid[0] = valid[0].asfactor()

In [15]:
param = {
      "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}
from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, train.shape[1])), y = 0, training_frame = train,validation_frame = valid)

xgboost Model Build progress: |███████████████████████████████████████████| 100%


In [16]:
model.model_performance(valid)


ModelMetricsBinomial: xgboost
** Reported on test data. **

MSE: 0.060203122517298036
RMSE: 0.24536324606040333
LogLoss: 0.23250324076518067
Mean Per-Class Error: 0.201726318219548
AUC: 0.88128607601079
AUCPR: 0.5405850313795718
Gini: 0.76257215202158

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3183365629778968: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,16046.0,684.0,0.0409,(684.0/16730.0)
1,1,821.0,805.0,0.5049,(821.0/1626.0)
2,Total,16867.0,1489.0,0.082,(1505.0/18356.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.318337,0.516854,179.0
1,max f2,0.160629,0.587529,279.0
2,max f0point5,0.473529,0.580829,120.0
3,max accuracy,0.473529,0.929015,120.0
4,max precision,0.912335,1.0,0.0
5,max recall,0.070901,1.0,399.0
6,max specificity,0.912335,1.0,0.0
7,max absolute_mcc,0.39527,0.472705,148.0
8,max min_per_class_accuracy,0.141684,0.796653,296.0
9,max mean_per_class_accuracy,0.139429,0.798274,298.0



Gains/Lift Table: Avg response rate:  8.86 %, avg score: 14.35 %


Unnamed: 0,Unnamed: 1,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
0,,1,0.010024,0.739886,9.939275,9.939275,0.880435,0.808287,0.880435,0.808287,0.099631,0.099631,893.927483,893.927483
1,,2,0.020048,0.636343,8.65085,9.295063,0.766304,0.684988,0.82337,0.746637,0.086716,0.186347,765.085031,829.506257
2,,3,0.030017,0.548306,7.464346,8.687039,0.661202,0.590898,0.76951,0.694912,0.074416,0.260763,646.434645,768.703888
3,,4,0.040041,0.483892,6.319415,8.094328,0.559783,0.51534,0.717007,0.649958,0.063346,0.324108,531.941548,709.432772
4,,5,0.050011,0.427835,4.996794,7.476845,0.442623,0.455886,0.662309,0.611271,0.049815,0.373924,399.67939,647.684549
5,,6,0.100022,0.272968,3.32031,5.398578,0.294118,0.339701,0.478214,0.475486,0.166052,0.539975,232.030967,439.857758
6,,7,0.150033,0.205635,2.324217,4.373791,0.205882,0.236793,0.387436,0.395921,0.116236,0.656212,132.421677,337.379064
7,,8,0.200044,0.167208,1.512586,3.658489,0.133987,0.184809,0.324074,0.343143,0.075646,0.731857,51.258552,265.848936
8,,9,0.300011,0.127359,1.088917,2.802276,0.096458,0.144701,0.24823,0.27702,0.108856,0.840713,8.891682,180.227625
9,,10,0.400033,0.106878,0.670211,2.269187,0.059368,0.11605,0.201008,0.236772,0.067036,0.907749,-32.978934,126.918726







References :

Lee, M., Raffa, J., Ghassemi, M., Pollard, T., Kalanidhi, S., Badawi, O., Matthys, K., Celi, L. A. (2020). WiDS (Women in Data Science) Datathon 2020: ICU Mortality Prediction. PhysioNet. doi:10.13026/vc0e-th79

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.

Official H20 documentation