# INTRODUCTION

The data comes from KAGGLE which consits of animal shelters. Outcomes represent the status of animals as they leave the Animal Center. All animals receive a unique Animal ID during intake.

In this competition, we are going to predict the outcome of the animal as they leave the Animal Center. These outcomes include: Adoption, Died, Euthanasia, Return to owner, and Transfer. 

Dataset can be found at - https://www.kaggle.com/c/shelter-animal-outcomes/data

# Using H2O

### IMPORTING LIBRARIES

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import time, warnings, h2o, logging, os, sys, psutil, random
import numpy as np
from h2o.automl import H2OAutoML

In [2]:
pct_memory=0.95
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print(min_mem_size)

9


In [3]:
# Connect to a cluster
port_no=random.randint(5555,55555)

#  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
try:
  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
except:
  logging.critical('h2o.init')
  h2o.download_all_logs(dirname=logs_path, filename=logfile)      
  h2o.cluster().shutdown()
  sys.exit(2)

Checking whether there is an H2O instance running at http://localhost:12739..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_121"; OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15); OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)
  Starting server from /home/nikunj/miniconda3/envs/py3.6/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpodv98b4p
  JVM stdout: /tmp/tmpodv98b4p/h2o_nikunj_started_from_python.out
  JVM stderr: /tmp/tmpodv98b4p/h2o_nikunj_started_from_python.err
  Server is running at http://127.0.0.1:12739
Connecting to H2O server at http://127.0.0.1:12739... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.1.2
H2O cluster version age:,1 month and 8 days
H2O cluster name:,H2O_from_python_nikunj_h1l0om
H2O cluster total nodes:,1
H2O cluster free memory:,8.62 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


### READING DATA AND PRE-PROCESSING

In [4]:
#Setting the path
current_dir = os.path.dirname(os.path.abspath(os.getcwd() + "/Kaggle Competition.ipynb"))
os.chdir('../data')
data_dir = os.getcwd()
train_data = data_dir + '/train.csv'

In [5]:
#Ingest data
train_data = h2o.import_file(path = train_data, destination_frame = "train_data")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
#Peeking inside the data
train_data.show()

AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan
A677334,Elsa,2014-04-25 13:04:00,Transfer,Partner,Dog,Intact Female,1 month,Cairn Terrier/Chihuahua Shorthair,Black/Tan
A699218,Jimmy,2015-03-28 13:11:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Tabby
A701489,,2015-04-30 17:02:00,Transfer,Partner,Cat,Unknown,3 weeks,Domestic Shorthair Mix,Brown Tabby
A671784,Lucy,2014-02-04 17:17:00,Adoption,,Dog,Spayed Female,5 months,American Pit Bull Terrier Mix,Red/White
A677747,,2014-05-03 07:48:00,Adoption,Offsite,Dog,Spayed Female,1 year,Cairn Terrier,White


In [7]:
# used to gain statistical information of the columns present in the dataset
train_data.describe()

Rows:26729
Cols:10




Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
type,string,enum,time,enum,enum,enum,enum,enum,enum,enum
mins,,,1380619860000.0,,,,,,,
mean,,,1418948543956.003,,,,,,,
maxs,,,1456082220000.0,,,,,,,
sigma,,,21433414554.633263,,,,,,,
zeros,0,,0,,,,,,,
missing,0,7691,0,0,13612,0,1,18,0,0
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White


In [8]:
target = 'OutcomeType'

def get_independent_variables(train_data, targ):
    C = [name for name in train_data.columns if name != targ]
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in train_data.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x = ints + enums + reals
    return x

X = get_independent_variables(train_data, target) 
print(X)
y = target

['Name', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'AgeuponOutcome', 'Breed', 'Color', 'AnimalID', 'DateTime']


In [9]:
train_data[y] = train_data[y].asfactor()

In [10]:
train_data.describe()

Rows:26729
Cols:10




Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
type,string,enum,time,enum,enum,enum,enum,enum,enum,enum
mins,,,1380619860000.0,,,,,,,
mean,,,1418948543956.003,,,,,,,
maxs,,,1456082220000.0,,,,,,,
sigma,,,21433414554.633263,,,,,,,
zeros,0,,0,,,,,,,
missing,0,7691,0,0,13612,0,1,18,0,0
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White


In [11]:
# setup autoML
# min_mem_size=6 
run_time=333
aml = H2OAutoML(max_runtime_secs=run_time)

In [12]:
os.getcwd()
os.chdir('../logs')
logs_path = os.getcwd()
logfile = 'logs.txt'

In [13]:
model_start_time = time.time()
  
try:
  aml.train(x=X,y=y,training_frame=train_data)  # Change training_frame=train
except Exception as e:
  logging.critical('aml.train') 
  h2o.download_all_logs(dirname=logs_path, filename=logfile)      
  h2o.cluster().shutdown()   
  sys.exit(4)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [14]:
meta_data={}
meta_data['model_execution_time'] = {"classification":(time.time() - model_start_time)}
meta_data
# d = meta_data['model_execution_time']
# d['classification'] = (time.time() - model_start_time)
# meta_data['model_execution_time'] = d

{'model_execution_time': {'classification': 412.43854880332947}}

In [15]:
print(aml.leaderboard)

model_id,mean_per_class_error,logloss,rmse,mse
StackedEnsemble_AllModels_AutoML_20190227_155417,0.126207,0.258704,0.291684,0.0850794
StackedEnsemble_BestOfFamily_AutoML_20190227_155417,0.126207,0.258704,0.291684,0.0850794
DRF_1_AutoML_20190227_155417,0.130335,0.29565,0.298685,0.0892126
XRT_1_AutoML_20190227_155417,0.20958,0.604429,0.442417,0.195732
GLM_grid_1_AutoML_20190227_155417_model_1,0.681595,1.01313,0.613699,0.376627





## Save the leaderboard model

There are two ways to save the leader model -- binary format and MOJO format. If you're taking your leader model to production, then we'd suggest the MOJO format since it's optimized for production use.

In [16]:
best_model = h2o.get_model(aml.leaderboard[0,'model_id'])

In [17]:
best_model.algo

'stackedensemble'

In [18]:
print(best_model.logloss(train = True))

0.14849808524524816


## RESULTS

Our evaluation metric is logloss for this dataset. The best on the kaggle leaderboard is logloss = 0.0000 whereas we get the logloss = 0.1485 for the first model while running it on H2O. We stand 3rd on the Kaggle public leaderboard and hence we are in the top 1% in this competition. Following is the leaderboard link for this competition:

[Kaggle Leaderboard](https://www.kaggle.com/c/shelter-animal-outcomes/leaderboard)
