# Iceberg Classification Step 6: Model Analysis

Note that this notebook needs classic jupyter notebook for widget visulization.

This notebook is tested with the following ``configuration`` from hopsworks.
<div>
<img src="fig/step6_jupyter_config.png" width="900" align="center"/>
</div>

### What-If Tool in a jupyter notebook

WARNING: This notebook only runs on "classic" Jupyter, not on Jupyterlab.

This notebook shows use of the [What-If Tool](https://pair-code.github.io/what-if-tool) inside of a jupyter notebook.

This notebook trains a linear classifier on the [UCI census problem](https://archive.ics.uci.edu/ml/datasets/census+income) (predicting whether a person earns more than $50K from their census information).

It then visualizes the results of the trained classifier on test data using the What-If Tool.


In [1]:
import os
import statistics
import functools
from hops import hdfs
import pandas as pd
import numpy as np
import tensorflow as tf

## Load the original dataset as pandas dataframe

In [2]:
DATA_FOLDER = 'eodata'
# get data path
train_ds_path = os.path.join(hdfs.project_path(), DATA_FOLDER,'train.json')
print("train_ds_path:", train_ds_path)

# read the raw data to pandas dataframe
raw_train_df = pd.read_json(train_ds_path)

raw_train_df['inc_angle'] = raw_train_df['inc_angle'].replace('na', '-1').astype('float64')
raw_train_df['is_iceberg'] = raw_train_df['is_iceberg'].astype('int64')

train_ds_path: hdfs://rpc.namenode.service.consul:8020/Projects/ExtremeEarth/eodata/train.json


In [3]:
raw_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1604 entries, 0 to 1603
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          1604 non-null   object 
 1   band_1      1604 non-null   object 
 2   band_2      1604 non-null   object 
 3   inc_angle   1604 non-null   float64
 4   is_iceberg  1604 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 62.8+ KB


## Create some new features.

In [4]:
# a function for taking list average
def list_avg(row):
    """Take element-wise average of two list."""
    return [sum(x)/2 for x in zip(row['band_1'], row['band_2'])]

def elementwise_absolute_difference(row):
    """Take element-wise absolute difference of two list."""
    return [abs(x[0] - x[1]) for x in zip(row['band_1'], row['band_2'])]

# element-wise average between band_1 and band_2
raw_train_df['band_avg'] = raw_train_df.apply(lambda row: list_avg(row), axis=1)
# max of element-wise absoulute difference between band_1 and band_2.
raw_train_df['elementwise_diff_max'] = raw_train_df.apply(lambda row: max(elementwise_absolute_difference(row)), axis=1)
# min of element-wise absoulute difference between band_1 and band_2.
raw_train_df['elementwise_diff_min'] = raw_train_df.apply(lambda row: min(elementwise_absolute_difference(row)), axis=1)
# average of element-wise absoulute difference between band_1 and band_2.
raw_train_df['elementwise_diff_mean'] = raw_train_df.apply(lambda row: statistics.mean(elementwise_absolute_difference(row)), axis=1)

In [5]:
raw_train_df.sample(5)

Unnamed: 0,id,band_1,band_2,inc_angle,is_iceberg,band_avg,elementwise_diff_max,elementwise_diff_min,elementwise_diff_mean
945,07d60155,"[-17.512577, -14.719253, -13.760696, -12.83898...","[-23.238712, -24.047201, -24.261593, -23.23845...",31.5671,0,"[-20.3756445, -19.383227, -19.0111445, -18.038...",21.087152,0.0,8.852034
121,1dc1c160,"[-20.755899, -19.012949, -18.019245, -18.37876...","[-24.277725, -25.033548, -24.774302, -24.03989...",40.3904,1,"[-22.516812, -22.0232485, -21.396773500000002,...",20.078608,0.0,7.542646
1101,3e07a2a6,"[-22.152166, -20.476685, -23.338564, -26.96702...","[-25.76689, -25.766947, -26.051815, -27.635498...",38.1382,0,"[-23.959528, -23.121816, -24.6951895, -27.3012...",21.138098,0.0,6.059063
859,f67babb0,"[-25.127279, -25.687901, -28.381918, -28.38196...","[-28.785894, -33.646702, -33.646744, -32.95155...",42.5591,1,"[-26.9565865, -29.6673015, -31.014331, -30.666...",18.061802,0.0,5.387069
23,bd1a1bdf,"[-14.6148, -14.6148, -16.136662, -15.342532, -...","[-26.656, -26.656, -22.534969, -25.496277, -26...",37.6866,1,"[-20.6354, -20.6354, -19.335815500000002, -20....",24.378874,0.0,9.293626


In [6]:
raw_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1604 entries, 0 to 1603
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     1604 non-null   object 
 1   band_1                 1604 non-null   object 
 2   band_2                 1604 non-null   object 
 3   inc_angle              1604 non-null   float64
 4   is_iceberg             1604 non-null   int64  
 5   band_avg               1604 non-null   object 
 6   elementwise_diff_max   1604 non-null   float64
 7   elementwise_diff_min   1604 non-null   float64
 8   elementwise_diff_mean  1604 non-null   float64
dtypes: float64(4), int64(1), object(4)
memory usage: 112.9+ KB


## Split into train and test set.

In [7]:
mask = np.random.rand(len(raw_train_df)) < 0.8
train_df = raw_train_df[mask]
test_df = raw_train_df[~mask]
print('Training dataframe has {} rows.\nTest dataframe has {} rows.'.format(len(train_df), len(test_df)))

Training dataframe has 1276 rows.
Test dataframe has 328 rows.


## Model Analysis Preparation

In [8]:
def df_to_examples(df, columns=None):
    """Converts a dataframe into a list of tf.Example protos."""
    examples = []
    if columns == None:
        columns = df.columns.values.tolist()
    for index, row in df.iterrows():
        example = tf.train.Example()
        for col in columns:
            if df[col].dtype is np.dtype(np.int64):
                example.features.feature[col].int64_list.value.append(int(row[col]))
            elif df[col].dtype is np.dtype(np.float64):
                example.features.feature[col].float_list.value.append(row[col])
            elif row[col] == row[col]:
                example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))
        examples.append(example)
    return examples

def create_feature_spec(df, columns=None):
    """Creates a tf feature spec from the dataframe and columns specified."""
    feature_spec = {}
    if columns == None:
        columns = df.columns.values.tolist()
    for f in columns:
        if df[f].dtype is np.dtype(np.int64):
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.int64)
        elif df[f].dtype is np.dtype(np.float64):
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.float32)
        else:
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.string)
    return feature_spec


def tfexamples_input_fn(examples, feature_spec, label, mode=tf.estimator.ModeKeys.EVAL,
                       num_epochs=None, 
                       batch_size=64):
    """An input function for providing input to a model from tf.Examples"""
    def ex_generator():
        for i in range(len(examples)):
            yield examples[i].SerializeToString()
    dataset = tf.data.Dataset.from_generator(
      ex_generator, tf.dtypes.string, tf.TensorShape([]))
    if mode == tf.estimator.ModeKeys.TRAIN:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda tf_example: parse_tf_example(tf_example, label, feature_spec))
    dataset = dataset.repeat(num_epochs)
    return dataset


def create_feature_columns(columns, feature_spec):
    """Creates simple numeric and categorical feature columns 
    from a feature spec and a list of columns from that spec to use.
    
    NOTE: Models might perform better with some feature engineering such as bucketed 
    numeric columns and hash-bucket/embedding columns for categorical features.
    """
    ret = []
    for col in columns:
        if feature_spec[col].dtype is tf.int64 or feature_spec[col].dtype is tf.float32:
            ret.append(tf.feature_column.numeric_column(col))
        else:
            ret.append(tf.feature_column.indicator_column(
                tf.feature_column.categorical_column_with_vocabulary_list(col, list(df[col].unique()))))
    return ret


def parse_tf_example(example_proto, label, feature_spec):
    """Parses Tf.Example protos into features for the input function."""
    parsed_features = tf.io.parse_example(serialized=example_proto, features=feature_spec)
    target = parsed_features.pop(label)
    return parsed_features, target

In [9]:
# Set the column in the dataset you wish for the model to predict
label_column = 'is_iceberg'

# Make the label column numeric (0 and 1), for use in our model.
# In this case, examples with a target value of 'is_iceberg' are considered to be in
# the '1' (iceberg) class and all other examples are considered to be in the
# '0' (ship) class.

# make_label_column_numeric(df, label_column, lambda val: val == '>50K')

# Set list of all columns from the dataset we will use for model input.
input_features = ['inc_angle', 'elementwise_diff_max', 'elementwise_diff_min', 'elementwise_diff_mean']

# Create a list containing all input features and the label column
features_and_labels = input_features + [label_column]

print('features_and_labels are {}'.format(features_and_labels))

features_and_labels are ['inc_angle', 'elementwise_diff_max', 'elementwise_diff_min', 'elementwise_diff_mean', 'is_iceberg']


In [10]:
examples = df_to_examples(train_df, features_and_labels)

# number of steps to train
num_steps = 500  #@param {type: "number"}

# Create a feature spec for the classifier
feature_spec = create_feature_spec(train_df, features_and_labels)

# Define and train the classifier
train_inpf = functools.partial(tfexamples_input_fn, examples, feature_spec, label_column)

In [11]:
# Define a linear classifier
classifier = tf.estimator.LinearClassifier(feature_columns=create_feature_columns(input_features, feature_spec))

# Train the classifier
classifier.train(train_inpf, steps=num_steps)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp0uymxyoe', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initiali



Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp0uymxyoe/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.6931472, step = 0
INFO:tensorflow:global_step/sec: 24.828
INFO:tensorflow:loss = 0.68952304, step = 100 (4.030 sec)
INFO:tensorflow:global_step/sec: 24.5715
INFO:tensorflow:loss = 0.6611427, step = 200 (4.070 sec)
INFO:tensorflow:global_step/sec: 24.7972
INFO:tensorflow:loss = 0.75395554, step = 300 (4.033 sec)
INFO:tensorflow:global_step/sec: 25.017
INFO:tensorflow:loss = 0.67420125, step = 400 (3.997 sec)
INFO:tensorflow:Cal

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifierV2 at 0x7f709d420510>

In [12]:
#@title Invoke What-If Tool for test data and the trained model {display-mode: "form"}

num_datapoints = 2000  #@param {type: "number"}
tool_height_in_px = 1000  #@param {type: "number"}

from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget

test_examples = df_to_examples(test_df, features_and_labels)

# Setup the tool with the test examples and the trained classifier
config_builder = WitConfigBuilder(test_examples).set_estimator_and_feature_spec(classifier, feature_spec).set_label_vocab(['not iceberg', 'is iceberg'])
WitWidget(config_builder, height=tool_height_in_px)

WitWidget(config={'model_type': 'classification', 'label_vocab': ['not iceberg', 'is iceberg'], 'are_sequence_…

# End of Step 6.