# ZOO

This notebook shows an implementation of [MNIST Addition](https://arxiv.org/abs/1805.10872). In this task, pairs of MNIST handwritten images and their sums are given, alongwith a domain knowledge base containing information on how to perform addition operations. The task is to recognize the digits of handwritten images and accurately determine their sum.

Intuitively, we first use a machine learning model (learning part) to convert the input images to digits (we call them pseudo-labels), and then use the knowledge base (reasoning part) to calculate the sum of these digits. Since we do not have ground-truth of the digits, in Abductive Learning, the reasoning part will leverage domain knowledge and revise the initial digits yielded by the learning part through abductive reasoning. This process enables us to further update the machine learning model.

In [1]:
# Import necessary libraries and modules
import os.path as osp
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from examples.zoo.get_dataset import load_and_preprocess_dataset, split_dataset
from abl.learning import ABLModel
from examples.zoo.kb import ZooKB
from abl.reasoning import Reasoner
from abl.data.evaluation import ReasoningMetric, SymbolMetric
from abl.utils import ABLLogger, print_log, confidence_dist
from abl.bridge import SimpleBridge

## Working with Data

First, we get the training and testing datasets:

In [2]:
# Load and preprocess the Zoo dataset
X, y = load_and_preprocess_dataset(dataset_id=62)

# Split data into labeled/unlabeled/test data
X_label, y_label, X_unlabel, y_unlabel, X_test, y_test = split_dataset(X, y, test_size=0.3)

`train_data` and `test_data` share identical structures: tuples with three components: X (list where each element is a list of two images), gt_pseudo_label (list where each element is a list of two digits, i.e., pseudo-labels) and Y (list where each element is the sum of the two digits). The length and structures of datasets are illustrated as follows.

Note: ``gt_pseudo_label`` is only used to evaluate the performance of the learning part but not to train the model.

In [3]:
print("Shape of X and y:", X.shape, y.shape)
print("First five elements of X:")
print(X[:5])
print("First five elements of y:")
print(y[:5])

Shape of X and y: (101, 16) (101,)
First five elements of X:
[[True False False True False False True True True True False False 4
  False False True]
 [True False False True False False False True True True False False 4
  True False True]
 [False False True False False True True True True False False True 0
  True False False]
 [True False False True False False True True True True False False 4
  False False True]
 [True False False True False False True True True True False False 4
  True False True]]
First five elements of y:
[0 0 3 0 0]


Transform tabluar data to the format required by ABL-Package, which is a tuple of (X, gt_pseudo_label, Y)

For tabular data in abl, each example contains a single instance (a row from the dataset).

For these tabular data samples, the reasoning results are expected to be 0, indicating no rules are violated.

In [4]:
def transform_tab_data(X, y):
    return ([[x] for x in X], [[y_item] for y_item in y], [0] * len(y))
label_data = transform_tab_data(X_label, y_label)
test_data = transform_tab_data(X_test, y_test)
train_data = transform_tab_data(X_unlabel, y_unlabel)

## Building the Learning Part

To build the learning part, we need to first build a machine learning base model. We use a [Random Forest](https://en.wikipedia.org/wiki/Random_forest) as the base model

In [5]:
base_model = RandomForestClassifier()

However, the base model built above deals with instance-level data, and can not directly deal with example-level data. Therefore, we wrap the base model into `ABLModel`, which enables the learning part to train, test, and predict on example-level data.

In [6]:
model = ABLModel(base_model)

## Building the Reasoning Part

In the reasoning part, we first build a knowledge base which contain information on how to perform addition operations. We build it by creating a subclass of `KBBase`. In the derived subclass, we initialize the `pseudo_label_list` parameter specifying list of possible pseudo-labels, and override the `logic_forward` function defining how to perform (deductive) reasoning.

In [7]:
kb = ZooKB()

Attribute names are:  ['hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail', 'domestic', 'catsize']
Target names are:  ['mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'invertebrate']


The knowledge base can perform logical reasoning (both deductive reasoning and abductive reasoning). Below is an example of performing (deductive) reasoning, and users can refer to [Documentation]() for details of abductive reasoning.

In [10]:
pseudo_label = [0]
data_point = [np.array([1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1])]
print(kb.logic_forward(pseudo_label, data_point))
for x, y_item in zip(X, y):
    print(x,y_item)
    print(kb.logic_forward([y_item], [x]))

Reasoning result of pseudo-label example [1, 2] is 3.


Note: In addition to building a knowledge base based on `KBBase`, we can also establish a knowledge base with a ground KB using `GroundKB`, or a knowledge base implemented based on Prolog files using `PrologKB`. The corresponding code for these implementations can be found in the `main.py` file. Those interested are encouraged to examine it for further insights.

Then, we create a reasoner by instantiating the class ``Reasoner``. Due to the indeterminism of abductive reasoning, there could be multiple candidates compatible to the knowledge base. When this happens, reasoner can minimize inconsistencies between the knowledge base and pseudo-labels predicted by the learning part, and then return only one candidate that has the highest consistency.

In [11]:
def consitency(data_example, candidates, candidate_idxs, reasoning_results):
    pred_prob = data_example.pred_prob
    model_scores = confidence_dist(pred_prob, candidate_idxs)
    rule_scores = np.array(reasoning_results)
    scores = model_scores + rule_scores
    return scores

reasoner = Reasoner(kb, dist_func=consitency)

## Building Evaluation Metrics

Next, we set up evaluation metrics. These metrics will be used to evaluate the model performance during training and testing. Specifically, we use `SymbolMetric` and `ReasoningMetric`, which are used to evaluate the accuracy of the machine learning model’s predictions and the accuracy of the final reasoning results, respectively.

In [12]:
metric_list = [SymbolMetric(prefix="zoo"), ReasoningMetric(kb=kb, prefix="zoo")]

## Bridging Learning and Reasoning

Now, the last step is to bridge the learning and reasoning part. We proceed this step by creating an instance of `SimpleBridge`.

In [13]:
bridge = SimpleBridge(model, reasoner, metric_list)

Perform training and testing by invoking the `train` and `test` methods of `SimpleBridge`.

In [None]:
# Build logger
print_log("Abductive Learning on the ZOO example.", logger="current")
log_dir = ABLLogger.get_current_instance().log_dir
weights_dir = osp.join(log_dir, "weights")

# Pre-train the machine learning model
base_model.fit(X_label, y_label)

# Test the initial model
print("------- Test the initial model -----------")
bridge.test(test_data)
print("------- Use ABL to train the model -----------")
# Use ABL to train the model
bridge.train(train_data=train_data, label_data=label_data, loops=3, segment_size=len(X_unlabel), save_dir=weights_dir)
print("------- Test the final model -----------")
# Test the final model
bridge.test(test_data)