Rule Extraction from Fuzzy ARTMAP Model with Reuters-21578 Corpus on the "Earn" Topic

This notebook illustrates using the saved (pickled) tf-idf vectorizer and the saved Fuzzy ARTMAP Model (A and AB weights, and the set of committed nodes) to generate fuzzy If-Then rules of document relevance.
The indexes of the first part of the weight_a vectors correspond to the labels of the tf-idf, this is also repeated in the second part of the weight_a vectors, but the weights are complement encoded to denote absence instead of presence.  The set of committed nodes is used to limit the number of rules and categories explored, because the Fuzzy ARTMAP algorithm supports adding categories at run-time, requiring only fixed input architecture.

The model in the ./example_data was generated from running Fuzzy ARTMAP with the following parameters (shown in the JSON block below) indicating running against the Reuters-21578 corpus with the "earn" topic as the relevant category, baseline vigilance is 0.95 (rho_a_bar) using fast learning (committed_beta) and random active learning.

```json
{
    "experiments": {
        "famdg-reuters21578-earn-tf_idf": {
            "corpus_params": {
                "corpus_name": "reuters21578",
                "collection_name": "reuters21578",
                "topic_id": "earn",
                "topic_set": "earn"
            },
            "vectorizer_params": null,
            "vectorizer_type": {
                "__enum__": "VectorizerType.tf_idf"
            },
            "run_notes": "vigilance 0.95 with beta 1.0",
            "fuzzy_artmap_params": {
                "rho_a_bar": 0.95,
                "number_of_mapping_nodes": 50,
                "model_type": "famdg",
                "max_nodes": null,
                "committed_beta": 1.0,
                "active_learning_mode": "random",
                "scheduler_address": "localhost:8786"
            }
        }
    }
}
```

In [85]:
# Retrieve the pickled tf-idf vectorizer
# This enables assignment of words to the feature positions

from pathlib import Path
import pickle
def get_tf_idf_features(file_name):
    pickle_path = (Path.cwd() / "example_data" / file_name).resolve()
    with open(pickle_path, "rb") as pickled_vectorizer_file:
        tfidf_vectorizer = pickle.load(pickled_vectorizer_file)

    return tfidf_vectorizer.get_feature_names_out()

features = get_tf_idf_features("vec_tf_idf_reuters21578.pkl")


In [86]:
# Retrieve the Fuzzy ARTMAP model weights and info
# Get the A and AB mapping weights, and the set of committed (used) nodes
import torch
def get_model(file_name):
    model_path = (Path.cwd() / "example_data" / file_name).resolve()
    return torch.load(model_path)
weight_a, weight_ab, committed_nodes = get_model("reuters21578_earn_tf_idf.pt")


In [87]:
from collections import namedtuple
CategoryFeatures = namedtuple("CategoryFeatures", ["relevant", "features"])
FeatureRange = namedtuple("FeatureRange", ["min", "max"])


In [88]:
# Re-shape the category data
# Convert feature indexes to words, associate the features and weights with the category label, and calculate the range of the features

def get_category_and_feature_info():
    feature_weights_by_category = {}
    for category_index, category_value in enumerate(weight_ab):
        if category_index >= max(committed_nodes):
            break
        category_features = CategoryFeatures(bool(category_value[0]), [])
        feature_min = 1.0
        feature_max = 0.0
        for index, feature in enumerate(features):
            presence = weight_a[category_index][index].item()
            # absence = weight_a[category_index][number_of_features + index]
            if presence == 0.0: #and absence == 1.0:
                continue
            category_features.features.append((feature, presence))
            feature_min = min(feature_min, presence)
            feature_max = max(feature_max, presence)
        feature_weights_by_category[category_index] = (category_features, FeatureRange(feature_min, feature_max))
    return feature_weights_by_category


In [89]:
# Quantize feature weights and produce human-readable rules
# Quantize based on the min and max feature weights to 3 categories of highly, somewhat, and rarely prevelant in the document
# Use the category label for whether or not a document matching the rule is Relevant or Not Relevant

def convert_category_and_feature_info_to_rule(category_and_feature_info, use_latex_formatting = False):
    if len(category_and_feature_info[0].features) == 0:
        return
    predicates = []
    bin_size = (category_and_feature_info[1].max - category_and_feature_info[1].min) / 3
    for feature, weight in category_and_feature_info[0].features:
        quantitized_weight = ""
        formatted_feature = f"and '{feature}'"
        predicate_end = ""
        if (category_and_feature_info[1].max - bin_size) < weight and weight <= category_and_feature_info[1].max:
            quantitized_weight = "highly"
        elif (category_and_feature_info[1].min + bin_size) < weight and weight <= (category_and_feature_info[1].max - bin_size):
            quantitized_weight = "somewhat"
        else:
            quantitized_weight = "rarely"
        if use_latex_formatting:
            quantitized_weight = f"\\textbf{{{quantitized_weight}}}"
            formatted_feature = f"and& \\textit{{{feature}}}"
            predicate_end = "\\\\"
        predicates.append(f"{formatted_feature} is {quantitized_weight} prevalent in document{predicate_end}")
    if len(predicates) > 0:
        if use_latex_formatting:
            predicates[0] = predicates[0][5:]
        else:
            predicates[0] = predicates[0][4:]
    combined_predicates = "\n".join(predicates)
    relevance = "Relevant" if category_and_feature_info[0].relevant else "Not Relevant"
    if use_latex_formatting:
        rule_if = "\\\\\nIF&"
    else:
        rule_if = "\nIF"
    rule = f"Document is {relevance}{rule_if} {combined_predicates}"
    return rule


In [90]:
# Display rules
# Iterate through category and feature info producing and outputing rules
category_and_feature_infos = get_category_and_feature_info()
for category_id, category_and_feature_info in category_and_feature_infos.items():
    rule = convert_category_and_feature_info_to_rule(category_and_feature_info)
    if rule:
        print(f"category {category_id} rule:\n{rule}")


category 3 rule:
Document is Not Relevant
IF 'reuter' is rarely prevalent in document
and 'said' is highly prevalent in document
category 6 rule:
Document is Not Relevant
IF 'dlrs' is highly prevalent in document
and 'reuter' is rarely prevalent in document
and 'said' is rarely prevalent in document
category 7 rule:
Document is Relevant
IF '15' is rarely prevalent in document
and '16' is highly prevalent in document
and '31' is rarely prevalent in document
and 'cri' is highly prevalent in document
and 'crm' is highly prevalent in document
and 'cts' is somewhat prevalent in document
and 'div' is somewhat prevalent in document
and 'insured' is somewhat prevalent in document
and 'investments' is somewhat prevalent in document
and 'lp' is somewhat prevalent in document
and 'march' is rarely prevalent in document
and 'mortgage' is somewhat prevalent in document
and 'mthly' is highly prevalent in document
and 'pay' is rarely prevalent in document
and 'payout' is somewhat prevalent in documen