# The Sparse Tensor Classifier Policy

By representing the input data as a tensor, `SparseTensorClassifier` is natively suitable for working with multi-valued attributes and multi-labeled data. In such cases, it may be convenient to represent the different attributes as separate dimensions of the tensor, in order to avoid mixing different aspects of the data items into a unique dimension of the feature space. When this is not required, for example when dealing with a single multi-valued attribute, such as text, or when dealing with multiple single-valued attributes, such as tabular data, STC can easily scale down to a matrix based data representation, which uses a single dimension for all the features.
In this tutorial you'll learn how to represent the data with multi-dimensional features and how to avoid missing predictions that may arise from this representation.


## Colab 

This tutorial and the rest in [this sequence](https://github.com/SparseTensorClassifier/tutorial) can be done in Google colab. If you'd like to open this notebook in colab, click [here](https://colab.research.google.com/github/SparseTensorClassifier/tutorial/blob/main/Quickstart_Policy.ipynb).

![](https://colab.research.google.com/assets/colab-badge.svg)

## Setup

Uncomment and run the following cell to install the packages. Then, import the modules.

In [1]:
# !pip install stc numpy pandas scikit-learn

In [2]:
import numpy as np
import pandas as pd
from stc import SparseTensorClassifier
from sklearn.metrics import accuracy_score

np.random.seed(0)

## Read the dataset

The dataset consists of 101 animals from a zoo. There are 16 variables with various traits to describe the animals. The 7 Class Types are: Mammal, Bird, Reptile, Fish, Amphibian, Bug and Invertebrate. Let's read and shuffle the data.

In [3]:
zoo = pd.read_csv('https://raw.githubusercontent.com/SparseTensorClassifier/tutorial/main/data/zoo/zoo.csv')
zoo = zoo.sample(frac=1, random_state=42)
zoo

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
84,squirrel,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,0,Mammal
55,oryx,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,Mammal
66,porpoise,0,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,Mammal
67,puma,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,Mammal
45,lion,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,Mammal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,pike,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,Fish
71,rhea,0,1,1,0,0,0,1,0,1,1,0,0,2,1,0,1,Bird
14,crab,0,0,1,0,0,1,1,0,0,0,0,0,4,0,0,0,Invertebrate
92,tuna,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,Fish


## Transform the data and add noise

To better illustrate how STC deals with (potentially noisy) multi-valued attributes, let's transform the zoo dataset. Convert to JSON and define the following dimensions: 
- `f1` containing a clean set of attributes, such as `hair=1` (has hair), `eggs=0` (does not lay eggs), etc.  
- `f2` containing the information on the number of legs plus random noise 
- `n1` and `n2` containing pure noise

In [4]:
items = []
for i, (_, row) in enumerate(zoo.iterrows()):
    item = {}
    item['f1'] = [f+"="+str(row[f]) for f in zoo.columns[1:] if f not in ['class_type','legs']] 
    item['f2'] = [f+"="+str(row[f]) for f in zoo.columns[1:] if f in ['legs']] 
    item['f2'] += list(np.random.binomial(10, 0.5, np.random.binomial(10, 0.7)))
    item['n1'] = list(np.random.binomial(10, 0.5, np.random.binomial(10, 0.7)))
    item['n2'] = list(np.random.binomial(10, 0.5, np.random.binomial(10, 0.7)))
    item['class_type'] = [row['class_type']]
    items.append(item)

items[0]

{'f1': ['hair=1',
  'feathers=0',
  'eggs=0',
  'milk=1',
  'airborne=0',
  'aquatic=0',
  'predator=0',
  'toothed=1',
  'backbone=1',
  'breathes=1',
  'venomous=0',
  'fins=0',
  'tail=1',
  'domestic=0',
  'catsize=0'],
 'f2': ['legs=2', 6, 5, 5, 5, 6, 5, 7],
 'n1': [5, 6, 5, 5],
 'n2': [3, 3, 2, 7, 6],
 'class_type': ['Mammal']}

## Initialize Sparse Tensor Classifier

Let's instruct STC to predict `class_type` based on `f1`, `f2`, `n1`, `n2`. In this example, we are going to keep the 4 multi-valued attributes in separate dimensions of the Sparse Tensor. This is obtained by setting `collapse=False` upon initialization. Each feature is now represented in a 4 dimensional space, i.e., (`f1`, `f2`, `n1`, `n2`), and each item is represented with the cartesian product of its attributes.

In [5]:
STC = SparseTensorClassifier(targets=['class_type'], features=['f1','f2','n1','n2'], collapse=False)

## Fit the training data

In [6]:
STC.fit(items[0:70])



## Learn the policy and predict

STC may be unable to provide a prediction in the case where all the (4 dimensional) features of the instance to predict are never seen in the training set. In this case we need to define a fallback mechanism: the *policy*. The policy is represented by a list of lists, e.g.:

```py
[['f1','f2','n1','n2'], ..., ['f1','f2'], ['f1'], []]
```
where each list contains the features to use for prediction. We predict with the first list of features. Then, for the items that could not be predicted, we use the second list of features. Then the third list and so on, until the last list. In other words, first lists are applied first. The empy list `[]` corresponds to ignoring all the features, i.e. predicting with the unconditional distribution of the target(s) in the training set. If the last list is the empty list, then all items are guaranteed to be predicted. In STC, the policy can be arbitrarily specified if prior knowledge is available, or learnt via the function `learn()`.

In [7]:
policy, score = STC.learn(max_iter=1, random_state=42)
policy



[['f2', 'f1'], ['f1'], []]

The policy learnt is `[['f2', 'f1'], ['f1'], []]`. This corresponds to using only `f2` and `f1` for prediction (STC was able to filter out the noisy dimensions `n1` and `n2`). Then, for those items that couldn't be predicted use `f1` only (the cleanest and most informative dimension). If still some items could not be predicted, ignore all the features and predict with the unconditional distribution of the target(s) in the training set. The policy is saved internally and automatically used for prediction.

In [8]:
labels, probability, explainability = STC.predict(items[70:])
accuracy_score(zoo['class_type'][70:], labels)



0.9032258064516129

The policy can also be arbitrary specified. Below an equivalent call where the policy is explicitly specified. This is particularly useful when prior knowledge is available and the learning of the policy is too compute intensive.

In [9]:
labels, probability, explainability = STC.predict(items[70:], policy=[['f1','f2'], ['f2'], []])
accuracy_score(zoo['class_type'][70:], labels)



0.9032258064516129

But watch out! This is an example where we arbitrarily predict with the noisy dimensions first.

In [10]:
labels, probability, explainability = STC.predict(items[70:], policy=[['n1','n2'], ['f1','f2'], ['f2'], []])
accuracy_score(zoo['class_type'][70:], labels)



0.0967741935483871

# Congratulations! 

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with Sparse Tensor Classifier, we encourage you to finish the rest of the tutorials in [this series](https://github.com/SparseTensorClassifier/tutorial). Don't forget to [star the repository](https://github.com/SparseTensorClassifier/stc)! 

![GitHub Repo stars](https://img.shields.io/github/stars/SparseTensorClassifier/stc?style=social)

<div>
    Thanks by <a href="https://sparsetensorclassifier.org">https://sparsetensorclassifier.org</a>  
    <span style="float:right">
        Questions? Open an <a href="https://github.com/SparseTensorClassifier/tutorial/issues">issue</a>
    </span> 
</div>