# The Sparse Tensor Classifier Input Data Format

In this tutorial you'll learn how to pass the input data to `SparseTensorClassifier` both in tabular and JSON format.


## Colab 

This tutorial and the rest in [this sequence](https://github.com/SparseTensorClassifier/tutorial) can be done in Google colab. If you'd like to open this notebook in colab, click [here](https://colab.research.google.com/github/SparseTensorClassifier/tutorial/blob/main/Quickstart_Input_Data.ipynb).

![](https://colab.research.google.com/assets/colab-badge.svg)

## Setup

Uncomment and run the following cell to install the packages. Then, import the modules.

In [1]:
# !pip install stc pandas scikit-learn

In [2]:
import pandas as pd
from stc import SparseTensorClassifier
from sklearn.metrics import accuracy_score

## Read the dataset

The dataset consists of 101 animals from a zoo. There are 16 variables with various traits to describe the animals. The 7 Class Types are: Mammal, Bird, Reptile, Fish, Amphibian, Bug and Invertebrate. Let's read and shuffle the data.

In [3]:
zoo = pd.read_csv('./data/zoo/zoo.csv')
zoo = zoo.sample(frac=1, random_state=42)
zoo

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
84,squirrel,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,0,Mammal
55,oryx,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,Mammal
66,porpoise,0,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,Mammal
67,puma,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,Mammal
45,lion,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,Mammal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,pike,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,Fish
71,rhea,0,1,1,0,0,0,1,0,1,1,0,0,2,1,0,1,Bird
14,crab,0,0,1,0,0,1,1,0,0,0,0,0,4,0,0,0,Invertebrate
92,tuna,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,Fish


## Tabular Input Data

STC supports input data in the form of pandas ``DataFrame`` for tabular data, where each row represents an item, each column represents a ``key`` and each cell represents a ``value``. **Missing values**, such as `NaN`, are converted to strings (e.g., `nan`) and treated as any other categorical value. STC deals with **categorical data** only and all the ``values`` are internally converted to strings. Continuous features should be discretized first.

Let's instruct STC to predict `class_type` based on all the other attributes in the dataset, except `animal_name`.

In [4]:
STC = SparseTensorClassifier(targets=['class_type'], features=zoo.columns[1:-1])

Fit the tabular data 

In [5]:
STC.fit(zoo[0:70])



Predict the tabular data

In [6]:
labels, probability, explainability = STC.predict(zoo[70:])



Evaluate the predictions

In [7]:
accuracy_score(zoo['class_type'][70:], labels)

0.967741935483871

## JSON Input Data

STC also supports JSON input data in the form of list of dictionaries structured as follows:

```python
data = [
    {'key1': [value1, value2, ..., valueN], 'key2': [], ..., 'keyN': []},
    ...
    {'key1': [value1, value2, ..., valueN], 'key2': [], ..., 'keyN': []},
]
```

Such that each dictionary represents an item where each ``key`` is a feature associated to one or more ``values``.
This makes easy to deal with multi-valued attributes. You can consider encoding the **missing values** with a string representing the absence of information (e.g., `""`), instead of providing an empty list `[]` for the `key`. STC deals with **categorical data** only and all the ``values`` are internally converted to strings. Continuous features should be discretized first.

Let's convert the `zoo` dataset into JSON. A possible representation of the animal is as follows. **Note**: This serves as a simple example on how STC deals with JSON datasets. It is generally not recommended to transform tabular data into JSON data. 

In [8]:
items = []
for i, (_, row) in enumerate(zoo.iterrows()):
    item = {}
    item['legs'] = [row['legs']]
    item['attributes'] = [f for f in zoo.columns[1:] if f not in ['legs','class_type'] and row[f]!=0]
    item['animal_name'] = [row['animal_name']]
    item['class_type'] = [row['class_type']]
    items.append(item)

items[0]

{'legs': [2],
 'attributes': ['hair', 'milk', 'toothed', 'backbone', 'breathes', 'tail'],
 'animal_name': ['squirrel'],
 'class_type': ['Mammal']}

Let's instruct STC to predict `class_type` based on `legs` and `attributes`.

In [9]:
STC = SparseTensorClassifier(targets=['class_type'], features=['legs', 'attributes'])

Fit the JSON data

In [10]:
STC.fit(items[0:70])



Predict the JSON data

In [11]:
labels, probability, explain = STC.predict(items[70:])



Evaluate the predictions

In [12]:
accuracy_score(zoo['class_type'][70:], labels)

0.9354838709677419

# Congratulations! 

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with Sparse Tensor Classifier, we encourage you to finish the rest of the tutorials in [this series](https://github.com/SparseTensorClassifier/tutorial). Don't forget to [star the repository](https://github.com/SparseTensorClassifier/stc)! 

![GitHub Repo stars](https://img.shields.io/github/stars/SparseTensorClassifier/stc?style=social)

<div>
    Thanks by <a href="https://sparsetensorclassifier.org">https://sparsetensorclassifier.org</a>  
    <span style="float:right">
        Questions? Open an <a href="https://github.com/SparseTensorClassifier/tutorial/issues">issue</a>
    </span> 
</div>