## Variable Length Sequences

Variable length sequence data can present a significant challenge
to machine learning algorithms and data science analysis.

Part of this problem is driven by the wide varieties of variable length
sequence data that are encountered in the wild.  To that end we present
a taxonomy of the kinds of variable length sequences that we typically
encounter and our suggestions for how to think about them.

We generally find it useful when describing variable length sequence data
to describe what it is a sequence of.  The basic types that we commonly
encounter are: <font color='limegreen'>categorical</font> values, <font color='limegreen'>scalar</font> values and <font color='limegreen'>vector</font> values. Certainly scalar data could be thought of a simple one dimensional vector data but given the different techniques that can, and often are, applied to such data we feel that treating it as a seperate data type is warranted.

Next we describe it as either <font color='magenta'>ordered</font> or <font color='magenta'>unordered</font> sequences.  Yes, an <font color='magenta'>unordered</font> sequence is an odd turn of phrase but we be find it to be a useful simplifying
notion.  An <font color='magenta'>unordered</font> sequence is often referred to as a bag in data science
literature.  For a example a `bag of words` is the phrase used to describe an
unordered collection of word tokens.  We would describe such a collection as an
<font color='magenta'>unordered</font> <font color='limegreen'>categorical</font> sequence.

Lasty, given an <font color='magenta'>ordered</font> sequence we require one extra piece of information:
is the ordered <font color='sky blue'>regular</font> or <font color='sky blue'>irregular</font>.  <font color='sky blue'>Regular</font> sequences are often
described as heartbeat data and generally assume equal spacing between all our
values.  <font color='sky blue'>Irregular</font> sequences are often referred to as event data and each
value is associated with a particular position allowing variable spacing amongst
our values.

Variable length sequence data comes in a vast variety of forms.  Different forms of variable length sequence data are amenable to different techniques.  To deal with this variety of data we propose this simple taxonomy of variable length sequence data and provide links and suggestions for techniques conducive to each type. 

* <font color='limegreen'>Type</font> of values: <font color='limegreen'>categorical</font>, <font color='limegreen'>scalar</font>, <font color='limegreen'>vector</font>
* <font color='magenta'>Order</font>  of values: <font color='magenta'>Ordered</font> or <font color='magenta'>Unordered</font>
*  <font color='sky blue'>Regularity</font>  of values: <font color='sky blue'>Regular</font> or <font color='sky blue'>Irregular</font>

### Sequence Types
#### Categorical
| regularity | order | type | sequence | | example |
| :- | :- | :- | :- | :-: | :-: | 
|     | <font color='magenta'>unordered</font> | <font color='limegreen'>categorical</font> | sequence | -> | bag of words |
| <font color='sky blue'>regular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>categorical</font>  | sequence | -> | text document |
| <font color='sky blue'>irregular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>categorical</font> | sequence  | ->| time stamped labelled events |


#### Scalar
| regularity | order | type | sequence | | example |
| :- | :- | :- | :- | :-: | :-: | 
|     | <font color='magenta'>unordered</font> | <font color='limegreen'>Scalar</font> | sequence | -> | random variable |
| <font color='sky blue'>regular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Scalar</font>  | sequence | -> | heartbeat time-series |
| <font color='sky blue'>irregular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Scalar</font> | sequence  | ->| time stamped values or event sequence |

#### Vector
| regularity | order | type | sequence | | example |
| :- | :- | :- | :- | :-: | :-: | 
|     | <font color='magenta'>unordered</font> | <font color='limegreen'>Vector</font> | sequence | -> | point cloud |
| <font color='sky blue'>regular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Vector</font>  | sequence | -> | spatial-trajectory data |
| <font color='sky blue'>irregular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Vector</font> | sequence  | ->| time stamped locations |


## Vectorizer Functions 

This library adheres the sklearn transformer paradigm.  With most functions having a `fit`, `fit_transform` and `transform` functions.  As such they can be easily arranged in sklearn pipelines to ensure that all of your data transformation steps are encapsulated cleanly.

For the most part our `vectorizers` take in a sequence of variable length sequences and learn a fixed width representation of these sequences.  Another way of thinking of this is transforming a jagged array of vectors into a fixed width array of vectors.  Fixed width representations are significantly more conducive to traditional machine leraning algorithms.

`Transformers` on the other hand are more generic utility functions that massage data in various useful ways.  

Due to the variety of vectorization techniques in this library a user might find it easier to determine the type of variable length sequences they are dealing with and use the following index to find the relevant functions.

#### Categorical
| regularity | order | type | sequence | | example | functions |
| :- | :- | :- | :- | :-: | :- | :- | 
|     | <font color='magenta'>unordered</font> | <font color='limegreen'>categorical</font> | sequence | -> | bag of words | [NgramVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.NgramVectorizer.html#vectorizers.NgramVectorizer), [EdgeListVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.EdgeListVectorizer.html) | 
| <font color='sky blue'>regular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>categorical</font>  | sequence | -> | text document | NgramVectorizer, [LZCompressionVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.LZCompressionVectorizer.html), [BPEVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.BytePairEncodingVectorizer.html) | 
| <font color='sky blue'>irregular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>categorical</font> | sequence  | ->| time stamped labelled events | [HistogramVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.HistogramVectorizer.html) |

All of these vectorizers take data in the form of a sequence of variable length sequences of categorical values (such as strings).  All of these methods presume that a user has already decomposed their data into something of this form.  

The most common sources of variable length categorical data are text documents or data frames with categorical columns. In both cases some pre-processing will be necessary to convert such data into sequences of variable length sequences.  

In the case of text documents this often involves tokenization and lemmatization steps.  An example of applying such transformations on text data before vectorization can be found in [document vectorizer](https://vectorizers.readthedocs.io/en/latest/document_vectorization.html).

Good tokenization and lemmatization libraries include: [HuggingFace](https://huggingface.co/docs/transformers/main_classes/tokenizer), [SentencePiece](https://github.com/google/sentencepiece), [spaCy](https://spacy.io/api/tokenizer), and [nltk](https://www.nltk.org/api/nltk.tokenize.html).

In the case of a data frame with multiple categorical columns one might make use of our libraries CategoricalColumnTransformer for transforming a data frame with one or more columns into a variable length sequence of categorical sequences.  This is typically done by specifying one categorical column to represent ones objects and another set of categorical columns to be used to describe said objects.
For an examples of how one might use this see an [introduction to CategoricalColumnTransformer](https://vectorizers.readthedocs.io/en/latest/CategoricalColumnTransformer_intro.html) or the more complicated [CategoricalColumnTransformer vignette](https://vectorizers.readthedocs.io/en/latest/categorical_column_transformer_example.html).  

#### Scalar
| regularity | order | type | sequence | | example | functions |
| :- | :- | :- | :- | :-: | :- | :- | 
|     | <font color='magenta'>unordered</font> | <font color='limegreen'>Scalar</font> | sequence | -> | random variable | [HistogramVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.HistogramVectorizer.html), DistributionVectorizer |
| <font color='sky blue'>regular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Scalar</font>  | sequence | -> | heartbeat time-series | SlidingWindowTransformer |
| <font color='sky blue'>irregular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Scalar</font> | sequence  | ->| time stamped values or event sequence | [KDEVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.KDEVectorizer.html#vectorizers.KDEVectorizer) |

One should note that <font color='sky blue'>regular</font> <font color='magenta'>ordered</font> <font color='limegreen'>scalar</font> sequences references a Transformer function instead of a Vectorizer.  That is because our current recommendation for dealing with such sequences is to use the SlidingWindowTransformer to encode the sequence information into an <font color='magenta'>unordered</font> <font color='limegreen'>scalar</font> sequence and then apply the appropriate techniques.

#### Vector
| regularity | order | type | sequence | | example | functions |
| :- | :- | :- | :- | :-: | :- | :- | 
|     | <font color='magenta'>unordered</font> | <font color='limegreen'>Vector</font> | sequence | -> | point cloud | [WassersteinVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.WassersteinVectorizer.html#vectorizers.WassersteinVectorizer), [SinkhornVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.SinkhornVectorizer.html), [ApproximateWassersteinVectorizer](https://vectorizers.readthedocs.io/en/latest/generated/vectorizers.ApproximateWassersteinVectorizer.html), DistributionVectorizer |
| <font color='sky blue'>regular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Vector</font>  | sequence | -> | spatial-trajectory data | SlidingWindowTransformer |
| <font color='sky blue'>irregular</font> | <font color='magenta'>ordered</font> | <font color='limegreen'>Vector</font> | sequence  | ->| time stamped locations | `we accept pull requests` |

One should note that <font color='sky blue'>regular</font> <font color='magenta'>ordered</font> <font color='limegreen'>vector</font> sequences references a Transformer function instead of a Vectorizer.  That is because our current recommendation for dealing with such sequences is to use the SlidingWindowTransformer to encode the sequence information into an <font color='magenta'>unordered</font> <font color='limegreen'>vector</font> sequence and then apply the appropriate techniques.

WassersteinVectorizer should be considered the gold standard for vectorizing point clouds of data.  It makes use linear optimal transport to linearize and thus provide a reasonably scalable vectorization of a point cloud so that Euclidean or Cosine distance on this space will be a reasonable approximation of Wasserstein distance between the point cloud distrubitons.  SinkhornVectorizer can handle much larger distributions of data and is generally more efficient but this efficiency may come with some loss of quality.  Lastly, we include an ApproximateWassersteinVectorizer which is a heuristic linear algebra based solution which poorly approximates our WassersteinVectorizer but is very, very fast.  