# Vectorization of short texts

Currently, many NLP tasks receive as input tweets or tweet-like messages, that is, short texts that usually correspond to a single sentence. Think of e-commerce product titles, questions in question-answering systems, requests in intent detection, or individual sentences for a common use case of SEQ2SEQ and Transformer-based sentence encoders.

However, standard bag-of-word (BoW) representations, as originally developed in the area of Information Retrieval and later adopted for Natural Language Processing, generally make the assumption that a system's input are whole documents, not isolated sentences, and longer documents providing a far larger context for language processing than sentences only.

For instance, Latent-Dirichlet-Allocation (LDA)-based topic modelling was initially designed to represent the meaning of each word based on the meaning of its neighboring words. As such, long documents provided a much richer context from which to derive this kind of inferences, whereas short texts are usually too fragmented to allow or effective modelling using the same technique, and LDA is well-known to struggle with short documents.

So, what is the specific impact of short texts on vectorization? If BoW representations are intended for longer texts, what happens if we suddenly start using them for shorter texts? Does it matter? Or does everything remain the same?

In this notebook we want to show that it generally does not matter, except in one possible situation.

For our experiment, we will need to
1. define the dataset we will be using,
2. define which text vectorization methods we want to compare;
3. and assess the differences between each method when run on the given dataset.

Points 1 and 2 will be covered next. After that, we will use them as inputs for the analysis in point 3.

## 1. Dataset

For our example, we will consider a small toy dataset consisting of 3 classes

`Y = {"cell-phones", "books", "nutrition"}`

with a few documents each:
1. 9 documents for `cell-phones`
2. 6 documents for `nutrition`
3. 6 documents or `books`

The dataset is hard-coded as variable `ads` inside the `Dataset` module.

## 2. Vectorizers

We will compare four types of vectorization strategies:
1. Dictionary vectorization
2. Frequency vectorization
3. TFIDF-weighted vectorization
4. TFICF-weighted vectorization

All of these strategies are implemented as Python classes in the `Vectorizer` module under the following names, respectively:
1. `DictionaryVectorizer`
2. `CountVectorizer`
3. `TfidfVectorizer` and 4 (`TfidfVectorizer` with keyword argument `group_by_class` set to `True`)

Here is the one-paragraph description of each of these methods:
1. The `DictionaryVectorizer` will take each document and return its binary encoding, that is, a vector with as many columns as words are in our vocabulary and, for each column, either a 1 or a 0, depending on whether the document contains that word. This is esentially an implementation of the identity function over a vocabulary and over all the words in an input text (the same that we would get if using `scikit-learn.feature_extraction.text.CountVectorizer` class with the `binary` parameter set to `True`, for those who are familiar with Python's `scikit-learn` library).
2. The `CountVectorizer` also returns a vector with as many columns as words are in our vocabulary but, for each column, the value is **not** 1 or 0, but actually the **frequency** or the number of times appears in the input text (where, remember, each column corresponds to one of the words in our vocabulary, and only that word).
3. The `TfidfVectorizer`

### Working hypothesis
Based on these descriptions, we can already venture some hypotheses:
1. `CountVectorizer` represents essentially the same information as the `DictionaryVectorizer` but with raw frequency counts instead of a categorial binary labeling. If we are working with long documents containing many mentions of the same words, then the values in `CountVectorizer`s vector would be much larger than the values in `DictionaryVectorizer`s vector, since the latter are effectively capped to 1, no matter how many times each word appears in the text. However, if we working with short texts, this no longer applies, since short texts tend not to contain any duplicated words, except for prepositions, determiners, and similar function words with little lexical meaning. Therefore, our hypothesis is that, with most words having the same frequency in short texts, a short text's `DictionaryVectorizer`-encoded vector must look very similar, or identical, to a `CountVectorizer`-encoded vector (this is definitly **not** true for long documents).



I'm not using scikit-learn's versions because I prefer to implement some
# custom functionality directly from scratch (scikit-learn's vectorizers could
also be extended accordingly, though), notice the `group_by_class` parameter.

## 3. Analysis

### Python pipeline

Below is the code for the experiment we will be running:

In [1]:
import collections
import random

from Dataset import ads

from Vectorizer import *

random.seed(3)



if __name__ == '__main__':

    vectorizers = [
        CountVectorizer(),
        DictionaryVectorizer(),
        TfidfVectorizer(),
        TfidfVectorizer(group_by_class=True)
    ]
    
    
    # Prepare training and test sets
    doc_ids__by__label = collections.defaultdict(list)
    for i, (_, label) in enumerate(ads):
        doc_ids__by__label[label].append(i)
    
    X_train, X_test, Y_train, Y_test = [], [], [], []
    for label, doc_ids in doc_ids__by__label.items():
        i = random.choice(list(range(len(doc_ids))))
        doc_id = doc_ids.pop(i)
        document, _ = ads[doc_id]
        X_test.append(document)
        Y_test.append(label)

        X_train.extend([
            ads[_doc_id][0] for _doc_id in doc_ids
            if _doc_id != doc_id
        ])

        Y_train.extend([
            label for _doc_id in doc_ids
             if _doc_id != doc_id
        ])

    

    # Fit all the vectorizers on the same dataset
    for vec in vectorizers:
        vec.fit(X_train, Y_train)



    # Vectorize with each one and compare the results
    for doc, label in zip(X_test, Y_test):
        print(doc)
        print(label)
        for vec in vectorizers:
            v = vec.transform([doc])[0]
            print('\t', vec)
            for name, weight in vec.interpret(v):
                print('\t\t%.2f\t%s' % (weight, name))
        print()



Ericsson DF688 Vintage Flip Cell Phone NEW LISTING Ericsson DF688 Vintage Flip Cell Phone
cell-phones
	 CountVectorizer
		2.00	vintage
		2.00	cell
		2.00	phone
		2.00	flip
		1.00	new
		1.00	listing
	 DictionaryVectorizer
		1.00	new
		1.00	vintage
		1.00	cell
		1.00	phone
		1.00	listing
		1.00	flip
	 TfidfVectorizer
		2.83	flip
		2.14	vintage
		2.14	cell
		2.14	listing
		1.45	phone
		1.22	new
	 TfidfVectorizer
		1.10	vintage
		1.10	cell
		1.10	phone
		1.10	flip
		0.41	new
		0.41	listing

Kitchen Confidential by Anthony Bourdain FREE SHIPPING a paperback book
books
	 CountVectorizer
		1.00	by
		1.00	a
		1.00	shipping
		1.00	book
		1.00	free
	 DictionaryVectorizer
		1.00	by
		1.00	a
		1.00	shipping
		1.00	book
		1.00	free
	 TfidfVectorizer
		2.83	by
		2.83	a
		2.83	shipping
		2.83	free
		1.45	book
	 TfidfVectorizer
		1.10	by
		1.10	a
		1.10	shipping
		1.10	book
		1.10	free

Lanes Calm Life Nutrition Supplement For Relaxation And Tranquility Capsules
nutrition
	 CountVectorizer
		1.00	caps