# Representations

In this tutorial, we will work on _numerical representations of text_, which we studied in the last reading. We will try different types of vectorization on easy examples.

## 1. Bag of Words Model

Find the unique words (i.e. vocabulary from the list of documents). Parse each document word with the vocabulary, if present `1` else `0`. This makes each document vector maintain the same length that of vocabulary length. We use this vocabulary for the new document vectorization. 

In [1]:
# load data
docs = ["SUPERB, I AM IN LOVE IN THIS PHONE", "I hate this phone"]

# convert to lowercase then split
words = list(set([
    word for doc in docs for word in doc.lower().split()
]))

# vectorize
vectors = []
for doc in docs:
    vectors.append([1 if word in doc.lower().split() else 0 for word in words])
print('vocabulary:', words)
print('vectors:', vectors)

vocabulary: ['i', 'love', 'this', 'hate', 'in', 'superb,', 'phone', 'am']
vectors: [[1, 1, 1, 0, 1, 1, 1, 1], [1, 0, 1, 1, 0, 0, 1, 0]]


## 2. Word Counts with CountVectorizer

Tokenize the collection of documents and form a vocabulary with it. We can then use this vocabulary to encode new documents. These functions can be completed with `CountVectorizer` from the `sklearn` library. By default, this class object will also remove punctuations and lower the documents.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# list of documents
docs = ["SUPERB, I AM IN LOVE IN THIS PHONE", "I hate this phone"]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocabulary
vectorizer.fit(docs)

print('vocabulary:', vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(docs)

# summarize encoded vector
print('shape:', vector.shape)
print('vectors:', vector.toarray())

vocabulary: {'superb': 5, 'am': 0, 'in': 2, 'love': 3, 'this': 6, 'phone': 4, 'hate': 1}
shape: (2, 7)
vectors: [[1 0 2 1 1 1 1]
 [0 1 0 0 1 0 1]]


It turns each vector into a sparse matrix, making sure that the word is present in the vocabulary and, if present, it prints the number of occurrences of the word in the vocabulary.

## 3. Word Frequencies with `TfidfVectorizer`

Word counts are basic. Using these counts, we lost interesting words and give priority to stop words and may include less meaningful words in the algorithm.

**TF-IDF**, _Term Frequency and Inverse Document Frequency_, calculated word frequency scores that try to highlight words that are more interesting (i.e. frequent in a particular document but rare across an entire set of documents). There are a few types of _weighting schemes_ for **TF-IDF**. Below is an example.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of documents
docs = ["SUPERB, I AM IN LOVE IN THIS PHONE", "I hate this phone"]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocabulary
vectorizer.fit(docs)

# summarize
print('vocabulary:', vectorizer.vocabulary_)
print('idfs:', vectorizer.idf_)

# encode document
vector = vectorizer.transform([docs[0]])

# summarize encoded vector
print('vectors:', vector.toarray())

vocabulary: {'superb': 5, 'am': 0, 'in': 2, 'love': 3, 'this': 6, 'phone': 4, 'hate': 1}
idfs: [1.40546511 1.40546511 1.40546511 1.40546511 1.         1.40546511
 1.        ]
vectors: [[0.35327777 0.         0.70655553 0.35327777 0.25136004 0.35327777
  0.25136004]]


**idf** per _term_ is calculated as follows:
$$ idf(t) \ = \ log(\frac{1+n_d}{1+df(d,t)}) \ + 1$$

After applying the equation above, the final step is vector normalization. `sklearn` uses `l2` normalization for each document.

**TF-IDF** is arguably the best vectorization method among those that were discussed here. Unlike **word counts**, **IDF** values deprioritizes stopwords and rewards unique words that are implicitly significant to the documents they are written for.