### Vectorizing

As a simple foundation, the **BOW** representation allowed text to be represented in a mathematical form in some way that represents describing a document in terms of a frequency dictionary.
<br>
The next step is to go further and represent such textual data into a **vector** of those word counts.

In [1]:
import pandas as pd 
import nltk
from collections import Counter 
from nltk.tokenize import TreebankWordTokenizer
from nlpia.data.loaders import kite_text

In [2]:
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(kite_text.lower())

In [4]:
token_counts = Counter(tokens)

In [5]:
nltk.download('stopwords', quiet=True)
stopwords = nltk.corpus.stopwords.words('english')

In [6]:
tokens = [x for x in tokens if x not in stopwords]
kite_counts = Counter(tokens)

In [7]:
doc_vector = []
doc_length = len(tokens)
for key, value in kite_counts.most_common():
    doc_vector.append(value/doc_length)

In [9]:
# Retrieve the first five most common tokens in this vector
doc_vector[:5]

[0.07207207207207207,
 0.06756756756756757,
 0.036036036036036036,
 0.02252252252252252,
 0.018018018018018018]

Technical note - as these document vectors get larger, it's best to deviate away from python built-ins and exploit data structures that inherently utilise vectorization such as `numpy`

The idea of using vectors and applying mathematical operations on them relies on them being relative to a common feature across all such vectors.
<br>
The mathematical operations on means that vectors need to represent a position in common space - relative to something consistent.
* Vector considerations - Vectors need to have the same origin and share the same scale (also units) on each of their dimensions
<br>
<br>
1) The first step is to normalize the counts by calculating normalized term frequency instead of raw count(s) in the document 
<br>
2) The second step is to ensure that all the vectors are in the form of standard length or dimension

We also want the value for each element of the vector to represent the same word in each document's vector.
* ***lexicon*** - The collection of (distinct) words in the vocabulary comes in this case where we find every unique word in the union of such multiple sets (combination of documents)

In [11]:
from nlpia.data.loaders import harry_docs as docs
docs

['The faster Harry got to the store, the faster and faster Harry would get home.',
 'Harry is hairy and faster than Jill.',
 'Jill is not as hairy as Harry.']

In [12]:
doc_tokens = [] 
for doc in docs:
    doc_tokens.append(sorted(tokenizer.tokenize(doc.lower())))

In [14]:
len(doc_tokens[0])

17

In [17]:
all_doc_tokens = sum(doc_tokens, [])
len(all_doc_tokens)

33

In [18]:
lexicon = sorted(set(all_doc_tokens))

In [23]:
len(lexicon)
lexicon

[',',
 '.',
 'and',
 'as',
 'faster',
 'get',
 'got',
 'hairy',
 'harry',
 'home',
 'is',
 'jill',
 'not',
 'store',
 'than',
 'the',
 'to',
 'would']

Hence, each of three document vectors would need to exhibit 18 values - even if a certain document for a its corresponding vector doesn't contain all 18 words in our lexicon.