### Vectorizing

As a simple foundation, the **BOW** representation allowed text to be represented in a mathematical form in some way that represents describing a document in terms of a frequency dictionary.
<br>
The next step is to go further and represent such textual data into a **vector** of those word counts.

In [4]:
import pandas as pd 
import nltk
from collections import Counter 
from nltk.tokenize import TreebankWordTokenizer
from nlpia.data.loaders import kite_text

In [5]:
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(kite_text.lower())

In [6]:
token_counts = Counter(tokens)

In [7]:
nltk.download('stopwords', quiet=True)
stopwords = nltk.corpus.stopwords.words('english')

In [8]:
tokens = [x for x in tokens if x not in stopwords]
kite_counts = Counter(tokens)

In [9]:
doc_vector = []
doc_length = len(tokens)
for key, value in kite_counts.most_common():
    doc_vector.append(value/doc_length)

In [10]:
# Retrieve the first five most common tokens in this vector
doc_vector[:5]

[0.07207207207207207,
 0.06756756756756757,
 0.036036036036036036,
 0.02252252252252252,
 0.018018018018018018]

Technical note - as these document vectors get larger, it's best to deviate away from python built-ins and exploit data structures that inherently utilise vectorization such as `numpy`

The idea of using vectors and applying mathematical operations on them relies on them being relative to a common feature across all such vectors.
<br>
The mathematical operations on means that vectors need to represent a position in common space - relative to something consistent.
* Vector considerations - Vectors need to have the same origin and share the same scale (also units) on each of their dimensions
<br>
<br>
1) The first step is to normalize the counts by calculating normalized term frequency instead of raw count(s) in the document 
<br>
2) The second step is to ensure that all the vectors are in the form of standard length or dimension

We also want the value for each element of the vector to represent the same word in each document's vector.
* ***lexicon*** - The collection of (distinct) words in the vocabulary comes in this case where we find every unique word in the union of such multiple sets (combination of documents)

In [11]:
from nlpia.data.loaders import harry_docs as docs
docs

['The faster Harry got to the store, the faster and faster Harry would get home.',
 'Harry is hairy and faster than Jill.',
 'Jill is not as hairy as Harry.']

In [12]:
doc_tokens = [] 
for doc in docs:
    doc_tokens.append(sorted(tokenizer.tokenize(doc.lower())))

In [13]:
len(doc_tokens[0])

17

In [14]:
all_doc_tokens = sum(doc_tokens, [])
len(all_doc_tokens)

33

In [15]:
lexicon = sorted(set(all_doc_tokens))

In [25]:
print(len(lexicon))
print(lexicon)

18
[',', '.', 'and', 'as', 'faster', 'get', 'got', 'hairy', 'harry', 'home', 'is', 'jill', 'not', 'store', 'than', 'the', 'to', 'would']


Hence, each of three document vectors would need to exhibit 18 values - even if a certain document for its corresponding vector doesn't contain all 18 words in our lexicon.
* Each token is assigned a *slot* in the vectors corresponding to its position in the lexicon

In [17]:
# Use this adv data structure to remember original order of lexicon (based on insertion)
from collections import OrderedDict
zero_vector = OrderedDict((token, 0) for token in lexicon)
zero_vector

OrderedDict([(',', 0),
             ('.', 0),
             ('and', 0),
             ('as', 0),
             ('faster', 0),
             ('get', 0),
             ('got', 0),
             ('hairy', 0),
             ('harry', 0),
             ('home', 0),
             ('is', 0),
             ('jill', 0),
             ('not', 0),
             ('store', 0),
             ('than', 0),
             ('the', 0),
             ('to', 0),
             ('would', 0)])

In [18]:
import copy 
doc_vectors = [] 
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)
    for key, value in token_counts.items():
        vec[key] = value/len(lexicon)
    doc_vectors.append(vec)

In [19]:
doc_vectors

[OrderedDict([(',', 0.05555555555555555),
              ('.', 0.05555555555555555),
              ('and', 0.05555555555555555),
              ('as', 0),
              ('faster', 0.16666666666666666),
              ('get', 0.05555555555555555),
              ('got', 0.05555555555555555),
              ('hairy', 0),
              ('harry', 0.1111111111111111),
              ('home', 0.05555555555555555),
              ('is', 0),
              ('jill', 0),
              ('not', 0),
              ('store', 0.05555555555555555),
              ('than', 0),
              ('the', 0.16666666666666666),
              ('to', 0.05555555555555555),
              ('would', 0.05555555555555555)]),
 OrderedDict([(',', 0),
              ('.', 0.05555555555555555),
              ('and', 0.05555555555555555),
              ('as', 0),
              ('faster', 0.05555555555555555),
              ('get', 0),
              ('got', 0),
              ('hairy', 0.05555555555555555),
              ('harry', 0.05

This is a coarse way to make word (count) vectors for each document in python using built-ins.

#### Vector spaces

In [20]:
from pathlib import Path
import os 

In [21]:
path = Path().home()/'Desktop'/'nlp-map-project'/'chp3-nlpia-notes'/'img-vects'
os.chdir(path)

<img src="img-vects/NLPIA-vect-2D.png" alt="Vectors in 2D space" width="400" height='400'/>

Figure 1
<br>
*NLPIA* Lane, Howard and Hapke (2019) chp 3.2.1 pp. 234 Apple iBooks.

Vectors are the foundation of Linear Algebra and such associated computations
* *Space* - collection of all possible vectors that could appear in such space representation
* Representation - ordered list of numbers/coordinates in a vector space, that are *rectilinear/Euclidean*
* Association - they illustrate a location or position in that space or they are used to represent direction and magnitude/distance
* Dimensions - A 2D space as in Figure 1 shows vectors with two values, where similar logic for a 3D space would mean vectors with three values etc.
<br>
<br>
Bear in mind that some vectors with 2D values ***cannot*** be represented in a normal 2D space as above such as Geospatial coordinates representing longitude and latitude. 

Regarding a natural language document vector space, the ***dimensionality*** of our vector space is the **count of the number of distinct words that appear in the whole corpus**.
<br>
Notation for dimensionality can be represented as the following: 
* TF - The dimensionality can be represented as **"K"**
* Distinct words/lexicon - Represented as the vocabulary size of the corpus, academic papers denote this with **"|V|"**
<br>
It's easy from then on to describe each document within such **K-dimensional vector space** by a **K-dimensional vector** 


In our simple example of `doc_vectors`, K = 18 in our three document corpus.

<img src="img-vects/NLPIA-tf-freq-angle-vects.png" alt="term frequency word vectors and corresponding angles in 2D space" width="400" height='400'/>

Figure 2
<br>
*NLPIA* Lane, Howard and Hapke (2019) chp 3.2.1 pp. 238 Apple iBooks.

To make the example more easy to visualize, as seen in figure 2, **K** is reduced to two (given 2D view) from the 18 dimensional vector space constructed from our original lexicon.

**Similarity** - Two vectors are similar when they share similar direction
<br>
**Magnitude** - represents the "length" of the vector, which would mean that two vectors having similar length translates to word count (TF) vectors possessing  the same length
<br>
<br>
The estimate of document similarity provides us with searching the use of the same words about the same number of times in similar proportions.
<br>
Such metric would provide us with being familiar of the documents they represent as being more likely talking about similar things/topics.

$cos (\Theta$) = $\frac{A \times B}{|A| \times |B|}$

**Cosine similarity** - Given the above formula, the $\theta$ represents the cosine of the angle between two vectors (A and B) as a solvable parameter, indicates how 'similar' TF across different documents appear.
<br>
It then becomes normalized by the Euclidean product (|A|*|B|) with the numerator being the dot product.

* Interpreting output - For most ML problems, cosine similarity has a convenient range in terms of output from -1 to +1

In [22]:
# Computing cosine similarity using numpy 
import numpy as np 
def cos_sim(vect_a, vect_b):
    return vect_a.dot(vect_a, vect_b) / np.linalg.norm(vect_a) * np.linalg.norm(vect_b)

In [23]:
# Computing cosine similarity using pure python via the built-in 'math' module
import math 
def cos_sim(vect_a, vect_b):
    """ Convert from dictionary to lists for easier computation and matching """
    

The basis of computing cosine similarity programmatically is to take the dot product of two of our vectors by multiplying the elements of each vector pairwise and sum up those products afterwards.
<br>
Then divide by the norm (magnitude) of each vector - which again is also the Euclidean distance going from head to tail of each vector as seen in figure 2.
* ***normalized dot product*** - This would be an output in a given range between -1 and +1 as with the cosine of the angle between the two vectors 
* Value illustration - How much the vectors point in the same direction

***Cosine similarity of 1:***

A cosine similarity value of 1 illustrates identical normalized vectors that point in identically the same direction along all dimensions. The vectors may have different lengths/magnitudes, but they point in the same direction. 
<br>
<br>
It goes with saying that the closer a cosine similarity value is to 1, would also mean that the closer the two values are in angular terms. 
<br>
This would also mean that for NLP document vectors, the documents are using similar words in similar proportions.
<br>
Hence, documents corresponding to document vectors that are close to each other are likely talking about the same thing/topic.


***Cosine similarity of 0:***

A cosine similarity value of 0 represents two vectors that share no components whatsoever. They are orthogonal/perpendicular in all dimensions (relative to the origin). 
<br>
With respect to NLP TF vectors, this occurs when two documents (corresponding the terms) share no words in common.
<br>
<br>
Given these documents use absolutely disparate words, they must be discussing talking about different things.
<br>
Albeit, this definitely doesn't mean they have different *meanings or topics*, it's just referring to the concept that they use different words.

***Cosine similarity of -1:***

A cosine similarity value of -1 represents two vectors that are completely nonsimilar (opposite) - as they point in opposite directions. This is not the case for a simple word count (term frequency) vectors as well as normalized TF vectors.
<br>
Hence, word count vectors will always be in the same quadrant of the vector space and term frequency vectors cannot be negative.