<h1 style="text-align: center;"><span style="color: #333399;">Computational Linguistics: Text Vectorization and Word Embedding</span></h1>

<h6 style="text-align: center;">Created by: Michael Gagliano on 2/8/19</h6>
<h6 style="text-align: center;">Last Update: Michael Gagliano 2/12/19</h6>


In [1]:
import numpy as np
import pandas as pd
import pprint as pp

<h1 style="text-align: center;">Background Information</h1>

Computational Linguistics and Natural Language Processing are interdisciplinary fields of study drawing from speech linguistics, phonology, cognitive science, neurolingustics, computer science, machine learning, deep learning, pragmatics - and many more.


<h3 style="text-align: center;">Why is NLP important?</h3>

One of the most significant problems in NLP (Natural Language Processing) - and a paradigm of programming in general - is the problem of ***ambiguity.***  

Natural language is abstract. Humans are great at inferring meaning from speech and text. Computers? Not so much.

**The goal of Computational Linguistics - in an extremely simplified sense - is making computers and software understand natural language. This means both speech and text.**

If you are designing a car, and someone tells you to "make it fast", how do we determine that fast means? Acceleration? Top speed? Relative to a turtle, my car, or an F1 car? 

Programming works to solve these abstract ideas and concepts by turning them into something explicity defined - much like a how following a recipe works.

Below are additional specific examples of linguistic elements that humans are inherently good at understanding, but computers are not.

>**Polysemy** is the coexistence of multiple meanings for one word   
    *e.g. "My mouth is sore" vs. "Watch your mouth"*

>**Implication** is speech that indirectly assumes meaning through inference  
    *e.g. "The sky looks really bad" can imply bad weather is coming*

>**Figurative Language/Tropes** such as metaphors, metonyms, and idioms provide symbolic meaning through comparison. 

___

<h1 style="text-align: center;">Word Embedding</h1>


[Word2Vec](#Word2Vec)
[Doc2Vec](#Doc2Vec)
[GloVe](#GloVe)



> "**Word Embedding** is a representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together. In the deep learning frameworks such as TensorFlow, Keras, this part is usually handled by an **embedding layer** which stores a lookup table to map the words represented by numeric indexes to their dense vector representations."

Source: https://towardsdatascience.com/machine-learning-word-embedding-sentiment-classification-using-keras-b83c28087456

**Simplified:**

> Word embedding is a language modeling technique that takes words found in text documents and maps them to a vector form of real numbers using distributed representation. Consider this "translating" the words to a form the computer can then use for natural language processing while capturing them with as much of the semantical/morphological/context/hierarchical/etc. information as possible.

---

<h1 style="text-align: center;">How does it work?</h1>

Before we get deep into how word embedding works and the role it plays in deep learning, let's start by understanding how we get text documents in a form that can be intepreted and utilized for natural language processing as whole. 

Below is an introduction on how text corpora undergo vectorization so they can be interpreted and analyzed, starting with two primitive methods.

<h2 style="text-align: center;">Text Vectorization</h2>

<h4 style="text-align: center;"><span style="color: red">Note: This is not yet word embedding. This section is for understanding the most common methods of vectorization of text corpora (CountVectorizer/Binarized OneHotEncoding, and TF-IDF) but do not give a <u>distributed representation</u>, which is essential to word embedding techniques</span></h4>

**What:** Text Vectorization is a specific type of <i>Feature Extraction</i> in NLP. By allowing us to represent documents numerically, we gain the power to perform analytics on the data as well as create instances that are fed into machine learning algorithms, like Word Embedding. Whether you are using a pre-designed vectorization method (i.e. CountVectorizer) or build a custom feature extraction algorithm, the functional output represents the attributes and properties of the documents.

**Measures:** Word Frequency via unigrams


**Why:** Document classification. Compare document similarities based on distance between matrices.  

<span style="color: red"><b>Does not account for:</b> Syntax, semantics</span>




### i. SciKit-Learn Count Vectorizer ([Bag-of-Words Approach][Ref1])



[Ref1]: https://en.wikipedia.org/wiki/Bag-of-words_model

This is a very simple and naive method of word embedding. SciKit-Learn's CountVectorizer function creates a sparse matrix where the words and documents that have effectively become vectors are stored. 

<i>IMPORTANT: CountVectorizer tokenizes the corpus for you, whereas other methods such as NLTK and Gensim require you to preprocess and tokenize/lemmatize/etc. prior to vectorization</i>

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# create CountVectorizer object
vectorizer = CountVectorizer(binary=True) # See note 1 below
corpus = [
          'Text of first document.',
          'Text of the second document made longer.',
          'Number three.',
          'This is number four.',
]
# learn the vocabulary and store CountVectorizer sparse matrix in X
X = vectorizer.fit_transform(corpus)
# columns of X correspond to the result of this method
print(vectorizer.get_feature_names()) # See Note 2 Below
# retrieving the matrix in the numpy form
X.toarray()

['document', 'first', 'four', 'is', 'longer', 'made', 'number', 'of', 'second', 'text', 'the', 'this', 'three']


array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]], dtype=int64)

**Note 1:** Frequency-based encoding methods like Bag of Words disregard grammar and syntax of words in the text documents. As a result, they suffer from the long tail, or Zipfian distribution, that characterizes natural language. As a result, tokens that occur very frequently are orders of magnitude more “significant” than other, less frequent ones. This can have a significant impact on some models (e.g. generalized linear models) that expect normally distributed features.

The parameter (binary=True) is very important here. It allows us to use CountVectorizer as OneHotEncoding without the limitations (2D arrays vs. 1D arrays). If the parameter is left as the default (False), the matrices will show frequency of the words, not the binarized presence/absence.

**Note 2:** I called .get_feature_names() to make the point that Bag of Words algorithms are efficient because when they generate the lexicon from the corpora (each term that represents a feature), <i>they are sorted alphabetically</i>. This is tremendously more computationally efficient. If you did not sort the features alphabetically, there would be no clean frame of reference and each matrix would have to be manually sorted by the greatest common number of subsequently shared documents. 

In [16]:
# transforming a new document according to the 'learned' vocabulary from corpus
vectorizer.transform(['I have to document three of my text messages today.',
                      'I have a document about Wendys Four for Four deals.']).toarray()

array([[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

### Interpreting the Results of CountVectorizer

The output is representing each row as a tuple of feature names, each column is a document (word), and each cell is a word count. 

Each of the documents in the corpus is represented by columns of equal length. This means that in an example where a corpus might have the words "good" and "great", are just as different as "car" and "tree". This is clearly not true - and highlights an issue of the wordcount vectors generated being stripped of context.



<b><u> For very large corpora:   </b></u>


In corpora where n features and D dimensions are extremely large, it is much more efficient to use **SciKit-Learn's HashingVectorizer**. Sparse matrices have many benefits, but as they grow increasingly large, the memory utilized to store them is rapidly occupied as well. 

<h6 style="text-align: center;"><span style="color: red">Note: sklearn.preprocessing.OneHotEncoder</span></h6>

The module in SciKit-Learn called OneHotEncoder is not the best fit for this specific task, despite its name. One of the goals of word embedding is dimensionality reduction to increase model performance. The OneHotEncoder treats each vector component (column) as an independent categorical variable, expanding the dimensionality of the vector for each observed value in each column. In this case, the component (four, 0) and (four, 1) would be treated as two categorical dimensions rather than as a single binary encoded vector component. 

OneHotEncoder is better used for categorical data encoding in pandas spreadsheets with significantly smaller n features and D dimensions.



---

### ii. TF-IDF Vectorization (Term Frequency - Inverse Document Frequency)

The bag-of-words representations only describe a document in a standalone fashion, not taking into account the context (syntax/semantics) of the corpus. A different approach would be to consider the relative frequency or rareness of tokens in the document against their frequency in other documents. 

**What**: TF-IDF is a very similar method to Latent Dirilecht Allocation (LDA) where weights are assigned to each word relative to their frequency in the documents. Words that appear more frequently will have a lower weight. </i></b>

**Measures:** Score of each word's importance (weight) found in the corpus. 

**Why:** Topic Modeling and Document Classification. "What's being talked about" within a document or when comparing multiple corpora

> <b>The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.</b>

Source: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
          'Text of first document.',
          'Text of the second document made longer.',
          'Number three.',
          'This is number four.',
]

tfidf  = TfidfVectorizer()
corpus = tfidf.fit_transform(corpus)
print(corpus)

  (0, 9)	0.4658085493691629
  (0, 7)	0.4658085493691629
  (0, 1)	0.5908190806023349
  (0, 0)	0.4658085493691629
  (1, 9)	0.32555708674977907
  (1, 7)	0.32555708674977907
  (1, 0)	0.32555708674977907
  (1, 10)	0.41292788407934816
  (1, 8)	0.41292788407934816
  (1, 5)	0.41292788407934816
  (1, 4)	0.41292788407934816
  (2, 6)	0.6191302964899972
  (2, 12)	0.7852882757103967
  (3, 6)	0.41428875116588965
  (3, 11)	0.5254727492640658
  (3, 3)	0.5254727492640658
  (3, 2)	0.5254727492640658


Viewing the output in the current format, the notation:

**((doc, term), TFIDF)**

**Converting the output to a sparse matrix**

In [13]:
tfidf_matrix = tfidf.fit_transform(corpus).toarray()
print(tfidf_matrix)

## If this cell throws an AttributeError "lower not found", re-run the CountVectorizer cell block first. 
## Reason: tfidfVectorizer does not tokenize the corpus for you, and the above just circumnavigates it


[[0.46580855 0.59081908 0.         0.         0.         0.
  0.         0.46580855 0.         0.46580855 0.         0.
  0.        ]
 [0.32555709 0.         0.         0.         0.41292788 0.41292788
  0.         0.32555709 0.41292788 0.32555709 0.41292788 0.
  0.        ]
 [0.         0.         0.         0.         0.         0.
  0.6191303  0.         0.         0.         0.         0.
  0.78528828]
 [0.         0.         0.52547275 0.52547275 0.         0.
  0.41428875 0.         0.         0.         0.         0.52547275
  0.        ]]


### Interpreting the Results of TfidfVectorizer:

If you do not convert the output of fit_transform() into a sparse matrix, you are provided with an list of tuples whose output corresponds with:

**((doc, term), TFIDF)**

Calling .toarray() transforms it into a sparse matrix where you can see the score of each term found in the documents. 

---

<h1 style="text-align: center;">So how does this tie into word embedding?</h1>

Methods like the ones shown above are the fundamental components that allow word embedding to work. Vectorization allows you to still build classification models such as a spam classifier using an SVM model, or for topic modeling using a Non-negative Factorization Matrix (NFM) or LDA clustering.

However, the true word embedding techniques that will be explained below also bring a concept of ***distributed representations***. You'll notice there are no non-negative values generated from the methods above. This prevents us from comparing documents within a corpus that <i>don't</i> share terms. By creating a continuous vector space among all documents, we understand their similarity/disimilarity to one another. <b><u>Word embedding maintains syntactical elements and semantics.</b></u> which can also then be fed directly into deep learning neural networks for rapid, large-scale application.

There are multiple methods for word embedding. Understanding the differences in functionality, scalability, and performance of each method will allow you to intuitevly decide which one is more appropriate given the problem you are working to solve. 

---

<h1 style="text-align: center;">Types of Common Word Embedding Methods:</h1>

# 1. Word2Vec

**Measures:** [Cosine similarity][Ref2] between the two specified words/documents using word vectors (embeddings) of each. 
    - e.g. "puppy" and "dog" are similar, with words that surround it like "cute" and "fluffy". They would be expected to share a similar vector representation. 
    - Model is PREDICTIVE in the sense it predicts what words surround the target word.

**Example Applications:** Creating a sentiment lexicon for sentiment analysis for a specific use-case like hotel review or movie reviews.

**Drawbacks:** In corpora with large repititions of text documents/terms, similarity between documents are disproptionately effected for not accounting for the weighted frequencies of the terms in each document.

[Word2Vec][Ref4], created by a team of researchers at Google led by Tomáš Mikolov, implements a word embedding model that enables us to create these kinds of distributed representations. It is a 2-layer shallow neural network that turns raw text into a numerical text that deep learning networks can understand and then use to predict. It is a ***feed-forward*** neural-network that can be optimized in regression/classification models using **SGD (Stochastic Gradient Descent)**.

There are two methods in Word2Vec: SkipGram and ContinuousBagofWords

Gensim allows us to use Word2Vec and Doc2Vec in Python, without having to adapt very large machine learning frameworks such as Keras or TensorFlow.

Further explanation can be found [here][Ref3]

[Ref2]: https://www.machinelearningplus.com/nlp/cosine-similarity/
[Ref3]: https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
[Ref4]: https://arxiv.org/pdf/1301.3781.pdf

### i. CBOW

This [algorithm][Ref5] operates in a similar manner as the the traditional Bag of Words algorithm, but creates a distributed representation of the text documents that predict inter-similarity of documents and the likelihood a word will appear adjacent to a specified one. This gives us the ability to maintain context by predicting what words may lie adjacent to one another given a certain input.

**This algorithm is feed-forward and very fast to train and execute.**

[Ref5]: https://cs.stanford.edu/~quocle/paragraph_vector.pdf

### ii. SkipGram

I highly suggest reading this [resource by Chris McCormick][Ref6] that delves deep into how the SkipGram algorithm works.

The primary point I would like to make in this notebook is that, generally, SkipGram is much slower than CBOW, but considered more accurate with infrequent words. In the case of smaller datasets, the performance differential may not be as drastic. 

[Ref6]: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

# 2. Doc2Vec

The [Doc2Vec][Ref4] algorithm is an extension of Word2Vec.

It proposes a paragraph vector; An unsupervised algorithm that learns fixed-length feature representations from variable length documents. 

Just like in Word2Vec, this distributed representation attempts to inherit the semantic properties of words such that “red” and “colorful” are more similar to each other than they are to “river” or “governance.” 

The additional paragraph vector takes into consideration the ordering of words within a narrow context, similar to an n-gram model. The combined result is much more effective than a bag-of-words or bag-of-n-grams model because it:
 - Generalizes better
 - Has a lower dimensionality
 - Consumes less memory than Word2Vec, as a a result of not needing to store word vectors

[Ref4]: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

# 3. GloVe

Whereas Word2Vec/Doc2Vec are ***predictive*** models, [GloVe][Ref8] is **count-based.**

The algorithm itself is much like [LDA (Latent Dirilecht Allocation)][Ref7] - where it generates statistical matrices of word co-occurances in a low dimensional space. However, LDA does not maintain the syntactical linear relationships in the vector space. 

**In a very loose sense of "complexity", think of TF-IDF -> LDA -> GloVe**

GloVe forces the model to learn/retain the linear relationship of the text documents based on the co-occurance matrix. It accomplishes this by using a weighted least squares regression model to account for rare terms/low-term frequencies in the documents

[Ref7]: https://ai.stanford.edu/~ang/papers/jair03-lda.pdf
[Ref8]: https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/


---

<h1 style="text-align: center;">Summary of Word Embedding Methods:</h1>

- Word2Vec, Doc2Vec, and GloVe all have advantages/disadvantages depending on their applications 
- Common applications of word embeddings are sentiment analysis, text classification, and topic modeling
- Word2Vec captures co-occurence of document terms one window at a time
- Doc2Vec functions similarly, but the additinal paragraph vector allows for more targeted predictive capabilities
- GloVe is much like LDA, but has built-in functionality to maintain syntax of the documents where LDA cannot do so.