----
everything2vec
====

<img src="images/all_the_things.jpg" style="width: 400px;"/>

---
word2vec for Machine Translation
---

<img src="images/machine_translation.png" style="width: 400px;"/>

Language translations are rotations and scalings of the vector space.  

---
How are Machine Translations learned?
----

The transform matrix can be learned by bootstrapping from a small sample (manually labeled), then extended to entire language.

Steps:

1. Create a word embedding in both languages
2. Manually specify pairs (typically, simple concrete nouns)
3. Find the translation matrix
4. Apply translation matrix across entire language

---
word2vec: Check for understanding
---

<details><summary>
What is goal of word2vec in Plain English?
</summary>
Create a dense vector representation of words that models semantic meaning based on context
</details>

<br>

<details><summary>
Why are neural networks powerful in machine learning?
</summary>
Capable of learning complex, arbitrary, non-linear relationships.
</details>

<br>

<details><summary>
What are the inputs and outputs of word2vec neural network during training?
</summary>
word and context
</details>

<br>

<details><summary>
What math operations are most common for word2vec embedding?
</summary>
Arithmetic, distance, and clustering. By any vector operations are meaningful
</details>

<br>

<details><summary>
Every ML algorithm has to solve 3 basic problems:  
<br>
1. Representation  
2. Evaluation  
3. Optimization  
<br>
How does word2vec solve each of these?
</summary>
1. (dense vector) Representation of word semantics (key breakthrough)
2. Evaluation is through word context (thus needs lots of words)    
3. Optimization is backpropagation (aka, back prop)  
</details>

----
By the end of this Notebook, you should be able to:
----

- Able to apply word2vec beyond just words
- Understand how t-sne is the preferred algorithm for high dimensional visualization
- Describe common extensions to word2vec:
    - Machine translation
    - Dependency parsing extension
    - doc2vec, how word2vec can be extended to paragraph and documents
    - Cutting-edge algorithms of Word Mover Distance and Thought Vectors

----
doc2vec, a powerful extension of word2vec
====

<img src="http://img5.picload.org/image/paagccr/doc2vec.png" style="width: 400px;"/>

Doc2vec (aka paragraph2vec or sentence embeddings) modifies the word2vec algorithm to larger blocks of text, such as sentences, paragraphs or entire documents. 

![](images/overview_word.png)

![](images/overview_paragraph.png)

Every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W . 
The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. 

Each additional context does not have be a fixed length (because it is vectorized and projected into the same space).

Additional parameters but the updates are sparse thus still efficient.

----
doc2vec Example: Descri.beer
---

<img src="https://timebusinessblog.files.wordpress.com/2013/03/85632599-e1364519588629.jpg?w=360&h=240&crop=1" style="width: 400px;"/>

How do to make sense of 1.6M beer reviews?

![](images/beer_space.jpg)

[Source](http://www.slideshare.net/BenEverson/describeer-demo)

----
emjoi2vec
----
<img src="https://s3.amazonaws.com/instagram-static/engineering-blog/emoji-hashtags/tsne_map_tight.png" style="width: 400px;"/>

[Image](images/https://s3.amazonaws.com/instagram-static/engineering-blog/emoji-hashtags/tsne_map_tight.png)  
[Source](http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji)

----
Other notable vectorizations
----

| Name | Embedding  | 
|:-------:|:------:| 
| [Char2Vec](http://arxiv.org/abs/1508.06615) | Character |
| [Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) | Word | 
| [GloVe](http://www-nlp.stanford.edu/pubs/glove.pdf) | Word | 
| [Doc2Vec](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) | Sections of text |
| [Image2Vec](Image2Vec) | Image |
| [Video2Vec](https://www.dropbox.com/s/m99k5md8461xi0s/ICIP_Paper_Revised.pdf) | Video |

[Source](http://datascienceassn.org/content/table-xx2vec-algorithms)

---
Thought Vectors
---

<img src="https://cdn-images-1.medium.com/max/2000/1*KYLrhDHqAAdQaJiN1G4ytA.jpeg" style="width: 400px;"/>

Geoffrey Hinton's, from Google, "Top Secret" new algorithm.

> When Google farts 💨, the rest of the world 💩

Instead of embedding words or documents in vector space, embed thoughts in vector space. Their features will represent how each thought relates to other thoughts. 

It hasn't been released so it is mostly speculation. Keep your eye out for it.

[Additional general reading](http://deeplearning4j.org/thoughtvectors.html)<br>
[Skip-Thought Vectors paper](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf)

---
Summary
---
- Given the properties of word2vec (e.g., large corpus input, straightforward training, and text vector output), it can be applied to a variety of problems.
- Word2vec is another perspective on Machine Translation, rotation and translation of embedded space.
- Other semantic meanings can be captured by using dependency parsing as context.
- Longer pieces of text can also be embedded into the same space as words (i.e., doc2vec).

----
Bonus Material
-----

---
Concept level document similarity
---

> The Sicilian gelato was extremely rich.

vs.

> The Italian ice-cream was very velvety.

The statements reference the __same__ idea but share __no__ words.

----
Word Mover’s Distance (WMD)
----

![](images/wmd_illustration_1.png)

Represent text documents as a weighted point cloud of embedded words. 

The distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B.

----
Earth mover’s distance metric (EMD)
-----

Word Mover’s Distance (WMD) is a special case of the [earth mover’s distance metric (EMD)](https://en.wikipedia.org/wiki/Earth_mover%27s_distance)

EMD is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features, which we call the ground distance is given. The EMD 'lifts' this distance from individual features to full distributions.

[Deep dive on EMD](http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/RUBNER/emd.htm)

---
Example
----

![](images/WMD_worked_example.png)

State-of-the-art kNN classification accuracy but slowest metric to compute.

[Source: From Word Embeddings To Document Distances](http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf)

[Application to Data Science](http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/)

---
lda2vec 
----

<img src="images/catdog_word2vec_cropped.jpg" style="width: 400px;"/>

---
LDA overview
---

<img src="http://salsahpc.indiana.edu/b649proj/images/proj3_LDA%20structure.png" style="width: 400px;"/>

<img src="https://i.ytimg.com/vi/Acs_esny-qQ/hqdefault.jpg" style="width: 400px;"/>

----
lda2vec
-----

<img src="images/lda2vec.png" style="width: 400px;"/>

$v_{doc}$ is a mixture:  
$v_{doc}$ = a $v_{topic1}$ + b $v_{topic2}$ + ... 

![](images/doc_vec.png)

---
word2vec vs. LDA
----

<img src="images/compare_models.png" style="width: 400px;"/>


| algorithm | scope | prediction | numbers | visualization | density | metaphor| 
|:-------:|:------:| :------:| :------:| :------:| :------:|  :------:|
| word2vec | local | one word predicts a nearby words | real numbers | bar chart | dense | location  |
| LDA | global | documents predict global words | percentages that sum to 100%  | pie chart | sparse | mixture| 


---
lda2vec Executive Summary
----

![](images/punchline.png)

lda2vec adds additional context, defines context as topic

---
lda2vec implementation
----

[GitHub repo](https://github.com/cemoody/lda2vec)

<br>
<br> 
<br>

----