# Word Embeddings with Word2Vec

Word embeddings represent words as vectors in high dimensional space so neural networks can understand natural language better. The main idea is that words with **similar meaning** are **closer** together, and words that are very different are very far apart.

<img src='./images/w2v_vis.png' width='100%'/>

**Fig 1.** *Word2Vec Word Embeddings created from wine reviews*

Word2vec is a two-layer neural net that processes text. 

Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. 

Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, likes, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.

The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.

Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.

## How does it work?

We’re going to train a simple neural network with a single hidden layer to perform a certain task, but then we’re **not actually going to use that neural network** for the task we trained it on! Instead, the goal is actually just to **learn the weights of the hidden layer**. We’ll see that these weights are actually the “word vectors” that we’re trying to learn.

## Fake Task to Train On

We’re going to train the neural network to do the following:

Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.

<img src='./images/w2v.png'/>

The network is going to learn the statistics from the **number of times** each pairing shows up.

<img src='./images/w2v_network.png'/>

## Notable other NLU architectures for language modelling:
- ELMo (Embeddings from Language Models)
- BERT (Bidirectional Encoder Representations from Transformers)