# 2. Working With Word Vectors
In the introduction section we went over some very basic NLP techniques, and more importantly gained an understanding for how to apply _basic_ machine learning models and APIs to the a problem revolving around text data. 

One of the things that we went over was the **term-document matrix**, and how it could be utilized in our process of converting text data to numerical data, so a model can understand it. Well, one of the things that we did not discuss is there is a slight problem with that method, namely its simple counting process. One of the things that tends to happen is that words such as "a", "the", "and", "in", "to", etc, have a high count for ALL documents, no matter what the category is! This is a very large amount of noise that will often overshadow the meaningful words. 

These words are known as **stopwords** and one common technique is to just remove them from the dataset before doing any machine learning. 

### 1.1 TF-IDF
However, there is another technique that we can utilize: **Term Frequency-Inverse Document Frequency**, (TF-IDF). We will not go into the full details of TF-IDF, however, the jist is as follows: We know that words that appear in many documents are probably less meaningful. With this in mind, we can weight each vector component (in this case a word) by something related to how many documents that word appears in. So, intuitively speaking, we may do something like:

$$\frac{\text{raw word count}}{\text{document count}}$$ 

So, the numerator tells us _how many times does this word appear in this document_, and the denominator tells us _how many documents does this word appear in, in total_. Now, in practice we do some transformations on these, like taking the log count, smoothing, and so on. However, the specific implementation isn't nearly as important for this course as is the general understand behind the process. 

### 1.2 Key Point
One of the most important things to keep in mind during all subsequent posts is that no matter what technique we are using, we are always interested in a matrix of size $(V x D)$, where $V$ is the vocabulary size (the number of total words), and $D$ is the vector dimensionality, which is we are doing something like counting up the total number of times a word appears in a set of books, $D$ is the total number of books. 

### 1.3 Word Embeddings
A final thing to note, we are going to encounter the term **Word-Embedding** quite a bit. This is just a fancy word for an old and relatively straight forward concept. A word-embedding is just a fancy name for a feature vector that represents a word. In other words, we can take a categorical object-a word in this case-and then map this object to a list of numbers (in other words, a vector). We say that we have embedded this word into a vector space, and that is why we call them word embeddings. 

# 2. Word Analogy
One of the most popular applications of word embeddings is **word analogies**. This is where the famous `king - man = queen - woman` comes from.

<img src="https://drive.google.com/uc?id=1mlFNZ-GeyzawODRG4Fl0p05fXPSBwOLp" width="700">

So, we are now going to focus on two main questions:

1. What are word analogies?
2. How can we calculate them?

First, however, a few examples. We can start with:

> King - Queen ~= Prince - Princess

Above on each side, we have a male member of the royal family minus a female member of the royal family, who is from the same generation. 

> France - Paris ~= Germany - Berlin

Now, on each side we have a country minus a famous city from that country.

> Japan - Japanese ~= China - Chinese

Here, we have a country minus the term used to refer to the people of that country.

> Brother - Sister ~= Uncle - Aunt

Now, we have a male member of the family, subtracting a close female member of the family.

> Walk - Walking ~= Swim - Swimming

Finally, we can see that we were able to learn something about verb tense. 

### 2.1 Visualing Analogies
So, how can we actually visualize these analogies? As usual, it is very helpful to think of things geometrically. First, recall that word embedding just means word vector. In other words, if we have a grid, each word is just represented by a dot on the grid. So, what will happen when subtracting one vector from another vector? Well, of course that will just yield the vector between the two vectors. If we can say that the difference of the two vectors on the left is approximately equal to the two vectors on the right, then we know that the two difference vectors are approximately the same. 

Now, we know that a vector has two components: a direction and a magnitude. So, when we say that these two vectors are approximately the same, what we are really saying is that their magnitude and direction are very close to one another. This can be visualized below.

<img src="https://drive.google.com/uc?id=1toGKPnF6d51GBxUB15w0wYI6PDyZc2oU" width="300">

### 2.2 How to find analogies?
With that said, how do we actually we find word analogies? Well, we know that their are four words in every analogy. So, what we can do is take three of these words and try to find the fourth word. So, we have an input of three words, and an output of the fourth word. Notice that we because we are dealing entirely with vectors, we can just rearrange our equation as follows:

$$\text{King - Man = ? - Woman}$$

$$\downarrow$$

$$\text{King - Man + Woman = ?}$$

We know that the $?$ is representing `Queen`, however we will refer to it as `SomeVector` for the time being. 

$$\text{King - Man + Woman = SomeVector}$$

So, our job is to find the word that is most closely associated with `SomeVector`. We will do this utilizing _distance_. In pseudocode this may look like:

```
closest_distance = infinity
best_word = None
test_vector = king - man + woman
for word, vector in vocabulary:
    distance = get_distance(test_vector, vector)
    if distance < closest_distance:
        closest_distance = distance
        best_word = word
```

Note that utilizing a `for` loop will be very slow, and we will of course want to vectorize this process utilizing `numpy`. 

Now, we did not define `get_distance` in the above pseudocode. We have a variety of options when deciding how to calculate distanc. Sometimes, we will simply use _Euclidean Distance_:

$$\text{Euclidean Distance: } ||a - b||^2$$

It is also common to use the _cosine distance_:

$$\text{Cosine Distance: } cosine\_distance(a, b) = \frac{1 - a^Tb}{||a|| \; ||b||}$$

In this later form, since only the angle matters, because:

$$a^Tb = ||a|| \; ||b|| cos(a,b)$$ 

During training we normalize all of the word vectors so that their length is 1:

$$cos(0) = 1, \; cos(90) = 0, \; cos(180) = -1$$

When two vectors are closer, $cos(\theta)$ is bigger. So, we want our distance to be:

$$\text{Distance} = 1 - cos(\theta)$$

At this point we can say that all of the word embeddings lie on the unit sphere. 

### 2.3 Why is this so cool?
One pretty interesting fact about neural word embedding algorithms is that _they can find these analogies at all_. Once we have covered these algorithms, specifically _**word2vec**_ and _**GloVe**_, it will become clear that these algorithms have no concept of analogies; in other words, what they want to optimize is totally unrelated to word analogies. So, the fact that word analogies suddenly emerge out of the training process is very intruiging. Remember, in all cases we are still dealing with a $VxD$ word embedding matrix. This is the case whether we are just using raw word counts, TF-IDF, or word2vec. Yet, raw word counts and TF-IDF do _not_ give us good analogies. So, the fact that good word analogies emerge from the model and training process of word2vec is a very interesting research area. 