We can use learned word vectors to translate words from one language to another. Assume that we want to translate English words to French and already know a subset of the translation. Then, we can learn a *rotation matrix* that aligns English and French word vector spaces by minimizing the distance between the word vectors of the same meaning. We can formulate the problem as:

$$
\DeclareMathOperator*{\argmin}{argmin}
XR \approx Y \\
\argmin_{R} ||Y - XR||_F
$$

where $X$ is a English word vector matrix formed by a word vector in each row and $Y$ is the French counterpart. Note that the known translations must be placed in the same rows for a correct alignment. Here, the subscript $F$ corresponds to Frobenius norm that is computed as $||A||_{F} = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|A_{i j}\right|^{2}}$. To find the optimum $R$ matrix, we use gradient descent and minimize the squared loss to ease gradient operations. The gradient of the squared loss is:

$$
g = \frac{d}{dR}Loss = \frac{2}{m}(X^T(XR - Y))
$$


**Remark:** This is a supervised learning approach where we know only a subset of the translations in the dictionary and predict the unknown translations based on the new ones. To translate an English word, we multiply the English word vector with $R$ and find the nearest vector in the French word space.

Hashing can be used to speed up the nearest neighbor search during prediction. In hashing, we map each entity to a bucket (a number) with a rule called *hash function*. Thus, hashing divides the word vectors into several buckets and with a elegantly designed hash function we need to search the nearest neihgbor only in a single bucket, instead of whole space. 

**Locality Senstive Hashing (LSH)** is a type of hashing where close vectors are mapped to the same or close buckets. In other words, their locations in the original spaces is reflected into hashes as well. For LSH, we divide the space with planes and for each vector we encode which side of the planes the vector is located. We use dot product of the vector and the normal vector of the plane to find the side. In turn, we compute the hash value as 

$$
hash = \sum_{i}^{H} 2^i \times h_i
$$

where $H$ is the number of planes (or $log_2$ buckets) and $h_i$ is $1$ if the sign of the dot product between the vector and normal vector of $i^{th}$ plane is non-negative and $0$ otherwise.

**Remark:** With this hash function, the vectors are mapped to the same hash value if and only if they lie in the same region, preserving the locality.

Using LSH, we can implement approximate nearest neighbor serach by dividing the space into random planes multiple times. Each time, we search the nearest vectors in the same region with the word vector in consideration and combine the results to approximate K-NN in a much faster way. 

**Remark:** Approximate nearest neighbor search is a great example of the trade-off between precision and speed of an algorithm. The more plane you use or the more times you divide the space with random planes, the more precise and the slower the algorithm is. 

**Author's Note:** LSH and approximate nearest neighbor serach is an efficient way to search large vector spaces. For smaller spaces, we can normalize the vectors to unit length construct two matrices $A$ and $B$ such that $A$ contains the search vector in its *rows* and $B$ contains the vectors in the search space in its *columns*. With this formulation $AB$ will generate cosine similarity between each search vector and the vectors in the space. Then, we can filter $AB$ to find the as much nearest neihgbor as we desire. 

This is an exact approach and considerably fast since matrix operations are performed very efficiently. Yet, it can be slower and more memory-consuming than LSH with large vector spaces.