# Gensim Word Vectors

<table class="tfo-notebook-buttons" align="left">

  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/word_embeddings.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/text/docs/guide/word_embeddings.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

In [None]:
import numpy as np

# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

## What is Gensim
For looking at word vectors, W'll use Gensim. \\
Gensim isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.\\

## Stanford GloVe model
Our homegrown Stanford offering is GloVe word vectors. Gensim doesn't give them first class support, but allows you to convert a file of GloVe vectors into word2vec format. You can download the GloVe vectors from [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)



In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2022-08-05 09:18:43--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-08-05 09:18:44--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-08-05 09:21:27 (5.07 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
glove_file = "./glove.6B.300d.txt"
word2vec_glove_file = "glove.6B.300d.word2vec.txt"
glove2word2vec(glove_file, word2vec_glove_file)

(400000, 300)

In [None]:
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

In [None]:
model.most_similar("arabic")

[('hebrew', 0.6557009220123291),
 ('language', 0.6296008825302124),
 ('urdu', 0.5955149531364441),
 ('languages', 0.5669240355491638),
 ('hindi', 0.5659825801849365),
 ('farsi', 0.5628933906555176),
 ('persian', 0.5627796649932861),
 ('english', 0.5513119697570801),
 ('word', 0.5412472486495972),
 ('pashto', 0.5407745838165283)]

We define functions to write word vectors to file inorder to visualize them using the [embeddings projector](http://projector.tensorflow.org)

In [None]:
def save_words(model, max_vocab_size=10000):
    i = 1
    with open("vocab.txt", "w") as vocab_file:
        for word, _ in model.vocab.items():
            if i <= max_vocab_size:
                vocab_file.write(word)
                vocab_file.write("\n")
            else:
                break
            i += 1
        
def save_vectors(model, max_vocab_size=10000):
    i = 1
    with open("vectors.txt", "w") as vectors_file:
        for word, _ in model.vocab.items():
            if i <= max_vocab_size:
                vector = model.get_vector(word)
                for weight in vector:
                    vectors_file.write(f"{weight}\t")
                vectors_file.write("\n")
            else:
                break
            i += 1

In [None]:
save_words(model)

In [None]:
save_vectors(model)

In [39]:
# We can word embeddings to find analogy
def analogy(w1, w2, y):
    return model.most_similar(positive=[y, w2], negative=[w1])[0][0]

In [40]:
analogy('japan', 'japanese', 'australia')

'australian'