# Visualizing <font color= #13c113  >Word2Vect </font>  with <font color= #00c1FF  >t-SNE</font>

![Word Embeddings t-SNE](https://jp.mathworks.com/help/examples/textanalytics/win64/VisualizeWordEmbeddingsUsingTextScatterPlotsExample_05.png)


## Adapted from:

https://medium.com/@aneesha/using-tsne-to-plot-a-subset-of-similar-words-from-word2vec-bb8eeaea6229


<br>


# * [MSTC](http://mstc.ssr.upm.es/big-data-track) and MUIT: <font size=5 color='green'>Deep Learning with Tensorflow & Keras</font>

https://medium.com/@aneesha/using-tsne-to-plot-a-subset-of-similar-words-from-word2vec-bb8eeaea6229

## 1. Loads a pre-trained word2vec embedding
## 2. Finds similar words and appends each of the similar words embedding vector to the matrix
## 3. Applies TSNE to the Matrix to project each word to a 2D space (i.e. dimension reduction)
## 4. Plots the 2D position of each word with a label





---
# <font color=#003950 >We can use word2vec pre-trained Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors).</font>
---



In [0]:
! wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

# [Gensim](https://radimrehurek.com/gensim/) is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible
![gensim](https://radimrehurek.com/gensim/_static/images/gensim.png)

In [0]:
! pip install gensim

In [0]:
import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)


---
# <font color=#003950 >Test and visualize the  word2vec model.</font>
---


In [0]:
vector_of_word=model['computer']

In [0]:
vector_of_word.shape

In [0]:
# Test the loaded word2vec model in gensim
# We will need the raw vector for a word
print(model['computer']) 

## <font color=#003950 >We can explore similar/closer words</font>
---


In [0]:

# We can get the words closest to a word
model.similar_by_word('computer')

## <font color=#003950 >Now let's visualize a LIMITED number of tokens using t-SNE</font>
---


In [0]:
type(model.vocab)

In [0]:
len(model.vocab)

In [0]:
# Limit number of tokens to be visualized
limit = 500
vector_dim = 300

# Getting tokens and vectors
words = []
embedding = np.array([])
i = 0
for word in model.vocab:
    # Break the loop if limit exceeds 
    if i == limit: break

    # Getting token 
    words.append(word)

    # Appending the vectors 
    embedding = np.append(embedding, model[word])

    i += 1

# Reshaping the embedding vector 
embedding = embedding.reshape(limit, vector_dim)

In [0]:
embedding.shape

In [0]:
len(words)

In [0]:
# Creating the tsne plot [Warning: will take time]
tsne = TSNE(perplexity=30.0, n_components=2, init='pca', n_iter=5000)

low_dim_embedding = tsne.fit_transform(embedding)


In [0]:
i= 300
label=words[i]

print('Word: ', label)

In [0]:
x, y = low_dim_embedding[i, :]

print('X: ',x,' Y: ',y)

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))  # in inches

x, y = low_dim_embedding[i, :]

plt.scatter(x, y)

plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt


def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(24, 24))  # in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y)
        plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')
    plt.savefig(filename)


In [0]:
# Finally plotting and saving the fig 
plot_with_labels(low_dim_embedding, words)


---

---


## <font color=#003950 >...you could try with different databases(as Wiki) and NLP libraries...</font>
---


https://gist.github.com/manashmndl/bd75db5b8eb6f709b7a4c978027cfcd6
---
---



https://fasttext.cc/docs/en/pretrained-vectors.html

Spanish

https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.es.vec


In [0]:
! wget -c "https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec"