# Class 12.2: Sentence Embeddings

### Review: Installing gensim and its dependencies and launching a Jupyter notebook

``python3 -m pip install numpy``

``python3 -m pip install scipy``

``python3 -m pip install gensim``

``python3 -m pip install scikit-learn``

``jupyter notebook``


### Review: Getting a pre-trained word2vec model

You can get a pre-trained word2vec model built on billions words of Google newsfrom here:

https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz

Just click on the "download" icon next to where it says "Raw".


## Importing some libraries

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import gensim
import re
import nltk
from sklearn.decomposition import PCA
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine



## Loading and using the pre-trained word2vec model

<b>Note: When you run the code below, it might take a minute to load the model!</b> Wait until you see <code>"big model loaded"</code> printed out below the cell. You can also check for the <code>*</code> in the brackets to the left of the cell you are executing.

In [None]:
bigmodel = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300-SLIM.bin.gz", binary=True)
print("big model loaded!")

Let's try summing some embeddings for sentences. Here's a little function that will do that for us:

In [None]:
def get_sent_embed(sentence):
    sentembed = np.zeros(300)
    for w in sentence.split():
        if w in bigmodel:
            sentembed += bigmodel[w]
    return sentembed

In [None]:
s1= "Dogs have fur and floppy ears"
s2 ="Cats are fluffy and have long tails"
s3 = "Computer science is fun and easy"
s4 = "Programming is an important skill"
s5 = "Click here with the mouse"

sent1 = get_sent_embed(s1)
sent2 = get_sent_embed(s2)
sent3 = get_sent_embed(s3)
sent4 = get_sent_embed(s4)
sent5 = get_sent_embed(s5)

In [None]:
print(sent1)

In [None]:
allsentences = [s1, s2, s3, s4, s5]
allw2v = [sent1, sent2, sent3, sent4, sent5]
for i in range(len(allsentences)):
    for j in range(len(allsentences)):
        print(allsentences[i] + " VS. " + allsentences[j], end=" ||| ")
        print(f'{1-cosine(allw2v[i], allw2v[j]):.3f}')
    print()

### Visualizing sentence vectors

The cell below projects our sentence embeddings down to 2D and then plots them, labeled with the relevant main word of the sentence.

In [None]:
vecs = [sent1, sent2, sent3, sent4, sent5]
vecwords = ["dog", "cat", "CS", "programming", "mouse"]

    
# Do  PCA to reduce to 2 dimensions
pca = PCA(n_components=2, whiten=True)
vectors2d = pca.fit(vecs).transform(vecs)

# Again, ugly matplotlib code to create visualization
i = 0
for point, word in zip(vectors2d, vecwords):
    plt.scatter(point[0], point[1], c='r')
    
    plt.annotate(
            word, 
            xy=(point[0], point[1]),
            xytext=(7, 6),
            textcoords='offset points',
            ha='left' ,
            va='top',
            size="medium"
            )

### Other kinds of sentence vectors: Sentence BERT (S-BERT)

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

em1 = model.encode(s1, convert_to_tensor=True)
em2 = model.encode(s2, convert_to_tensor=True)
em3 = model.encode(s3, convert_to_tensor=True)
em4 = model.encode(s4, convert_to_tensor=True)
em5 = model.encode(s5, convert_to_tensor=True)


In [None]:
allsentences = [s1, s2, s3, s4, s5]
allsbert = [em1, em2, em3, em4, em5]
allw2v = [sent1, sent2, sent3, sent4, sent5]
for i in range(len(allsentences)):
    for j in range(len(allsentences)):
        print(allsentences[i] + " VS. " + allsentences[j], end=" ||| ")
        print(f'{float(util.cos_sim(allsbert[i], allsbert[j])):.3f}', end=" ||| ")
        print(f'{1-cosine(allw2v[i], allw2v[j]):.3f}')
    print()

In [None]:
vecs = [em1, em2, em3, em4, em5]
vecwords = ["dog", "cat", "CS", "programming", "mouse"]

    
# Do  PCA to reduce to 2 dimensions
pca = PCA(n_components=2, whiten=True)
vectors2d = pca.fit(vecs).transform(vecs)

# Again, ugly matplotlib code to create visualization
i = 0
for point, word in zip(vectors2d, vecwords):
    plt.scatter(point[0], point[1], c='r')
    
    plt.annotate(
            word, 
            xy=(point[0], point[1]),
            xytext=(7, 6),
            textcoords='offset points',
            ha='left' ,
            va='top',
            size="medium"
            )