# Getting Started With Text Embeddings

#### Project environment setup

- Load credentials and relevant Python Libraries
- If you were running this notebook locally, you would first install Vertex AI.  In this classroom, this is already installed.
```Python
!pip install google-cloud-aiplatform
```

In [None]:
from utils import authenticate
credentials, PROJECT_ID = authenticate()
import vertexai
from vertexai.language_models import TextEmbeddingModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#### Enter Project Details

In [None]:
print(PROJECT_ID)

In [None]:
REGION = 'us-central1'

In [None]:
# Import and initialize the Vertex AI Python SDK
vertexai.init(project = PROJECT_ID, 
              location = REGION, 
              credentials = credentials)

#### Use The Embeddings Model
- Import and load the model.

In [None]:
embedding_model = TextEmbeddingModel.from_pretrained(
    "textembedding-gecko@001")

- Generate a Word Embedding

In [None]:
embedding = embedding_model.get_embeddings(
    ["life"])

- The returned object is a list with a single `TextEmbedding` object.
- The `TextEmbedding.values` field stores the embeddings in a Python list.

In [None]:
vector = embedding[0].values
print(f"Length = {len(vector)}")
print(vector[:10])

- Generate a Sentence Embedding.

In [None]:
embedding = embedding_model.get_embeddings(
    ["What is the meaning of life?"])

In [None]:
vector = embedding[0].values
print(f"Length = {len(vector)}")
print(vector[:10])

#### Similarity

- Calculate the similarity between two sentences as a number between 0 and 1.
- Try out your own sentences and check if the similarity calculations match your intuition.

In [None]:
emb_1 = embedding_model.get_embeddings(
    ["What is the meaning of life?"]) # 42!

emb_2 = embedding_model.get_embeddings(
    ["How does one spend their time well on Earth?"])

emb_3 = embedding_model.get_embeddings(
    ["Would you like a salad?"])

vec_1 = [emb_1[0].values]
vec_2 = [emb_2[0].values]
vec_3 = [emb_3[0].values]

- Note: the reason we wrap the embeddings (a Python list) in another list is because the `cosine_similarity` function expects either a 2D numpy array or a list of lists.
```Python
vec_1 = [emb_1[0].values]
```

In [None]:
print(cosine_similarity(vec_1,vec_2)) 
print(cosine_similarity(vec_2,vec_3))
print(cosine_similarity(vec_1,vec_3))

#### From Word to Sentence Embeddings
- One possible way to calculate sentence embeddings from word embeddings is to take the average of the word embeddings.
- This ignores word order and context, so two sentences with different meanings, but the same set of words will end up with the same sentence embedding.

In [None]:
in_1 = "The kids play in the park."
in_2 = "The play was for kids in the park."

- Remove stop words like ["the", "in", "for", "an", "is"] and punctuation.

In [None]:
in_pp_1 = ["kids", "play", "park"]
in_pp_2 = ["play", "kids", "park"]

- Generate one embedding for each word.  So this is a list of three lists.

In [None]:
embeddings_1 = [emb.values for emb in embedding_model.get_embeddings(in_pp_1)]

- Use numpy to convert this list of lists into a 2D array of 3 rows and 768 columns.

In [None]:
emb_array_1 = np.stack(embeddings_1)
print(emb_array_1.shape)

In [None]:
embeddings_2 = [emb.values for emb in embedding_model.get_embeddings(in_pp_2)]
emb_array_2 = np.stack(embeddings_2)
print(emb_array_2.shape)

- Take the average embedding across the 3 word embeddings 
- You'll get a single embedding of length 768.

In [None]:
emb_1_mean = emb_array_1.mean(axis = 0) 
print(emb_1_mean.shape)

In [None]:
emb_2_mean = emb_array_2.mean(axis = 0)

- Check to see that taking an average of word embeddings results in two sentence embeddings that are identical.

In [None]:
print(emb_1_mean[:4])
print(emb_2_mean[:4])

#### Get Sentence Embeddings From The Model.
- These sentence embeddings account for word order and context.
- Verify that the sentence embeddings are not the same.

In [None]:
print(in_1)
print(in_2)

In [None]:
embedding_1 = embedding_model.get_embeddings([in_1])
embedding_2 = embedding_model.get_embeddings([in_2])

In [None]:
vector_1 = embedding_1[0].values
print(vector_1[:4])
vector_2 = embedding_2[0].values
print(vector_2[:4])