# Lesson 2: Embeddings
- Embeddings are numerical representations of text that computers can more easily process. 
- This makes them one of the most important components of large language models. 

| Embeddings          | Word Embeddings        |
| ---------------------------------- | -------------------------------- |
| ![](images/embeddings.png)  | ![](images/word_embeddings.png) |

- As you can see in this embedding, similar words are grouped together. 
- So in the top left you have sports, in the bottom left you have houses and buildings and castles, in the bottom right you have vehicles like bikes and cars, and in the top right you have fruits. 
- So the apple would go among the fruits. 
- Then the coordinates for Apple here are 5'5 because we are associating each word in the table in the right to two numbers, the horizontal and the vertical coordinate. 
- This is an embedding. 
- Now this embedding sends each word to two numbers like this. 
- In general, embeddings would send words to a lot more numbers and we would have all the possible words. 
- Embeddings that we use in practice could send a word to hundreds of different numbers or even thousands. 

### Setup
Load needed API keys and relevant Python libaries.

In [None]:
# !pip install cohere umap-learn altair datasets

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [None]:
import cohere
co = cohere.Client(os.environ['COHERE_API_KEY'])

In [None]:
import pandas as pd

## Word Embeddings

Consider a very small dataset of three words.

In [None]:
three_words = pd.DataFrame({'text':
  [
      'joy',
      'happiness',
      'potato'
  ]})

three_words

Let's create the embeddings for the three words:

In [None]:
three_words_emb = co.embed(texts=list(three_words['text']),
                           model='embed-english-v2.0').embeddings

In [None]:
word_1 = three_words_emb[0]
word_2 = three_words_emb[1]
word_3 = three_words_emb[2]

In [None]:
word_1[:10]

## Sentence Embeddings

![](images/text_embeddings.png)

In this example here, we have embeddings for sentences. 
- Now the sentences get sent to a vector or a list of numbers. 
- And notice that that the first sentence is, hello, how are you? The last one is, hi, how's it going? And they don't have the same words, but they are very similar. 
- And because they're very similar, the embedding sends them to numbers that are really close to each other. 

We're going to take a look at a small data set of sentences. \
This one has eight sentences, as you can see. They are in pairs. Each one is the answer to the previous one

In [None]:
sentences = pd.DataFrame({'text':
  [
   'Where is the world cup?',
   'The world cup is in Qatar',
   'What color is the sky?',
   'The sky is blue',
   'Where does the bear live?',
   'The bear lives in the the woods',
   'What is an apple?',
   'An apple is a fruit',
  ]})

sentences

Let's create the embeddings for the three sentences:

In [None]:
emb = co.embed(texts=list(sentences['text']),
               model='embed-english-v2.0').embeddings

# Explore the 10 first entries of the embeddings of the 3 sentences:
for e in emb:
    print(e[:3])

Now how many numbers are associated to each one of the sentences?

In [None]:
len(emb[0])

In this particular case it's 4096, but different embeddings have different lengths. 

In [None]:
#import umap
#import altair as alt

In [None]:
from utils import umap_plot

In [None]:
chart = umap_plot(sentences, emb)

In [None]:
chart.interactive()

## Articles Embeddings

In [None]:
import pandas as pd
wiki_articles = pd.read_pickle('wikipedia.pkl')
wiki_articles

In [None]:
import numpy as np
from utils import umap_plot_big

In [None]:
articles = wiki_articles[['title', 'text']]
embeds = np.array([d for d in wiki_articles['emb']])

chart = umap_plot_big(articles, embeds)
chart.interactive()