# Embeddings & You - A Brief Introduction to Embeddings in Machine Learning

If you've toyed with LangChain, LlamaIndex, or even OpenAI's `ada` model - you've likely run into the word: "Embeddings" a few time.

They've had a recent surge in popularity due to the profliferation of Retrieval Augmented Generation, but they've been around for a very long time.

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook language/NLP-centric, embeddings have uses beyond just text!

## Notebook Table of Contents:

- Breakout Room #1: Training Word2Vec from Scratch
  - Task 1: Dependencies
  - Task 2: Data Collection
  - Task 3: Data Preprocessing
    - 🏗️ Activity #1
    - ❓Question #1
    - 🏗️ Activity #2
    - 👪❓ Discussion Question #1
  - Task 4: Training Word2Vec
    - 🏗️ Activity #3
    - ❓Question #2
- Breakout Room #2:
  - Task 1: Fine-tuning Our Embedding Model
    - ❓Question #3
    - 🏗️ Activity #4
  - Task 2: Evaluating our Embedding Model
    - 👪❓Discussion Question #2

### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them:

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

*They need numeric inputs.*

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

1. Convert non-numeric data into numeric-data
2. Capture potential semantic relationships between individual pieces of data

## Breakout Room #1: Training Word2Vec from Scratch

Now that we have a bit of background on Embeddings - let's look at what it takes to create our own embeddings using Word2Vec!

We'll be leveraging the `gensim` library, which you can read all about [here](https://pypi.org/project/gensim/).

Before we begin training, however, we need some data!

Let's use the Wikipedia pages for Wicked and Gladiator as examples.

### Task 1: Dependencies
We'll leverage the `wikipedia` library, and `langchain`s `WikipediaLoader` to obtain our Wikipedia data!

In [1]:
!pip install datasets langchain_community wikipedia

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain_community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Colle

In [2]:
import json
import re

from gensim import models
from langchain_community import document_loaders
import nltk
from nltk import tokenize
from nltk.corpus import stopwords
import pandas as pd
import sentence_transformers
from sentence_transformers import evaluation, losses
from torch.utils import data

  from tqdm.autonotebook import tqdm, trange


> NOTE: Please reset the Colab environment after running the install cells.

### Task 2: Data Collection



In [3]:
wicked_docs = document_loaders.WikipediaLoader(
    query="Wicked (2024 film)",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
).load()



  lis = BeautifulSoup(html).find_all('li')


In [4]:
len(wicked_docs)

4

In [5]:
gladiator_2_docs = document_loaders.WikipediaLoader(
    query="Gladiator II",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
).load()

In [6]:
len(gladiator_2_docs)

5

### Task 3: Data Preprocessing

Now that we have some text, we need to do some preprocessing! That's right - classic NLP!

Let's begin by cleaning up our text, we'll:

- Remove special characters
- Remove stop words
- Remove links
- Convert to lowercase
- Strip whitespace

To do this, we'll need two main modules:

- The `re` standard library module
- `spacy`, another NLP library

In [7]:
nltk.download('stopwords')
# nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Let's take a peek at what these "stopwords" are - for traditional embedding models and NLP.

In [8]:
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

#### Text Normalization

The first step is to make a helper function that normalizes our text.

#####🏗️ Activity #1:

What should the output format of the `preprocess_text` function be?

Once you've determined the output format - please complete the code cell and ensure the appropriate format is returned.

In [9]:

stopwords.words

In [10]:
def preprocess_text(text: str) -> list[str]:
  # remove links
  text = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", text)

  # remove all characters except alphabet characters
  text = re.sub("[^a-zA-Z ]", " ", text)

  # tokenize text, make lowercase
  tokens = tokenize.word_tokenize(text.lower())

  # filter stop words
  stop_words = set(stopwords.words('english'))
  filtered_tokens = [token.lower() for token in tokens if token.lower() not in stop_words]

  return filtered_tokens

Let's see how this works on some of our Wikipedia data!

In [11]:
preprocess_text(wicked_docs[0].page_content[:100])

['wicked',
 'titled',
 'onscreen',
 'wicked',
 'part',
 'american',
 'musical',
 'fantasy',
 'film',
 'directed',
 'jon']

#### Sentence Tokenization:

Now we'll turn our corpus into sets of sentences.

#####🏗️ Activity #2:

What should the output format of the `sentence_tokenization` function be?

Once you've determined the output format - please complete the code cell and ensure the appropriate format is returned.

> NOTE: We've mysteriously imported the `sent_tokenize` helper function - it may be useful. Check out the [docs](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html)!

In [12]:
def sentence_tokenization(text: str) -> list[list[str]]:

    # Tokenize the text into sentences
    sentences = tokenize.sent_tokenize(text)

    # Tokenize each sentence into words and store them in a list of lists
    sentence_tokens = [preprocess_text(sentence) for sentence in sentences]

    return sentence_tokens

In [13]:
sentence_tokenization(wicked_docs[0].page_content[:200])

[['wicked',
  'titled',
  'onscreen',
  'wicked',
  'part',
  'american',
  'musical',
  'fantasy',
  'film',
  'directed',
  'jon',
  'chu',
  'written',
  'winnie',
  'holzman',
  'dana',
  'fox'],
 ['first', 'two', 'part', 'film', 'adaptation']]

Perfect, with that, we're ready to create our corpus!

In [14]:
corpus = []

for doc in wicked_docs:
  corpus += sentence_tokenization(doc.page_content)

for doc in gladiator_2_docs:
  corpus += sentence_tokenization(doc.page_content)

##### ❓ Question #1:

Why is this normalization and tokenization necessary to train a Word2Vec Embedding Model?

##### Answer to question #1:

Removing stop words removes unnecessary context that doesn't help the model understand the words' meaning.

Similarly with removing punctuation and links.

##### 👪❓ Discussion Question #1:

When creating training data for Large Language Models, do we need to/should we use text normalization?

What arguments for or against text normalization exist at LLM-scale datasets?

### Task 4: Training Word2Vec

Now that we have our corpus set up, we can train our Word2Vec model.

Training is straightforward, thanks to `gensim`, and more can be understood about the process by reading the paper - but let's see it in code!

It's also worth considering/playing around with the `gensim` parameters.

### An Aside on Skip-gram (SG) and Continuous Bag of Words (CBOW):

**Skip-gram**:

Skip-gram is an approach to teaching computers the meaning of words by predicting the surrounding context from a given word. Think of it as a student who learns by taking a single word and trying to guess what words might appear around it. For example, given the word "sun," Skip-gram would learn to predict related words like "bright," "sky," and "shine." This method is particularly effective at handling rare words in the vocabulary and capturing multiple meanings of words, though it typically requires more training time. The key insight is that words appearing in similar contexts often have related meanings.

**Continuous Bag of Words (CBOW)**:

CBOW takes the opposite approach to Skip-gram by predicting a target word based on its surrounding context words. Imagine playing a fill-in-the-blank game where you see "The ___ is barking at the mailman" and need to predict "dog" based on the surrounding words. CBOW looks at multiple context words at once and tries to understand what word would make sense in the middle. This method tends to be faster to train than Skip-gram and performs particularly well with frequent words in the vocabulary. However, it might not be as effective at handling rare words or capturing multiple word meanings since it averages the context.

#####🏗️ Activity #3:

Set appropriate hyperparameters for the gensim `Word2Vec` model.

> NOTE: Documentation is available [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)

##### ❓ Question #2:

What do each of the hyper-parameters mean or relate to:

- `VECTOR_SIZE` -> YOUR ANSWER HERE
- `WINDOW` -> YOUR ANSWER HERE
- `MIN_COUNT` -> YOUR ANSWER HERE
- `SG` -> YOUR ANSWER HERE

In [15]:
VECTOR_SIZE = 128
WINDOW = 10
MIN_COUNT = 2
SG = 1

word_2_vec_model = models.Word2Vec(
    sentences=corpus,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=SG
)

Blink and you'll miss it. You just trained an embeddings model!

Let's try it out and see what we did!

In [16]:
word_2_vec_model.wv["elphaba"]

array([ 0.00934273, -0.23757197,  0.2439895 ,  0.15971881,  0.2526417 ,
       -0.14715204,  0.13578135, -0.09941833, -0.10889397,  0.07052544,
        0.13674077, -0.06756032, -0.07757369, -0.05029253,  0.17961328,
        0.09929936, -0.23639187, -0.02394586, -0.25639224,  0.12575628,
        0.15077488,  0.15196322, -0.29158807, -0.28679764, -0.14743932,
        0.21579881, -0.1574203 ,  0.24075913,  0.04152977, -0.11354606,
       -0.10680997,  0.04884756,  0.09889401, -0.01061296,  0.11582282,
        0.03255213,  0.16497654, -0.15893741,  0.01817768,  0.0407531 ,
       -0.03271148,  0.15662718, -0.16393635, -0.0786014 ,  0.07111156,
       -0.01866215, -0.01182928, -0.06986199, -0.090794  , -0.03390267,
        0.22530603,  0.14156061,  0.02541267,  0.08408847, -0.05957473,
        0.02511497,  0.2223031 , -0.13083285, -0.10989736,  0.34783   ,
       -0.25098202,  0.06095483,  0.19387668,  0.05993149,  0.09443571,
        0.01629171,  0.00867418, -0.10341121, -0.02624717, -0.22

Finally! We see it: An embedding in the wild.

Notice how we input a word, in this case "Elphaba", and we got back a 100-dimensional vector of floats.

Let's see if we can't get back a list of similar vectors to the vector for "Elphaba", and "Maximus"!

In [17]:
word_vectors = word_2_vec_model.wv

In [18]:
word_vectors.most_similar(positive=["elphaba"], topn=3)

[('glinda', 0.9975351691246033),
 ('galinda', 0.9955207705497742),
 ('wizard', 0.9954692721366882)]

In [19]:
word_vectors.most_similar(positive=["maximus"], topn=3)

[('son', 0.9980468153953552),
 ('lucilla', 0.9978277087211609),
 ('commodus', 0.9978188276290894)]

Now, for the moment of truth - let's do some vector math and see what happens!

In [21]:
galinda_vec = word_2_vec_model.wv["galinda"]
good_vec = word_2_vec_model.wv["good"]
mystery_vector = galinda_vec - good_vec

In [22]:
word_vectors.most_similar(positive=[mystery_vector], topn=3)

[('elphaba', 0.3605653643608093),
 ('glinda', 0.34110862016677856),
 ('wizard', 0.3406887352466583)]

And there we have it - embeddings, and a demonstration of what makes them so powerful!

> Note: This is a very small sample size, and while this result is what we'd hope for - it is largely coincidental - this behaviour is expressed better in much larger corpus' of text.

## Breakout Room #2: Fine-tuning a BERT-Style Embedding Model on Question Answer Pairs.

Now that we've seen where embeddings "started", as it were, let's see where they've gotten.

In this section, we'll be fine-tuning Hugging Face's [sentence transformers](https://www.sbert.net/).

Sentence Transformers leverages the work done in the [Sentence-BERT](https://arxiv.org/abs/1908.10084) paper. So while the idea of converting input text into a dense vector representation is the same, the way we got to those embeddings is a bit different.

> NOTE: As the name implies, the following model is an *ENTIRE* transformer model (though Encoder-only, as described by Sentence-BERT).

### Fine-tuning Our Embeddings Model

Finally, the set up is complete - and we can move on to fine-tuning our sentence transformer embedding model!

The process is simplified considerably by how amazing the Hugging Face `sentence-transformer` library is, so let's jump straight in!

We're going to use the `BAAI/bge-small-en` embedding model as an example, but you could use any of the `sentence-transformer` embeddings models.

In [24]:
model_id = "BAAI/bge-small-en"
model = sentence_transformers.SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [25]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Let's load our data into the desired format!

In [26]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 119, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 119 (delta 36), reused 40 (delta 10), pack-reused 8 (from 1)[K
Receiving objects: 100% (119/119), 78.04 MiB | 15.77 MiB/s, done.
Resolving deltas: 100% (36/36), done.


In [27]:
TRAIN_DATASET_FPATH = './DataRepository/embedding_data/train_dataset.json'
VAL_DATASET_FPATH = './DataRepository/embedding_data/eval_dataset.json'

In [28]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

In [29]:
dataset = train_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

examples = []
for query_id, query in queries.items():
    node_id = relevant_docs[query_id][0]
    text = corpus[node_id]
    example = sentence_transformers.InputExample(texts=[query, text])
    examples.append(example)

We're going to be leveraging `sentence_transformers` `MultipleNegativesRankingLoss` as our loss function.

You can read more about it in the docs, [here](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss).

Note that there is [research](https://arxiv.org/pdf/1705.00652.pdf) that indicates that performance generally scales with `BATCH_SIZE`, but we're going to stick with an arbitrary 10 for the example in the notebook.

##### ❓ Question #3:

What is happening in `MultipleNegativesRankingLoss` that makes it useful for our task?

In [30]:
loss = losses.MultipleNegativesRankingLoss(model)

In [31]:
BATCH_SIZE = 10

loader = data.DataLoader(examples, batch_size=BATCH_SIZE)

We'll set up the `InformationRetrievalEvaluator` to determine performance during training.

In [32]:
corpus = val_dataset['corpus']
queries = val_dataset['queries']
relevant_docs = val_dataset['relevant_docs']

evaluator = evaluation.InformationRetrievalEvaluator(queries, corpus, relevant_docs)

You could use a larger epoch size here, but for the example in the Notebook, we'll stick with 10.

In [33]:
EPOCHS = 10

Nothing left to do but #trainthatmodel!

> NOTE: You'll need to make sure you enter the desired Weights and Biases key - you should be able to simple click the link `https://wandb.ai/authorize` and follow the outlined steps to get the API key.

In [35]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='exp_finetune',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)

  super().__init__(


ValueError: You have set `args.eval_strategy` to steps but you didn't pass an `eval_dataset` to `Trainer`. Either set `args.eval_strategy` to `no` or pass an `eval_dataset`. 

Got this error, and after debugging with Claude for a long time, gave up (I'd need to understand the internals of the `transformers` and `sentence-transformers` libraries better to be able to debug, I think).

### Task 2: Evaluating our Embeddings Models

Now that we've fine-tuned our embedding model on our data - lets see how it performs compared to the base embeddings!

We're going to be using the `InformationRetrievalEvaluator` to help us determine how well our embedding model is performing on a widely used task: Information Retrieval!

You can dive deeper into the documentation [here](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) to see under the hood.

You'll notice, however, that we have common suffixes for our evaluation metrics:

- `X_accuracy@1`, `X_accuracy@3`, etc.

This is computing metrics by looking at the accuracy, recall, precision, Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDGC), and Mean-Average Precision (MAP) at various numbers of retrieved items.

That is to say:

We look at these scores as we include the first closest document, top three closest documents, etc.

We can think of these `@k` as "top k` metrics.

These will help us guide important hyper-parameters when using these models for Information Retrieval tasks down the road!

In [37]:
TRAIN_DATASET_FPATH = './DataRepository/embedding_data/train_dataset.json'
EVAL_DATASET_FPATH = './DataRepository/embedding_data/eval_dataset.json'

In [38]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(EVAL_DATASET_FPATH, 'r+') as f:
    eval_dataset = json.load(f)

In [39]:
def evaluate_st(
    dataset: dict,
    model_id: str,
    name: str,
) -> dict:
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    evaluator = evaluation.InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = sentence_transformers.SentenceTransformer(model_id)
    return evaluator(model, output_path="/content/")

#####🏗️ Activity #4:

Describe what the `evaluate` function is doing in the above cell in natural language.

#### Base Embeddings Model Results

In [40]:
evaluate_st(eval_dataset, "BAAI/bge-small-en", name='bge')

{'bge_cosine_accuracy@1': 0.5067114093959731,
 'bge_cosine_accuracy@3': 0.714765100671141,
 'bge_cosine_accuracy@5': 0.7818791946308725,
 'bge_cosine_accuracy@10': 0.8288590604026845,
 'bge_cosine_precision@1': 0.5067114093959731,
 'bge_cosine_precision@3': 0.23825503355704697,
 'bge_cosine_precision@5': 0.1563758389261745,
 'bge_cosine_precision@10': 0.08288590604026844,
 'bge_cosine_recall@1': 0.5067114093959731,
 'bge_cosine_recall@3': 0.714765100671141,
 'bge_cosine_recall@5': 0.7818791946308725,
 'bge_cosine_recall@10': 0.8288590604026845,
 'bge_cosine_ndcg@10': 0.6710313851865369,
 'bge_cosine_mrr@10': 0.619814637264302,
 'bge_cosine_map@100': 0.6279603491960256,
 'bge_dot_accuracy@1': 0.5067114093959731,
 'bge_dot_accuracy@3': 0.714765100671141,
 'bge_dot_accuracy@5': 0.7818791946308725,
 'bge_dot_accuracy@10': 0.8288590604026845,
 'bge_dot_precision@1': 0.5067114093959731,
 'bge_dot_precision@3': 0.23825503355704697,
 'bge_dot_precision@5': 0.1563758389261745,
 'bge_dot_precisi

#### Fine-tuned Results

In [41]:
evaluate_st(eval_dataset, "exp_finetune", name='finetuned')



OSError: sentence-transformers/exp_finetune is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

### Conclusion

Now we can compare the embeddings models to see which performed the best!

In [42]:
df_st_bge = pd.read_csv('/content/Information-Retrieval_evaluation_bge_results.csv')
df_st_finetuned = pd.read_csv('/content/Information-Retrieval_evaluation_finetuned_results.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/Information-Retrieval_evaluation_finetuned_results.csv'

In [43]:
df_st_bge['model'] = 'bge'
df_st_finetuned['model'] = 'fine_tuned'
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index('model')
df_st_all

NameError: name 'df_st_finetuned' is not defined

##### 👪❓Discussion Question #2:

Discuss the results with your group!