![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/06.1.Modern_Embeddings.ipynb)

# 06.1 Modern Embeddings in Spark NLP

This notebook explores the new generation of sentence embeddings available in Spark NLP. These embeddings are optimized for retrieval, semantic search, and multilingual understanding, bridging the gap between large transformer models and efficient, production-ready embeddings.

## **Introduction: Performance, Efficiency, and Size**

Selecting the right embedding model for a production environment is a critical decision that requires balancing maximum performance against practical constraints such as model size and computational cost. The latest models offer a variety of profiles to meet these diverse needs.

*The chart below, which plots MTEB Arena Performance (Elo) against model parameters, visually summarizes the trade-offs explored in this notebook.*

![MTEB Arena Performance of Embedding Models](https://www.nomic.ai/_next/image?url=%2Fblog%2Farena.png&w=1080&q=75)

*Source: [Nomic Embed's Surprisingly Good MTEB Arena Elo Score](https://www.nomic.ai/blog/posts/evaluating-embedding-models)*


This plot emphasizes that **compact embedding models** can **outperform or match very large models** in real-world, user-preference-driven benchmarks.

## **Key Model Characteristics**

- **E5 Embeddings**
  Positioned at the highest Elo score, the E5 family offers top-tier retrieval performance. This superior quality comes with a significant model size, confirming that E5 performs better but is large.

- **Nomic Embeddings**
  Situated within the green "Best perf./cost ratio" zone, Nomic strikes an excellent balance. It achieves performance comparable to the largest models with a much smaller parameter count, illustrating that Nomic is the most efficient with parameters.

- **MiniLM Embeddings**
  With the lowest parameter count on the chart, the MiniLM family is purpose-built for efficiency and speed in resource-constrained settings. It is the lightest option available for embedding generation.



## colab setup

In [None]:
!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import pandas as pd
import numpy as np

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

from pyspark.sql import functions as F

spark = sparknlp.start(gpu=True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 6.2.0
Apache Spark version: 3.5.1


In [None]:
!ls /root/.ivy2/jars | grep com.johnsnowlabs

com.johnsnowlabs.nlp_jsl-llamacpp-gpu-1.0.2-compat-rc1.jar
com.johnsnowlabs.nlp_jsl-openvino-cpu_2.12-0.2.0.jar
com.johnsnowlabs.nlp_spark-nlp-gpu_2.12-6.2.0.jar
com.johnsnowlabs.nlp_tensorflow-gpu_2.12-0.4.4.jar


setting `sparknlp.start(gpu=True)` download all the necessary gpu jars

now we can levergae **ONNX**, **Tensorflow**, **GGUF** models better

lets download a dataset from kaggle

In [None]:
!curl -s -L -o fake-and-real-news-dataset.zip https://www.kaggle.com/api/v1/datasets/download/clmentbisaillon/fake-and-real-news-dataset > /dev/null 2>&1
!unzip -q /content/fake-and-real-news-dataset.zip


In [None]:
fake_df = spark.read.csv("/content/Fake.csv", header=True, inferSchema=True)
true_df = spark.read.csv("/content/True.csv", header=True, inferSchema=True)

fake_df = fake_df.withColumn("label", F.lit("FAKE"))
true_df = true_df.withColumn("label", F.lit("TRUE"))

fake_limited = fake_df.orderBy(F.rand()).limit(500)
true_limited = true_df.orderBy(F.rand()).limit(500)

combined_df = fake_limited.unionByName(true_limited)
combined_df.show(5)


+--------------------+--------------------+-----------+-----------------+-----+
|               title|                text|    subject|             date|label|
+--------------------+--------------------+-----------+-----------------+-----+
|WHOA! ‘SESAME STR...|Wrong target audi...|  left-news|      May 3, 2016| FAKE|
|DEMOCRAT CONGRESS...|Congresswoman Gab...|   politics|     Nov 23, 2015| FAKE|
| WATCH: Fox News ...|Even Fox News is ...|       News|February 16, 2017| FAKE|
|Show #137 – SUNDA...|Episode #137 of S...|Middle-east|     May 29, 2016| FAKE|
|STRANGER THAN FIC...|Shawn Helton  21s...|Middle-east| October 19, 2017| FAKE|
+--------------------+--------------------+-----------+-----------------+-----+
only showing top 5 rows



In [None]:
def inspect_dataset(df, label_col_name, name="Dataset"):
    print(f"--- {name} Inspection ---")
    total = df.count()
    print(f"Total Rows: {total}")
    print("Label Distribution:")
    for row in df.groupBy(label_col_name).count().collect():
        label, count = row[label_col_name], row['count']
        print(f"  - {label}: {count} rows ({(count / total) * 100:.2f}%)")
    print()

# E5 Embeddings

## What Are E5 Embeddings?

**E5** (which stands for **EmbEddings from bidirectional Encoder rEpresentations**) is a family of high-performance text embedding models. Their primary goal is to create a single, general-purpose vector representation for any piece of text (from short sentences to longer documents) that performs exceptionally well across a wide range of tasks, including retrieval, clustering, semantic textual similarity (STS), and classification.

A key achievement of the original E5 model was its **zero-shot performance**. The unsupervised pre-trained version (`E5-PT`) was the first model of its kind to outperform the strong BM25 baseline on the BEIR retrieval benchmark, without using any labeled data.

## How to Use E5: The "query:" and "passage:" Prefixes

This is the most critical part for practical application in a notebook. The original E5 models (like `e5-small`, `e5-base`, `e5-large`) use a shared encoder but are trained with an **asymmetric design**. This means you **must add a specific prefix** to your input text to tell the model what kind of text it is embedding.

* **For Retrieval / Search:**
    * Prefix your search query with: `"query: "`.
    * Prefix all the documents in your corpus (your "index") with: `"passage: "`.
    * You then find the document embeddings that have the highest cosine similarity to your query embedding.

* **For All Other Tasks (Clustering, Classification, STS):**
    * Prefix **all** input texts with: `"query: "`.
    * For example, to check the similarity between "Hello world" and "Hi there," you would embed `"query: Hello world"` and `"query: Hi there"` and then compute their cosine similarity.

In [None]:
# Add "query: " prefix to each text entry
combined_df = combined_df.withColumn(
    "prefixed_text",
    F.concat(F.lit("query: "), F.col("text"))
)

combined_df.select("prefixed_text", "label").show(5)


+--------------------+-----+
|       prefixed_text|label|
+--------------------+-----+
|query: This week,...| FAKE|
|query: Donald Tru...| FAKE|
|query: The more a...| FAKE|
|query: Jefferson ...| FAKE|
|query: Donald Tru...| FAKE|
+--------------------+-----+
only showing top 5 rows



In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("prefixed_text") \
    .setOutputCol("document")

embeddings = E5Embeddings.pretrained("e5_base", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("e5")

finisher = EmbeddingsFinisher() \
    .setInputCols(["e5"]) \
    .setOutputCols(["e5_vector"]) \
    .setOutputAsVector(True)

pipeline = Pipeline(stages=[
    document_assembler,
    embeddings,
    finisher
])

model = pipeline.fit(combined_df)
result_df = model.transform(combined_df)

result_df.select("text", "e5_vector").show()


e5_base download started this may take some time.
Approximate size to download 246.7 MB
[OK!]
+--------------------+--------------------+
|                text|           e5_vector|
+--------------------+--------------------+
|This week, comedi...|[[-0.078054353594...|
|Donald Trump is i...|[[-0.044472001492...|
|The more a Trump ...|[[-0.032238207757...|
|Jefferson County ...|[[-0.023467477411...|
|Donald Trump s da...|[[-0.062421932816...|
|Jedediah Bila is ...|[[-0.087177462875...|
|" I just want peo...|[[-0.030944490805...|
|Watch this hilari...|[[-0.050122454762...|
|"Sean Spicer s is...|[[-0.048426404595...|
|It s the lame duc...|[[-0.047007333487...|
|Mark Levin droppe...|[[-0.058899585157...|
|What s happening ...|[[-0.020701564848...|
|The French citize...|[[-0.054747760295...|
|#FlashbackFriday ...|[[-0.032272677868...|
|Saudi Arabia, tha...|[[-0.025831991806...|
|"Donald Trump has...|[[-0.060007635504...|
|This week, a raci...|[[-0.034697193652...|
|It s fascinating ...|[[-0

## Exploratory Clustering and 3D Visualization of E5 Text Embeddings

We have text → we convert it into 768-dimensional vectors (E5 embeddings) to capture their meaning. We want to see if texts cluster into FAKE vs TRUE → we use K-Means to group similar vectors, but it is unsupervised and does not know the actual labels. 768 dimensions are too many to visualize → so we use PCA to reduce them to 3D for plotting. This compresses information and may distort distances, but that’s fine because we are doing this **for exploratory visualization and pattern inspection**, not for definitive classification, we will do that in the section after this


In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.express as px

# 1. Collect the results from Spark to Pandas
# We only need the original text and the vector
pandas_df = result_df.select("text", "e5_vector").toPandas()

# 2. Convert the vector column into a NumPy array
X = np.array(pandas_df["e5_vector"].to_list())
X = X.squeeze()

# 3. Reduce Dimensions with PCA (768 -> 3)
pca_model = PCA(n_components=3)
X_emb = pca_model.fit_transform(X)

# 4. Run K-Means Clustering to group similar vectors
n_clusters = 2 # FAKE/TRUE
# Ensure n_init is set for the latest sklearn versions
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_emb)

# 5. Create a clean DataFrame for plotting
df_plot = pd.DataFrame(X_emb, columns=['x', 'y', 'z'])
df_plot['cluster'] = clusters.astype(str) # Plotly likes string for discrete colors
df_plot['text'] = pandas_df['text']       # Add the original text for hover-over

# 6. Create the 3D Interactive Plot
fig = px.scatter_3d(
    df_plot,
    x='x',
    y='y',
    z='z',
    color='cluster',
    hover_data=['text'],
    title=f"3D Visualization of E5 Embeddings Clustered into {n_clusters} Groups"
)

fig.update_traces(marker=dict(size=5))
fig.show()


## Binary Classification Using E5

‎

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*YWEqFeKKKzDiNWy5UfrTsg.png" alt="Text Classification" width="1000">
</p>
<p align="center">
  <em>Image source: <a href="https://medium.com/">Medium</a> – “Text Classification” illustration</em>
</p>

‎

Let's prepare our data for traning/testing a classification model using E5 embeddings

In [None]:
# This script performs a STRATIFIED 80/20 train-test split on the dataset.
# The split is done separately on the 'fake_df' and 'true_df' DataFrames before combining,
# ensuring that both the final 'train_df' and 'test_df' maintain the original class balance.
# We also add a  "query: " prefix to each text entry

fake_train_df, fake_test_df = fake_df.randomSplit([0.8, 0.2], seed=42)
true_train_df, true_test_df = true_df.randomSplit([0.8, 0.2], seed=42)

train_df = fake_train_df.unionByName(true_train_df)
test_df = fake_test_df.unionByName(true_test_df)

train_df = train_df.withColumn("prefixed_text", F.concat(F.lit("query: "), F.col("text")))
test_df = test_df.withColumn("prefixed_text", F.concat(F.lit("query: "), F.col("text")))


In [None]:
inspect_dataset(train_df, "label", "Training Set")
inspect_dataset(test_df, "label","Testing Set")


--- Training Set Inspection ---
Total Rows: 36133
Label Distribution:
  - FAKE: 18885 rows (52.27%)
  - TRUE: 17248 rows (47.73%)

--- Testing Set Inspection ---
Total Rows: 8773
Label Distribution:
  - FAKE: 4604 rows (52.48%)
  - TRUE: 4169 rows (47.52%)



We're gonna use an ONNX model here beacuse it will utilize our GPU better, and training will be faster

You can find this model [here](https://sparknlp.org/2024/12/17/e5_base_en.html) on our ModelHub

In [None]:
DOWNLOAD_LINK = "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/e5_base_en_5.5.1_3.0_1734398004940.zip"
ZIP_PATH = "e5_base_onnx.zip"

!wget -q -O $ZIP_PATH $DOWNLOAD_LINK
!unzip -q $ZIP_PATH -d e5_base_onnx


In [None]:
embeddings_onnx = E5Embeddings.load("e5_base_onnx") \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")


In [None]:
embeddings_onnx.getOrDefault("engine")

'onnx'

vs the embeddings we used at the start

In [None]:
embeddings.getOrDefault("engine")

'openvino'

let's train our classification model, we use [ClassifierDLApproach](https://sparknlp.org/docs/en/annotators#classifierdl) for this

In [None]:
!mkdir -p ./classifier_logs

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("prefixed_text") \
    .setOutputCol("document")

classifier = ClassifierDLApproach() \
    .setInputCols(["embeddings"]) \
    .setOutputCol("prediction") \
    .setLabelColumn("label") \
    .setMaxEpochs(20) \
    .setLr(0.001) \
    .setBatchSize(2048) \
    .setDropout(0.4) \
    .setValidationSplit(0.15) \
    .setRandomSeed(42) \
    .setEnableOutputLogs(True) \
    .setOutputLogsPath("./classifier_logs")

classifier_pipeline = Pipeline(stages=[
    document_assembler,
    embeddings_onnx,
    classifier
])


In [None]:
%%time
classifier_model = classifier_pipeline.fit(train_df)


CPU times: user 114 ms, sys: 31.4 ms, total: 145 ms
Wall time: 10min 31s


In [None]:
predictions = classifier_model.transform(test_df)
predictions.select("text", "label", "prediction.result").show(5)


+--------------------+-----+------+
|                text|label|result|
+--------------------+-----+------+
|There are no two ...| FAKE|[FAKE]|
|A new report rele...| FAKE|[FAKE]|
|Former San Franci...| FAKE|[FAKE]|
|Former conservati...| FAKE|[FAKE]|
|Rep. Steve King (...| FAKE|[FAKE]|
+--------------------+-----+------+
only showing top 5 rows



Let's use `classification_report` from `sklearn` to evaluate the final scores

In [None]:
# Before we do that we need to convert our spark df to a pandas df
preds_df = predictions.select("text", "label", "prediction.result").toPandas()

# As you see in the spark df above our predicted vales are inside a list
# so we just take them out of the list (eg; [FAKE] --> 'FAKE')
preds_df = preds_df[preds_df['result'].apply(lambda x: not (isinstance(x, list) and len(x) == 0))]
preds_df['result'] = preds_df['result'].apply(lambda x: x[0] if isinstance(x, list) else x)


In [None]:
from sklearn.metrics import classification_report

print(classification_report(
    preds_df['result'],
    preds_df['label'],
    zero_division=0
))


              precision    recall  f1-score   support

        FAKE       0.98      0.99      0.99      4587
        TRUE       0.98      0.98      0.98      4180

    accuracy                           0.98      8767
   macro avg       0.98      0.98      0.98      8767
weighted avg       0.98      0.98      0.98      8767



# Nomic Embeddings

## What Are Nomic Embeddings?

[Nomic Embed](https://www.nomic.ai/blog/posts/nomic-embed-text-v1) is a family of truly open text embedding models that achieve state-of-the-art performance on both short-context (MTEB) and long-context (LoCo) tasks.

**Truly Open Philosophy** The "truly open" philosophy means everything is available: the model weights, the training data (curated from ~235 million text pairs), and the full training code are all released under an [Apache-2.0 license](https://www.apache.org/licenses/LICENSE-2.0). This makes the model fully reproducible and auditable, setting it apart from proprietary alternatives.

**Why Nomic Embed is a Game Changer**

- **Longest Context**: It supports a huge **8192 context length**, matching OpenAI's text-embedding-ada-002 but doing so with superior performance on key benchmarks.

- **Performance**: It outperforms OpenAI's Ada-002 and text-embedding-3-small on the Massive Text Embedding Benchmark (MTEB) and Long Context (LoCo) benchmarks.

- **Matryoshka Representation Learning (v1.5+)**: Newer versions of Nomic Embed offer the ability to **resize the output dimensionality** (e.g., from 768 down to 64) at inference time. This allows you to trade off a negligible amount of performance for huge savings in storage, memory, and bandwidth.


Click the Nomic Atlas map below to visualize a 5M sample of the contrastive pretraining data!

<a rel="nofollow" href="https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample">
  <img
    alt="image/webp"
    src="https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp"
    width="670"
  >
</a>


## How to Use Nomic Embed: The "Task Instruction Prefixes"

Like E5 embeddings, Nomic Embed is trained to be highly efficient when guided by a specific instruction. This is the most critical step for leveraging Nomic Embed's performance, especially in a classification pipeline.

You must prefix your input text with one of the following **"Task Instruction Prefixes"** to tell the model the intended use case:








| Prefix           | Purpose                                     | Example                                           | Use Case                       |
|-----------------|---------------------------------------------|--------------------------------------------------|--------------------------------|
| `classification:` | Embed features for a classification model | `"classification: The political debate was heated."` | Classification                |
| `clustering:`     | Embed for grouping, semantic clusters, or topic modeling | `"clustering: The quick brown fox jumps over the lazy dog."` | Topic discovery, deduplication |
| `search_query:`   | Encode a query for a Retrieval-Augmented Generation (RAG) system | `"search_query: What is the latest update on the budget?"` | Querying a Vector Database     |
| `search_document:`| Encode documents for a Vector Database (default behavior) | `"search_document: The new budget was approved on Tuesday."` | Indexing a RAG corpus          |

> **Best Practice Note:** You must always include the appropriate prefix in your input text. If you omit the prefix, the model defaults to the behavior of `search_document:`, which may result in poor performance for tasks like classification or searching with a query.

In [None]:
inspect_dataset(combined_df, "label")

--- Dataset Inspection ---
Total Rows: 1000
Label Distribution:
  - FAKE: 500 rows (50.00%)
  - TRUE: 500 rows (50.00%)



In [None]:
# Add "classification: " prefix to each text entry
combined_df = combined_df.withColumn(
    "prefixed_text",
    F.concat(F.lit("classification: "), F.col("text"))
)

combined_df.select("prefixed_text", "label").show(5, truncate=80)


+--------------------------------------------------------------------------------+-----+
|                                                                   prefixed_text|label|
+--------------------------------------------------------------------------------+-----+
|classification: 21st Century Wire says The contrast in the numbers protesting...| FAKE|
|classification: Buckle up America Obama still has 6 months to fundamentally t...| FAKE|
|classification: After racists threw a temper tantrum over Old Navy s ad featu...| FAKE|
|classification: Meanwhile, back at the White House, Obama spent the afternoon...| FAKE|
|classification: "Sometimes, politics can get so ridiculous that one is forced...| FAKE|
+--------------------------------------------------------------------------------+-----+
only showing top 5 rows



In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("prefixed_text") \
    .setOutputCol("document")

embeddings = NomicEmbeddings.pretrained("nomic_embed_v1","en") \
      .setInputCols(["document"]) \
      .setOutputCol("nomic")

finisher = EmbeddingsFinisher() \
    .setInputCols(["nomic"]) \
    .setOutputCols(["nomic_vector"]) \
    .setOutputAsVector(True)

pipeline = Pipeline(stages=[
    document_assembler,
    embeddings,
    finisher
])

model = pipeline.fit(combined_df)
result_df = model.transform(combined_df)

result_df.select("text", "nomic_vector").show(5)


nomic_embed_v1 download started this may take some time.
Approximate size to download 243.2 MB
[OK!]
+--------------------+--------------------+
|                text|        nomic_vector|
+--------------------+--------------------+
|21st Century Wire...|[[0.0087867416441...|
|Buckle up America...|[[-0.042215321213...|
|After racists thr...|[[0.0194554198533...|
|Meanwhile, back a...|[[0.0117667634040...|
|"Sometimes, polit...|[[-0.017648920416...|
+--------------------+--------------------+
only showing top 5 rows



# MiniLM Embeddings

## What Are MiniLM Embeddings?

[**MiniLM**](github.com/microsoft/unilm/tree/master/minilm) (which stands for **Mini-Language Model**) is a family of highly efficient and compact pre-trained models. The core idea behind MiniLM is to achieve performance comparable to larger, more resource-intensive models (like **BERT** or **RoBERTa**) while dramatically reducing the **model size**, **memory footprint**, and **inference latency**.

The most widely-used version for embeddings is the fine-tuned model from the **Sentence Transformers** library, such as **`all-MiniLM-L6-v2`**. This model specifically excels at creating **high-quality, dense vector representations** for sentences and short paragraphs.

### Key Features of MiniLM Embeddings

- **Exceptional Efficiency:** Due to its compact architecture (**MiniLM-L6 has only 6 layers and 384 dimensions for v2**), it is significantly faster and requires much less memory than larger models, making it ideal for deployment on **resource-constrained environments** (e.g., edge devices) or for **large-scale indexing**.

- **Small Vector Size:** The [all-MiniLM-L6-v2](https://sparknlp.org/2025/06/23/minilm_l6_v2_en.html) model outputs a **384-dimensional vector**. This is a considerable reduction compared to many larger models which can output **768 or 1024 dimensions**, leading to **massive savings in storage** and faster **vector database operations** (indexing, search).

- **High Performance-to-Size Ratio:** While smaller, it maintains a strong position on popular benchmarks like **MTEB** for various semantic tasks, establishing it as a top choice for a *"good enough" embedding model* that prioritizes **speed and size**.


In [None]:
inspect_dataset(combined_df, "label")


--- Dataset Inspection ---
Total Rows: 1000
Label Distribution:
  - FAKE: 500 rows (50.00%)
  - TRUE: 500 rows (50.00%)



In [None]:
combined_df.cache()

In [None]:
DOWNLOAD_LINK = "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/minilm_l6_v2_en_5.5.1_3.0_1750674121132.zip"
ZIP_PATH = "minilm_l6_v2_openvino.zip"

!wget -q -O $ZIP_PATH $DOWNLOAD_LINK
!unzip -q $ZIP_PATH -d minilm_l6_v2_openvino


In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings = MiniLMEmbeddings.load("minilm_l6_v2_openvino") \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings
]).fit(combined_df)


In [None]:
%%time

result = pipeline.transform(combined_df).count()


CPU times: user 71.3 ms, sys: 17.5 ms, total: 88.8 ms
Wall time: 5min 49s
