## Summary: Understanding Embeddings in LangGraph Workflows  
Overview  
This demo develops intuition about embeddings—dense vector representations of text—and how to use them for similarity search and visualization. It introduces building embeddings with Hugging Face or OpenAI, comparing semantic similarity, and visualizing embeddings in 2D space.
  
Key Steps Covered  
1. Embeddings Factory Setup  
A simple EmbeddingsFactory class is built to support two providers:

Hugging Face models (e.g., all-MiniLM-L6-v2)

OpenAI Embeddings (via API key)


class EmbeddingsFactory:

  def __init__(self, provider):

      ...
Hugging Face models are freely available and downloaded automatically.

OpenAI models require an API key and internet access.

L4_demo_03_embeddings


In [None]:
from typing import Literal, List, Dict
import itertools
from langchain_core.embeddings import Embeddings
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

Embeddings intuition  
2. Sentence List Creation  
Six sample sentences are prepared: Grouped into three pairs of similar sentences.
Example:  
"The cat sat on the mat." and "A cat was resting on a mat."
sentences = [
  "The cat sat on the mat.",
  "A cat was resting on a mat.",
  "The sun is bright today.",
  "It’s sunny and warm outside.",
  "I love reading books at night.",
  "Reading at bedtime is my favorite."

]


In [None]:
class EmbeddingsFactory:
    def __init__(self, 
                 provider:Literal["OpenAI", "HugginFace"],
                 **kwargs):
        self.provider = provider
        self.kwargs = kwargs
    
    def create(self) -> Embeddings:
        if self.provider == "OpenAI":
            return OpenAIEmbeddings(**self.kwargs)
        if self.provider == "HugginFace":
            return HuggingFaceEmbeddings(**self.kwargs)
        raise ValueError(f"Unknown embeddings provider: {self.provider}")

In [None]:
sentence_list = [
    "I want to listen to music again",
    "I'm in the mood to hear music once more.",
    "Playstation has been a big part of my childhood",
    "I grew up playing Nintendo games",
    "The place I visited is the same as before",
    "The destination I returned to hasn’t changed over the years",
]

3. Generating Embeddings  
Hugging Face embeddings are generated for each sentence.

Each embedding is a 384-dimensional vector.


embeddings = [embeds.embed_query(sentence) for sentence in sentences]
These vectors encode semantic meaning: similar sentences yield similar vectors.

In [None]:
# HuggingFace
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# try moving this over to openAI
embeddings = EmbeddingsFactory(
    provider="HugginFace",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    encode_kwargs={'normalize_embeddings': False}
).create()

In [None]:
# OpenAI
from dotenv import load_dotenv
load_dotenv()

embeddings = EmbeddingsFactory(
    provider="OpenAI"
).create()

In [None]:
embeddings_list = [
    embeddings.embed_query(sentence)
    for sentence in sentence_list
]

In [None]:
len(embeddings_list)

In [None]:
len(embeddings_list[0])

In [None]:
embeddings_list[0][:10]

In [None]:
sentence_embeddings_map = [
    {"sentence":sentence, "embeddings":embeddings}
    for sentence,embeddings in zip(sentence_list, embeddings_list)
]

4. Computing Similarities  
Using dot product from NumPy, semantic similarity between embeddings is calculated.

Pairs of similar sentences have higher similarity scores (e.g., 0.62, 0.58, 0.56).

Non-related sentences yield lower similarity scores.
  

similarity_score = np.dot(embedding1, embedding2)
Results demonstrate that embeddings capture semantic (not just lexical) similarity.

In [None]:
def print_similarity(
        sentence_embeddings_map:List[Dict], 
        i1:int=0, 
        i2:int=1)->None:
    s1 = sentence_embeddings_map[i1]["sentence"]
    e1 = sentence_embeddings_map[i1]["embeddings"]
    s2 = sentence_embeddings_map[i2]["sentence"]
    e2 = sentence_embeddings_map[i2]["embeddings"]
    print(f"Score: {np.dot(e1,e2):.2f}\n")
    print(f"Sentence {i1}: {s1}\nSentence {i2}: {s2}")

In [None]:
print_similarity(sentence_embeddings_map,0,1)

In [None]:
print_similarity(sentence_embeddings_map,2,3)

In [None]:
print_similarity(sentence_embeddings_map,4,5)

In [None]:
print_similarity(sentence_embeddings_map,0,3)

5. Dimensionality Reduction  
Embeddings (384 dimensions) are too large to visualize easily.

Dimensionality reduction is performed (e.g., via PCA) to map embeddings into 2D space.


from sklearn.decomposition import PCA

pca = PCA(n_components=2)

embeddings_2d = pca.fit_transform(embeddings)
Although some information is lost, this allows easier visualization.

In [None]:
pca_model = PCA(n_components=2)
pca_model.fit(embeddings_list)
new_values = pca_model.transform(embeddings_list)

In [None]:
print(f"shape: {new_values.shape}")
print(new_values)

6. Visualization  
A scatterplot is created showing embeddings in 2D space.

Sentences with similar meaning are plotted close together.
  

plt.scatter(...)

for i, sentence in enumerate(sentences):

  plt.annotate(sentence, ...)
Visualization confirms:

Similar sentences cluster together.

Dissimilar sentences are further apart.

In [None]:
def plot_2d(x_values, y_values, info_list):
    fig, ax = plt.subplots()
    scatter = ax.scatter(
        x_values,
        y_values,
        alpha=0.5,
        edgecolors='k',
        s=40
    )

    ax.set_title("Embeddings Viz in 2D")
    ax.set_xlabel("X_1")
    ax.set_ylabel("X_2")

    for i, info in enumerate(info_list):
        ax.annotate(info, (x_values[i], y_values[i]))
    
    plt.show()

In [None]:
plot_2d(new_values[:,0], new_values[:,1], sentence_list)

7. Switching Providers  
Switching from Hugging Face to OpenAI:

OpenAI embeddings are larger (e.g., 1536 dimensions).

Higher-dimensional embeddings may better capture subtle semantic differences.

Procedure is the same; only provider changes.


embeds = EmbeddingsFactory(provider="openai")  
8. Key Concepts Highlighted  
Embeddings represent the meaning of text numerically.

Similarity between embeddings reflects semantic closeness.

Dimensionality reduction sacrifices precision but increases explainability.

Provider choice affects embedding quality, size, and performance.
  
9. Conclusion  
Embeddings are the backbone of RAG, retrieval, recommendation systems, and clustering tasks.

Understanding how to create, compare, and visualize embeddings is fundamental to building AI systems that understand natural language at a deeper level.

## Next demo: Using ChromaDB
Summary: Using ChromaDB for Embedding Storage and Retrieval
Overview
This demo shows how to apply embeddings in practice using ChromaDB, a vector database. It walks through inserting documents, querying with semantic similarity, switching embedding models, and eventually integrating ChromaDB with LangChain.

Key Steps Covered  
1. Initial Setup  
Five sentences (mini news articles) are created.

Topics include Meta, Nvidia, Google, Intel, and more.


sentences = [
  
  "Meta drops multimodal Llama model.",

  "Chip giant Nvidia acquires OctoAI.",

  "Google brings Gemini to older Pixel Buds.",

  "Intel Battlemage GPU benchmarks leaked.",

  "Nvidia CEO reveals new AI chip roadmap."
  
]

In [None]:
import os
import chromadb
from chromadb.utils import embedding_functions
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
import numpy as np

In [None]:
sentence_list = [
    "Meta drops multimodal Llama 3.2 — here's why it's such a big deal",
    "Chip giant Nvidia acquires OctoAI, a Seattle startup that helps companies run AI models",
    "Google is bringing Gemini to all older Pixel Buds",
    "The first Intel Battlmage GPU benchmarks have leaked",
    "Dell partners with Nvidia to accelerate AI adoption in telecoms",
]
ids = ["id1", "id2", "id3", "id4", "id5"]

2. Creating a Chroma Collection
A Chroma client is created.

A collection named "Udacity" is initialized.


import chromadb

client = chromadb.Client()

collection = client.create_collection(name="Udacity")
Documents and IDs are added.

Default embedding model: all-MiniLM-L6-v2 (via Sentence Transformers).


collection.add(documents=sentences, ids=[...])
Validation:

Number of documents = 5.

Peeking into the collection shows embedded vectors.

In [None]:
chroma_client = chromadb.Client()

In [None]:
# To persist in disk, use:
# chroma_client = chromadb.PersistentClient(path="chromadb/")

In [None]:
collection = chroma_client.create_collection(name="udacity")

In [None]:
# By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 
# model to create embeddings.
collection.add(
    documents=sentence_list,
    ids=ids
)

In [None]:
collection._embedding_function

In [None]:
collection.count()

In [None]:
collection.peek(2)

3. Querying the Vector Database  
Queries are performed using keywords like:
  
"GPU"

"CPU"

"memory"

"gadget"

Chroma searches based on semantic similarity, returning:

Closest matching documents

Metadata

Distances (similarity scores)
  

results = collection.query(query_texts=["GPU"], n_results=2)
Example result:

Top results for "GPU" are articles about Intel GPUs and Nvidia acquisitions.

In [None]:
collection.query(
    query_texts=["gadget"],
    n_results=2,
    include=['metadatas', 'documents', 'distances']
)

4. Changing Embedding Models  
A new embedding model all-mpnet-base-v2 is introduced to improve semantic search quality.

OpenAI embeddings (text-embedding-ada-002) can also be used.

After changing the embedding model:

The Udacity collection is deleted and recreated.

New embeddings are added.


from sentence_transformers import SentenceTransformer

embedding_function = SentenceTransformer('all-mpnet-base-v2')
Comparisons using dot products show that articles about Nvidia are highly similar.

In [None]:
embeddings_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"
)

In [None]:
embeddings = embeddings_fn(sentence_list)
len(embeddings)

In [None]:
print(np.dot(embeddings[1], embeddings[4]))
print(sentence_list[1])
print(sentence_list[4])

In [None]:
from dotenv import load_dotenv
load_dotenv()

embeddings_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv("OPENAI_API_KEY")
)

In [None]:
embeddings_fn._model_name

In [None]:
chroma_client.delete_collection(name="udacity")

collection = chroma_client.create_collection(
    name="udacity",
    embedding_function=embeddings_fn
)

In [None]:
collection.add(
    documents=sentence_list,
    ids=ids
)

In [None]:
collection._embedding_function

In [None]:
collection.query(
    query_texts=["gadget"],
    n_results=2,
    include=['metadatas', 'documents', 'distances']
)

5. Switching to LangChain Chroma Integration  
Instead of using raw Chroma, the demo switches to using Chroma vector stores through LangChain.

from langchain.vectorstores import Chroma
Documents are created with metadata (e.g., company, topic).

Example metadata:  

Company: Meta, Topic: Llama

Company: Nvidia, Topic: OctoAI


docs = [  

  Document(page_content="...", metadata={"company": "Meta", "topic": "Llama"}),

  ...
  
]  
Vector store created with:
  
Documents

Embeddings

Persistence to Chroma backend

Semantic search with score:

Top results are returned with both their text content and similarity scores.

vectorstore.similarity_search_with_score(query="GPU", k=3)

In [None]:
chroma_client.delete_collection(name="udacity")

In [None]:
from dotenv import load_dotenv
load_dotenv()

vector_store = Chroma(
    collection_name="udacity",
    embedding_function=OpenAIEmbeddings(),
)

In [None]:
documents = [
    Document(
        page_content="Meta drops multimodal Llama 3.2 — here's why it's such a big deal",
        metadata={"company":"Meta", "topic": "llama"}
    ),
    Document(
        page_content="Chip giant Nvidia acquires OctoAI, a Seattle startup that helps companies run AI models",
        metadata={"company":"Nvidia", "topic": "acquisition"}
    ),
    Document(
        page_content="Google is bringing Gemini to all older Pixel Buds",
        metadata={"company":"Google", "topic": "gemini"}
    ),
    Document(
        page_content="The first Intel Battlmage GPU benchmarks have leaked",
        metadata={"company":"Intel", "topic": "gpu"}
    ),
    Document(
        page_content="Dell partners with Nvidia to accelerate AI adoption in telecoms",
        metadata={"company":"Dell", "topic": "partnership"}
    ),
]

In [None]:
vector_store.add_documents(documents=documents, ids=ids)

In [None]:
results = vector_store.similarity_search_with_score(query="gpu",k=2)
for doc, score in results:
    print(f"-> {doc.page_content}\n   [Score={score:.2f}]\n   [{doc.metadata}]\n\n")

6. Key Concepts Highlighted  
Vector databases like Chroma are essential for fast semantic search.

Embeddings encode meaning; better models yield better search quality.

Metadata enhances retrieval by enabling filtered or faceted searches.

LangChain integration makes it easy to manage vector stores for downstream tasks.
  
7. Conclusion  
Using ChromaDB (standalone or with LangChain) enables powerful semantic search capabilities.

Combining embeddings, persistent storage, and flexible search methods is crucial for real-world RAG (Retrieval-Augmented Generation) and AI systems.