##Embedding Models

In [None]:
pip install langchain-openai

In [2]:
from dotenv import load_dotenv
load_dotenv

<function dotenv.main.load_dotenv(dotenv_path: Union[str, ForwardRef('os.PathLike[str]'), NoneType] = None, stream: Optional[IO[str]] = None, verbose: bool = False, override: bool = False, interpolate: bool = True, encoding: Optional[str] = 'utf-8') -> bool>

In [None]:
import os
print(os.getcwd())
print(os.environ.get("OPENAI_API_KEY"))

In [4]:
from langchain_openai import OpenAIEmbeddings
openai_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)



  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = DirectoryLoader("data/", glob="**/*")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)

doc_chunks = splitter.split_documents(documents)

print("Total Chunks:", len(doc_chunks))


libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


Total Chunks: 7


In [7]:
embedding = openai_embeddings.embed_query(doc_chunks[0].page_content)

print("Embedding Vector Length:", len(embedding))
print("Sample Values:", embedding[:10])

Embedding Vector Length: 1536
Sample Values: [0.015104802325367928, -0.01100511010736227, -0.008538889698684216, 0.041406888514757156, 0.011498354375362396, -0.05785689875483513, -0.01393895223736763, 0.030568325892090797, -0.04304676502943039, -0.04045883193612099]


Task 2 ---— Hugging Face Embeddings

In [11]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-5.2.3-py3-none-any.whl.metadata (16 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_learn-1.7.2-cp310-cp310-win_amd64.whl.metadata (11 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->sentence-transformers)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading sentence_transformers-5.2.3-py3-none-any.whl (494 kB)
Downloading scikit_learn-1.7.2-cp310-cp310-win_amd64.whl (8.9 MB)
   ---------------------------------------- 0.0/8.9 MB ? eta -:--:--
   ---- ----------------------------------- 1.0/8.9 MB 6.3 MB/s eta 0:00:02
   ---------- ----------------------------- 2.4/8.9 MB 6.4 MB/s eta 0:00:02
   ---------------- ----------------------- 3.7/8.9 MB 6.2 MB/s eta 0:00:01
   ---------------------- ----------------- 5.0/8.9 MB 6.2 MB/s eta 0:00:01
   ---------------------------- ----------- 6.3/8.9 MB 6.0 MB/s eta 0:00:01
   ---------------------------------- 

In [12]:
from langchain_community.embeddings import HuggingFaceEmbeddings

hf_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

hf_embedding = hf_embeddings.embed_query(doc_chunks[0].page_content)

print("HF Vector Length:", len(hf_embedding))
print("Sample Values:", hf_embedding[:10])

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 240.39it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


HF Vector Length: 384
Sample Values: [-0.009657028131186962, 0.008480062708258629, -0.002803912851959467, -0.01811068132519722, -0.01980390027165413, -0.00885832030326128, 0.056905534118413925, -0.0323285311460495, -0.08476506918668747, 0.02724147029221058]


Task 3 — ---OpenAI vs Hugging Face 

When to prefer OpenAI?

High quality semantic understanding

Production-grade search

No infra maintenance

Offline/local usage

Cost-sensitive projects

Full control over deployment

Cost vs Performance

Factor	OpenAI	        Hugging Face
Cost	Paid API	    Free (local compute)
Setup	Easy	        Slightly more setup
Quality	Very strong	    Good
Speed	API dependent	Local GPU dependent

Similarity Search (OpenAI)

In [9]:
import numpy as np

# Create embeddings for all chunks
chunk_texts = [doc.page_content for doc in doc_chunks]
chunk_embeddings = openai_embeddings.embed_documents(chunk_texts)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def search(query, top_k=3):
    query_embedding = openai_embeddings.embed_query(query)
    
    similarities = [
        cosine_similarity(query_embedding, emb)
        for emb in chunk_embeddings
    ]
    
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    return [doc_chunks[i].page_content for i in top_indices]

In [10]:
print(search("What is machine learning?"))
print(search("Explain embeddings"))
print(search("Information about company policy"))

["The quick brown fox jumps over the lazy dog.\n\nThis is the second line of the file.\n\nLangChain's TextLoader is great for simple text files.\n\nIt handles various encodings like UTF-8.", 'Name Age City Occupation Alice 30 New York Software Engineer Bob 25 Los Angeles Data Scientist Charlie 35 Chicago Product Manager David 28 Houston UX Designer', '8. Blog / Thought Leadership Highlights\n\nProfessional headshots for business success\n\n\n\nImportance of visual storytelling in corporate branding\n\nTips for leveraging corporate videos for engagement\n\n9. Contact Information\n\nWebsite: Neuron.in\n\nContact form (highlighted visually)\n\n\n\nLocation: Noida, India\n\n\n\nSocial links (if available)\n\n10. Closing Page\n\n\n\nStrong brand statement: “With Neuron, your brand speaks visually.”\n\nCTA: “Explore our portfolio and elevate your business image today.”']
["The quick brown fox jumps over the lazy dog.\n\nThis is the second line of the file.\n\nLangChain's TextLoader is great 

LangChain Abstraction

In [None]:
pip install faiss-cpu

In [16]:
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(doc_chunks, openai_embeddings)

results = vectorstore.similarity_search("What is quick brown fox?", k=3)

for r in results:
    print(r.page_content)
    print("-" * 50)

The quick brown fox jumps over the lazy dog.

This is the second line of the file.

LangChain's TextLoader is great for simple text files.

It handles various encodings like UTF-8.
--------------------------------------------------
1. Cover Page



Logo (from website)

Title: Neuron – Corporate Photography & Video Solutions

Tagline: Visual storytelling for modern brands

Hero image from homepage

2. About Neuron



Introduction from homepage: “Elevate your brand's presence with our bespoke corporate video production and professional headshots in Pune. At Neuron, we help you unleash the power of visual storytelling to connect with your audience and showcase your corporate identity like never before.”
--------------------------------------------------
o Mortal, Thug, Yatin Karyekar, Tisca Chopra, Sumedh Mudgalkar, Dynamo, Scout,

Regaltos, etc.

Corporate clients: Leading brands in tech and imaging



Include images of client work and celebrity shoots

7. Why Choose Neuron

Custom packa

Ollama Embeddings (Local)

In [None]:
!ollama pull nomic-embed-text

In [20]:
from langchain_community.embeddings import OllamaEmbeddings

ollama_embeddings = OllamaEmbeddings(
    model="nomic-embed-text"
)

ollama_vector = ollama_embeddings.embed_query(
    doc_chunks[0].page_content
)

print("Ollama Vector Length:", len(ollama_vector))

  ollama_embeddings = OllamaEmbeddings(


Ollama Vector Length: 768


Model	Internet Required	Cost	Speed
OpenAI	Yes	Paid	Fast
Ollama	No	Free	Depends on CPU/GPU

Task 7 --- FAISS Vector Store

In [21]:
from langchain_community.vectorstores import FAISS

faiss_store = FAISS.from_documents(doc_chunks, openai_embeddings)

# Similarity search
results = faiss_store.similarity_search("Explain embeddings", k=3)

# Save locally
faiss_store.save_local("faiss_index")

# Reload
loaded_store = FAISS.load_local(
    "faiss_index",
    openai_embeddings,
    allow_dangerous_deserialization=True
)

loaded_store.similarity_search("Explain embeddings", k=3)

[Document(id='6b5f965e-e157-4963-a5dd-edd1e064a128', metadata={'source': 'data\\sample.txt'}, page_content="The quick brown fox jumps over the lazy dog.\n\nThis is the second line of the file.\n\nLangChain's TextLoader is great for simple text files.\n\nIt handles various encodings like UTF-8."),
 Document(id='110700a8-22f9-4587-8598-eebfbc40d7a2', metadata={'source': 'data\\sample.pdf'}, page_content='8. Blog / Thought Leadership Highlights\n\nProfessional headshots for business success\n\n\n\nImportance of visual storytelling in corporate branding\n\nTips for leveraging corporate videos for engagement\n\n9. Contact Information\n\nWebsite: Neuron.in\n\nContact form (highlighted visually)\n\n\n\nLocation: Noida, India\n\n\n\nSocial links (if available)\n\n10. Closing Page\n\n\n\nStrong brand statement: “With Neuron, your brand speaks visually.”\n\nCTA: “Explore our portfolio and elevate your business image today.”'),
 Document(id='33ce6f45-740d-4af3-a776-4cb0caa0b0e3', metadata={'sourc

Task 8 —--- ChromaDB Vector Store

In [None]:
pip install chromadb

In [24]:
from langchain_community.vectorstores import Chroma

chroma_store = Chroma.from_documents(
    documents=doc_chunks,
    embedding=openai_embeddings,
    persist_directory="chroma_db"
)

chroma_store.persist()

# Reload
chroma_loaded = Chroma(
    persist_directory="chroma_db",
    embedding_function=openai_embeddings
)

chroma_loaded.similarity_search("Explain embeddings", k=3)

python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 2
  chroma_store.persist()
  chroma_loaded = Chroma(
python-dotenv could not parse statement starting at line 2


[Document(metadata={'source': 'data\\sample.txt'}, page_content="The quick brown fox jumps over the lazy dog.\n\nThis is the second line of the file.\n\nLangChain's TextLoader is great for simple text files.\n\nIt handles various encodings like UTF-8."),
 Document(metadata={'source': 'data\\sample.pdf'}, page_content='8. Blog / Thought Leadership Highlights\n\nProfessional headshots for business success\n\n\n\nImportance of visual storytelling in corporate branding\n\nTips for leveraging corporate videos for engagement\n\n9. Contact Information\n\nWebsite: Neuron.in\n\nContact form (highlighted visually)\n\n\n\nLocation: Noida, India\n\n\n\nSocial links (if available)\n\n10. Closing Page\n\n\n\nStrong brand statement: “With Neuron, your brand speaks visually.”\n\nCTA: “Explore our portfolio and elevate your business image today.”'),
 Document(metadata={'source': 'data\\sample.pdf'}, page_content="1. Cover Page\n\n\n\nLogo (from website)\n\nTitle: Neuron – Corporate Photography & Video 

Task 9 — FAISS vs Chroma

In-memory vs Persistent

FAISS → primarily in-memory (can save index manually)

Chroma → persistent by default

Use cases for FAISS

Fast similarity search

Research experiments

Lightweight production

Use cases for Chroma

Persistent applications

RAG systems

Long-running apps

PART 5 — End-to-End Pipeline

In [25]:
def build_pipeline(
    documents,
    embedding_type="openai",
    vectorstore_type="faiss"
):
    
    # Select embedding
    if embedding_type == "openai":
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    elif embedding_type == "hf":
        embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
    elif embedding_type == "ollama":
        embeddings = OllamaEmbeddings(model="nomic-embed-text")
    else:
        raise ValueError("Invalid embedding type")
    
    # Select vector store
    if vectorstore_type == "faiss":
        vectorstore = FAISS.from_documents(documents, embeddings)
    elif vectorstore_type == "chroma":
        vectorstore = Chroma.from_documents(
            documents,
            embeddings,
            persist_directory="chroma_db"
        )
    else:
        raise ValueError("Invalid vector store type")
    
    return vectorstore

In [26]:
vectorstore = build_pipeline(
    doc_chunks,
    embedding_type="openai",
    vectorstore_type="faiss"
)

vectorstore.similarity_search("Explain vector databases", k=3)

[Document(id='9633fbe6-701b-4bcb-a39f-27db37c3f707', metadata={'source': 'data\\sample.txt'}, page_content="The quick brown fox jumps over the lazy dog.\n\nThis is the second line of the file.\n\nLangChain's TextLoader is great for simple text files.\n\nIt handles various encodings like UTF-8."),
 Document(id='f6ccc361-4bc1-4739-9002-38ac07cbce83', metadata={'source': 'data\\sample.pdf'}, page_content='o Mortal, Thug, Yatin Karyekar, Tisca Chopra, Sumedh Mudgalkar, Dynamo, Scout,\n\nRegaltos, etc.\n\nCorporate clients: Leading brands in tech and imaging\n\n\n\nInclude images of client work and celebrity shoots\n\n7. Why Choose Neuron\n\nCustom packages for budget & objectives\n\nTransform corporate image & marketing efforts\n\nExceptional value without compromising quality\n\nProven results reflected in #1 Google ranking\n\n8. Blog / Thought Leadership Highlights\n\nProfessional headshots for business success'),
 Document(id='40497a18-a88e-4beb-ae9e-62daa2a4a4d9', metadata={'source': '

Task 11 — Observations & Insights

Importance of Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. Without embeddings, machines cannot measure text similarity effectively.

Why Vector Databases?

Efficient similarity search

Handles large datasets

Fast retrieval

Enables scalable search systems

This pipeline:

Converts documents → embeddings

Stores embeddings in vector DB

Retrieves relevant chunks

Feeds them to LLM

This is exactly how RAG systems retrieve context.