# Generative AI Assignment ‚Äî Embeddings & Vector Stores
**Course:** Gen AI  
**Author:** Mehdy Mokhtari
**Date:** 1/8/1404

---

## üìò Introduction

In this part of the assignment, we focus on **embeddings** ‚Äî a fundamental concept in Generative AI that allows text data to be converted into numerical representations (vectors). These embeddings help models understand **semantic meaning**, enabling **similarity search**, **information retrieval**, and **contextual understanding**.

You will learn how to generate embeddings locally using a **fine-tuned Persian BERT model (ParsBERT)** and store them efficiently in a **FAISS vector store** for semantic search tasks.

---

## üß© What You‚Äôll Learn

- The concept and purpose of **text embeddings**.  
- How to **install and use LangChain with HuggingFace** for embedding generation.  
- How to use **ParsBERT (HooshvareLab/bert-base-parsbert-uncased)** for Persian text embeddings.  
- How to **build and test a FAISS vector store** to perform efficient similarity searches.  
- How to work with embeddings **locally**, without relying on external APIs like OpenAI.

## 1.Download Embedding Model

In [5]:
# !pip install langchain langchain-huggingface langchain_community faiss-cpu

Collecting langchain-huggingface
  Downloading langchain_huggingface-1.0.0-py3-none-any.whl.metadata (2.1 kB)
Collecting langchain_community
  Downloading langchain_community-0.4-py3-none-any.whl.metadata (3.0 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
INFO: pip is looking at multiple versions of langchain-huggingface to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
INFO: pip is looking at multiple versions of langchain-community to determine which version is compatible with other requirements. This could take a while.
Collecting langchain_community
  Downloading langchain_community-0.3.31-py3-none-any.whl.metadata (3.0 kB)
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dat

In [1]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

In [2]:
model_name = "HooshvareLab/bert-base-parsbert-uncased"

model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

# Embedding model (ParsBERT)
hf_embedding = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/654M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

In [3]:
# Example data (Persian text)
text = """ ŸÖÿß ÿØÿ± ŸáŸàÿ¥Ÿàÿßÿ±Ÿá ŸÖÿπÿ™ŸÇÿØ€åŸÖ ÿ®ÿß ÿßŸÜÿ™ŸÇÿßŸÑ ÿµÿ≠€åÿ≠ ÿØÿßŸÜÿ¥ Ÿà ÿ¢⁄ØÿßŸá€å ŸáŸÖŸá ÿßŸÅÿ±ÿßÿØ ŸÖ€åÿ™ŸàÿßŸÜŸÜÿØ ÿßÿ≤ ÿßÿ®ÿ≤ÿßÿ± Ÿáÿß€å ŸáŸàÿ¥ŸÖŸÜÿØ ÿßÿ≥ÿ™ŸÅÿßÿØŸá ÿ®⁄©ŸÜŸÜÿØ. ÿ¥ÿπÿßÿ± ŸÖÿß ŸáŸàÿ¥ ŸÖÿµŸÜŸàÿπ€å ÿ®ÿ±ÿß€å ŸáŸÖŸá ÿßÿ≥ÿ™.

"""
embed = hf_embedding.embed_query(text)
print(len(embed))

768


## 2. Faiss Vector Store

In [4]:
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
import faiss
import os

In [5]:
# FAISS index (L2) with correct dimension
index = faiss.IndexFlatL2(len(hf_embedding.embed_query("ÿ≥ŸÑÿßŸÖ")))

# Corpus
texts = [
    "ÿ®ÿ±ÿØÿßÿ±Ÿáÿß€å ŸÖÿπŸÜÿß€å€å ŸÜŸÖÿß€åÿ¥ ÿπÿØÿØ€å ÿßÿ≤ ŸÖÿ™ŸÜ Ÿáÿ≥ÿ™ŸÜÿØ.",
    "FAISS ÿ®ÿ±ÿß€å ÿ¨ÿ≥ÿ™ÿ¨Ÿà€å ÿ¥ÿ®ÿßŸáÿ™ ÿ±Ÿà€å ÿ®ÿ±ÿØÿßÿ±Ÿáÿß ÿßÿ≥ÿ™ŸÅÿßÿØŸá ŸÖ€å‚Äåÿ¥ŸàÿØ.",
    "ŸÖÿØŸÑ‚ÄåŸáÿß€å ÿ≤ÿ®ÿßŸÜ€å ÿØÿ± ÿÆŸÑÿßÿµŸá‚Äåÿ≥ÿßÿ≤€å Ÿà ÿ®ÿßÿ≤€åÿßÿ®€å ÿßÿ∑ŸÑÿßÿπÿßÿ™ ŸÖŸÅ€åÿØŸÜÿØ."
]

# Build vector store
vDB = FAISS.from_texts(texts, hf_embedding)


# Query
query = "⁄Üÿ∑Ÿàÿ± ÿ¨ÿ≥ÿ™ÿ¨Ÿà€å ŸÖÿπŸÜÿß€å€å ÿßŸÜÿ¨ÿßŸÖ ÿØŸá€åŸÖÿü"
for i, d in enumerate(vDB.similarity_search(query, k=3), 1):
    print(f"{i}. {d.page_content}")

1. ŸÖÿØŸÑ‚ÄåŸáÿß€å ÿ≤ÿ®ÿßŸÜ€å ÿØÿ± ÿÆŸÑÿßÿµŸá‚Äåÿ≥ÿßÿ≤€å Ÿà ÿ®ÿßÿ≤€åÿßÿ®€å ÿßÿ∑ŸÑÿßÿπÿßÿ™ ŸÖŸÅ€åÿØŸÜÿØ.
2. FAISS ÿ®ÿ±ÿß€å ÿ¨ÿ≥ÿ™ÿ¨Ÿà€å ÿ¥ÿ®ÿßŸáÿ™ ÿ±Ÿà€å ÿ®ÿ±ÿØÿßÿ±Ÿáÿß ÿßÿ≥ÿ™ŸÅÿßÿØŸá ŸÖ€å‚Äåÿ¥ŸàÿØ.
3. ÿ®ÿ±ÿØÿßÿ±Ÿáÿß€å ŸÖÿπŸÜÿß€å€å ŸÜŸÖÿß€åÿ¥ ÿπÿØÿØ€å ÿßÿ≤ ŸÖÿ™ŸÜ Ÿáÿ≥ÿ™ŸÜÿØ.
