## Embeddings Techninque ?

Embedding techniques convert high-dimensional data (like words, images, or graphs) into lower-dimensional numerical vectors (embeddings) that capture semantic meaning and relationships, enabling machine learning models to understand and process complex data more efficiently, with key methods including Word2Vec (CBOW, Skip-gram), GloVe, FastText, and transformer-based models like BERT for contextual understanding, alongside traditional techniques like TF-IDF


*** there are three types of embeding actually have ***

### Traditional word embeding 
-  These methods rely on statistical counts of words in a document or corpus rather than neural networks. They are generally sparse and do not capture deep semantic relationships

example - One-Hot Encoding, Bag of word , TF -IDF




### static word embedding -

These techniques use shallow neural networks to learn dense, continuous vector representations where similar words are placed closer together. They are called "static" because once trained, each word has a fixed vector regardless of how it is used in a sentence

*** Example - Word2Vec ,Fasttext ***




### contexttualized wmbedding -


As of 2026, these are the standard for modern Large Language Models (LLMs). They generate dynamic representations where the same word can have different vectors depending on the surrounding text. For example, the word "bank" in "river bank" will have a different embedding than in "bank account"

### example - 1  ELMo: Uses deep bi-directional LSTMs to generate context-sensitive embeddings.
### 2 - BERT: Employs Transformers and self-attention mechanisms to weigh the importance of every word in a sentence simultaneously.
### 3 - Modern LLMs (GPT-4, etc.): Use complex transformer architectures to provide highly nuanced embeddings for whole phrases or documents.





In [1]:
import os 
from dotenv import load_dotenv
load_dotenv()


True

In [2]:
os.environ["GEMINI_API_KEY"]=os.getenv("GEMINI_API_KEY")

In [3]:
pip install -U langchain-google-genai


Note: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model="gemini-embedding-001")
embeddings


  from pydantic.v1.fields import FieldInfo as FieldInfoV1
  from .autonotebook import tqdm as notebook_tqdm


GoogleGenerativeAIEmbeddings(client=<google.genai.client.Client object at 0x000001AECAAF4EC0>, model='gemini-embedding-001', task_type=None, google_api_key=SecretStr('**********'), credentials=None, vertexai=None, project=None, location=None, base_url=None, additional_headers=None, client_args=None, request_options=None, output_dimensionality=None)

In [5]:
# Example usage
text = "This is a sample text to embed."
query_result = embeddings.embed_query(text)
print(query_result)
print(len(query_result)) # by default it Should be 3072 for this model embedding length you we can define dimenssion while creating embeddings object.

[-0.010710121, 0.015950082, 0.006366068, -0.07438837, 0.014673617, 0.0056567825, 0.0070044906, 0.0071163448, -0.0016675299, 0.0074965656, 0.005235593, -0.023587119, -0.015551125, 0.031038469, 0.10528153, 0.007770921, 0.0014510939, -0.0040626572, 0.00433319, -0.0062873615, -0.009588641, -0.016193885, 0.0066184765, 0.0023628224, 0.0063248877, -0.028992279, 0.020484034, 0.018463444, 0.037057005, 0.010146175, 0.011619997, 0.0059836516, 0.024548437, -0.020876637, 0.0052039474, 0.0007824665, 0.0040742266, 0.010154911, -0.013042971, 0.0098218685, -0.025052138, -0.010258311, -0.017248437, 0.0013779661, -0.0040621646, 0.013679881, 0.014594428, -0.00896207, 0.011924898, 0.013718023, -0.004526353, -0.017148826, 0.0031131126, -0.15400614, -0.0057694255, 0.0045491112, -0.008049591, -0.00032503594, 0.0033308286, 0.011007116, -0.001872798, 0.016410159, -0.008090881, -0.047408774, 0.017519359, 0.022413736, -0.008835018, 0.015053979, -0.008413505, 0.007370773, -0.0019955293, 0.015002608, -0.024100808, 

In [6]:
# Practical example with document loader

# Step 1: Load documents

from langchain_community.document_loaders import TextLoader
loader = TextLoader('speech.txt')
docs = loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content='Text to speech (TTS) is the use of software to create a sound output in the form of a spoken voice. The program that is used by programs to change text on the page to an audio output of the spoken voice is normally a text to speech engine. Blind people, people who do not see well, and people with reading disabilities can rely on good text-to-speech systems. That way they can listen to pieces of the text. TTS engines are needed for an audio output of machine translation results.\n\nUp until about 2010, there was the analytic approach: This approach uses multiply steps to convert the text to speech. Usually, an input text is transformed into phonetic writing. This says how the words are pronounced, and not how they are written. In the phonetic writing, phonemes can be identified. The system can then produce speech by putting together prerecorded or synthesized diphones. A problem is to make the language flow sound natural, what l

In [7]:
pip install -U langchain-text-splitters


Note: you may need to restart the kernel to use updated packages.


In [8]:

# step 2: Split documents into smaller chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
final_documents = text_splitter.split_documents(docs)
final_documents

[Document(metadata={'source': 'speech.txt'}, page_content='Text to speech (TTS) is the use of software to create a sound output in the form of a spoken voice. The program that is used by programs to change text on the page to an audio output of the spoken voice is normally a text to speech engine. Blind people, people who do not see well, and people with reading disabilities can rely on good text-to-speech systems. That way they can listen to pieces of the text. TTS engines are needed for an audio output of machine translation results.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Up until about 2010, there was the analytic approach: This approach uses multiply steps to convert the text to speech. Usually, an input text is transformed into phonetic writing. This says how the words are pronounced, and not how they are written. In the phonetic writing, phonemes can be identified. The system can then produce speech by putting together prerecorded or synthesized diphones. A

In [9]:
!pip install chromadb



In [10]:
## step 3 and 4 both are combined actually (vector embedding and vector store)

from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(final_documents, embeddings, collection_name="speech-embeddings")
db

ImportError: Could not import chromadb python package. Please install it with `pip install chromadb`.