#type of embedding

#Here's a comprehensive practical in Python demonstrating:

Sentence-Level Embeddings

Document-Level Embeddings

Contextual Embeddings

Sparse Embeddings

Each section includes code with comments and uses common libraries like sentence-transformers, transformers, and scikit-learn.

#✅ Prerequisites: Install Libraries

In [1]:
!pip install sentence-transformers transformers scikit-learn numpy




#🧪 1. Sentence-Level Embeddings
Uses sentence-transformers to embed single sentences.

In [2]:
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample sentences
sentences = [
    "Cats sit on the mat.",
    "Dogs bark loudly."
]

# Get sentence-level embeddings (one per sentence)
sentence_embeddings = model.encode(sentences)

# Print shape and sample embedding
print("Shape of sentence embeddings:", sentence_embeddings.shape)
print("First sentence embedding:\n", sentence_embeddings[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Shape of sentence embeddings: (2, 384)
First sentence embedding:
 [ 1.23507969e-01 -5.30569628e-02 -4.98963101e-03  4.08077948e-02
 -5.87625504e-02  3.23616564e-02  1.36526460e-02 -1.89095102e-02
  7.37913372e-03  6.12467825e-02 -1.38610778e-02  4.40199748e-02
  5.11355512e-02  2.96326522e-02 -4.26474139e-02 -1.78456791e-02
 -7.07902461e-02 -2.39863824e-02  5.32077998e-02  5.58119752e-02
 -7.13114366e-02 -1.06898136e-02  2.41438597e-02 -5.29936664e-02
 -3.31550017e-02  7.54715353e-02 -6.43899590e-02 -1.27926674e-02
 -1.01729110e-03  2.22194456e-02 -5.12349121e-02  6.82529900e-03
  3.53012756e-02  4.56747785e-02  3.28019559e-02 -7.41018504e-02
 -5.86832548e-03 -5.69223687e-02  4.62158509e-02  1.82213672e-02
  5.16728722e-02  9.47929267e-03  1.19462768e-02 -5.76193221e-02
  2.74709929e-02 -2.15046126e-02  7.64024407e-02 -1.02675572e-01
  6.06848001e-02  3.82964760e-02 -5.84474467e-02  4.52259444e-02
  3.11363637e-02  3.98165695e-02 -1.10298708e-01 -8.58578980e-02
  5.21038659e-02 -5.2191

#📄 2. Document-Level Embeddings
Here, we treat paragraphs or documents as inputs (same model as above, longer input).


In [3]:
# Example documents (longer than individual sentences)
documents = [
    "Cats are small, furry mammals often kept as pets. They are known for their agility and playfulness.",
    "Dogs are loyal animals and are often referred to as man's best friend. They are great companions."
]

# Document-level embeddings
doc_embeddings = model.encode(documents)

print("Shape of document embeddings:", doc_embeddings.shape)
print("First document embedding:\n", doc_embeddings[0])


Shape of document embeddings: (2, 384)
First document embedding:
 [ 1.30502298e-01  4.41324338e-02  2.81173773e-02  6.80499747e-02
 -4.33122590e-02  6.26773853e-03  4.07127589e-02 -1.52002005e-02
 -4.37959880e-02  3.77174206e-02  9.14347917e-03 -1.50090745e-02
 -4.13588323e-02  3.29346023e-02  1.46413324e-02 -9.30430181e-03
 -3.08440849e-02 -1.35069089e-02 -4.31512371e-02  1.22479029e-01
 -6.57049194e-02 -2.63490211e-02 -4.19980362e-02  3.69552597e-02
 -6.97339252e-02  5.04998453e-02 -7.50417337e-02 -4.81876656e-02
 -9.31781903e-03  1.25896065e-02 -4.32693250e-02 -3.01639549e-03
  5.13468869e-02  4.25595343e-02 -5.75995594e-02 -3.13562574e-03
  2.17579934e-03  2.43815426e-02  5.38392365e-02  3.29162329e-02
 -4.74375598e-02 -7.08421925e-03  1.93294547e-02 -3.15704420e-02
 -4.08464819e-02  2.07617842e-02 -5.18498709e-03 -1.12409070e-01
 -2.59006042e-02 -4.12486456e-02 -3.09385005e-02  9.30458978e-02
 -3.63378674e-02  4.40119356e-02  2.81599676e-03 -5.16539924e-02
  3.69923003e-02 -5.9620

#✅ Note: SentenceTransformer treats any text (sentence, paragraph, or document) as input, but the length affects semantic richness.

#🤖 3. Contextual Embeddings (Token-Level via BERT)
Uses HuggingFace transformers to get context-aware embeddings for each token.

In [4]:
from transformers import BertTokenizer, BertModel
import torch

# Load BERT base model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased')

# Input sentence
sentence = "The bank can guarantee deposits will eventually cover future tuition costs."

# Tokenize and get embeddings
inputs = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
    outputs = bert(**inputs)

# Token embeddings (batch_size, seq_len, hidden_dim)
token_embeddings = outputs.last_hidden_state

print("Shape of token-level embeddings:", token_embeddings.shape)
print("Token-level embedding for 'bank':\n", token_embeddings[0][2])  # token id 2 is 'bank'


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Shape of token-level embeddings: torch.Size([1, 14, 768])
Token-level embedding for 'bank':
 tensor([ 3.2366e-02, -8.8792e-02,  9.7181e-02,  6.4398e-01,  1.6796e+00,
         1.1667e-01, -5.4524e-01,  1.0407e+00, -8.1740e-02,  2.1106e-01,
         1.0294e-01, -4.5629e-01,  2.1476e-01, -1.5781e-01, -3.2740e-01,
        -6.4122e-01,  8.2487e-03,  6.7572e-01,  6.0111e-01,  6.8973e-01,
        -5.5383e-01,  2.4864e-01,  5.9280e-01,  3.7805e-01,  1.9175e-01,
         3.0864e-01,  7.9869e-01,  2.1527e-01,  8.4612e-02, -5.8811e-01,
         9.2081e-01,  5.6642e-01,  3.4433e-01, -7.7252e-02,  9.3918e-02,
        -1.4752e-01,  3.2043e-01, -3.9992e-01, -5.6791e-01,  2.9558e-01,
         9.1975e-02, -6.7627e-01, -1.0869e-01,  4.1766e-01, -7.3077e-01,
        -4.9637e-01, -2.3400e-01,  3.8468e-01, -2.7349e-01, -5.3578e-01,
        -1.7451e-01,  4.9081e-01, -2.9910e-01, -6.6330e-01,  2.8657e-01,
        -2.4094e-01, -8.1191e-01, -8.8894e-01,  1.1608e-01, -8.8811e-01,
         5.3782e-01,  4.8290e-0

#🧠 4. Sparse Embeddings (TF-IDF)
Uses TfidfVectorizer to create sparse vectors for documents based on token frequency.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example corpus
corpus = [
    "Machine learning enables computers to learn from data.",
    "Artificial intelligence and machine learning are closely related."
]

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform
sparse_embeddings = vectorizer.fit_transform(corpus)

print("Shape of sparse TF-IDF matrix:", sparse_embeddings.shape)

# Convert to dense for demonstration (not recommended for large corpora)
print("TF-IDF vector for first document:\n", sparse_embeddings[0].toarray())


Shape of sparse TF-IDF matrix: (2, 14)
TF-IDF vector for first document:
 [[0.         0.         0.         0.         0.37762778 0.37762778
  0.37762778 0.37762778 0.         0.37762778 0.26868528 0.26868528
  0.         0.37762778]]
