In [None]:
!pip install langchain langchain-google-genai streamlit sentence-transformers

# Text Embeddings using sentence-transformers

In [8]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Machine learning is fun!",
    "I love artificial intelligence.",
    "The sky is blue today.",
    "AI and ML are closely related fields."
]

embeddings = model.encode(sentences)

print("Embedding shape:", embeddings[0].shape)
print("Embeddings: ", embeddings[0][:25]) # There are total upto 384, but lets just print first 26 values

Embedding shape: (384,)
Embeddings:  [-0.00285213 -0.08175396  0.08026796 -0.00358492 -0.03835627 -0.05097928
 -0.06526164 -0.0693114  -0.02728982  0.04648076 -0.02619143 -0.0344762
  0.03024899  0.00664412 -0.05433209 -0.03604044 -0.04957154  0.04955317
 -0.04182218 -0.08995242 -0.02662038 -0.0399776  -0.01725702 -0.03215188
  0.04945602]


Here’s a list of some **popular and commonly used models** from the `sentence-transformers` library, along with their **embedding sizes**, **architecture**, and **key use cases**:


| Model Name                                | Embedding Size | Architecture       | Notes / Use Case                                               |
| ----------------------------------------- | -------------- | ------------------ | -------------------------------------------------------------- |
| **all-MiniLM-L6-v2**                      | **384**        | MiniLM (6 layers)  | Very fast, general-purpose                                     |
| **all-MiniLM-L12-v2**                     | **384**        | MiniLM (12 layers) | Better accuracy than L6                                        |
| **all-mpnet-base-v2**                     | **768**        | MPNet-base         | Very accurate general-purpose                                  |
| **paraphrase-MiniLM-L6-v2**               | **384**        | MiniLM (6 layers)  | Trained specifically for paraphrase similarity                 |
| **paraphrase-multilingual-MiniLM-L12-v2** | **768**        | MiniLM (12 layers) | Multilingual, supports \~50+ languages                         |
| **multi-qa-MiniLM-L6-cos-v1**             | **384**        | MiniLM (6 layers)  | Trained for QA retrieval tasks                                 |
| **multi-qa-mpnet-base-dot-v1**            | **768**        | MPNet-base         | Best for multi-lingual semantic search                         |
| **distiluse-base-multilingual-cased-v2**  | **512**        | DistilBERT         | Multilingual + general-purpose                                 |
| **gtr-t5-base**                           | **768**        | T5 encoder (base)  | State-of-the-art for semantic similarity, supports longer text |
| **gtr-t5-large**                          | **1024**       | T5 encoder (large) | Higher accuracy, slower and memory-heavy                       |
| **nli-roberta-base-v2**                   | **768**        | RoBERTa-base       | Trained on NLI data for sentence similarity                    |
| **sentence-t5-base**                      | **768**        | T5-base            | Optimized for semantic embeddings via ST5 framework            |

---

### Notes:

* **MiniLM models (384-dim)**: Great balance of speed and performance for real-time tasks.
* **MPNet and RoBERTa (768-dim)**: Larger, slower, but more accurate embeddings.
* **Multilingual models**: Like `paraphrase-multilingual-MiniLM-L12-v2` or `distiluse-...`, support cross-lingual applications.
* **T5 / GTR models**: Encoder-decoder transformers — great for long text and QA-style semantic tasks, but more resource intensive.
