# Embeddings Introduction
## NLP Tasks using Embeddings
[MTEB - Massive Text Embedding Benchmark](https://huggingface.co/spaces/mteb/leaderboard)

- MTEB consists of 58 datasets covering 112 languages <br>
- 8 embedding tasks:
  1. Bitext mining - Used to match two different language sentences. F1 metric, accuracy, precision, recall
  2. Classification - Transform text into feature vectors and use them to train a logistic regression classifier. Accuracy, average precision, F1
  3. Clustering - Like k-means, group sentences into clusters. v-measure
  4. Pair classification - Combine a pair of sentences and predict its labels. Accuracy, average precision, F1, etc
  5. Reranking - Input query, and a list of relevant and irrelevant reference texts. Rank texts according to query based on relevance score
  6. Retrieval - Find most relevant text documents based on query. nDCG@k, MRR@k, MAP@k, precision@k and recall@k are computed for several values of k
  7. STS (Semantic Textual Similarity ) - Determine similarity between two sentences. Pearson and Spearman cor-relations. Spearman correlation based on cosine similarity
  8. Summarization - Summarize context. Compare generated summary to human written summary using cosine similarity score
 

## Similarity Score Calculation

### 1. Dot Product vs Cosine Similarity

- openAI embedding generated vector magnitude is 1.0, so dot producting two vectors will give similairty score. <br>
- If vector magnitude matters, then use cosine similarity, otherwise dot product
-------------------
#### Dot Product
![image.png](attachment:2170f334-9653-4dcd-84ac-3cf3e15f4692.png)

```python
dot_product = np.dot(vector1, vector2)
```
-------------------

#### Cosine Similarity
![image.png](attachment:092ea6f8-269f-46e0-8de0-c2d2bb8337ca.png)

```python
dot_product = np.dot(vector1, vector2)
mag1 = np.linalg.norm(vector1)
mag2 = np.linalg.norm(vector2)
similarity_score = dot_product / (mag1 * mag2)
```
-------------------

