# [Jina-V3](https://huggingface.co/jinaai/jina-embeddings-v3)
jina-embeddings-v3是一个多语言多任务文本嵌入模型，专为各种NLP应用程序而设计。基于Jina-XLM-RoBERTa架构，该模型支持旋转位置嵌入来处理高达8192个令牌的长输入序列。此外，它还具有5个LoRA适配器，可以有效地生成特定于任务的嵌入。


## 五种任务模式
- **retrieval.query**：用于非对称检索任务中的查询嵌入
- **retrieval.passage**：用于非对称检索任务中的通道嵌入
- **separation**：用于集群和重新排名应用程序中的嵌入
- **classification**：用于分类任务中的嵌入
- **text-matching**：用于嵌入量化两个文本之间相似性的任务，例如STS或对称检索任务

接下来对五种特定任务的嵌入做一下测试和验证。

1. 初始化模型

In [2]:
from transformers import AutoModel

model = AutoModel.from_pretrained("/Users/randy/Downloads/models/jinaai/jina-embeddings-v3", trust_remote_code=True)
model.to('mps')
print("模型初始化完成")

模型初始化完成


2. 定义任务和语言

In [3]:
tasks = ['retrieval.query', 'retrieval.passage', 'separation', 'classification', 'text-matching']
texts = [
  "Follow the white rabbit.",  # English
  "Sigue al conejo blanco.",  # Spanish
  "Suis le lapin blanc.",  # French
  "跟着白兔走。",  # Chinese
  "اتبع الأرنب الأبيض.",  # Arabic
  "Folge dem weißen Kaninchen.",  # German
  "在前面那只白色的兔子后面走。",  # Similar Chinese
]
languages = ['English', 'Spanish', 'French', 'Chinese', 'Arabic', 'German', 'Similar']

3. 计算每个任务的嵌入向量和相似度

In [7]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# 存储每个任务的相似度矩阵
similarities = {}
for task in tasks:
    embeddings = model.encode(texts, task=task)
    sim_matrix = cosine_similarity(embeddings, embeddings)
    similarities[task] = sim_matrix

- `retrieval.query`任务

In [10]:
print(f"\n=== 任务模式: {tasks[0]} ===")
df = pd.DataFrame(
    similarities[tasks[0]], 
    index=languages,
    columns=languages
)
print(df.round(3))


=== 任务模式: retrieval.query ===
         English  Spanish  French  Chinese  Arabic  German  Similar
English    1.000    0.683   0.580    0.706   0.713   0.739    0.687
Spanish    0.683    1.000   0.562    0.629   0.650   0.780    0.585
French     0.580    0.562   1.000    0.451   0.543   0.665    0.432
Chinese    0.706    0.629   0.451    1.000   0.586   0.621    0.785
Arabic     0.713    0.650   0.543    0.586   1.000   0.684    0.563
German     0.739    0.780   0.665    0.621   0.684   1.000    0.601
Similar    0.687    0.585   0.432    0.785   0.563   0.601    1.000


- `retrieval.passage`任务

In [11]:

print(f"\n=== 任务模式: {tasks[1]} ===")
df = pd.DataFrame(
    similarities[tasks[1]], 
    index=languages,
    columns=languages
)
print(df.round(3))


=== 任务模式: retrieval.passage ===
         English  Spanish  French  Chinese  Arabic  German  Similar
English    1.000    0.842   0.576    0.661   0.819   0.851    0.691
Spanish    0.842    1.000   0.633    0.566   0.797   0.886    0.592
French     0.576    0.633   1.000    0.287   0.528   0.652    0.359
Chinese    0.661    0.566   0.287    1.000   0.620   0.490    0.778
Arabic     0.819    0.797   0.528    0.620   1.000   0.764    0.630
German     0.851    0.886   0.652    0.490   0.764   1.000    0.529
Similar    0.691    0.592   0.359    0.778   0.630   0.529    1.000


In [None]:

print(f"\n=== 任务模式: {tasks[0]} ===")
df = pd.DataFrame(
    similarities[tasks[0]], 
    index=languages,
    columns=languages
)
print(df.round(3))


=== 任务模式: retrieval.query ===
         English  Spanish  French  Chinese  Arabic  German  Similar
English    1.000    0.683   0.580    0.706   0.713   0.739    0.687
Spanish    0.683    1.000   0.562    0.629   0.650   0.780    0.585
French     0.580    0.562   1.000    0.451   0.543   0.665    0.432
Chinese    0.706    0.629   0.451    1.000   0.586   0.621    0.785
Arabic     0.713    0.650   0.543    0.586   1.000   0.684    0.563
German     0.739    0.780   0.665    0.621   0.684   1.000    0.601
Similar    0.687    0.585   0.432    0.785   0.563   0.601    1.000


- `separation`任务

In [12]:
print(f"\n=== 任务模式: {tasks[2]} ===")
df = pd.DataFrame(
    similarities[tasks[2]], 
    index=languages,
    columns=languages
)
print(df.round(3))


=== 任务模式: separation ===
         English  Spanish  French  Chinese  Arabic  German  Similar
English    1.000    0.895   0.881    0.888   0.897   0.906    0.877
Spanish    0.895    1.000   0.878    0.879   0.885   0.917    0.866
French     0.881    0.878   1.000    0.843   0.851   0.895    0.824
Chinese    0.888    0.879   0.843    1.000   0.855   0.872    0.942
Arabic     0.897    0.885   0.851    0.855   1.000   0.872    0.847
German     0.906    0.917   0.895    0.872   0.872   1.000    0.869
Similar    0.877    0.866   0.824    0.942   0.847   0.869    1.000


- `classification`任务

In [13]:
print(f"\n=== 任务模式: {tasks[3]} ===")
df = pd.DataFrame(
    similarities[tasks[3]], 
    index=languages,
    columns=languages
)
print(df.round(3))


=== 任务模式: classification ===
         English  Spanish  French  Chinese  Arabic  German  Similar
English    1.000    0.862   0.868    0.865   0.835   0.882    0.830
Spanish    0.862    1.000   0.894    0.846   0.863   0.904    0.808
French     0.868    0.894   1.000    0.846   0.865   0.912    0.813
Chinese    0.865    0.846   0.846    1.000   0.801   0.855    0.902
Arabic     0.835    0.863   0.865    0.801   1.000   0.844    0.777
German     0.882    0.904   0.912    0.855   0.844   1.000    0.833
Similar    0.830    0.808   0.813    0.902   0.777   0.833    1.000


- `text-matching`任务

In [14]:
print(f"\n=== 任务模式: {tasks[4]} ===")
df = pd.DataFrame(
    similarities[tasks[4]], 
    index=languages,
    columns=languages
)
print(df.round(3))


=== 任务模式: text-matching ===
         English  Spanish  French  Chinese  Arabic  German  Similar
English    1.000    0.709   0.538    0.656   0.721   0.776    0.681
Spanish    0.709    1.000   0.603    0.666   0.634   0.821    0.664
French     0.538    0.603   1.000    0.526   0.523   0.616    0.506
Chinese    0.656    0.666   0.526    1.000   0.536   0.643    0.930
Arabic     0.721    0.634   0.523    0.536   1.000   0.641    0.545
German     0.776    0.821   0.616    0.643   0.641   1.000    0.662
Similar    0.681    0.664   0.506    0.930   0.545   0.662    1.000


4. 任务模式之间的平均差异

In [16]:
import numpy as np
print("\n=== 不同任务模式之间的平均差异 ===")
task_differences = {}
for i in range(len(tasks)):
    for j in range(i+1, len(tasks)):
        task1, task2 = tasks[i], tasks[j]
        diff = np.abs(similarities[task1] - similarities[task2]).mean()
        task_differences[f"{task1} vs {task2}"] = diff

for pair, diff in task_differences.items():
    print(f"{pair}: {diff:.3f}")


=== 不同任务模式之间的平均差异 ===
retrieval.query vs retrieval.passage: 0.060
retrieval.query vs separation: 0.212
retrieval.query vs classification: 0.190
retrieval.query vs text-matching: 0.038
retrieval.passage vs separation: 0.199
retrieval.passage vs classification: 0.178
retrieval.passage vs text-matching: 0.079
separation vs classification: 0.026
separation vs text-matching: 0.198
classification vs text-matching: 0.178


## 五种模式的分析

上面的实验中可以看出Jina的不同模式