# 测试 embedding 中英文差异

初步结论：

- bge-m3直接使用数据是正常的，而且数据更好
- 使用ollama-embedding方式的bge-m3，数据问题很大

## 数据

### 英文数据

In [2]:
%%time

simple_data_en={
    "titles": ["What is BGE M3?", "Defination of BM25"],
    "desc": "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."
}

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.58 µs


In [3]:
%%time

simple_data_en2={
    "titles": ["What concerns had many Americans expressed about Joe Biden before Thursday evening?", "How did the heatwave affect Mr. Ram's children and their comfort at home?"],
    "desc": "Before Thursday evening, many Americans had expressed concerns about Joe Biden’s age and fitness for office. To say that this debate did not put those concerns to rest may be one of the greatest understatements of the year."
}

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 4.77 µs


### 中文数据

In [4]:
%%time

simple_data_zh={
    "titles": ["和平共处五项原则为啥到现在还有重要意义？", "丰鸟科技在内蒙古主场地的资产清点和撤出的具体原因是什么？","电池工厂火灾的具体起因是什么？","莫迪计划推出的亲商改革具体包括哪些措施？","金融监管总局支持信托公司和理财公司加大创业投资力度的具体措施有哪些？"],
    "desc": "和平共处五项原则包括互相尊重主权和领土完整、互不侵犯、互不干涉内政、平等互利、和平共处，正式提出于1954年周恩来总理访问印度和缅甸期间。此后70年里，这些原则在新中国的外交实践中不仅未过时，反而历久弥新，作用愈加显著。其背景与新中国成立初期的特殊历史时期紧密相关，当时新政权制定了“另起炉灶”“打扫干净屋子再请客”和“一边倒”的三大外交方针，强调不承认旧政权的外交关系，清除帝国主义特权，并加入社会主义阵营。这些方针中贯穿了不可动摇的平等原则，这是中国民主革命的宗旨，体现了从旧时代走向新时代的转变。和平共处五项原则作为这一方针的产物，始终坚持平等原则，反映了中国反对霸权主义的坚定立场。"
}

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 3.58 µs


## bge-m3

### 使用官方 api

In [2]:
%%time

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('/models/bge-m3',  
                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

CPU times: user 2 s, sys: 1.7 s, total: 3.69 s
Wall time: 21.2 s


#### simple_data_en

In [4]:
%%time

embeddings_1 = model.encode(simple_data_en['titles'], 
                            batch_size=12, 
                            max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                            )['dense_vecs']

CPU times: user 398 ms, sys: 27.8 ms, total: 426 ms
Wall time: 478 ms


In [5]:
%%time

embeddings_2 = model.encode(simple_data_en['desc'])['dense_vecs']

CPU times: user 21.9 ms, sys: 11.1 ms, total: 33 ms
Wall time: 35.1 ms


In [6]:
%%time

similarity = embeddings_1 @ embeddings_2.T
similarity

CPU times: user 39 µs, sys: 14 µs, total: 53 µs
Wall time: 55.6 µs


array([0.626 , 0.3499], dtype=float16)

#### simple_data_en2

In [11]:
%%time

embeddings_1 = model.encode(simple_data_en2['titles'], 
                            batch_size=12, 
                            max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                            )['dense_vecs']

embeddings_2 = model.encode(simple_data_en2['desc'])['dense_vecs']

similarity = embeddings_1 @ embeddings_2.T
similarity

CPU times: user 32.3 ms, sys: 71 µs, total: 32.4 ms
Wall time: 31.3 ms


array([0.708 , 0.2905], dtype=float16)

#### simple_data_zh

In [13]:
%%time

embeddings_1 = model.encode(simple_data_zh['titles'], 
                            batch_size=12, 
                            max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                            )['dense_vecs']

embeddings_2 = model.encode(simple_data_zh['desc'])['dense_vecs']

similarity = embeddings_1 @ embeddings_2.T
similarity

CPU times: user 39.2 ms, sys: 594 µs, total: 39.8 ms
Wall time: 38.5 ms


array([0.7095, 0.2084, 0.1761, 0.3816, 0.2766], dtype=float16)

### 使用llamaindex api

In [None]:
%%time

from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
    model_name="chatfire/bge-m3:q8_0",
    base_url="http://192.168.0.72:11435",
    ollama_additional_kwargs={"mirostat": 0},
)

#### 官方最简单示例

In [None]:
pass_embedding = ollama_embedding.get_text_embedding_batch(
    ["This is a passage!", "This is another passage"], show_progress=True
)

query_embedding = ollama_embedding.get_query_embedding("Where is blue?")

Generating embeddings:   0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 17 ms, sys: 0 ns, total: 17 ms
Wall time: 299 ms


In [24]:
%%time

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 打印嵌入的形状（用于调试）
print(f"Query embedding shape: {len(query_embedding)}")
print(f"Documents embedding shape: {len(pass_embedding)}")
print(f"Documents embedding shape: {len(pass_embedding[0])}")

# 计算余弦相似度
similarities = cosine_similarity([query_embedding], pass_embedding)[0]

similarities

Query embedding shape: 1024
Documents embedding shape: 2
Documents embedding shape: 1024
CPU times: user 1.33 ms, sys: 241 µs, total: 1.57 ms
Wall time: 1.29 ms


array([0.37509823, 0.32102673])

In [25]:
%%time

ollama_embedding.similarity(query_embedding,pass_embedding[0])

CPU times: user 184 µs, sys: 0 ns, total: 184 µs
Wall time: 178 µs


0.3750982311495818

In [26]:
ollama_embedding.similarity(query_embedding,pass_embedding[1])

0.3210267275431935

#### simple_data_en

In [33]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_en['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_en['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 0 ns, sys: 10.2 ms, total: 10.2 ms
Wall time: 313 ms


[0.7353921255077288, 0.44610423796588616]

#### simple_data_en2

In [34]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_en2['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_en2['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 9.61 ms, sys: 565 µs, total: 10.2 ms
Wall time: 314 ms


[0.7294886528560206, 0.6175215888569838]

#### simple_data_zh

In [35]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_zh['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_zh['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 17.4 ms, sys: 1.37 ms, total: 18.7 ms
Wall time: 630 ms


[0.34512684808687755,
 0.3593357783425573,
 0.3444081868084491,
 0.36743581566434685,
 0.3681121091437458]

## bge-large-zh:v1.5

In [41]:
%%time

from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
    # model_name="dztech/bge-large-zh:v1.5",
    model_name="quentinz/bge-large-zh-v1.5",
    base_url="http://192.168.0.72:11435",
    ollama_additional_kwargs={"mirostat": 0},
)

CPU times: user 57 µs, sys: 20 µs, total: 77 µs
Wall time: 79.4 µs


#### simple_data_en

In [42]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_en['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_en['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 10.4 ms, sys: 253 µs, total: 10.6 ms
Wall time: 1.32 s


[0.6511125167740869, 0.39602238080864743]

#### simple_data_en2

In [43]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_en2['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_en2['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 8.78 ms, sys: 1.56 ms, total: 10.3 ms
Wall time: 303 ms


[0.7326026137600575, 0.39623643956081306]

#### simple_data_zh

In [44]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_zh['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_zh['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 13.7 ms, sys: 5.37 ms, total: 19.1 ms
Wall time: 568 ms


[0.7338371389471868,
 0.4380859030075601,
 0.3091280955158122,
 0.5194359507257739,
 0.5239616908581779]

## gte-qwen2-1.5b-instruct-embed-f16

In [1]:
%%time

from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
    # model_name="dztech/bge-large-zh:v1.5",
    model_name="rjmalagon/gte-qwen2-1.5b-instruct-embed-f16",
    base_url="http://192.168.0.72:11435",
    ollama_additional_kwargs={"mirostat": 0},
)

CPU times: user 2.62 s, sys: 359 ms, total: 2.98 s
Wall time: 2.62 s


### simple_data_en

In [5]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_en['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_en['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 13.1 ms, sys: 26 µs, total: 13.1 ms
Wall time: 2.85 s


[0.6913344372407509, 0.43053375382086784]

### simple_data_en2

In [6]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_en2['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_en2['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 11.8 ms, sys: 234 µs, total: 12 ms
Wall time: 327 ms


[0.7857081872132854, 0.359405007111947]

### simple_data_zh

In [7]:
%%time

query_embeddings = [ollama_embedding.get_query_embedding(title) 
                    for title in simple_data_zh['titles']]
doc_embedding=ollama_embedding.get_text_embedding(simple_data_zh['desc'])

similarity=[ollama_embedding.similarity(query_embedding,doc_embedding)
           for query_embedding in query_embeddings]
similarity

CPU times: user 18.6 ms, sys: 3.48 ms, total: 22.1 ms
Wall time: 624 ms


[0.7387741058141705,
 0.2787999044073892,
 0.21621591390707512,
 0.3526459249675495,
 0.29698117094221566]