Skip to content

Qinks6/Dmeta

Repository files navigation

pipeline_tag tags model-index license language
sentence-similarity
sentence-transformers
feature-extraction
sentence-similarity
mteb
name results
Dmeta-embedding
task dataset metrics
type
STS
type name config split revision
C-MTEB/AFQMC
MTEB AFQMC
default
validation
None
type value
cos_sim_pearson
65.60825224706932
type value
cos_sim_spearman
71.12862586297193
type value
euclidean_pearson
70.18130275750404
type value
euclidean_spearman
71.12862586297193
type value
manhattan_pearson
70.14470398075396
type value
manhattan_spearman
71.05226975911737
task dataset metrics
type
STS
type name config split revision
C-MTEB/ATEC
MTEB ATEC
default
test
None
type value
cos_sim_pearson
65.52386345655479
type value
cos_sim_spearman
64.64245253181382
type value
euclidean_pearson
73.20157662981914
type value
euclidean_spearman
64.64245253178956
type value
manhattan_pearson
73.22837571756348
type value
manhattan_spearman
64.62632334391418
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_reviews_multi
MTEB AmazonReviewsClassification (zh)
zh
test
1399c76144fd37290681b995c656ef9b2e06e26d
type value
accuracy
44.925999999999995
type value
f1
42.82555191308971
task dataset metrics
type
STS
type name config split revision
C-MTEB/BQ
MTEB BQ
default
test
None
type value
cos_sim_pearson
71.35236446393156
type value
cos_sim_spearman
72.29629643702184
type value
euclidean_pearson
70.94570179874498
type value
euclidean_spearman
72.29629297226953
type value
manhattan_pearson
70.84463025501125
type value
manhattan_spearman
72.24527021975821
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringP2P
MTEB CLSClusteringP2P
default
test
None
type value
v_measure
40.24232916894152
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/CLSClusteringS2S
MTEB CLSClusteringS2S
default
test
None
type value
v_measure
39.167806226929706
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv1-reranking
MTEB CMedQAv1
default
test
None
type value
map
88.48837920106357
type value
mrr
90.36861111111111
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/CMedQAv2-reranking
MTEB CMedQAv2
default
test
None
type value
map
89.17878171657071
type value
mrr
91.35805555555555
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CmedqaRetrieval
MTEB CmedqaRetrieval
default
dev
None
type value
map_at_1
25.751
type value
map_at_10
38.946
type value
map_at_100
40.855000000000004
type value
map_at_1000
40.953
type value
map_at_3
34.533
type value
map_at_5
36.905
type value
mrr_at_1
39.235
type value
mrr_at_10
47.713
type value
mrr_at_100
48.71
type value
mrr_at_1000
48.747
type value
mrr_at_3
45.086
type value
mrr_at_5
46.498
type value
ndcg_at_1
39.235
type value
ndcg_at_10
45.831
type value
ndcg_at_100
53.162
type value
ndcg_at_1000
54.800000000000004
type value
ndcg_at_3
40.188
type value
ndcg_at_5
42.387
type value
precision_at_1
39.235
type value
precision_at_10
10.273
type value
precision_at_100
1.627
type value
precision_at_1000
0.183
type value
precision_at_3
22.772000000000002
type value
precision_at_5
16.524
type value
recall_at_1
25.751
type value
recall_at_10
57.411
type value
recall_at_100
87.44
type value
recall_at_1000
98.386
type value
recall_at_3
40.416000000000004
type value
recall_at_5
47.238
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/CMNLI
MTEB Cmnli
default
validation
None
type value
cos_sim_accuracy
83.59591100420926
type value
cos_sim_ap
90.65538153970263
type value
cos_sim_f1
84.76466651795673
type value
cos_sim_precision
81.04073363190446
type value
cos_sim_recall
88.84732288987608
type value
dot_accuracy
83.59591100420926
type value
dot_ap
90.64355541781003
type value
dot_f1
84.76466651795673
type value
dot_precision
81.04073363190446
type value
dot_recall
88.84732288987608
type value
euclidean_accuracy
83.59591100420926
type value
euclidean_ap
90.6547878194287
type value
euclidean_f1
84.76466651795673
type value
euclidean_precision
81.04073363190446
type value
euclidean_recall
88.84732288987608
type value
manhattan_accuracy
83.51172579675286
type value
manhattan_ap
90.59941589844144
type value
manhattan_f1
84.51827242524917
type value
manhattan_precision
80.28613507258574
type value
manhattan_recall
89.22141688099134
type value
max_accuracy
83.59591100420926
type value
max_ap
90.65538153970263
type value
max_f1
84.76466651795673
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/CovidRetrieval
MTEB CovidRetrieval
default
dev
None
type value
map_at_1
63.251000000000005
type value
map_at_10
72.442
type value
map_at_100
72.79299999999999
type value
map_at_1000
72.80499999999999
type value
map_at_3
70.293
type value
map_at_5
71.571
type value
mrr_at_1
63.541000000000004
type value
mrr_at_10
72.502
type value
mrr_at_100
72.846
type value
mrr_at_1000
72.858
type value
mrr_at_3
70.39
type value
mrr_at_5
71.654
type value
ndcg_at_1
63.541000000000004
type value
ndcg_at_10
76.774
type value
ndcg_at_100
78.389
type value
ndcg_at_1000
78.678
type value
ndcg_at_3
72.47
type value
ndcg_at_5
74.748
type value
precision_at_1
63.541000000000004
type value
precision_at_10
9.115
type value
precision_at_100
0.9860000000000001
type value
precision_at_1000
0.101
type value
precision_at_3
26.379
type value
precision_at_5
16.965
type value
recall_at_1
63.251000000000005
type value
recall_at_10
90.253
type value
recall_at_100
97.576
type value
recall_at_1000
99.789
type value
recall_at_3
78.635
type value
recall_at_5
84.141
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/DuRetrieval
MTEB DuRetrieval
default
dev
None
type value
map_at_1
23.597
type value
map_at_10
72.411
type value
map_at_100
75.58500000000001
type value
map_at_1000
75.64800000000001
type value
map_at_3
49.61
type value
map_at_5
62.527
type value
mrr_at_1
84.65
type value
mrr_at_10
89.43900000000001
type value
mrr_at_100
89.525
type value
mrr_at_1000
89.529
type value
mrr_at_3
89
type value
mrr_at_5
89.297
type value
ndcg_at_1
84.65
type value
ndcg_at_10
81.47
type value
ndcg_at_100
85.198
type value
ndcg_at_1000
85.828
type value
ndcg_at_3
79.809
type value
ndcg_at_5
78.55
type value
precision_at_1
84.65
type value
precision_at_10
39.595
type value
precision_at_100
4.707
type value
precision_at_1000
0.485
type value
precision_at_3
71.61699999999999
type value
precision_at_5
60.45
type value
recall_at_1
23.597
type value
recall_at_10
83.34
type value
recall_at_100
95.19800000000001
type value
recall_at_1000
98.509
type value
recall_at_3
52.744
type value
recall_at_5
68.411
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/EcomRetrieval
MTEB EcomRetrieval
default
dev
None
type value
map_at_1
53.1
type value
map_at_10
63.359
type value
map_at_100
63.9
type value
map_at_1000
63.909000000000006
type value
map_at_3
60.95
type value
map_at_5
62.305
type value
mrr_at_1
53.1
type value
mrr_at_10
63.359
type value
mrr_at_100
63.9
type value
mrr_at_1000
63.909000000000006
type value
mrr_at_3
60.95
type value
mrr_at_5
62.305
type value
ndcg_at_1
53.1
type value
ndcg_at_10
68.418
type value
ndcg_at_100
70.88499999999999
type value
ndcg_at_1000
71.135
type value
ndcg_at_3
63.50599999999999
type value
ndcg_at_5
65.92
type value
precision_at_1
53.1
type value
precision_at_10
8.43
type value
precision_at_100
0.955
type value
precision_at_1000
0.098
type value
precision_at_3
23.633000000000003
type value
precision_at_5
15.340000000000002
type value
recall_at_1
53.1
type value
recall_at_10
84.3
type value
recall_at_100
95.5
type value
recall_at_1000
97.5
type value
recall_at_3
70.89999999999999
type value
recall_at_5
76.7
task dataset metrics
type
Classification
type name config split revision
C-MTEB/IFlyTek-classification
MTEB IFlyTek
default
validation
None
type value
accuracy
48.303193535975375
type value
f1
35.96559358693866
task dataset metrics
type
Classification
type name config split revision
C-MTEB/JDReview-classification
MTEB JDReview
default
test
None
type value
accuracy
85.06566604127579
type value
ap
52.0596483757231
type value
f1
79.5196835127668
task dataset metrics
type
STS
type name config split revision
C-MTEB/LCQMC
MTEB LCQMC
default
test
None
type value
cos_sim_pearson
74.48499423626059
type value
cos_sim_spearman
78.75806756061169
type value
euclidean_pearson
78.47917601852879
type value
euclidean_spearman
78.75807199272622
type value
manhattan_pearson
78.40207586289772
type value
manhattan_spearman
78.6911776964119
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/Mmarco-reranking
MTEB MMarcoReranking
default
dev
None
type value
map
24.75987466552363
type value
mrr
23.40515873015873
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MMarcoRetrieval
MTEB MMarcoRetrieval
default
dev
None
type value
map_at_1
58.026999999999994
type value
map_at_10
67.50699999999999
type value
map_at_100
67.946
type value
map_at_1000
67.96600000000001
type value
map_at_3
65.503
type value
map_at_5
66.649
type value
mrr_at_1
60.20100000000001
type value
mrr_at_10
68.271
type value
mrr_at_100
68.664
type value
mrr_at_1000
68.682
type value
mrr_at_3
66.47800000000001
type value
mrr_at_5
67.499
type value
ndcg_at_1
60.20100000000001
type value
ndcg_at_10
71.697
type value
ndcg_at_100
73.736
type value
ndcg_at_1000
74.259
type value
ndcg_at_3
67.768
type value
ndcg_at_5
69.72
type value
precision_at_1
60.20100000000001
type value
precision_at_10
8.927999999999999
type value
precision_at_100
0.9950000000000001
type value
precision_at_1000
0.104
type value
precision_at_3
25.883
type value
precision_at_5
16.55
type value
recall_at_1
58.026999999999994
type value
recall_at_10
83.966
type value
recall_at_100
93.313
type value
recall_at_1000
97.426
type value
recall_at_3
73.342
type value
recall_at_5
77.997
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_intent
MTEB MassiveIntentClassification (zh-CN)
zh-CN
test
31efe3c427b0bae9c22cbb560b8f15491cc6bed7
type value
accuracy
71.1600537995965
type value
f1
68.8126216609964
task dataset metrics
type
Classification
type name config split revision
mteb/amazon_massive_scenario
MTEB MassiveScenarioClassification (zh-CN)
zh-CN
test
7d571f92784cd94a019292a1f45445077d0ef634
type value
accuracy
73.54068594485541
type value
f1
73.46845879869848
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/MedicalRetrieval
MTEB MedicalRetrieval
default
dev
None
type value
map_at_1
54.900000000000006
type value
map_at_10
61.363
type value
map_at_100
61.924
type value
map_at_1000
61.967000000000006
type value
map_at_3
59.767
type value
map_at_5
60.802
type value
mrr_at_1
55.1
type value
mrr_at_10
61.454
type value
mrr_at_100
62.016000000000005
type value
mrr_at_1000
62.059
type value
mrr_at_3
59.882999999999996
type value
mrr_at_5
60.893
type value
ndcg_at_1
54.900000000000006
type value
ndcg_at_10
64.423
type value
ndcg_at_100
67.35900000000001
type value
ndcg_at_1000
68.512
type value
ndcg_at_3
61.224000000000004
type value
ndcg_at_5
63.083
type value
precision_at_1
54.900000000000006
type value
precision_at_10
7.3999999999999995
type value
precision_at_100
0.882
type value
precision_at_1000
0.097
type value
precision_at_3
21.8
type value
precision_at_5
13.98
type value
recall_at_1
54.900000000000006
type value
recall_at_10
74
type value
recall_at_100
88.2
type value
recall_at_1000
97.3
type value
recall_at_3
65.4
type value
recall_at_5
69.89999999999999
task dataset metrics
type
Classification
type name config split revision
C-MTEB/MultilingualSentiment-classification
MTEB MultilingualSentiment
default
validation
None
type value
accuracy
75.15666666666667
type value
f1
74.8306375354435
task dataset metrics
type
PairClassification
type name config split revision
C-MTEB/OCNLI
MTEB Ocnli
default
validation
None
type value
cos_sim_accuracy
83.10774228478614
type value
cos_sim_ap
87.17679348388666
type value
cos_sim_f1
84.59302325581395
type value
cos_sim_precision
78.15577439570276
type value
cos_sim_recall
92.18585005279832
type value
dot_accuracy
83.10774228478614
type value
dot_ap
87.17679348388666
type value
dot_f1
84.59302325581395
type value
dot_precision
78.15577439570276
type value
dot_recall
92.18585005279832
type value
euclidean_accuracy
83.10774228478614
type value
euclidean_ap
87.17679348388666
type value
euclidean_f1
84.59302325581395
type value
euclidean_precision
78.15577439570276
type value
euclidean_recall
92.18585005279832
type value
manhattan_accuracy
82.67460747157553
type value
manhattan_ap
86.94296334435238
type value
manhattan_f1
84.32327166504382
type value
manhattan_precision
78.22944896115628
type value
manhattan_recall
91.4466737064414
type value
max_accuracy
83.10774228478614
type value
max_ap
87.17679348388666
type value
max_f1
84.59302325581395
task dataset metrics
type
Classification
type name config split revision
C-MTEB/OnlineShopping-classification
MTEB OnlineShopping
default
test
None
type value
accuracy
93.24999999999999
type value
ap
90.98617641063584
type value
f1
93.23447883650289
task dataset metrics
type
STS
type name config split revision
C-MTEB/PAWSX
MTEB PAWSX
default
test
None
type value
cos_sim_pearson
41.071417937737856
type value
cos_sim_spearman
45.049199344455424
type value
euclidean_pearson
44.913450096830786
type value
euclidean_spearman
45.05733424275291
type value
manhattan_pearson
44.881623825912065
type value
manhattan_spearman
44.989923561416596
task dataset metrics
type
STS
type name config split revision
C-MTEB/QBQTC
MTEB QBQTC
default
test
None
type value
cos_sim_pearson
41.38238052689359
type value
cos_sim_spearman
42.61949690594399
type value
euclidean_pearson
40.61261500356766
type value
euclidean_spearman
42.619626605620724
type value
manhattan_pearson
40.8886109204474
type value
manhattan_spearman
42.75791523010463
task dataset metrics
type
STS
type name config split revision
mteb/sts22-crosslingual-sts
MTEB STS22 (zh)
zh
test
6d1ba47164174a496b7fa5d3569dae26a6813b80
type value
cos_sim_pearson
62.10977863727196
type value
cos_sim_spearman
63.843727112473225
type value
euclidean_pearson
63.25133487817196
type value
euclidean_spearman
63.843727112473225
type value
manhattan_pearson
63.58749018644103
type value
manhattan_spearman
63.83820575456674
task dataset metrics
type
STS
type name config split revision
C-MTEB/STSB
MTEB STSB
default
test
None
type value
cos_sim_pearson
79.30616496720054
type value
cos_sim_spearman
80.767935782436
type value
euclidean_pearson
80.4160642670106
type value
euclidean_spearman
80.76820284024356
type value
manhattan_pearson
80.27318714580251
type value
manhattan_spearman
80.61030164164964
task dataset metrics
type
Reranking
type name config split revision
C-MTEB/T2Reranking
MTEB T2Reranking
default
dev
None
type value
map
66.26242871142425
type value
mrr
76.20689863623174
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/T2Retrieval
MTEB T2Retrieval
default
dev
None
type value
map_at_1
26.240999999999996
type value
map_at_10
73.009
type value
map_at_100
76.893
type value
map_at_1000
76.973
type value
map_at_3
51.339
type value
map_at_5
63.003
type value
mrr_at_1
87.458
type value
mrr_at_10
90.44
type value
mrr_at_100
90.558
type value
mrr_at_1000
90.562
type value
mrr_at_3
89.89
type value
mrr_at_5
90.231
type value
ndcg_at_1
87.458
type value
ndcg_at_10
81.325
type value
ndcg_at_100
85.61999999999999
type value
ndcg_at_1000
86.394
type value
ndcg_at_3
82.796
type value
ndcg_at_5
81.219
type value
precision_at_1
87.458
type value
precision_at_10
40.534
type value
precision_at_100
4.96
type value
precision_at_1000
0.514
type value
precision_at_3
72.444
type value
precision_at_5
60.601000000000006
type value
recall_at_1
26.240999999999996
type value
recall_at_10
80.42
type value
recall_at_100
94.118
type value
recall_at_1000
98.02199999999999
type value
recall_at_3
53.174
type value
recall_at_5
66.739
task dataset metrics
type
Classification
type name config split revision
C-MTEB/TNews-classification
MTEB TNews
default
validation
None
type value
accuracy
52.40899999999999
type value
f1
50.68532128056062
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringP2P
MTEB ThuNewsClusteringP2P
default
test
None
type value
v_measure
65.57616085176686
task dataset metrics
type
Clustering
type name config split revision
C-MTEB/ThuNewsClusteringS2S
MTEB ThuNewsClusteringS2S
default
test
None
type value
v_measure
58.844999922904925
task dataset metrics
type
Retrieval
type name config split revision
C-MTEB/VideoRetrieval
MTEB VideoRetrieval
default
dev
None
type value
map_at_1
58.4
type value
map_at_10
68.64
type value
map_at_100
69.062
type value
map_at_1000
69.073
type value
map_at_3
66.567
type value
map_at_5
67.89699999999999
type value
mrr_at_1
58.4
type value
mrr_at_10
68.64
type value
mrr_at_100
69.062
type value
mrr_at_1000
69.073
type value
mrr_at_3
66.567
type value
mrr_at_5
67.89699999999999
type value
ndcg_at_1
58.4
type value
ndcg_at_10
73.30600000000001
type value
ndcg_at_100
75.276
type value
ndcg_at_1000
75.553
type value
ndcg_at_3
69.126
type value
ndcg_at_5
71.519
type value
precision_at_1
58.4
type value
precision_at_10
8.780000000000001
type value
precision_at_100
0.968
type value
precision_at_1000
0.099
type value
precision_at_3
25.5
type value
precision_at_5
16.46
type value
recall_at_1
58.4
type value
recall_at_10
87.8
type value
recall_at_100
96.8
type value
recall_at_1000
99
type value
recall_at_3
76.5
type value
recall_at_5
82.3
task dataset metrics
type
Classification
type name config split revision
C-MTEB/waimai-classification
MTEB Waimai
default
test
None
type value
accuracy
86.21000000000001
type value
ap
69.17460264576461
type value
f1
84.68032984659226
apache-2.0
zh
en
icon

Dmeta-embedding

Dmeta-embedding 是一款跨领域、跨任务、开箱即用的中文 Embedding 模型,适用于搜索、问答、智能客服、LLM+RAG 等各种业务场景,支持使用 Transformers/Sentence-Transformers/Langchain 等工具加载推理。

优势特点如下:

  • 多任务、场景泛化性能优异,目前已取得 MTEB 中文榜单第二成绩(2024.01.25)
  • 模型参数大小仅 400MB,对比参数量超过 GB 级模型,可以极大降低推理成本
  • 支持上下文窗口长度达到 1024,对于长文本检索、RAG 等场景更适配

Usage

目前模型支持通过 Sentence-Transformers, Langchain, Huggingface Transformers 等主流框架进行推理,具体用法参考各个框架的示例。

Sentence-Transformers

Dmeta-embedding 模型支持通过 sentence-transformers 来加载推理:

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]

model = SentenceTransformer('DMetaSoul/Dmeta-embedding')
embs1 = model.encode(texts1, normalize_embeddings=True)
embs2 = model.encode(texts2, normalize_embeddings=True)

# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)

# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
    scores = []
    for j in range(len(texts2)):
        scores.append([texts2[j], similarity[i][j]])
    scores = sorted(scores, key=lambda x:x[1], reverse=True)

    print(f"查询文本:{texts1[i]}")
    for text2, score in scores:
        print(f"相似文本:{text2},打分:{score}")
    print()

示例输出如下:

查询文本:胡子长得太快怎么办?
相似文本:胡子长得快怎么办?,打分:0.9535336494445801
相似文本:怎样使胡子不浓密!,打分:0.6776421070098877
相似文本:香港买手表哪里好,打分:0.2297907918691635
相似文本:在杭州手机到哪里买,打分:0.11386542022228241

查询文本:在香港哪里买手表好
相似文本:香港买手表哪里好,打分:0.9843372106552124
相似文本:在杭州手机到哪里买,打分:0.45211508870124817
相似文本:胡子长得快怎么办?,打分:0.19985519349575043
相似文本:怎样使胡子不浓密!,打分:0.18558596074581146

Langchain

Dmeta-embedding 模型支持通过 LLM 工具框架 langchain 来加载推理:

pip install -U langchain
import torch
import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "DMetaSoul/Dmeta-embedding"
model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]

embs1 = model.embed_documents(texts1)
embs2 = model.embed_documents(texts2)
embs1, embs2 = np.array(embs1), np.array(embs2)

# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)

# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
    scores = []
    for j in range(len(texts2)):
        scores.append([texts2[j], similarity[i][j]])
    scores = sorted(scores, key=lambda x:x[1], reverse=True)

    print(f"查询文本:{texts1[i]}")
    for text2, score in scores:
        print(f"相似文本:{text2},打分:{score}")
    print()

HuggingFace Transformers

Dmeta-embedding 模型支持通过 HuggingFace Transformers 框架来加载推理:

pip install -U transformers
import torch
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def cls_pooling(model_output):
    return model_output[0][:, 0]


texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]

tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/Dmeta-embedding')
model = AutoModel.from_pretrained('DMetaSoul/Dmeta-embedding')
model.eval()

with torch.no_grad():
    inputs1 = tokenizer(texts1, padding=True, truncation=True, return_tensors='pt')
    inputs2 = tokenizer(texts2, padding=True, truncation=True, return_tensors='pt')

    model_output1 = model(**inputs1)
    model_output2 = model(**inputs2)
    embs1, embs2 = cls_pooling(model_output1), cls_pooling(model_output2)
    embs1 = torch.nn.functional.normalize(embs1, p=2, dim=1).numpy()
    embs2 = torch.nn.functional.normalize(embs2, p=2, dim=1).numpy()

# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)

# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
    scores = []
    for j in range(len(texts2)):
        scores.append([texts2[j], similarity[i][j]])
    scores = sorted(scores, key=lambda x:x[1], reverse=True)

    print(f"查询文本:{texts1[i]}")
    for text2, score in scores:
        print(f"相似文本:{text2},打分:{score}")
    print()

Evaluation

Dmeta-embedding 模型在 MTEB 中文榜单取得开源第一的成绩(2024.01.25,Baichuan 榜单第一、未开源),具体关于评测数据和代码可参考 MTEB 官方仓库

MTEB Chinese:

榜单数据集由智源研究院团队(BAAI)收集整理,包含 6 个经典任务共计 35 个中文数据集,涵盖了分类、检索、排序、句对、STS 等任务,是目前 Embedding 模型全方位能力评测的全球权威榜单。

Model Vendor Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering
Dmeta-embedding 数元灵 1024 67.51 70.41 64.09 88.92 70 67.17 50.96
gte-large-zh 阿里达摩院 1024 66.72 72.49 57.82 84.41 71.34 67.4 53.07
BAAI/bge-large-zh-v1.5 智源 1024 64.53 70.46 56.25 81.6 69.13 65.84 48.99
BAAI/bge-base-zh-v1.5 智源 768 63.13 69.49 53.72 79.75 68.07 65.39 47.53
text-embedding-ada-002(OpenAI) OpenAI 1536 53.02 52.0 43.35 69.56 64.31 54.28 45.68
text2vec-base 个人 768 47.63 38.79 43.41 67.41 62.19 49.45 37.66
text2vec-large 个人 1024 47.36 41.94 44.97 70.86 60.66 49.16 30.02

FAQ

1. 为何模型多任务、场景泛化能力优异,可开箱即用适配诸多应用场景?

简单来说,模型优异的泛化能力来自于预训练数据的广泛和多样,以及模型优化时面向多任务场景设计了不同优化目标。

具体来说,技术要点有:

1)首先是大规模弱标签对比学习。业界经验表明开箱即用的语言模型在 Embedding 相关任务上表现不佳,但由于监督数据标注、获取成本较高,因此大规模、高质量的弱标签学习成为一条可选技术路线。通过在互联网上论坛、新闻、问答社区、百科等半结构化数据中提取弱标签,并利用大模型进行低质过滤,得到 10 亿级别弱监督文本对数据。

2)其次是高质量监督学习。我们收集整理了大规模开源标注的语句对数据集,包含百科、教育、金融、医疗、法律、新闻、学术等多个领域共计 3000 万句对样本。同时挖掘难负样本对,借助对比学习更好的进行模型优化。

3)最后是检索任务针对性优化。考虑到搜索、问答以及 RAG 等场景是 Embedding 模型落地的重要应用阵地,为了增强模型跨领域、跨场景的效果性能,我们专门针对检索任务进行了模型优化,核心在于从问答、检索等数据中挖掘难负样本,借助稀疏和稠密检索等多种手段,构造百万级难负样本对数据集,显著提升了模型跨领域的检索性能。

2. 模型可以商用吗?

我们的开源模型基于 Apache-2.0 协议,完全支持免费商用。

3. 如何复现 MTEB 评测结果?

我们在模型仓库中提供了脚本 mteb_eval.py,您可以直接运行此脚本来复现我们的评测结果。

4. 后续规划有哪些?

我们将不断致力于为社区提供效果优异、推理轻量、多场景开箱即用的 Embedding 模型,同时我们也会将 Embedding 逐步整合到目前已经的技术生态中,跟随社区一起成长!

Contact

您如果在使用过程中,遇到任何问题,欢迎前往讨论区建言献策。

您也可以联系我们:赵中昊 zhongh@dmetasoul.com, 肖文斌 xiaowenbin@dmetasoul.com, 孙凯 sunkai@dmetasoul.com

License

Dmeta-embedding 模型采用 Apache-2.0 License,开源模型可以进行免费商用私有部署。

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages