-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
search/embeddings: speed up similarity search by more than 50%
Exploit normalization of embeddings. Document this assumption and fulfill it in the mock embedding model.
- Loading branch information
Showing
8 changed files
with
15 additions
and
27 deletions.
There are no files selected for viewing
2 changes: 1 addition & 1 deletion
2
...s/SemanticText.package/OpenAIEmbeddingModel.class/instance/getEmbeddingsForAll.config..st
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
18 changes: 2 additions & 16 deletions
18
packages/SemanticText.package/SemanticCorpus.class/instance/distanceBetween.and..st
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,7 @@ | ||
private | ||
distanceBetween: embedding and: anotherEmbedding | ||
"cosine distance" | ||
"Answer the cosine distance between both embeddings. The length of embeddings is ignored, so senders have to take care not to compare differences between pairs of vectors with different total scalars." | ||
|
||
| abs otherAbs | | ||
anotherEmbedding ifNil: [^ Float infinity]. | ||
|
||
abs := embedding squaredLength. | ||
abs = 0 ifTrue: [^ Float infinity]. | ||
otherAbs := anotherEmbedding squaredLength. | ||
otherAbs = 0 ifTrue: [^ Float infinity]. | ||
^ 1.0 - | ||
( | ||
(embedding dot: anotherEmbedding) | ||
/ | ||
( | ||
abs | ||
* | ||
otherAbs | ||
) sqrt | ||
) | ||
^ 1.0 - (embedding dot: anotherEmbedding) |
1 change: 1 addition & 0 deletions
1
...ges/SemanticText.package/SemanticCorpus.class/instance/findAllDocuments.nearEmbedding..st
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9 changes: 5 additions & 4 deletions
9
...nticText.package/SemanticMockEmbeddingModel.class/instance/getEmbeddingsForAll.config..st
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,16 @@ | ||
service | ||
getEmbeddingsForAll: strings config: aConfigOrNil | ||
"Answer a collection with one embedding for each string. Each embedding vector is an array of numbers, commonly represented as a Float32Array." | ||
"Answer a collection with one embedding for each string. Each embedding vector is an array of numbers, commonly represented as a Float32Array. Each vector is normalized, i.e., has a length very close to 1." | ||
|
||
| config | | ||
config := self baseConfig. | ||
aConfigOrNil ifNotNil: | ||
[config := config updatedWith: aConfigOrNil]. | ||
|
||
^ strings collect: [:string | | ||
| words | | ||
| words vector | | ||
words := string substrings collect: [:word | word asLowercaseAlphabetic] as: Bag. | ||
self keywords | ||
vector := self keywords | ||
collect: [:keyword | (words occurrencesOf: keyword) / words size] | ||
as: Float32Array] | ||
as: Float32Array. | ||
vector /= vector length] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters