Skip to content

Commit

Permalink
fix: pipelines of embed and rerank (#53)
Browse files Browse the repository at this point in the history
  • Loading branch information
LongxingTan committed Jun 5, 2024
1 parent 23b9d44 commit 24bdc39
Show file tree
Hide file tree
Showing 25 changed files with 425 additions and 336 deletions.
21 changes: 0 additions & 21 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,33 +43,12 @@ jobs:
shell: bash
run: python -m unittest discover -s ./tests -p 'test_*.py'

- name: Codecov startup
if: success()
run: |
codecovcli create-commit
codecovcli create-report
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}'

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4.0.1
with:
token: ${{ secrets.CODECOV_TOKEN }}
slug: longxingtan/open-retrievals

# - name: Static Analysis
# run: |
# codecovcli static-analysis --token=${CODECOV_STATIC_TOKEN} \
# --folders-to-exclude .artifacts \
# --folders-to-exclude .github \
# --folders-to-exclude .venv \
# --folders-to-exclude static \
# --folders-to-exclude bin
# env:
# CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
# CODECOV_STATIC_TOKEN: ${{ secrets.CODECOV_STATIC_TOKEN }}


docs:
name: Test docs build
runs-on: ubuntu-latest
Expand Down
54 changes: 13 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@
[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master

<h1 align="center">
<img src="./docs/source/_static/logo.svg" width="520" align=center/>
</h1><br>
<img src="./docs/source/_static/logo.svg" width="420" align=center/>
</h1>

[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
Expand All @@ -27,9 +27,10 @@

**[Documentation](https://open-retrievals.readthedocs.io)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)** | **[日本語](https://github.com/LongxingTan/open-retrievals/blob/master/README_ja-JP.md)**

**Open-retrievals** simplifies text embeddings, retrievals, ranking, and RAG applications using PyTorch and Transformers. This user-friendly framework is designed for information retrieval and LLM-enhanced generation.
- Contrastive learning enhanced embeddings/ LLM embeddings
- Cross-encoder and ColBERT Rerank
**Open-retrievals** simplifies text embeddings, retrievals, ranking, and RAG using PyTorch and Transformers. This user-friendly framework is designed for information retrieval and LLM generation.
- Embeddings, retrieval and rerank all-in-one: `AutoModelForEmbedding`
- Contrastive learning/LLM enhanced embeddings, with point-wise, pairwise and listwise training
- Cross-encoder, ColBERT and LLM enhanced Reranking
- Fast RAG demo integrated with Langchain and LlamaIndex


Expand Down Expand Up @@ -67,7 +68,7 @@ pip install -e .

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-WBMisdWLeHUKlzJ2DrREXY_kSV8vjP3?usp=sharing)

**Text embedding from Pretrained weights**
**Text embeddings from Pretrained weights**
```python
from retrievals import AutoModelForEmbedding

Expand Down Expand Up @@ -102,10 +103,10 @@ print(indices)

**Rerank using pretrained weights**
```python
from retrievals import RerankModel
from retrievals import AutoModelForRanking

model_name_or_path: str = "BAAI/bge-reranker-base"
rerank_model = RerankModel.from_pretrained(model_name_or_path)
rerank_model = AutoModelForRanking.from_pretrained(model_name_or_path)
scores_list = rerank_model.compute_score(["In 1974, I won the championship in Southeast Asia in my first kickboxing match", "In 1982, I defeated the heavy hitter Ryu Long."])
print(scores_list)
```
Expand All @@ -122,7 +123,7 @@ pip install chromadb

```python
from retrievals.tools.langchain import LangchainEmbedding, LangchainReranker, LangchainLLM
from retrievals import RerankModel
from retrievals import AutoModelForRanking
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma as Vectorstore
from langchain.prompts.prompt import PromptTemplate
Expand All @@ -141,7 +142,7 @@ vectordb = Vectorstore(
retrieval_args = {"search_type" :"similarity", "score_threshold": 0.15, "k": 10}
retriever = vectordb.as_retriever(**retrieval_args)

ranker = RerankModel.from_pretrained(rerank_model_name_or_path)
ranker = AutoModelForRanking.from_pretrained(rerank_model_name_or_path)
reranker = LangchainReranker(model=ranker, top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker, base_retriever=retriever
Expand Down Expand Up @@ -181,20 +182,6 @@ response = qa_chain({"query": user_query})
print(response)
```

[//]: # (**RAG with LLamaIndex**)

[//]: # ()
[//]: # (```shell)

[//]: # (pip install llamaindex)

[//]: # (```)

[//]: # ()
[//]: # (```python)

[//]: # ()
[//]: # (```)

**Text embedding model fine-tuned by contrastive learning**

Expand Down Expand Up @@ -244,7 +231,7 @@ trainer.train()

```python
from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
from retrievals import RerankCollator, RerankModel, RerankTrainer, RerankDataset
from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset

model_name_or_path: str = "microsoft/deberta-v3-base"
max_length: int = 128
Expand All @@ -254,7 +241,7 @@ epochs: int = 3

train_dataset = RerankDataset('./t2rank.json', positive_key='pos', negative_key='neg')
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = RerankModel.from_pretrained(model_name_or_path, pooling_method="mean")
model = AutoModelForRanking.from_pretrained(model_name_or_path, pooling_method="mean")
optimizer = AdamW(model.parameters(), lr=learning_rate)
num_train_steps = int(len(train_dataset) / batch_size * epochs)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0.05 * num_train_steps, num_training_steps=num_train_steps)
Expand All @@ -277,21 +264,6 @@ trainer.scheduler = scheduler
trainer.train()
```

**Semantic search by cosine similarity/KNN**
```python
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval

query_texts = ['A dog is chasing car.']
document_texts = ['A man is playing a guitar.', 'A bee is flying low']
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding.from_pretrained(model_name_or_path)
query_embeddings = model.encode(query_texts, convert_to_tensor=True)
document_embeddings = model.encode(document_texts, convert_to_tensor=True)

matcher = AutoModelForRetrieval(method='cosine')
dists, indices = matcher.similarity_search(query_embeddings, document_embeddings, top_k=1)
```


## Reference & Acknowledge
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
Expand Down
34 changes: 10 additions & 24 deletions README_ja-JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@
[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master

<h1 align="center">
<img src="./docs/source/_static/logo.svg" width="520" align=center/>
</h1><br>
<img src="./docs/source/_static/logo.svg" width="420" align=center/>
</h1>

[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
Expand All @@ -28,8 +28,8 @@
**[ドキュメント](https://open-retrievals.readthedocs.io)** | **[英語](https://github.com/LongxingTan/open-retrievals/blob/master/README.md)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)**

**Open-Retrievals** は、PyTorch と Transformers をベースとした、情報検索と LLM 検索拡張生成を指向した、SOTA テキスト埋め込みを取得する使いやすい Python フレームワークです。
- 対照学習エンベッディング
- LLM エンベッディング
- `AutoModelForEmbedding` はベクトル化、検索、リランクの分野を統一します
- 対照学習エンベッディング, LLM エンベッディング
- 高速 RAG デモ


Expand Down Expand Up @@ -95,10 +95,10 @@ print(indices)

**リランク**
```python
from retrievals import RerankModel
from retrievals import AutoModelForRanking

model_name_or_path: str = "BAAI/bge-reranker-base"
rerank_model = RerankModel.from_pretrained(model_name_or_path)
rerank_model = AutoModelForRanking.from_pretrained(model_name_or_path)
scores_list = rerank_model.compute_score(["In 1974, I won the championship in Southeast Asia in my first kickboxing match", "In 1982, I defeated the heavy hitter Ryu Long."])
print(scores_list)
```
Expand All @@ -116,7 +116,7 @@ pip install chromadb
- サーバー
```python
from retrievals.tools.langchain import LangchainEmbedding, LangchainReranker, LangchainLLM
from retrievals import RerankModel
from retrievals import AutoModelForRanking
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma as Vectorstore
from langchain.prompts.prompt import PromptTemplate
Expand All @@ -135,7 +135,7 @@ vectordb = Vectorstore(
retrieval_args = {"search_type" :"similarity", "score_threshold": 0.15, "k": 10}
retriever = vectordb.as_retriever(**retrieval_args)

ranker = RerankModel.from_pretrained(rerank_model_name_or_path)
ranker = AutoModelForRanking.from_pretrained(rerank_model_name_or_path)
reranker = LangchainReranker(model=ranker, top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker, base_retriever=retriever
Expand Down Expand Up @@ -175,20 +175,6 @@ response = qa_chain({"query": user_query})
print(response)
```

[//]: # (**RAG with LLamaIndex**)

[//]: # ()
[//]: # (```shell)

[//]: # (pip install llamaindex)

[//]: # (```)

[//]: # ()
[//]: # (```python)

[//]: # ()
[//]: # (```)

**コントラスト学習による transformers のウェイトのファインチューニング**

Expand Down Expand Up @@ -249,7 +235,7 @@ model = AutoModelForEmbedding.from_pretrained(

```python
from transformers import AutoTokenizer, TrainingArguments, get_cosine_schedule_with_warmup, AdamW
from retrievals import RerankCollator, RerankModel, RerankTrainer, RerankDataset
from retrievals import RerankCollator, AutoModelForRanking, RerankTrainer, RerankDataset

model_name_or_path: str = "microsoft/deberta-v3-base"
max_length: int = 128
Expand All @@ -259,7 +245,7 @@ epochs: int = 3

train_dataset = RerankDataset('./t2rank.json', positive_key='pos', negative_key='neg')
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = RerankModel.from_pretrained(model_name_or_path, pooling_method="mean")
model = AutoModelForRanking.from_pretrained(model_name_or_path, pooling_method="mean")
optimizer = AdamW(model.parameters(), lr=learning_rate)
num_train_steps = int(len(train_dataset) / batch_size * epochs)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=0.05 * num_train_steps, num_training_steps=num_train_steps)
Expand Down
Loading

0 comments on commit 24bdc39

Please sign in to comment.