<p style="font-size:small; color:gray;"> Author: 鄭永誠, Year: 2024 </p>

# C4 - 進階RAG策略方向
----------
1. **文件文檔優化**: 例如透過[Mathpix](https://mathpix.com/), [Marker](https://github.com/VikParuchuri/marker)等專業工具，將PDF轉成Markdown格式，有助於RAG效果提升

2. **向量資料庫/架構改變**: 使用效果更好的向量資料庫(加速搜尋)如FAISS，甚至直接使用GraphRAG等存儲架構策略
補充: MMR (Maximum Marginal Relevance Retrieval)
- Chroma DB 有提供該功能
- 超白話理解: 問“永誠好帥” 有機制可能會同時去找“永誠” 和 “好帥”相關資訊，增加文本搜尋(當然還有基於公式判斷相關性和多樣性的函數...)

3. **使用各種上下文壓縮與相關處理 (Contextual compression)**  可以直接使用langchain框架下資源實踐
- LLMChainExtractor:  只擷取與查詢相關的內容，節省token數
- LLMChainFilter: 直接決定過濾掉最初檢索到的文檔
- EmbeddingsFilter: 向量比對相似度，先前提及做法
- 結合DocumentCompressorPipeline

4. **多查詢檢索 (MultiQueryRetriever)**:   用LLM去生成多個與原查詢相似的問題，Langchain底下也有整合

5. **Reranker**
- 又叫做Cross-Encoder，輸入是兩段文字，輸出是一個相關性分數 0 到 1 之間的相關性分數
- 一般RAG embedding 後的向量比對速度較快，但是只直接看兩個字詞向量，有可能會落失一些資訊
- Reranker 的執行速度較慢，成本較高，但更準確
- 當資料非常多、想要快又要準時，可考慮跟 embeddings 模型搭配，做成兩階段檢索
    - 第一階段: 用向量比對找出top 100 相關資訊
    - 第二階段: 用比較慢但效果比較好的Reranker找出當中前5資訊
- 以下會使用Jina AI作為範例
- 其他常見Reranker工具包含 [Voyage](https://docs.voyageai.com/docs/reranker), [Cohere](https://docs.cohere.com/docs/overview), [Jina](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual) 跟開源的 [bge](https://huggingface.co/BAAI/bge-reranker-v2-m3) 系列
- 注意，可能會有token數量限制的問題












-----------------------
## # Advanced RAG - Jina Reranker

### embedding向量直接比對效果
(此處省略RAG操作，直接示意sentence-transformers操作後句子關係比對結果)

In [3]:
""" 要留意torch版本問題，請不要直接跑，根據需求調整版本(高機率會有問題) """
%pip show sentence-transformers
%pip install sentence-transformers==3.0.1 -q
%pip uninstall -y torch -q
%pip install torchvision 
%pip install torch==2.0.0 -q
%pip show torch

Note: you may need to restart the kernel to use updated packages.




Defaulting to user installation because normal site-packages is not writeable
Collecting torchvision
  Using cached torchvision-0.19.0-1-cp312-cp312-win_amd64.whl.metadata (6.1 kB)
Collecting torch==2.4.0 (from torchvision)
  Using cached torch-2.4.0-cp312-cp312-win_amd64.whl.metadata (27 kB)
Using cached torchvision-0.19.0-1-cp312-cp312-win_amd64.whl (1.3 MB)
Using cached torch-2.4.0-cp312-cp312-win_amd64.whl (197.8 MB)
Installing collected packages: torch, torchvision
Successfully installed torch-2.4.0 torchvision-0.19.0
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
milvus-model 0.2.0 requires protobuf==3.20.0, but you have protobuf 4.25.4 which is incompatible.

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\PipiHi\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement torch==2.0.0 (from versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0)
ERROR: No matching distribution found for torch==2.0.0

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\PipiHi\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Name: torch
Version: 2.4.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: C:\Users\PipiHi\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages
Requires: filelock, fsspec, jinja2, networkx, setuptools, sympy, typing-extensions
Required-by: accelerate, FlagEmbedding, sentence-transformers, torchvision
Note: you may need to restart the kernel to use updated packages.


In [4]:
""" 這是一個簡單的範例，比對 "->簡禎富副校長真的好帥" 跟其他句子的相似度 """
import warnings
from sentence_transformers import SentenceTransformer

# 載入模型 (已選擇中文擅長模型)
model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v2')  

# 忽略警告 (我只是希望他不要跳出來好煩)
warnings.simplefilter(action='ignore')

  from tqdm.autonotebook import tqdm, trange


In [6]:
# 定義句子，目標句為第一句 "簡禎富副校長真的好帥"
sentences = [
    "->簡禎富副校長真的好帥",
    "很多人都喜歡簡禎富的樣貌", 
    "大家看到簡禎富都會尖叫",
    "簡禎富長得超級好看",
    "鄭永誠真的好帥", 
    "副校長真的好帥"
    ]

# 執行embedding(轉成向量)
embedding = model.encode(sentences, convert_to_tensor=False)

In [7]:
from sentence_transformers import util
cosine_scores = util.cos_sim(embedding, embedding)

d = {}
for i, v1 in enumerate(sentences):
    for j, v2 in enumerate(sentences):
        if i >= j:
            continue
        d[v1 + ' vs. ' + v2] = cosine_scores[i][j].item()

# sort by score
d_sorted = dict(sorted(d.items(), key=lambda x: x[1], reverse=True))
d_sorted

{'->簡禎富副校長真的好帥 vs. 副校長真的好帥': 0.8442525863647461,
 '很多人都喜歡簡禎富的樣貌 vs. 簡禎富長得超級好看': 0.8075946569442749,
 '很多人都喜歡簡禎富的樣貌 vs. 大家看到簡禎富都會尖叫': 0.8003949522972107,
 '大家看到簡禎富都會尖叫 vs. 簡禎富長得超級好看': 0.7775326371192932,
 '簡禎富長得超級好看 vs. 鄭永誠真的好帥': 0.7260526418685913,
 '->簡禎富副校長真的好帥 vs. 簡禎富長得超級好看': 0.7093853950500488,
 '鄭永誠真的好帥 vs. 副校長真的好帥': 0.7027955055236816,
 '->簡禎富副校長真的好帥 vs. 大家看到簡禎富都會尖叫': 0.6851842999458313,
 '->簡禎富副校長真的好帥 vs. 鄭永誠真的好帥': 0.6751537919044495,
 '->簡禎富副校長真的好帥 vs. 很多人都喜歡簡禎富的樣貌': 0.6724876761436462,
 '很多人都喜歡簡禎富的樣貌 vs. 鄭永誠真的好帥': 0.6299742460250854,
 '大家看到簡禎富都會尖叫 vs. 鄭永誠真的好帥': 0.6042143702507019,
 '簡禎富長得超級好看 vs. 副校長真的好帥': 0.5831388235092163,
 '大家看到簡禎富都會尖叫 vs. 副校長真的好帥': 0.5793490409851074,
 '很多人都喜歡簡禎富的樣貌 vs. 副校長真的好帥': 0.5629788637161255}

應該可以從結果看到，
- ->清華簡禎富副校長真的好帥 vs. 副校長真的好帥 ，關聯很高
- ->清華簡禎富副校長真的好帥 vs. 很多人都喜歡簡禎富的樣貌，關聯居然是最低的
- ->清華簡禎富副校長真的好帥 vs. 鄭永誠真的好帥  ，的關聯性甚至有 0.6207665801048279,

(注意，這也會和使用的embedding模型也有很大關聯性)




### 使用Jina-Reranker的句子比對效果
- Jina公司有提供可使用的Reranker API資源
- https://jina.ai/reranker/

In [9]:
# -*- coding: utf-8 -*-

""" Setup Jina Reranker """
import requests
from typing import List
import json

# API URL
JINA_RERANKER_URL = "https://api.jina.ai/v1/rerank"

# Jina Reranker函數
def jina_rerank(query: str, text_list: List[str]):
    headers = {"Content-Type": "application/json", "Authorization": "Bearer jina_d68362712b5143188d360eaadef63cf16WjSf5hb686SC-yBocaJLq-2xvo7"}

    json_data = {
      "model": "jina-reranker-v2-base-multilingual",
      "documents": text_list,
      "query": query,
      "top_n": 5,
    }

    response = requests.post(JINA_RERANKER_URL, headers=headers, data=json.dumps(json_data))
    return response.json()
    
# 使用方式

search_query = "簡禎富副校長真的好帥" # 想要搜尋的句子

# 想要比對的句子
just_case_text = [
    "很多人都喜歡簡禎富的樣貌", 
    "大家看到簡禎富都會尖叫",
    "簡禎富長得超級好看",
    "鄭永誠真的好帥", 
    "副校長真的好帥"
    ]

reranked_results = jina_rerank(search_query, just_case_text)

print(json.dumps(reranked_results["results"], indent=4, ensure_ascii=False))


""" 
在我的執行結果中，
簡禎富副校長真的好帥和以下句子相關性:
"副校長真的好帥" = 0.9539660811424255
"簡禎富長得超級好看" = 0.6477982401847839
"很多人都喜歡簡禎富的樣貌" = 0.5940803289413452
"大家看到簡禎富都會尖叫" = 0.3747906982898712
"鄭永誠真的好帥" = 0.07696083933115005
"""


[
    {
        "index": 4,
        "document": {
            "text": "副校長真的好帥"
        },
        "relevance_score": 0.9539660811424255
    },
    {
        "index": 2,
        "document": {
            "text": "簡禎富長得超級好看"
        },
        "relevance_score": 0.6477982401847839
    },
    {
        "index": 0,
        "document": {
            "text": "很多人都喜歡簡禎富的樣貌"
        },
        "relevance_score": 0.5940803289413452
    },
    {
        "index": 1,
        "document": {
            "text": "大家看到簡禎富都會尖叫"
        },
        "relevance_score": 0.3747906982898712
    },
    {
        "index": 3,
        "document": {
            "text": "鄭永誠真的好帥"
        },
        "relevance_score": 0.07696083933115005
    }
]


' \n在我的執行結果中，\n簡禎富副校長真的好帥和以下句子相關性:\n"副校長真的好帥" = 0.9539660811424255\n"簡禎富長得超級好看" = 0.6477982401847839\n"很多人都喜歡簡禎富的樣貌" = 0.5940803289413452\n"大家看到簡禎富都會尖叫" = 0.3747906982898712\n"鄭永誠真的好帥" = 0.07696083933115005\n'